Improving Food Recipe Suggestions with Hierarchical Classification of Food Recipes

(1)

STOCKHOLM SWEDEN 2018,

Improving Food Recipe

Suggestions With Hierarchical Classification of Food Recipes

PEDRAM FATHOLLAHZADEH KAFSHI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

1 (1)

Title (English): Improving Food Recipe Suggestions with Hierarchical Classification of Food Recipes Title (Swedish): Förbättrande rekommendationer av matrecept genom hierarkisk klassificering av matrecept

Author: Pedram Fathollahzadeh Kafshi Email: pedramfk@kth.se

Topic: computer science, machine learning, data mining

Program: Electrical Engineering with a master degree in Systems, Control and Robotics Supervisor: Pawel Herman

Examiner: Patric Jensfelt Principal: Eatit

Date: 2018-03-23

(3)

Abstract

Making personalized recommendations has become a central part in many platforms, and is continuing to grow with more access to mas- sive amounts of data online. Giving recommendations based on the interests of the individual, rather than recommending items that are popular, increases the user experience and can potentially attract more customers when done right.

In order to make personalized recommendations, many platforms resort to machine learning algorithms. In the context of food recipes, these machine learning algorithms tend to consist of hybrid methods between collaborative filtering, content-based methods and matrix factor- ization. Most content-based approaches are ingredient based and can be very fruitful. However, fetching every single ingredient for recipes and processing them can be computationally expensive. Therefore, this paper investigates if clustering recipes according to what cuisine they belong to and what the main protein is can also improve rating predictions compared to when only collaborative filtering and matrix factorization methods are employed. This suggested content-based approach has a structure of a hierarchical classification, where recipes are first clustered into what cuisine group they belong to, then the specific cuisine and finally what the main protein is.

The results suggest that the content-based approach can improve

the predictions slightly but not significantly, and can help reduce the

sparsity of the rating matrix to some extent. However, it suffers from

heavily sparse data with respect to how many rating predictions it can

give.

(4)

Sammanfattning

Att ge personliga rekommendationer har blivit en central del av många plattformar och fortsätter att bli det då tillgången till stora mängder data har ökat. Genom att ge personliga rekommendationer baserat på användares intressen, istället för att rekommendera det som är popu- lärt, förbättrar användarupplevelsen och kan attrahera fler kunder.

För att kunna producera personliga rekommendationer så vänder sig många plattformar till maskininlärningsalgoritmer. När det kom- mer till matrecept, så brukar dessa maskininlärningsalgoritmer bestå av hybrida metoder som sammanfogar collaborative filtering, innehålls- baserande metoder och matrisfaktorisering. De flesta innehållsbase- rande metoderna baseras på ingredienser och har visats vara effekti- va. Däremot, så kan det vara kostsamt för datorer att ta hänsyn till varenda ingrediens i varje matrecept. Därför undersöker denna artikel om att klassificera recept hierarkiskt efter matkultur och huvudprote- in också kan förbättra rekommendationer när bara collaborative filte- ring och matrisfaktorisering används. Denna innehållsbaserande me- tod har en struktur av hierarkisk klassificering, där recept först indelas efter matkultur, specifik matkultur och till slut vad huvudproteinet är.

Resultaten visar att innehållsbaserande metoden kan förbättra re-

ceptförslagen, men inte på en statistisk signifikant nivå, och kan redu-

cera gleshet i en matris med tillsatta betyg från olika användare med

olika recept något. Däremot så påverkas den ansenligt när det är glest

med tillgänglighet av data.

(5)

Improving Food Recipe Suggestions With Hierarchical Classification of Food Recipes

Pedram Fathollahzadeh Kafshi

Abstract—Making personalized recommendations has become a central part in many platforms, and is continuing to grow with more access to massive amounts of data online. Giving recommendations based on the interests of the individual, rather than recommending items that are popular, increases the user experience and can potentially attract more customers when done right.

In order to make personalized recommendations, many platforms resort to machine learning algorithms. In the context of food recipes, these machine learning algorithms tend to consist of hybrid methods between collaborative filtering, content-based methods and matrix factorization. Most content-based approaches are ingredient based and can be very fruitful. However, fetching every single ingredient for recipes and processing them can be computationally expensive. Therefore, this paper investigates if clustering recipes according to what cuisine they belong to and what the main protein is can also improve rating predictions compared to when only collaborative filtering and matrix factorization methods are employed. This suggested content-based approach has a structure of a hierarchical classification, where recipes are first clustered into what cuisine group they belong to, then the specific cuisine and finally what the main protein is.

The results suggest that the content-based approach can improve the predictions slightly but not significantly, and can help reduce the sparsity of the rating matrix to some extent.

However, it suffers from heavily sparse data with respect to how many rating predictions it can give.

Keywords—collaborative filtering, content-based method, matrix factorization, recommender systems, hierarchical classification, recipes.

I. INTRODUCTION

As the access to huge amounts of data online increases, many platforms such as Netflix and Amazon want to employ machine learning algorithms in order to predict what items a user wants to consume. The reason why platforms resort to machine learning algorithms instead of simply recommending popular items is that unique user profiles can be missed out and it is essential to give unique user experience in order to attract and keep customers.

In the context of food recipes, many users can potentially have very unique taste profiles due to cultural and historical differences, but also because of personal preferences [1].

Genetics also plays a role here to some extent. For example, neuroscientist Dr. Charles Zuker claims that humans tend to prefer the taste of sweetness to bitterness, but that in the social context the taste of bitterness can outweigh initial preferences

Pedram Fathollahzadeh Kafshi is with the Department of Electrical En- gineering, Royal Institute of Technology, Stockholm, Sweden, e-mail: pedramfk@kth.se.

(coffee, beer, etc.) [1]. Consequently, taste can be a very complex phenomenon.

One popular method in recommender systems is collaborative filtering(CF). There are two CF approaches; user-to-item CF and item-to-item CF [8]. User-to-item CF predicts how a user would rate an item based on how other similar users have rated that item, while item-to-item CF predicts how a user would rate an item based on how other similar items have been rated. These two CF approaches are dual to each other in the sense that one looks for similar users and the other for similar items when predicting ratings. Which CF approach to use clearly depends on the application; if it is more difficult to find similar users than items due to users being more complex, then item-to-item CF is an appropriate choice and vice versa.

Traditional CF looks at local effects between items and users, and is a powerful and popular method as it does not require any additional information about the features of the items. However, CF tends to have a popularity bias and unique items can easily be crowded out by popular items [2]. To solve this, and some other shortcomings of CF such as sparsity and the first-rater problem [3], hybrid methods can be employed.

These hybrid methods can be a combination between CF and different content-based (CB) methods [3,4]. CB methods look into features of the items in a catalog. By doing this, CB methods can recommend specific type or genre of items that a user is potentially interested in, and can therefore give more unique and relevant recommendations compared to what CF can offer [3]. However, CF can recommend items that are dissimilar to what a user is generally interested in with respect to their features as it does not take the features into account and can therefore give more diverse recommendations.

The state-of-the-art for recommender systems in general employ mainly some hybrid method between CF and CB together with different matrix factorization (MF) methods such as Singular Value Decomposition (SVD) or Factored Item Similarity Models(FISM) [5]. MF methods look into regional effects by factorizing a rating matrix which deals with the sparsity issue and has been proven to be very efficient and subsequently used in many platforms such as Netflix [7].

For recommender systems in the context of food and recipes, different hybrid recommender systems have been employed given different pre-requisites. For instance, Svensson et al.

[12] built a social navigation system based on what recipes users choose in order to predict what users want to consume, while van Pinxteren et al. [9] derived and evaluated a recipe similarity measure for recommending healthy recipes. Other recommender systems that have been investigated use feature- based CB methods where recipes are represented as vectors with, for example, ingredients as the entries [6,11]. Hanai et

(6)

al. [13] managed to instead cluster recipes successfully based on their names and ingredients as well as seasoning.

Many recipe recommender systems use ingredient-based methods [6,9,11,12,13]. However, there can be cases where all ingredients for recipes are not available. Also, it can prove to be computationally expensive to iterate through and gather every ingredient and feature of an item in order to produce predictions. For instance, a mobile application might need to fetch recipes online and only wants to gather all relevant information about recipes that are deemed relevant for users.

Subsequently, only gathering enough information about recipes such that they can be appropriately clustered, and only fetch complete information about recipes that users find relevant will be less of a computational burden. Therefore, in this paper, a CB method is proposed and evaluated, which takes into account what cuisine a recipe belongs to and what the main protein is. Instead of representing a recipe as a vector of features such as ingredients, this paper investigates whether the suggested high-level representation of recipes (see Fig. 1) can also provide better recommendations than what traditional CF together with MF methods can provide. The research question is therefore if grouping recipes and classifying them after their main protein and cuisine (as implemented in Algorithm 1) will help improving the predictions when also CF and MF are employed.

A. Outline

In Section II, notations and terminology are described.

Section III explains the different approaches in recommender systems and how they have been combined and implemented to produce recommendations in food platforms and other domains. Section IV gives a concrete proposal of a practical implementation of the suggested CB approach. Section V describes the working data set and evaluation methodologies in order to quantify prediction and recommendation accuracy.

In Section VI the results are documented and Section VII provides a discussion about the results. Finally, in Section VIII, a conclusion is made with a brief discussion of potential future work and improvements.

II. NOTATIONS ANDTERMINOLOGY

A. Notations

• ¯v - vector

• A^T_m×n - transpose of matrix A ∈R^m×n

• R_m×n- (reserved for) rating matrix with m items (rows) and n users (columns)

• Rij - element (i, j) of rating matrix R

• Reij - estimated value of Rij

• a ←− ζ(b) - assign object/value ζ(b) to a

• Ci,j,k- cluster (i: cuisine, j: specific cuisine, k: protein)

• µ - mean rating in the catalog/rating matrix

• µu - mean user rating in the catalog/rating matrix

• µi - mean item rating in the catalog/rating matrix

• bi - item rating bias (bi= µi− µ)

• bu - user rating bias (bu= µu− µ)

B. Terminology

• CF - collaborative filtering

• CB - content-based (method)

• GB - global baseline

• CBCF - content-boosted collaborative filtering

• SVD - Singular Value Decomposition

• MF - matrix factorization

• GD - gradient descent

• SGD - stochastic gradient descent

• rating matrix - a matrix where entries correspond to ratings (each row is an item and each column is a user) and also referred to as a catalog

• RMSE - root-mean-square error

• ACC - accuracy

• IoI - a specific item that is of interest

• UoI - a specific user that is of interest

• neighborhood clusters - clusters with the same cuisine group and protein (Ci=I,j,k=K)

• DS - data set

• TS - training set

III. BACKGROUND& RELATEDWORK

A. Collaborative Filtering

A pure CF approach is a neighborhood-based algorithm that predicts ratings based on either how similar users have rated items or how other items have been rated [8]. Once a set of nearest neighborhood (items or users) is found, a weighted average is taken between the items or users with respect to their similarities (2). One popular way of quantifying the similarities is by calculating the Pearson Correlation Coefficient (1).

sxy = x · ¯¯ y

|¯x| · |¯y| (1)

rui= PN

j=1s_ij· r_xj PN

j=1sij

(2) However, this approach on its own does not capture the intuition that some users are tougher raters than others in the sense that some tend to give lower ratings. One way of capturing this intuition is by normalizing the rating matrix before employing CF such that the ratings for all users are centered around zero. Positive ratings would then suggest that the user liked the item more than average and vice versa.

B. Global Baseline

A global baseline (GB) estimate has three components; the mean rating in the catalog (µ), the mean user rating (µu) and the mean item rating (µi). By linearly combining these components, where bi is the item bias (µi− µ) and buthe user bias (µu− µ), GB gives estimates based on how the average rating of users and items deviates from the overall mean rating (3).

bu,i= µ + bi+ bu= µ + (µi− µ) + (µu− µ) (3)

(7)

Hierarchical Tree Structure for Food Representation

Mediterranean Scandinavian

Italian

Beef

Cuisine

Specific Cuisine

Specific Protein

Lamb Chicken

Greek

Fig. 1: A suggested hierarchical tree structure for a high-level representation of food/recipes. Recipes are first clustered in cuisines, and then main protein. Italian pasta with beef would here fall under Mediterranean ⇒ Italian ⇒ beef.

C. Content-Based Method

CB methods look into the features of the items and often represent items as vectors of these features. Different CB methods such as Bayesian text classifier [3], weighted average [9] or simply linear regression are popular methods used in recommender systems.

D. Matrix Factorization

MF decomposes a matrix into sub-matrices. One popular matrix-decomposition method is called SVD and decomposes a matrix into three matrices (4).

R_m×n= U_m×kΣ_k×kV_n×k^T (4) If R is considered a rating matrix (each available entry contains a rating represented as a real value between 1 and 5) where the rows in R represent items and the columns represent

users (item-to-user), then U is item-to-feature (m × k), Σ feature-to-feature (k×k) and V user-to-feature (n×k). It can be tempting to have a large number of features (or latent factors) k as possible, however having too many features can prove to be computationally expensive and result in overfitting due to noise in R. Therefore, when applying SVD in recommender systems on rating matrices, a common practice in choosing the number of features ˜k is to find a balance between the risk of overfitting and accuracy and complexity [7] in (6). Generally, the objective function for SVD is to minimize the mean-square error (5) over all entries in R.

E(R, U, Σ, V ) = 1 N

X

(i,j)∈S

(R_ij− eR_ij)² (5)

Re_ij = U_m×˜_kΣ_k×˜_kV^T

n×˜k (6)

(8)

SVD gives the best low-rank approximation of the rating matrix R, but has two inherent problems; overfitting and the fact that SVD requires that all entries in R are available as the objective function (5) requires available ratings over all entries.

As previously mentioned, one way to reduce overfitting is to choose an appropriate number of features, but this can also be solved by adding a regularization term [7] in the objective function (5) which allows for a rich model where there is a lot of data and reduces the complexity on parts where data is sparse.

There are different ways of addressing the issue with SVD requiring that all entries in R are available. One way is to initialize the missing entries in R by taking the average rating in the system, using GB or picking random values. Another way is to redefine the objective function (5) such that the mean- square error is defined only over all available entries in R.

Together with regularization, the objective can then be defined as estimating matrices P and Q (7) such that (8) is minimized.

Re_m×n= P_m×lQ^T_n×l (7)

Ereg(R, P, Q) = X

(i,j)∈S

(Rij− eRij)²+ βX

l

(||P ||²+ ||Q||²) (8) Finding the matrices P and Q with respect to the objective function (8) can be done through gradient descent (GD) and does not require that all entries in the rating matrix are available. In the context of recommender systems, this can be a more practical approach due to sparsity and different variants of this method have proven to be efficient [5].

E. Hybrid Approaches

Each method has its strengths and shortcomings. CF can recommend items only based on explicit feedback, but has a popularity bias and its ability to predict ratings becomes compromised as the rating matrix is sparse. CB methods can recommend items based on what users have previously shown to like even though CF does not recommend those due to sparsity. However, CB methods in its nature tend to only recommend items that are similar to items users have already shown interest in, unlike CF.

To overcome these shortcomings, many recommender systems combine CF and CB in order to produce diverse and accurate recommendations. Melville et al. [3] proposed a method called content-boosted collaborative filtering (CBCF) where CF is first applied to reduce sparsity in the rating matrix, and then a CB method is applied to fill in the rest of the partially empty matrix. CBCF was found to perform 9.2%

better than a pure CB method and 4% better than pure CF.

For recommender systems in the context of recipes, Zhengx- ian et al. [10] investigated a hybrid approach between CB methods and MF (7). Jagithyala [11] proposed CB methods for recipe recommender systems where recommendations were based on ingredients and review text and Freyne and Berkovsky [6] proposed a CB method based on ingredients.

When representing items as a vector of ingredients, Jagithyala

[11] produced recommendations based on a similarity measure between different items using Pearson Correlation Coefficient (1) and Freyne et al. [6] applied a method where a weighted average between ingredients is taken based on explicit feedback. Jagithyala [11] found that ingredient-based similarity approach was the most fruitful and Freyne [6] managed to improve predictions with a hybrid approach between CF and the suggested CB method with weighted averages.

IV. METHODS

In Fig. 1 it can be seen that the suggested approach first classifies what cuisine group a recipe belongs to (such as Mediterranean), then a specific cuisine (Italian) and finally what the main protein is (beef ). The idea is that this information is adequate in clustering recipes in the sense that users can have a clear preference towards certain group of recipes.

If the ratings for different recipes in the same cluster indicate a preference (the rating distribution is narrow and therefore the rating variance is low) given that enough recipes have been rated in that cluster, then a meaningful average of the ratings can be drawn as a prediction for other recipes belonging to the same cluster (see Fig. 2 (a)).

If there is no clear concentration of ratings (trend) in that particular cluster (high rating variance) or not enough information (few ratings), then other clusters belonging to the same cuisine group with the same protein (neighborhood clusters) are checked (such as Mediterranean ⇒ Greek ⇒ beef in Fig. 2 (b)). If every single specific cuisine belonging to the same cuisine group with the same protein does not indicate a trend, then this CB method fails in predicting a rating (see Fig. 2 (c)). How different cuisines should be grouped together (such as Italian and Greek in Fig. 1) can be based on some qualitative or subjective measure of similarity, or they can be geographically and historically clustered.

The proposed CB method in this paper is formalized in Algorithm 1, and is used in Algorithms 2 and 3. Algorithm 2 uses SVD as MF while Algorithm 3 uses stochastic gradient descent (SGD) to find the decomposed matrices (7) with respect to the objective function (8) (see Fig. 3). Because SVD requires that all entries in the rating matrix are available, the missing ones are estimated using GB in Algorithm 2, while this is not necessary for Algorithm 3. However, the GB estimate is still introduced in Algorithm 3 after MF is employed to capture more global effects. CF is performed on the matrices before CB in both Algorithm 2 and 3 in order to reduce sparsity of the rating matrix as done by Melville et al. [3]. Therefore, the CB method can predict more missing entries. CF only produces ratings for items when there is at least one other item with a similarity measure greater than 0.

Algorithm 1 shows the pseudo-code for the suggested practical implementation of the CB high-level hierarchical approach (see Fig. 1 and 2). The predicted rating of the item of interest (IoI) for the user of interest (UoI) is stored in the variable predictedRating and is initialized in line 1. Assuming that the IoI belongs to a specific cluster Ci=I,j=J,k=K, line 2 checks if any reasonable conclusion can be drawn about the predicted rating for the IoI given its cluster Ci=I,j=J,k=K.

(9)

This is quantified by analyzing if the statistical rating variance for all items rated by the UoI in the cluster Ci=I,j=J,k=K is below a given threshold and if the number of rated items in the cluster is above a certain threshold.

Algorithm 1 CB Method - Hierarchical Clusters

1: predictedRating = ∅

2: if V ar{ratings(Ci=I,j=J,k=K)} <

T_ov and size(Ci=I,j=J,k=K) > T_os then

3: predictedRating = mean{CIJ K}

4: else

5: ratingClusters = [ ]

6: N = 0

7: for all Ci=I,j,k=K ∈ ∅ do/

8: if V ar{Ci=I,j,k=K} <

Tov and size(Ci=I,j,k=K) > Tos then

9: ratingClusters.add(mean{Ci=I,j,k=K})

10: N = N + 1

11: if V ar{ratingClusters} < Tv and N > Ts then

12: predictedRating = mean{ratingClusters}

13: return predictedRating

These thresholds can be chosen arbitrarily. If the rating distribution is narrow and enough items have been rated in the cluster, an average for all the ratings within that cluster is taken and stored in predictedRating. Otherwise, no meaningful conclusion can be drawn from that cluster. If this is the case, line 5-10 gathers ratings from all neighborhood clusters (Ci=I,j,k=K) that belong to the same main cuisine and have the same protein given that the neighborhood cluster satisfies similar constraints (line 8) as in line 1. If the constraints in line 11 are satisfied then an average of the ratings between the neighborhood clusters that are available in ratingClusters is taken and stored in predictedRating in line 12. However, if at least one of the constraints in line 11 is not satisfied, then that suggests either a disagreement between the neighborhood clusters (high rating variance) or that enough information is not available (few ratings) and the algorithm will return the value predictedRating was initialized with (∅).

Algorithm 2 Recommender System - Content-Boosted CF with GB & SVD

1: Rnormalized←− N ormalize(R)

2: R ←− CollaborativeF iltering(R_normalized)

3: R ←− ContentBased(R)

4: for all (i, j) ∈ R do

5: if Rij ∈ ∅ then

6: Rij ←− GlobalBaseline(Rij)

7: else

8: Rij ←− GlobalBaseline(R_ij)+R_ij 2

9: R ←− SingularV alueDecomposition(R)

10: return R

(a)

Mediterranean

4 4

4 4 3

Italian beef Greek beef

3.8

(b)

Mediterranean

3 2

1 5 1

2 2

3 2 3

2.4

(c)

Mediterranean

3 2

1 5 1

5 3

4 1 4

φ

Fig. 2: Here, three examples are demonstrated where only two cuisines under the same cuisine group with the same protein (here defined as cluster C1,1,1 and C1,2,1) are considered. In the first case (a), an IoI belongs to the left cluster where the rating variance is low (V ar{ratings(C_1,1,1)} = 0.2) and there are enough rated items in that cluster (size(C1,1,1) = 5) so an average is taken in predicting the rating of the IoI (predictedRating = 3.8). In (b), an IoI belongs to the left cluster, but the rating variance in that cluster is too high (V ar{ratings(C1,1,1)} = 2.8), so an average is taken in the right neighborhood cluster instead as the predicted rating (predictedRating = 2.4) as the rating variance in the neighborhood cluster is low (V ar{C1,2,1} = 0.3).

In (c), the rating variance in both clusters are too high (V ar{ratings(C1,1,1)} = 2.8 and V ar{ratings(C1,2,1)} = 2.3). Therefore, no predicted rating should be returned (predictedRating = ∅).

(10)

0 500 1000 1500 2000 2500 3000 3500 4000 Iteration/Step

10² 10³ 10⁴ 10⁵ 10⁶

Error

Error Measure for MF

Fig. 3: This plot shows how the error measure typically decreases when using SGD for MF with respect to the objective function (8) which is the error measure.

Algorithm 2 shows a procedure for predicting missing entries in the rating matrix using CBCF with GB and SVD. In line 1 the rating matrix is normalized and stored in a separate object Rnormalized. In line 2, some, if not all, missing entries are filled using CF. In line 3, some, if not all, of the remaining missing entries are filled using the CB approach specified in Algorithm 1. In line 4-8, the remaining missing entries are predicted using GB and the entries that are not missing are averaged with the GB estimation. Finally, in line 9 SVD is employed on R.

Algorithm 3 is similar to that of Algorithm 2, but after the CBCF (line 2-3), MF is applied on R in line 4 and the result is stored in eR. The difference here is that MatrixFactorization(R) does not require that all entries of R are available. Therefore, no entries that could not be estimated through CBCF have to be estimated using GB.

Algorithm 3 Recommender System - Content-Boosted CF with GB & MF

1: Rnormalized←− N ormalize(R)

2: R ←− CollaborativeF iltering(R_normalized)

3: R ←− ContentBased(R)

4: R ←− M atrixF actorization(R)e

5: for all (i, j) ∈ R do

6: if Rij ∈ ∅ then

7: Rij ←− eRij

8: else

9: R_ij ←− GlobalBaseline(Rij)+ eRij

2 10: return R

How to choose the thresholds in Algorithm 1 in line 2, 8 and 11 is a trade-off between stricter restrictions and more predictions. Stricter thresholds constitute of having requirements of more rated recipes (size(Ci,j,k)) and lower rating variance (V ar{ratings(Ci,j,k)}) in a cluster before letting that cluster take part in predicting the rating for the IoI. Stricter

requirements will likely lead to more accurate and reasonable predictions, but will compromise the number of different items that can be predicted.

In order to dismiss the suggested CB approach, line 3 in both Algorithms 2 and 3 can be omitted.

V. EXPERIMENTALEVALUATION

A. Data Set

User ratings for different recipes are fetched from Allrecipes [14] where each rating is a real value between 1 and 5 (five- star rating system). Ratings from 11,603 users with 358 unique recipes are gathered, where most users have rated five items in the catalog (see Fig. 4). This sparse data set (DS), or catalog, constitutes the ground truth for the rating matrix, and only 17,116 entries in the DS are available out of 4,153,874 entries (358×11603). Therefore, only 0.41% (or 0.0041) of the ratings in the DS are available. From Fig. 5 it can be seen that a five- star rating is the most frequent rating in the DS (over 11,000 out of 17,116). The least common given rating is a one-star rating. Fig. 5 clearly shows that the catalog has a bias towards higher ratings with an average rating, µ, of approximately 4.5 and a median rating of 5.

Using 10-fold cross-validation, ten non-overlapping test sets where, in each case, 10% of the available data in the DS are used as a test set (1712 entries) and the remaining 90% as a TS (15,404 entries) are constructed.

Number of Ratings per User Distribution

0 20 40 60 80 100 120

Number of Ratings 10⁰

10¹ 10² 10³

Number of Users

Fig. 4: This figure shows how the ratings per user are distributed in the DS. Most users (6,000 out of 11,603) have rated five recipes, while there are users in the catalog that have rated over 120 different recipes.

B. Evaluation Methodology

In order to quantify the performance of the algorithms, the root-mean-square error (RMSE) is estimated (9).

(11)

Fig. 5: This figure shows how the ratings in the DS are distributed. The rating frequency seems to grow exponentially with higher ratings, with a one-star rating being the least frequent and a five-star rating being the most frequent given rating.

RM SE = v u u t 1 N

X

(i,j)∈S

(Rij− eRij)² (9)

Another way to quantify the performance is to estimate the accuracy (one minus the error rate), where both the predicted and actual ratings are rounded to the nearest integer (10).

ACC = 1 N

X

(i,j)∈S

I(round(Rij) = round( eRij)) (10)

I(x = y) is an indicator variable that is equaled to 1 when x is equaled to y and 0 otherwise.

C. Statistical Evaluation

The standard error (11) is evaluated for quantifying the perturbations around the obtained mean RMSE and ACC values, i.e. given an average quantity mean(x) from a set of values x (with standard deviation σ and sample size N ), then the perturbations are given by mean(x) ± 2 · SE(x).

SE(x) = σ

√N (11)

In order to determine if one sample mean has any significant difference between another one, the t-statistic is evaluated and in turn the p-value. The significance level is here chosen as 5%. Also, the one-way ANOVA test is evaluated for multiple different sample means.

VI. RESULTS

A. Algorithm 1 - CB Method

Using the proposed CB approach alone (Algorithm 1) manages to predict, on average, the rating of 145.5 out of 1,712 test entries (8.50% of the test entries). This gives an average RMSE (9) of approximately 1.05 and accuracy (10) of 0.53 (accounting for only the predicted entries), as outlined in Table I. Using a little stricter thresholds (T_oS = 1 and TS = 1 with threshold rating variance of 1.7) gives on average 12.7 predictions (0.74% of the test entries). Increasing the thresholds further gives an average of 2.7 predicted entries which heavily compromises the prediction rate. The RMSE and ACC values here are NaN due to CB approach not being able to predict any ratings in some test sets. As can be noted in Table I, having stricter threshold requirements on the rating variance does not affect the results. Because the prediction rate is sensitive with respect to the chosen thresholds (cluster sizes), using stricter thresholds results in few rating predictions and, subsequently, unreliable RMSE and ACC values. Therefore, the first setting in Table I is chosen as a reference for further analysis.

TABLE I: The suggested CB method (Algorithm 1) evaluated using 10-fold cross validation. ToS/TS and ToV/TV are the cluster size and variance thresholds in Algorithm 1. The average number of predictions, RMSE and ACC are outlined with standard errors given the different thresholds.

ToS/TS ToV/TV entries predicted RMSE ACC 0 1.7 145.5 ± 2.8 1.05 ± 0.03 0.53 ± 0.01 1 1.7 12.7 ± 1.1 0.88 ± 0.07 0.50 ± 0.04

2 1.7 2.7 ± 0.69 NaN NaN

0 1.0 145.1 ± 2.9 1.05 ± 0.03 0.53 ± 0.01 1 1.0 11.9 ± 0.8 0.88 ± 0.07 0.51 ± 0.04

2 1.0 1.9 ± 0.4 NaN NaN

Applying CF before CB in order to reduce the sparsity increases the prediction rate (as outlined in Table II). With CF, ratings for 486.1 test entries can be predicted on average with an average RMSE of approximately 1.00 and ACC of approximately 0.52. Applying CB after CF produces predicted ratings for 540.5 missing entries on average with an RMSE of 1.00 and 0.52 ACC over the entries that the CBCF can predict (540.5 out of 1712 on average). Even though the prediction rate increases when applying the CB method after CF, there is no difference between the approaches with respect to their RMSE and ACC values (as can be seen in Table II, where both CF and CBCF have approximately an average RMSE of 1.00 and ACC of 0.52). Applying GB to predict all entries and comparing the results to the results obtained with or without the CB method no significant difference is found (the t-statistic is approximately 0.01 and p-value is estimated to be 0.50).

B. Algorithm 2 - SVD

Applying SVD on the rating matrix in order to take the regional effects into account requires that all entries of the rating matrix are available. Therefore, the results from performing CBCF & GB are incorporated with SVD (see Algorithm 2)

(12)

TABLE II: The suggested CB method (Algorithm 1) evaluated together with CF using 10-fold cross validation. The average number of predictions, RMSE and ACC are outlined with standard errors.

approach entries predicted RMSE ACC CB 145.5 ± 2.8 1.05 ± 0.03 0.53 ± 0.01 CF 486.1 ± 5.5 1.00 ± 0.02 0.52 ± 0.01 CBCF 540.5 ± 6.3 1.00 ± 0.02 0.52 ± 0.01 CF & GB 1712 (all) 0.89 ± 0.02 0.54 ± 0.00 CBCF & GB 1712 (all) 0.89 ± 0.02 0.54 ± 0.00

with different latent factors. At best, SVD can give an average RMSE of 0.89 and an ACC of 0.54. Evaluating the t-statistic and p-value shows that the difference between SVD and CBCF

& GB is not significant. However, as can be seen in Table III, the confidential interval for SVD with 350 latent factors is narrower than that of the CBCF & GB.

TABLE III: Average RMSE and ACC values and their standard errors for different number of latent factors, k.

approach entries predicted k (SVD) RMSE ACC CBCF & GB 1712 (all) / 0.89 ± 0.02 0.54 ± 0.00

Alg. 2 1712 (all) 50 2.59 ± 0.01 0.32 ± 0.01 Alg. 2 1712 (all) 150 1.59 ± 0.02 0.46 ± 0.00 Alg. 2 1712 (all) 250 1.14 ± 0.02 0.52 ± 0.00 Alg. 2 1712 (all) 350 0.89 ± 0.01 0.54 ± 0.00

C. Algorithm 3 - MF

Table IV outlines the performances when MF is employed under different circumstances. Fig. 5 illustrates how the error measure typically decreases when using SGD for determining the decomposed matrices. The first two approaches are outlined in Table IV (CF & MF and CF & MF & GB), which shows the average RMSE and ACC performance without the CB method. The first approach uses CF and MF which gives an RMSE of 0.98 and ACC of 0.50. Introducing GB, the second approach manages to decrease the RMSE substantially with approximately 10% and increase the ACC with approximately 4%. The last two approaches in Table IV use the CB method.

The third one, without GB, gives an average RMSE of 0.94 and an ACC of 0.51. This approach gives an RMSE that is approximately 4% lower and an ACC that is approximately 2% higher than the corresponding approach without the CB method (first approach). The difference of RMSE between approach 1 and 3 is significant as the t-statistic is approximately 3.78 with the p-value of 0.00068.

Introducing GB when the CB method is applied, as shown in approach 4, decreases the RMSE substantially to 0.87 and increases the ACC to 0.53. The RMSE of this approach is slightly lower (1%) and the ACC slightly higher (2%) than that of its corresponding approach (approach 2). The confidential interval for the ACC is slightly narrower, but, when comparing the two approaches (approach 2 and 4), the t-statistic for the RMSE values is approximately 1.07 with a p-value of 0.15, which indicates no significant difference between the mean

TABLE IV: Different approaches involving MF and their RMSE and ACC values together with their standard errors.

The second approach is Algorithm 3 but without the suggested CB approach, and the fourth approach is Algorithm 3.

approach entries predicted k RMSE ACC

CF & MF 1712 (all) 50 0.98 ± 0.01 0.50 ± 0.00 CF & MF & GB 1712 (all) 50 0.88 ± 0.01 0.52 ± 0.01 CBCF & MF 1712 (all) 50 0.94 ± 0.01 0.51 ± 0.00 CBCF & MF & GB 1712 (all) 50 0.87 ± 0.01 0.53 ± 0.00

RMSE values. Evaluating the t-statistic for the ACC values of approach 2 and 4 gives approximately 1.66 with a p-value of 0.06, which indicates that there is no significant difference between approach 2 and 4 with respect to the ACC either.

D. Outline of Results

The approach that gives the lowest RMSE without the CB method is Algorithm 3 (when line 3 is omitted) with an RMSE of approximately 0.88. The approach that gives the lowest RMSE with the CB method is Algorithm 3 (when line 3 is not omitted) with an RMSE of approximately 0.87.

These results show a slight improvement of 1% when CB is employed. However, evaluating the p-value shows that there is no significant difference between these methods.

The approach that gives the highest ACC without the CB method is CF & GB with an ACC of approximately 0.54. The approach that gives the highest ACC with the CB method is Algorithm 2 with an ACC that is also approximately 0.54.

TABLE V: The results for the best approaches with respect to RMSE and ACC are outlined here.

approach RMSE ACC

GB 0.88 ± 0.01 0.55 ± 0.00 mu 0.88 ± 0.01 0.55 ± 0.00 CF & GB 0.89 ± 0.02 0.54 ± 0.00 Alg. 2 0.89 ± 0.01 0.54 ± 0.00 Alg. 3 (no CB) 0.88±, 0.01 0.52 ± 0.01 Alg. 3 0.87 ± 0.01 0.53 ± 0.01

As can be seen in Table V, only using GB gives an RMSE of approximately 0.88 and an ACC of 0.55. Using the average rating in the TS (µ = 4.5 for all TS in the 10-fold cross- validation sets) as the predicted rating for all test entries gives an RMSE that is also approximately 0.88 and an ACC that is 0.55. Therefore, the most successful approach with respect to RMSE is Algorithm 3, and with respect to ACC it is GB (see Fig. 6). However, evaluating the five best approaches outlined in Table V using one-way ANOVA test gives a p-value of approximately 0.54 with the RMSE values and 5.2 × 10⁻⁶ with the ACC values. This suggests that there is not a single approach in Table V that is significantly better or worse than the other approaches with respect to the RMSE values, but that there is at least one approach that is significantly better or worse with respect to the ACC values. Comparing the samples from the ACC values for Algorithm 2 and 3 by evaluating the t-statistic shows that Algorithm 2 is significantly better than Algorithm 3 with a p-value of approximately 0.043.

(13)

(a)

Alg. 2 Alg. 3 (no CB) GB/ Alg. 3

Different Approaches 0.85

0.86 0.87 0.88 0.89 0.90

RMSE

RMSE for Different Approaches

X: 1 Y: 0.8869

X: 2

Y: 0.8814 X: 3

Y: 0.8791

X: 4 Y: 0.8716

(b)

Alg. 2 Alg. 3 (no CB) GB/ Alg. 3

Different Approaches 0.50

0.51 0.52 0.53 0.54 0.55 0.56

ACC

ACC for Different Approaches

X: 1 Y: 0.5402

X: 2 Y: 0.5182

X: 3 Y: 0.5473

X: 4 Y: 0.528

Fig. 6: This figure shows the most successful approach (Algo- rithm 3) in terms of RMSE and the most successful approach (GB and µ) in terms of ACC, together with Algorithm 2 and Algorithm 3 with the CB method omitted. The figures also include the standard errors.

VII. DISCUSSION

The goal of this paper is to investigate if clustering food recipes according to their cuisine and protein can improve predictions compared to when the state-of-the-art methods (CF, MF and GB) are employed, as a hierarchical structure based on just a few key descriptors is relatively computationally cheap to process. Its structure is also very intuitive and interpretable.

The best approach, in terms of RMSE, is given by Algorithm 3 (0.87), and the best approach, in terms of ACC, is given by GB (0.55). Considering the approach with lowest RMSE,

Algorithm 3 is slightly better when the suggested CB method is integrated compared to when it is not, however this is not a statistically significant effect. The suggested CB approach cannot therefore improve the predictions when integrated with the state-of-the-art methods (neither in terms of RMSE or ACC).

It can be hypothesized that the reason the CB method does not improve the predictions is due to the heavily sparse DS, and indeed it is heavily compromised with respect to how many rating predictions it can give. The CB method, on its own, only manages to predict on average 145.5 out of 1712 entries (8.5%

of the missing entries). Even when CF is applied to the rating matrix to reduce the sparsity before the CB method, only 540.5 out of 1712 entries can be predicted on average (31.6% of the missing entries). Having the threshold for the number of ratings in a cluster being 0 (meaning that it is enough to have 1 item rated in a cluster for it to produce a predicted rating) gives, to no surprise, the most predictions. Using stricter thresholds on the number of rated items in a cluster causes the number of predictions to degrade quickly. The number of predicted ratings that the CB method can produce is seemingly very sensitive with respect to the threshold values ToS and TS (see Table I). The CB method has two degrees of freedom, required cluster size and rating variance, and can be sensitive when the ratings in the clusters are too spread out (high rating variance) or when the rating matrix is too sparse (few ratings in clusters).

Since picking stricter threshold values for the rating variance does not affect the results or how many entries that can be predicted to any significant degree, as outlined in Table I, one can draw the conclusion that the relatively small prediction rate, when only applying CB, is because of the sparsity, as only the cluster size thresholds seem to have an impact. This becomes even clearer when observing Fig. 5; most of the ratings are concentrated towards four and five-star ratings (with µ = 4.5 and a median rating of 5) suggesting that the rating variance should not be high within or even between clusters. If the ratings are more spread out, then the number of predicted entries should also degrade when picking stricter thresholds for rating variance in clusters. This is a big weak point for the hierarchical clustering that is solely based on which cluster food recipes belong to and what the ratings are when users have rated few recipes belonging to different clusters (the most common number of rated recipes in the DS is five as shown in Fig. 4). The suggested CB approach tries to mitigate this weak point to some extent by also looking at neighborhood clusters, but it is clearly not adequate in this case, and in reality sparsity cannot be avoided. This shows that the state-of- the-art CB methods that look into the features clearly have an advantage when it comes to number of produced predictions [6, 9, 11, 12, 13]. Extracting and comparing low-level features of items can show similarities between items that are seemingly different with respect to cuisine and protein, but similar with respect to other ingredients. Even the social context of what the peers of a user are interested in can provide recommendations [12], and just using the cuisine and protein misses out on predictions that these CB methods can potentially provide.

Recommendation platforms for food recipes should therefore not omit the low-level features and even the social context

(14)

when designing recommender systems. The take-away lesson is therefore to use all information that is available as the more that is used, the more robust and accurate the predictions can become.

One interesting finding is that Algorithm 2 seems to be significantly better than Algorithm 3 with respect to the ACC values. This can be due to the data itself; most of the ratings are five-star ratings (see Fig. 5) with µ = 4.5 and a median rating of 5. If most of the entries in the DS are 5, then using the average rating 4.5 as a predictor all of the time will produce more hits than misses when evaluating the ACC (I(round(4.5) = round(5)) = 1), which is why GB and µ have an ACC value larger than 0.50. Because of the sparsity and rating bias in the DS, the GB estimate becomes effectively the same as using µ as a predictor (µ ≈ µi≈ µu⇒ GB ≈ µ), why GB and µ have both the same RMSE and ACC values (see Table V and Fig. 6). Because, in Algorithm 2, all entries need to be predicted before employing SVD, the GB estimate is used in entries that CBCF fail to predict (see line 5-6 in Algorithm 2). As shown in Table II, the CBCF cannot predict even one third of the total number of entries that need to be predicted, so most of the entries are predicted using only the GB estimate with Algorithm 2 before applying SVD. Using ACC instead of RMSE as an evaluation methodology can therefore be criticized as some information is lost when evaluating ACC and not RMSE; ACC uses a binary indicator variable (hit or miss) while RMSE takes all deviations into account. Also, according to the ACC values, using the same predicted rating, µ, for all entries produces the best results, but this defeats the purpose of using a recommender system to give personalized recommendations when all predicted ratings are the same.

Therefore, RMSE is seemingly a more appropriate choice for evaluating different approaches, or evaluation methods that only consider items that users find interesting such as the one used in [5]. Many recommender algorithms, such as FISM [5], try to find a set of items that a user might be the most interested in and train the algorithms to find those instead of trying to predict all ratings. It can therefore be argued that ranking items and only considering if a machine learning algorithm can produce recommendations users will likely prefer, is a more relevant way of training the algorithms. RMSE can show poor performance, while only evaluating the number of relevant recommendations can show a good performance.

As mentioned, the biggest limiting factor of the suggested CB method is sparsity. Another problem with the CB method is that the choice of clusters can be arbitrary. For instance, it can be argued that Swedish cuisine should be a part of the European cuisine, and therefore clustered together with the German cuisine. It can also be argued that Swedish cuisine should be a part of the Scandinavian cuisine and should therefore not be a part of the European cuisine. How the cuisines are structured will affect the results, and there is no ground truth suggesting that Swedish cuisine should be a part of Scandinavian cuisine but not European, and vice versa. The hierarchical clustering that is based on subjective measures of similarity introduces variability in that one tree structure (such as the one in Fig. 1) can be very different from another and perform differently on the same data set.

Using a low-level feature based CB method eliminates this variability. Furthermore, this variability and heavy dependency on data structure (sparsity and rating variance) can lead to the hypothesis becoming an ad-hoc hypothesis; if the suggested CB method does not produce statistically significant results, then it can be argued that this is due to the fact that the clusters are not structured appropriately or that the data is too sparse.

On an ethical note, this CB method is less intrusive as it gathers less data about user activity and only gathers data that users explicitly give (ratings). One can argue that building a social navigation system [9] or keeping track of the number of times a user has selected an item [8] is a dangerous leap towards a future where there is a lot of private data and information gathered on people that can be used to manipulate larger populations through, for example, their food habits.

There is a trade-off between gathering more data for better recommendations and respecting the privacy of users. Open discussions between companies employing machine learning algorithms and customers are necessary in order to know where to draw the line. What kind of recommendations to give is also important to discuss for a healthy sustainable consumption among users, especially in the context of food recipes; recipe recommendations can be solely based on what users like to consume and the health aspect can be completely omitted.

More research on recipe recommender systems should take the health aspect into account, as in [9].

VIII. CONCLUSION

The suggested CB approach manages to perform slightly better when integrated with the state-of-the-art methods, but the difference is not statistically significant which could be due to insufficient amount of data. The CB method suffers from sparsity in that the amount of predictions it can give becomes compromised, so it is hypothesized that the true potential of the algorithm will not be discovered until it is applied on less sparse data. However, in reality, rating matrices are usually sparse, so the suggested CB method should be integrated with other feature-based algorithms, even though the purpose of this paper is to investigate a method deemed not computationally complex.

A. Future Work

To further investigate how well the CB approach works, it could be applied to a less sparse dataset. Also, more complex versions of the hierarchical clustering, as demonstrated in Fig.

7, can be evaluated. For example, another layer of classification besides cuisine and protein can be added, such as dish. For instance, if a recipe is served as a pie, with spaghetti or as a soup.

ACKNOWLEDGMENT

The author would like to thank Pawel Herman for his supervision during this project. The author would also like to thank Theresia Silander Hagstrom (CEO at Eatit), Oytun Yildirimdemir (CEO at Eatit) and Baran Topal (advisor at Eatit) for their support and help with the logistical aspects of this project.

(15)

Hierarchical Tree Structure for Food

Representation

Mediterranean Scandinavian

Italian

Red Meat

Beef

Pasta

Sandwich Plate

Cuisine

Specific Cuisine

Protein

Specific Protein

Dish

Fig. 7: A suggested hierarchical tree structure for a more complex high-level representation of food/recipes compared to Fig.

1. Recipes are first clustered in cuisines, then main protein and finally dish. Italian pasta with beef would here fall under Mediterranean ⇒ Italian ⇒ beef ⇒ pasta.

REFERENCES

[1] Sean Coughlan: Why do we love and hate different tastes? BBC Business 10 November 2016.

http://www.bbc.com/news/business-37800097

[2] Xiangyu Zhao, Zhendong Niu and Wei Chen. Opinion-Based Collab- orative Filtering to Solve Popularity Bias in Recommender Systems.

Lecture Notes In Computer Science, 8056; 426-433; Database and expert systems applications by Springer, Heidelberg; 2013.

[3] Prem Melville, Raymond J. Mooney and Ramadass Nagarajan. Content- Boosted Collaborative Filtering.Proceedings of the SIGIR-2001 Work- shop on Recommender Systems, New Orleans, LA, September 2001.

[4] Luis M. de Campos, Juan M. Fernndez-Luna, Juan F. Huete and Miguel A. Rueda-Morales. Combining content-based and collaborative recommendations: A hybrid approach based on Bayesian networks.

International Journal of Approximate Reasoning 51 (2010) 785799.

[5] Santosh Kabbur, Xia Ning and George Karypis. FISM: Factored Item Similarity Models for Top-N Recommender Systems.19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2013

[6] Jill Freyne and Shlomo Berkovsky. Recommending Food: Reasoning on Recipes and Ingredients.CSIRO, Tasmanian ICT Center GPO Box 1538, Hobart, 7001, Australia

[7] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan.

Large-scale Parallel Collaborative Filtering for the Netflix Prize.HP Labs, 1501 Page Mill Rd, Palo Alto, CA, 94304

[8] Maryam Khanian Najafabadi, Mohd Naz’ri Mahrin, Suriayati Chuprat and Haslina Md Sarkan. Improving the accuracy of collaborative filtering recommendations using clustering and association rules mining on implicit data.Computers in Human Behavior 67 (2017) 113-128.

[9] Youri van Pinxteren, Gijs Geleijnse, and Paul Kamsteeg. Deriving a recipe similarity measure for recommending healthful meals. In Proceedings of the 16th international conference on Intelligent user interfaces. ACM, 105114, 2011.

[10] Zhengxian Li, Jinlong Hu, Jiazhao Shen and Yong Xu. A scalable recipe recommendation system for mobile application.2016 3rd International Conference on Information Science and Control Engineering.

[11] Anirudh Jagithyala Recommending recipes based on ingredients and user reviews.B.Tech, Jawaharlal Nehru Technology University (JNTU), India, 2010.

(16)

[12] Martin Svensson, Kristina Hk, and Rickard Cster. Designing and evaluating kalas: A social navigation system for food recipes. ACM Transactions on Computer-Human Interaction (TOCHI) 12, 3 (2005), 374400.

[13] Shunsuke Hanai, Hidetsugu Nanba and Akiyo Nadamoto. Clustering for Closely Similar Recipes to Extract Spam Recipes in User-generated Recipe Sites.iiWAS ’15 Proceedings of the 17th International Confer- ence on Information Integration and Web-based Applications & Services Article No. 31, 2015.

[14] Allrecipes

http://www.allrecipes.com

(17)