A Random Indexing Approach toUser Preference Prediction

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING 300 , FIRST CYCLE

CREDITS

STOCKHOLM SWEDEN 2016,

A Random Indexing Approach to User Preference Prediction

ISAK STENSÖ AND ANDREAS ROSENBACK

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

A Random Indexing Approach to User Preference Prediction

ISAK STENSÖ AND ANDREAS ROSENBACK

Degree Project in Computer Science, DD143X Supervisor: Kevin Smith

Examinator: Örjan Ekeberg

CSC, KTH 2016-05-04

(3)

Abstract

Predicting user preferences is a common problem for many companies and services. With the growth of Internet services it becomes both more important and more lucrative being able to predict what products a user would like and then recommend these to them. There are many ways of attempting this, but this study attempts to use random indexing to solve the same problem.

Random indexing is a method that has been used successfully when studying the similarity between words, and allows entities to be represented as vectors with relatively small dimensionality. This would allow for fast and memory- efficient implementations of prediction systems.

This study uses the dataset Amazon Fine Food Reviews, which contains reviews of products with a rating. It is attempted to predict these ratings, and the result of random indexing is compared to the results on the same dataset using collaborative filtering. Various parameters used in the random indexing method are also varied, to study their effect on the results. These methods are evaluated based on root mean square error and mean absolute error.

The results indicate that random indexing does not generate as good results as collaborative filtering. However, the difference is small enough to warrant further study into the other strengths of random indexing, such as speed and memory efficiency. It is theorized that the sparsity of the dataset might have caused the differences in errors between the methods, and with a dense dataset the results might be better.

(4)

Sammanfattning

Att förutspå användarpreferenser är ett vanligt problem för många företag och tjänster. Med allt fler och större internettjänster har det blivit både vik- tigare och mer lukrativt att kunna förutspå vad en användare skulle gilla och sedan rekommendera detta till dem. Det finns många sätt att åstadkomma detta, men denna studie försöker att använda random indexing för att lösa samma problem. Random indexing är en metod som främst har använts för att undersöka likheten mellan ord, och låter olika entiteter representeras som vektorer med relativt små dimensioner. Det öppnar för snabba och minnesef- fektiva implementationer för system som vill förutspå användares preferenser.

Denna studie använder sig av Amazon Fine Food Reviews, som är ett dataset som innehåller recensioner av produkter, med ett betyg. Det försöks att förutspå detta betyg, och random indexing-metodens resultat jämförs med resultatet när collaborative filtering används. Diverse parametrar som används med random indexing varieras också för att undersöka deras påverkan på resultaten. Dessa metoder evalueras baserat på roten ur medelkvadratfel och genomsnittlig absolut avvikelse.

Resultaten indikerar att random indexing inte generar lika bra resultat som collaborative filtering. Skillnaden är dock tillräckligt liten för att motivera fortsatta studier om random indexings övriga styrkor, såsom hastighet och minneseffektivitet. Det är möjligt att datasetets gleshet kan ha legat till grund för skillnaden i fel mellan de två metoderna, och att med ett tätare dataset kan ett bättre resultat fås.

(5)

Introduction

In order to satisfy their customers, many companies devote considerable resources on predicting their preferences and behavior, based on previous behavior exhibited by the customer. This is the topic of user behavior prediction, and it is applicable on everything from dating sites to movie streaming. As an example, Netflix predicts what movies a user might like by taking into account reviews they have given (Bennet

& Lanning 2007). This can also be done on retail websites, such as Amazon.com.

By considering what the users have bought, looked at, and what reviews they have given, other products can be proposed to the user.

Another area, that seemingly has little to do with prediction models, is vector- space modeling. Vector-space models have been used for a long time in many areas, especially with linguistic problems. One method, that has been used with great suc- cess to calculate similarity in words based on context, is random indexing (Sahlgren 2005). This manages to abstract the context of an entity (often a word) with a single vector, with a very low dimensionality, and then the similarity between vectors can be calculated. While this method is common in word analysis, it has not yet been tried for preference prediction.

We want to investigate whether random indexing can be used to predict a user’s preferences by abstracting products as vectors, in the same way as is usually done for words. By letting the users and their ratings become the context of the products, it should be simple to compare the product vectors to find similar products. With these, a prediction can be given. This could potentially be faster than current implementations, as it would require far smaller vectors to calculate with. There are several potential areas where this could be useful. One example could be in web-based applications. Here it is often important to generate results fast, as the user does not want to wait. A method that can generate predictions faster could potentially be better than a slower method, even if the results are slightly worse. It is thus important to see how good results random indexing can yield.

We use the dataset Amazon Fine Food Reviews to calculate what products a user would prefer. This dataset contains information about what reviews users have given about Amazon products.

(8)

1.1 Problem Statement

This study sets out to explore whether it is possible to accurately predict what rating a user would give a product based on previous reviews the user has given, using random indexing. This is done using a database called Amazon Fine Food Reviews, which contains ratings on products by users. The results are then compared to using collaborative filtering. It further studies how varying several parameters used in random indexing could affect the result.

1.2 Dataset

This paper uses the dataset Amazon Fine Food Reviews, which is a dataset containing reviews of food products from amazon.com. A user and a product is associated with each review, as well as a rating on a scale from 1 to 5. Every review also has a helpfulness rating, which is how many users found the review helpful out of all who reviewed the helpfulness of the review. More information about the dataset and the other fields of it can be found in Section 3.1.

1.3 Scope

When calculating what products a user would most likely prefer there are several factors that can be taken into consideration. This paper bases its prediction on what users have reviewed the target product, and then finds other similar products among those that the target user has reviewed. We thus do not take into consideration what products the user has bought or what products the user has previously viewed. In fact, no data about the user, such as age or gender, will be taken into consideration.

This is because we do not have access to that data.

This paper also ignores the text of the review. Trying to parse text and ascertain its meaning is far beyond the scope of this study. We will instead rely entirely on the rating that every review contains. However, we will also conduct further experiments to see how the helpfulness rating that every review is given can help the method.

The time complexity of our implementation is not considered, only what results are given. The time the implementation takes to run is not relevant to analyze the results, and can be improved at a later date.

Random indexing has been chosen as the method, since it has not been well explored in this area. It has however been proven very capable in language technology. When studying the similarity between vectors, this paper uses cosine similarity which compares the angle between the vectors. This is because it is simple to im- plement, and it is impossible to compare the length of the vector, since they depend on the amount of reviews given by the user or about the product.

To compare the random indexing method we use item-item collaborative filtering.

This is a commonly used method in recommender systems, and is thus a good baseline.

(9)

1.4 Disposition

This paper is structured as follows. Background relating to user behavior prediction in general is presented in Section 2. Here is also collaborative filtering described, followed by descriptions of random indexing and cosine similarity. The next section, Section 3, presents the methods used, both to process the data and how the prediction algorithm works. Different variations of the algorithm are also presented here, as well as how the algorithms were tested. Further, the collaborative filtering algorithm designed to compare random indexing against is also described here. Section 4 presents the results for the different variations and methods; what error rate each method had and how they compared to each other. Section 5 discusses these results with regard both to the results and the method used, plus gives a recommendation for interesting future research. The paper ends with a conclusion which summarizes the study.

(10)

Chapter 2

Background

2.1 User Behavior Prediction

The traits of users of different systems can vary immensely and therefore also their preferences. In order to tailor the user experience to each individual it is important to understand the users and to use the information about them that is available. By doing this it is possible to predict what the user’s behavior might look like in the near future. Predicting that behavior is done through different predictive models.

Common methods used for prediction can be divided into statistical methods and artificial intelligence methods (El-Baz & Tzscheutschler 2015). Here we will provide a brief overview of some methods that can be used for prediction.

2.1.1 K-Nearest Neighbor

One way to predict behavior is to establish a set of predetermined categories, and try to classify new observations. K-Nearest Neighbor is one of the simplest methods that is being used for classification. One advantage with the nearest neighbor classifier is that it requires no preprocessing (Keller, Gray & Givens 1985). The idea of the K-Nearest Neighbor algorithm is that it assigns a class to an observation of unknown classification to the same class as the majority of its K nearest neighbors, where K is a predefined number. The observation then acts as the input of the algorithm. The input can vary depending on the problem instance, for example in pattern recognition the input would be some kind of pattern. Based on the class label represented by the K-closest neighbors of the vector a class label is assigned to an input. The nearest neighbors are found by some kind of distance measure, for example with the Euclidian distance. The input then receives the same classification as the majority of the K nearest neighbors have.

There are two major drawbacks with K-Nearest Neighbor (Hassanat, Abbadi, Altarawneh & Alhasanat 2014). The first drawback is that there is no output trained model, which means that the algorithm needs to use all training examples on every test. The other drawback is that its classification performance is dependent on the number of neighbors, which can be different between different data samples.

(11)

2.1.2 Naive-Bayes Classifiers

Another approach to prediction is to use Naive-Bayes classifiers, with what is known as a simple Bayesian classification algorithm. One example where it is used is in text categorization (Dai, Xue, Yang & Yu 2007). In order to classify a data instance, for example a document, the Naive-Bayes requires a set of class labels and a set of already known classified data instances. A data instance is represented by a feature vector. The vector contains the features of the data instance as variables. The Naive Bayes classifier makes an assumption that the features of data instances are conditionally independent given a class, which means that it is assumed that every feature of a data instance is unrelated to each other (Rish 2001). By applying Bayes theorem on the known information about the data instances, with the assumption of the independence between features taken into consideration, a class can be assigned to the data instance.

The problem with Naive Bayes is the independence assumption which almost never holds true for real life applications (Lewis, 1988)(Webb, Boughton & Wang, 2005). Another limit to Naive Bayes is that it can only do predictions on predetermined classes.

2.1.3 Regression Models

While Naive-Bayes classifiers and similar methods are mostly used for classification, regression models are also commonly used. Regression models calculate a dependent variable when several independent variables are known. By studying the correlation between attributes of some entity, it is possible to predict one of them if it is assumed that it depends on the others. The simplest model is linear regression, which models this correlation as linear (Birkes & Dodge, 1993, page 4). There are many other regression models, that can predict the dependent variable in various ways. For example, if we assume that the dependent variable is a binary variable, binary logistic regression can be used instead (Harrell 2015, page 220).

2.1.4 Collaborative Filtering

Collaborative filtering is yet another common method to do predictions with. In this section collaborative filtering will be explored in general and item-based collaborative filtering will be explored in greater detail.

The idea of a collaborative filtering algorithm is to give a recommendation of an item to a user or make a prediction of what kind of items the user might like by using data of the opinions of users with similar taste. The user that the prediction will be made for will be referred to as the target user and the item the prediction refers to will be referred to as the target item. There are many different collaborative filtering algorithms which generally are divided into two main categories, the user-based and the item-based algorithms (Sarwar, Karypis, Konstan & Riedl 2001).

User-based algorithms focus on the users. They use statistical methods to find a set of users that have a history of similar preferences to the target user and combine

(12)

these to make a prediction of what the target user might like (Sarwar, Karypis, Konstan & Riedl 2001). Item-based collaborative filtering examines the items that the target user has rated before and then calculates how similar they are to the target item. It then chooses the most similar items and calculates their corresponding similarities. The prediction is then made by taking the weighted average of the items. This is more of a probabilistic approach rather than statistical.

Item-based collaborative filtering needs to calculate the similarity between different items. This can be done in different ways. One way, among others, is to compare the different items with a correlation-based similarity, like Pearson correlation (Sarwar, Karypis, Konstan & Riedl 2001). Pearson correlation is a well known correlation measure and measures what the linear relationship between two variables look like (Herlocker, Konstan, Terveen & Riedl 2004).

The prediction about the rating of an item for a user can also be made in various ways. One way for the item-based collaborative filtering algorithm to make a prediction is through making a weighted sum of the ratings of the similar items (Sarwar, Karypis, Konstan & Riedl 2001). The prediction P_u,i can be defined as:

Pu,i = P

n∈N

(s_i,n∗ R_u,n) P

n∈N

(|si,n|) , (2.1)

where u is the target user, i is the target item, s_i,j is the similarity between the items i and j, R_u,i is the rating of the user u on the item i and N is the set of the most similar items to i.

Two weaknesses with user-based collaborative filtering is sparsity and scalability (Sarwar, Karypis, Konstan & Riedl 2001). With sparsity it is meant that the dataset in many applications are very sparse because a user might have only bought or reviewed few of the available items from a retailer. It can therefore lead to bad accuracy of the algorithm. With scalability it is meant that the computational cost of the algorithm grows with both the number of items and users. It can therefore be an issue when the datasets become larger. The item-based collaborative filtering on the other hand are often time-consuming (Xue et al. 2005).

2.2 Vector Space Modeling

One way to study similarity between two entities is with a vector space model.

The idea is to abstract the entities as high-dimensional vectors, and calculate the similarity between them. There are many methods to achieve this abstraction, and it is frequently used as word space models. Word space models attempt to study the similarity between words by analyzing their contexts (Sahlgren 2006). An example of how they are often used will be presented below, to facilitate the reader’s understanding.

The foundation of word space models is that words with similar contexts will be semantically related (eg. synonyms or antonyms) (Sahlgren 2005). A context

(13)

vector is created for every separate word, which will be its representation in the vector space (Sahlgren 2005). It will originally only consist of zeroes. This context vector can be built in different ways, depending on how the context is defined. This definition can be made in many different ways. Two common methods are to define it as either the documents that contain the word, or a predefined number of words surrounding the word in a given text (called the context window ).

If the latter method is being used, the context vectors have the same length as the amount of different words. Every word in the dictionary is then given an individual index, from 1 to n, where n is the number of words. When the text to be studied is parsed, every time a word is encountered its context will be analyzed.

The context will in this case contain a predefined number of words separate from the word studied. The context vector of the currently studied word will then be modified by adding one on the index corresponding to every word in the context (Sahlgren 2005). By doing so, every word can be modeled as a vector, and be compared to other words. (Wan, Jönsson, Wang, Li & Yang, 2011).

However, these methods have some inherent problems. First and foremost; the matrix that is built from the context vectors for every word will be very large (Sahlgren 2005). In the case studied above it be will be nxn, where n is the number of words in the dictionary. This matrix will be large and cumbersome. Even worse, the matrix will be very sparse as well. Most words will not exist in each other word’s context, and thus the matrix will be filled with zeroes. Further, every time a new word is added to the dictionary, the dimensionality of all the context vectors would have to be expanded.

2.2.1 Random Indexing

One way to solve the problems with vector space modeling is with random indexing.

Random indexing is also a vector space model, but it does not have the problem of large matrices. Also, the dimensionality of the context vectors is predefined.

In random indexing, every entity (in the word space model above, every word in the dictionary) is given an index vector. It has the size of a predefined dimensionality, usually in the order of hundreds or thousands (Sahlgren 2005). This index vector almost entirely consists of zeroes, except for a few elements that are either +1s or

−1s. These ±1s are randomly distributed, so that each index vector is different from the rest (Wan et al. 2011).

After this, the text is scanned as in the example mentioned in section 2.2. How- ever, in random indexing the index vector of an entity is added to the context vector, if it exists in the context of the currently studied entity. As an example, if we have a word space model as previously, the index vector of a word a would be added to the context vector of the word b, if a exists inside the context of b (Wan et al. 2011).

In this way, context vectors can be created with much smaller dimensionality than otherwise.

(14)

2.2.2 Similarity Measures

When comparing the similarity between vectors there are several ways to go about this. Often the Euclidean distance between the vectors is calculated (Huang 2008).

Other ways can often be more effective however. One method is to use cosine similarity. With cosine similarity, the angle between the vectors determines the similarity (Huang 2008). It is calculated using the following formula:

SIM_C(u, v) = cosβ = u · v

|u||v|, (2.2)

where β is the angle between the two vectors. The similarity will then be a number from zero to one, and the higher it is the more similar the vectors are. Since it only takes the angle of the vectors into consideration, this approach is good if the length of the vector is unimportant.

(15)

Chapter 3

Method

Since the goal of this paper is to study the performance of using random indexing, an algorithm was developed. The results this algorithm yielded were compared to the results that were given from an algorithm using collaborative filtering.

3.1 Dataset

The dataset that was used to compare the methods was Amazon Fine Food Reviews.

It contains 568, 454 different reviews of Amazon products. Each review has an individual id, the id of the reviewed product and the id of the user that gave the review. Apart from that, every review also contains a rating of the product on a scale from 1 to 5, plus a text review the user has given. It also contains a helpfulness rating, meaning the number of users who found the review helpful as well as the number of users that indicated whether they found it useful or not. There is also a short summary of the review and a timestamp from when it was made. The reviewed products number 74, 258 and in total 256, 059 different users have given at least one review.

3.1.1 Curating Data

This study used cross-validation when predicting a review. Since the random indexing algorithm predicts a rating based on other ratings, the reviews that were not part of the testing set were used to train the algorithm. In order not to compromise the results, leave-one-out cross validation was used, so that only one rating was removed and predicted.

However, this cannot be done for all reviews; some reviews are impossible to predict using our method. Thus, several so called testing sets were created, which contained a subset of the reviews. More specifically, our method cannot predict reviews belonging to users with only one review, or products that have only been reviewed once. If that review is removed no data can be used to try to predict whether the user would like the product. Also, the method has difficulty predicting

(16)

ratings for reviews by users with few reviews or concerning seldomly reviewed products. Thus users or products with few reviews have been removed from those that are tested. Four different limits were used for the minimum amount of reviews made by the user or about the product. These are 3, 6, 9 and 12. Thus, the method was tested on 16 different testing sets, as there is no need for the users and products to have the same limitations. See table 3.1 for the amount of reviews in each dataset.

In the training data all reviews were present however.

Table 3.1: Size of training sets

Least amount of reviews for product

3 6 9 12

Least amount of reviews by user

3 293308 272852 258297 245943 6 191115 179267 170775 163383 9 136341 127574 121511 116319 12 99473 92611 87765 83593

These datasets with the reviews to predict were parsed out and made into their own SQLite files. Apart from the previously mentioned limits, some other restrictions were applied. If there were several reviews by the same user concerning the same product, these were removed from the testing sets. Sometimes users would give very conflicting reviews, such as rating a product both one and five. This makes it hard to predict an accurate rating, since to predict 3 in this case is not entirely cor- rect either. Further, a list of problematic users and products was created. Reviews concerning these products or users were not added to the testing sets. This was done to catch several special cases, where the above restrictions were insufficient.

3.2 Random Indexing

For the random indexing algorithm, the most important step is to let every product be modeled as a vector. The first step is to give every user in the database a randomized index vector. This index vector consisted of 1000 elements as proposed by Sahlgren (Sahlgren 2005), of which four were 1s, four were −1s and the rest were zeroes. After this, each product was given a vector, with each element being set to zero. These vectors were of the same dimensionality as the index vectors of the users.

For each review in the database, the index vector of the user multiplied with a factor based on the score of the review was added to the product vector, according to the following formula:

(17)

P_p= ^X

u∈N

(k_u,p∗ i_u), (3.1)

where P is the product vector for the product p, N is the set of users that have reviewed p, k_u,p is the score factor of the rating the user u has given the product p and i_u is the index vector of the user u. This factor was calculated according to Table 3.2. These score factors were picked as to allow scores of 1 and 5, or 2 and 4, to cancel out each other while allowing a score of 3 to still matter.

Table 3.2: Score modification factor based on rating Score Factor

1 -3

2 -2

3 1

4 2

5 3

3.2.1 Predicting a Review

After this, one of the testing sets is chosen. The testing set contains a subset of all the reviews in the database. Then, the task is to predict the rating of one of these reviews.

First, the review must not influence the calculation of the prediction. Thus, the index vector multiplied with the score factor is subtracted from the vector of the target product. Then, using cosine similarity as in equation 2.2, the 5 most similar products to the target product are found. These 5 products must have been reviewed by the target user. When they are found, a predicted rating for the target product is calculated using the weighted sum mentioned in equation 2.1.

After a review has been predicted, the algorithm adds the index vector of the target user to the vector of the target product. After this it continues with predicting the next review in the training set. When all reviews in the testing set have been predicted, the root mean square error (RMSE) and the mean absolute error (MAE) are calculated as used by Sarwar, Karypis, Konstan and Riedl (Sarwar, Karypis, Konstan & Riedl 2001).

3.2.2 Variation of Parameters

There are two parameters that are of interest to vary, to study their effects on the result. These are the dimensionality of the vectors used in random indexing, and

(18)

how many ±1s used in the index vectors. Thus, after determining the testing set that yields the best result, that testing set was used when varying these parameters.

First the dimensionality was varied, and values of 100, 500, 1000 and 2000 were tested. Then the number of ±1s was changed, while keeping the dimensionality at 1000. The amount was originally 8, but values of 2, 4 and 16 were also tried. In all other aspects the test was as previously.

3.2.3 Helpfulness

In order to test whether using the helpfulness rating associated with every review had any effect on the result, the above algorithm was modified slightly. The helpfulness rating was calculated by using the helpfulness numerator field, meaning the amount of users who found the review helpful, and the helpfulness denominator, being the amount of users that indicated whether they found it helpful or not. This helpfulness value was then used in the following formula:

H =

h n+ 1

2 , (3.2)

where H is the new helpfulness modifier, h is the helpfulness numerator value and n is the helpfulness denominator value. This allowed the helpfulness rating to range from 0.5 to 1. This modification was made to allow reviews that were not found helpful to still matter, though not as much. Also, if no users had rated a review helpful or not, it was given a helpfulness rating of 1. This helpfulness value was then multiplied to the score factor used in the base random indexing method. Apart from the helpfulness modification, the algorithm was the same, and the same tests were carried out.

3.3 Collaborative Filtering

Apart from the random indexing algorithm, an item-item collaborative filtering method was used as well, in order to compare the results of the random indexing method. Item-item was chosen to combat the sparsity problem of the dataset.

Since time is not a relevant factor in this study, it is not important to take it un- der consideration. The item-item collaborative filtering algorithm was run on the same datasets, and was used to compare the results. While item-item collaborative filtering normally uses a matrix, with the user on one axis and the products on the other. However, this would generate a matrix that would be too large to keep in memory. In order to preserve memory, another method had to be adopted. An array containing a list of associated review ids for each product was used instead, together with an array containing user, product and score for each review.

To predict a review, the five most similar products to the reviewed one were determined using cosine similarity between the products, according to the equation 2.2. This similarity was then modified according to the following formula:

(19)

SIM = cos(β) + 1

2 , (3.3)

where cos(β) was the cosine similarity between the two products. This allowed the similarity to range from 0 to 1, instead of from -1 to 1. These five most similar products had to have been reviewed by the user. A score for the rating was then predicted using the weighted sum mentioned in equation 2.1.

3.4 Programming Languages

The algorithms were written in Java, since the authors were most skilled in this language. If time performance was relevant to the study, C++ or a similar language could have been used, but in this case it would be unnecessary. To query the database JDBC was used.

The dataset was modeled as a database in SQLite. This is how the dataset was available, and thus it was easiest to keep it this way. Having the dataset as an SQL database allowed for easy querying and portability.

(20)

Chapter 4

Results

In this section the results will be presented for the different tests that were made.

In section 4.1 the results for random indexing will be presented. After that in section 4.1.1 the results are presented for the tests where the parameters on the basic random indexing algorithm were changed. In section 4.1.2 the results when helpfulness were taken into consideration will be presented. Lastly in section 4.2 the results for the collaborative filtering algorithm will be presented. The results are presented in different tables and the results that are shown are either RMSE or MAE. The best result in each table is bolded. All the numbers that are presented have 5 significant digits.

4.1 Random Indexing

Table 4.1: RMSE for random indexing Least amount of reviews for product

3 6 9 12

3 0.7929 0.7572 0.7290 0.7053 6 0.7769 0.7424 0.7141 0.6917 9 0.7965 0.7636 0.7418 0.7148 12 0.8363 0.8053 0.7801 0.7620

In Table 4.1 the calculated RMSE for the random indexing algorithm is shown. The table is divided into 16 different cells, where each cell has the value of the RMSE for each of the tests made on the random indexing algorithm. Each cell corresponds to one of the testing sets defined in section 3.1.1. Each row defines the number of reviews the target user must have given, and each column defines the minimum

(21)

number of reviews the target products must have received. The RMSE for the different limits varies between around 0.7 and slightly above 0.8. The results got better as the users reviewed less products and the products had more users that reviewed the product. The worst result can be seen in the bottom left corner of Table 4.1 while the best result is located to the far right of the second row of the table.

In Table 4.2 the calculated MAE for the random indexing algorithm is shown.

The MAE for the different limits varies between around 0.3 and slightly above 0.4.

The same trend as in table 4.1 can be seen, with better results further to the right and further up in the table. However, this time the best result is in the first row rather than the second.

Table 4.2: MAE for random indexing

3 6 9 12

3 0.3621 0.3382 0.3194 0.3041 6 0.3699 0.3453 0.3267 0.3104 9 0.3933 0.3686 0.3511 0.3334 12 0.4375 0.4121 0.3930 0.3775

4.1.1 Variation of Parameters

In this section the results for the random indexing algorithm will be presented when some of the parameters were changed, specifically the dimension of the vector and the number of ±1s that were used in the index vector. All tests were made on the same set of data, which is the same dataset as the top right corner of Table 4.2 which was the one that generated the best results. It had the parameters set to 1000 for dimensionality and 8 for the total number of ±1s.

Table 4.3: RMSE with varied dimensionality of vectors Dimensionality

100 500 1000 2000

0.7093 0.7070 0.7053 0.7050

In Table 4.3 and 4.4 four results are shown respectively. The results that are shown are the results for the RMSE and MAE for the random indexing algorithm when the dimensionality of the vectors are changed. The four different dimension-

(22)

Table 4.4: MAE with varied dimensionality of vectors Dimensionality

100 500 1000 2000

0.3073 0.3050 0.3041 0.3024

alities that were tested were 100, 500, 1000 and 2000. As it can be seen in the table 4.3, the dimensionality of 2000 yielded best the result while the worst result was acquired when the dimensionality was 100, as can be seen in both table 4.3 and 4.4.

However, the differences are extremely small between 2000 and 1000 for the RMSE in Table 4.3.

Table 4.5 and 4.6 contains the RMSE and MAE results for the tests when the number of ±1s varied for the random indexing algorithm. The numbers that were tested were 2, 4, 8 and 16. The results between the RMSE and the MAE are different for these tests. There is no clear pattern for the results when looking at the RMSE.

It can be seen in Table 4.5 that the best result is acquired when the number of ±1s is 16. The best result for MAE was acquired when the number of ±1s were 2. Then it gets worse as the number of ±1s grows.

Table 4.5: RMSE with varied number of ±1s in index vectors Number of ±1s

2 4 8 16

0.7058 0.7074 0.7053 0.7046

Table 4.6: MAE with varied number of ±1s in index vectors Number of ±1s

2 4 8 16

0.2963 0.3020 0.3041 0.3043

(23)

4.1.2 Helpfulness

In Tables 4.7 and 4.8 the RMSE and MAE for the random indexing algorithm with the helpfulness rating taken into consideration are presented. They have the same structure as Table 4.1. As can be seen when comparing these tables with Table 4.1 and 4.2 the results with helpfulness are slightly lower. The trend that the best result is located to the right of the table has not changed and the best result is acquired in either the first row or the second row depending if looking at RMSE or MAE. The worst result for both Table 4.7 and 4.8 are located in the bottom left corner of the tables.

Table 4.7: RMSE for random indexing with helpfulness taken into consideration

3 6 9 12

3 0.7949 0.7585 0.7297 0.7059

6 0.7808 0.7433 0.7209 0.6962

9 0.8007 0.7643 0.7420 0.7212

12 0.8394 0.8080 0.7830 0.7656

Table 4.8: MAE for random indexing with helpfulness taken into consideration

3 6 9 12

3 0.3634 0.3394 0.3201 0.3043

6 0.3721 0.3459 0.3293 0.3135

9 0.3958 0.3687 0.3522 0.3371

12 0.4393 0.4140 0.3960 0.3804

4.2 Collaborative Filtering

In Table 4.9 the calculated RMSE for the collaborative filtering algorithm is shown.

The table is divided into the same 16 cells as in the previous tables. The worst result can be seen in the bottom left corner of Table 4.9 while the best result is located in

(24)

the top right corner of the table. The same trend can be seen in Table 4.10, which describes the MAE of the collaborative filtering algorithm.

Table 4.9: RMSE for collaborative filtering Least amount of reviews for product

3 6 9 12

3 0.5422 0.5407 0.5363 0.5307

6 0.5871 0.5841 0.5793 0.5729

9 0.6343 0.6305 0.6242 0.6168

12 0.6887 0.6841 0.6775 0.6704

Table 4.10: MAE for collaborative filtering Least amount of reviews for product

3 6 9 12

3 0.1807 0.1808 0.1793 0.1774

6 0.2199 0.2190 0.2170 0.2145

9 0.2593 0.2581 0.2550 0.2517

12 0.3075 0.3057 0.3024 0.2990

(25)

Chapter 5

Discussion

5.1 Results

The results indicate that random indexing is not an as good method to use to predict ratings in recommender systems as collaborative filtering. This study shows that there is a difference in error between random indexing and collaborative filtering.

With a mean error of about 0.2 to 0.3 which can be seen in Table 4.10, the predictions of collaborative filtering were better than those of random indexing, which yielded a mean error between 0.3 and 0.4 as can be seen in Table 4.2.

The goal of this study was to investigate whether random indexing can be considered as an alternative to traditional collaborative filtering. This had not been done before, and as random indexing uses less memory and works faster than collaborative filtering it could be a great improvement. However, while it may be too early to completely dismiss random indexing as a concept in this field, these results indicate that it could be difficult to generate as good predictions from a random indexing method as from collaborative filtering.

When studying the results of the random indexing algorithm, we see that there is a trend that when it only predicted reviews concerning products that had been rated many times, the results were better than when more reviews were predicted (see Table 4.2). This is hardly surprising, as the algorithm works by determining the most similar products to the target product, and the more is known about the product the better the comparisons will be. More surprising is the fact that the same cannot be said about the users. The MAE is the lowest for the testing set with the lowest restrictions. That is, the reviews made by users that had reviewed few products were easier to predict. This could potentially indicate that users that reviewed few products were more likely to give the same review for these products.

This is especially true since the same trend can be seen in the collaborative filtering results as well. Users with more reviews had more products to compare, since the 5 most similar products of those that the user had reviewed were chosen to be the foundation for the prediction. With more reviewed products, there is less chance that all are the same, and small errors will occur. The same seems to hold up for

(26)

collaborative filtering, seeing as the same trend can be seen there (see Table 4.10).

Another interesting observation is that the RMSE of the random indexing algorithm was the lowest when users had reviewed 6 items or more which can be seen in Table 4.2. Thus, the RMSE and MAE measures were not the lowest for the same testing sets as can be seen when comparing Table 4.1 and 4.2. This indicates that while the error rate was the lowest when users had reviewed as little as possible, the variation between the individual predictions was higher then, compared to when users had reviewed at least 6 items. However, this trend cannot be seen for collaborative filtering, which had RMSE and MAE being the smallest for the same testing sets, as seen in Tables 4.9 and 4.10.

By increasing the dimensionality of the vectors for random indexing, slightly better results were yielded which can be seen in Tables 4.3 and 4.4. The changes are very small however, and it does not seem to be worth the increase in both computational time and memory cost. Thus, it is probably best to use vectors somewhere below 1000.

A quite surprising result can be seen when varying the number of ±1s of the index vectors of the users. The fewer ±1s were used, the better the results were (see Table 4.6). This could be because of the effect the overlapping of indices has on the algorithm. If the vectors contain few non-zero elements, the chance is small that two vectors will have similar elements on the same index. This means that it is rare that two vectors are similar. Since the assumption is that when creating the index vectors no two will be alike, that is important. The more alike two index vectors are, the more difficult it will be to compare the similarity of the product vectors later.

With few non-zero elements, the chance that two index vectors have some elements the same is quite small. However, when they do coincide, the effect is much more significant than when there are many non-zero elements. Thus, as can be seen in the results in Table 4.5, the RMSE tends to be larger with fewer ±1s.

When calculating with the helpfulness rating the results are consistently worse than when not using it. This is probably not due to chance, since the results are worse for every testing set, as seen when comparing Table 4.7 with 4.1 and 4.8 with 4.2. However, the differences between with and without helpfulness are extremely small. This indicates that the helpfulness rating is not very significant in any way.

It also seems that the helpfulness rating is not a good measurement on how good a review actually is. The helpfulness rating is very diverse between different reviews;

some have never been rated neither helpful nor unhelpful, while others have received many such reviews. It is likely that the helpfulness rating referred more to the text of the review than the actual rating.

We can see that collaborative filtering is consistently better than random indexing, and no changes to the parameters of the random indexing algorithm can change this. The main difference between the two algorithms is that in collaborative filtering the products are represented with the ratings given to them by each user. Thus each product is represented in an exact way. For random indexing, each product is represented by adding the index vectors of each user, multiplied with the rating. It is not very surprising that the results are slightly worse, however the

(27)

magnitude of the differences between the methods seem to indicate that random indexing might not be a suitable method for user preference prediction. It is too early to discount random indexing completely. It has several advantages that were not studied in this study. With smaller vectors, it is both faster and less memory consuming than collaborative filtering. Also, it scales very well when adding more products or users. When adding a new user, that user is simply given a new index vector, and the products they have reviewed are modified accordingly. When adding a new product, a new vector is generated for that product. When adding a product or user to collaborative filtering, the entire matrix, which is already very large, will have to be expanded. This could be important in applications where speed is vital.

In dynamic real-time applications, where the user expects a result quickly, it could potentially be more important to generate a fast result, with an acceptable degree of correctness, rather than a slow but more exact prediction. This especially holds up for denser datasets, where there is much data to process, and where it is possible that random indexing can perform better.

As mentioned, one more possibility for the poor performance of the random indexing algorithm is the sparsity of the dataset. As we can see, the more is known about a product the better the results are. Thus, it is possible that far better results can be given for a denser dataset.

5.2 Method

The algorithms were implemented using Java. Of course there are several negative aspects with this. One important one is how slow the algorithms became. It took several hours to run some of them, and they had be run many times on separate datasets. This severely slowed down progress, and every change to the algorithms cost several days worth of results. This problem could be solved by using a dedicated maths language, such as MatLab.

There is also a risk with poor exactness. There is a limit on how many significant numbers Java can handle, and this could cause continually greater errors the more computations are made. However, the problem with exactness is probably not very large, since more than a few significant digits would be unnecessary for these results.

Also, as the data was so sparse, not enough computations were made for a very small error to grow enough to make any difference.

Another problem that was encountered was users who had reviewed products in a way that made the algorithm unusable. An example is a user who reviews one product twice with opposite ratings (eg. 1 and 5) and another product once. This caused the algorithm to not have any data to build a user vector, and thus had nothing to compare products with. This was solved by simply removing these users from the datasets. This is a valid solution, as there were so few of these that the results were not undermined. However, if we had had more time it would probably have been better to find a solution on how to deal with these users.

(28)

It is reasonable to conclude that the dataset used was very sparse, and with a better dataset better and more reliable results could have been achieved.

5.3 Future Work

This study does not entirely conclude that random indexing cannot be used in user preference prediction. For one, a better and denser dataset could be used in order to properly determine the use of the method in that sort of environment. It could also be interesting to test the algorithm on several similar datasets with different sparsity. It would then be possible to study whether the result was affected by the sparsity of the dataset.

Also, it would be interesting to compare the execution time and memory cost of random indexing compared to another, widely used, method, as well as compare the scalability of the two methods. This is the strong points of random indexing, and while this study sought out to see whether random indexing could potentially yield good results, it is necessary to study other effects of the algorithm before deciding whether it can be used or not.

(29)

Chapter 6

Conclusion

In this study a random indexing approach to prediction was suggested. The suggested algorithm was compared to an item-based collaborative filtering algorithm.

This algorithm generated acceptable results but not as good results as the com- parative algorithm. The difference between the results seems to be because of the differences in how the products are represented. The products are represented by their ratings in the collaborative filtering method, thus the product is defined ex- actly. The products for random indexing are however represented as abstract vectors based on what users have reviewed them. Some information is lost in this process, and thus the result becomes worse.

In conclusion from the results, a random indexing approach to prediction might not be valid in terms of yielding the best prediction results. It might however be a good idea to further explore how random indexing would fare against other widely used and known prediction algorithms in terms of scalability and speed. It could also be interesting to see how random indexing would perform on a denser dataset.

(30)

Chapter 7

References

Bennet, J. & Lanning, S. (2007). The Netflix Prize. In KDDCup’07. San Jose, California, USA, August 12, 2007.

Birkes, D. & Dodge, Y. (1993). Alternative Models of Regression. New York: John Wiley & sons, Inc.

Dai, W., Xue, G., Yang, Q., Yu, Y. (2007) Transferring naive bayes classifiers for text classification. Proceedings of the national conference on artificial intelligence.

Vol. 22. No. 1. Menlo Park, CA, Cambridge, MA, London, AAAI Press; MIT Press.

El-Baz, W., & Tzscheutschler, P. (2015). Short-term smart learning electrical load prediction algorithm for home energy management systems. Applied Energy, 147, 10-19.

Guzdial, M., (1993). Deriving software usage patterns from log files. Georgia Insti- tute of Technology.

Harrell, F. (2015). Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Second edition.

Springer International Publishing Switzerland. DOI: 10.1007/978-3-319-19425-7 Hassanat, A. B., Abbadi, M. A., Altarawneh, G. A., & Alhasanat, A. A. (2014).

Solving the problem of the K parameter in the kNN classifier using an ensemble learning approach. arXiv preprint arXiv:1409.0919.

Herlocker, J. L., Konstan, J. A., Terveen, L. G., & Riedl, J. T. (2004). Evaluat- ing collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS), 22(1), 5-53.

(31)

Huang, A. (2008). Similarity Measures for Text Document Clustering. In New Zealand Computer Science Research Student Conference 2008. Christchurch, New Zealand, April 14–18 2008.

Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A Fuzzy K-Nearest Neighbor Algorithm. In IEEE Transactions on Systems, Man, and Cybernetics, 15(4), pp.

580-585. DOI: 10.1109/TSMC.1985.6313426

Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine learning: ECML-98 (pp. 4-15). Springer Berlin Heidelberg.

Nazemi, K., Stab, C. & Fellner, D. W. (2010). Interaction analysis: An algorithm for interaction prediction and activity recognition in adaptive systems. In Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference.

Xiamen, China, October 29-31, 2010, pp. 607-612.

Rish, I. (2001). An Empirical Study of the naive Bayes Classifier. In IJCAI 2001.

Seattle, Washington, USA, August 4-10, 2001.

Sahlgren, M. (2005). An Introduction to Random Indexing. In Methods and Ap- plications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering. Copenhagen, Denmark, 16 August 2005.

Sahlgren, M. (2006). The Word-Space Model. Diss. Stockholm: Stockholm Univer- sity.

Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (pp. 285-295). ACM.

Wan, M., Jönsson, A., Wang, C., Li, L., & Yang, Y. (2011). A Random Indexing Approach for Web User Clustering and Web Prefetching. In New Frontiers in Ap- plied Data Mining, New Frontiers in Applied Data Mining. Shenzhen, China, May 24-27, 2011, pp. 40-52.

Webb, G. I., Boughton, J. R., & Wang, Z. (2005). Not so naive Bayes: aggregating one-dependence estimators. Machine learning, 58(1), 5-24.

Xue, G. R., Lin, C., Yang, Q., Xi, W., Zeng, H. J., Yu, Y., & Chen, Z. (2005).

Scalable collaborative filtering using cluster-based smoothing. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 114-121). ACM.

(32)

A Random Indexing Approach toUser Preference Prediction

A Random Indexing Approach to User Preference Prediction

A Random Indexing Approach to User Preference Prediction

Contents

Chapter 1

Introduction

1.1 Problem Statement

1.2 Dataset

1.3 Scope

1.4 Disposition

Chapter 2

Background

2.1 User Behavior Prediction

2.2 Vector Space Modeling

Chapter 3

Method

3.1 Dataset

3.2 Random Indexing

3.3 Collaborative Filtering

3.4 Programming Languages

Chapter 4

Results

4.1 Random Indexing

4.2 Collaborative Filtering

Chapter 5

Discussion

5.1 Results

5.2 Method

5.3 Future Work

Chapter 6

Conclusion

Chapter 7

References