Cognitive Search Engine Optimization

(1)

Cognitive Search Engine Optimization

JOAKIM EDLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

JOAKIM EDLUND

Master in Computer Science Date: August 10, 2020

Supervisor: Johan Gustavsson Examiner: Olov Engwall

School of Electrical Engineering and Computer Science Host company: Pacing

Swedish title: Kognitiv sökmotoroptimering

(4)

(5)

ing documents in large unstructured collections. Within this field there are widely researched baseline solutions to solve this prob- lem. There are also more advanced techniques (often based on ma- chine learning) to improve relevant results further. However, pick- ing the right algorithm or technique when implementing a search engine is no trivial task and deciding which performs better might seem hard.

This project takes a commonly used baseline search engine im- plementation (elasticsearch) and measures its relevance score us- ing standard measurements within the field of information retrieval (precision, recall, f-measure). After establishing a baseline config- uration a query expansion algorithm (based on Word2Vec) is imple- mented in parallel with a recommendation algorithm (collaborative filtering) to compare against each other and the baseline config- uration. Finally a combined model using both the query expansion algorithm and collaborative filtering is used to see if they can utilize each other’s strengths to make an even better setup.

Findings show that both Word2Vec and collaborative filtering

improves relevance over all three measurements (precision, recall,

f-measure). These findings could also be confirmed to be signifi-

cant through statistical analysis. Collaborative filtering seems to

be performing better than Word2Vec for the topmost results while

Word2Vec improves more the longer the result set is set to be. The

combined model did show a significant improvement to all measure-

ments for result sets of sizes 3 and 5 but larger result sets show less

of an improvement and even worse performance.

(6)

studerar metoder för att hitta dokument inom stora ostrukturerade samlingar av dokument. Det finns flera standardlösningar inom om- rådet som ämnas att lösa problemet. Det finns även ett flertal mer avancerade tekniker, ofta baserade på maskininlärning, vars mål är att öka relevansen hos resultaten ytterligare. Att välja rätt algoritm är dock inte trivialt och att avgöra vilken som ger bäst resultat kan tyckas vara svårt.

I det här projektet används en ofta använd sökmotor, elasticse- arch, i dess standarduppsättning och utvärderas mot vanligen an- vända mätvärden inom informationssökning (precision, täckning och f-värde). Efter att standaruppsättningens resultat har etablerats så implementeras en frågeutvidgningsalgoritm (query expansion), ba- serad på Word2Vec, och en rekommendationsalgoritm baserad på collaborative filtering. Alla tre modellerna jämförs senare mot varand- ra efter de tre mätvärdena. Slutligen implementeras även en kom- binerad modell av både Word2Vec och collaborative filtering för att se om det går att nyttja båda modellernas styrkor för en ännu bättre modell.

Resultaten visar att både Word2Vec och collaborative filtering ger bättre resultat för alla mätvärden. Resultatförbättringarna kun- de verifieras som signifikant bättre efter en statistisk analys. Colla- borative filtering verkar prestera bäst när man endast tillåter ett få- tal dokument i resultatmängden medan word2vec blir bättre desto större resultatmängden är. Den kombinerade modellen visade en signifikant förbättring för resultatmängder i storlekarna 3 och 5.

Större resultatmängder visade dock ingen förbättring eller till och

med en försämring gentemot word2vec och collaborative filtering.

(7)

1 Introduction 1

1.1 Problem Description . . . . 1

1.2 Research Question . . . . 2

1.3 Scope of the Project . . . . 2

2 Background 4 2.1 Information Retrieval . . . . 4

2.1.1 Search Engine . . . . 5

2.1.2 Types of Search Engines . . . . 5

2.1.3 Search Engine Software . . . . 6

2.2 How Does a Search Engine Work? . . . . 7

2.2.1 Relevance and Result Ranking . . . . 7

2.2.2 Term Frequency and Inverse Document Frequency 8 2.3 Machine Learning in Information Retrieval . . . . 9

2.4 Machine Learning Algorithms . . . 10

2.4.1 Latent Dirichlet Allocation . . . 10

2.4.2 Latent Semantic Analysis . . . 10

2.4.3 Word2Vec . . . 10

2.5 Recommender Engines . . . 11

2.5.1 Collaborative Filtering . . . 11

2.6 Evaluation Metrics . . . 13

2.6.1 Precision . . . 13

2.6.2 Recall . . . 13

2.6.3 F-measure . . . 14

2.7 Improving Search Results . . . 14

2.8 Related Work . . . 15

2.9 Summary/Conclusion . . . 16

v

(8)

3 Methods 18

3.1 Research Design . . . 18

3.2 Data . . . 19

3.2.1 Data Gathering . . . 19

3.2.2 Processing . . . 22

3.2.3 Data validation . . . 23

3.3 Models . . . 24

3.3.1 Baseline Model . . . 24

3.3.2 Word2Vec . . . 26

3.3.3 Collaborative Filtering . . . 29

3.3.4 Combining Models . . . 30

3.4 Implementation . . . 30

3.4.1 Tools . . . 30

3.4.2 Tuning . . . 31

3.5 Result Gathering . . . 32

3.6 Analysis . . . 32

4 Results 33 4.1 Word2Vec . . . 33

4.2 Collaborative Filtering . . . 36

4.3 Model Comparison . . . 41

5 Discussion 45 5.1 Comparing models . . . 46

5.2 Combined Model . . . 46

5.3 Sustainability/Ethics/Societal Impact . . . 47

5.4 Future Work . . . 48

6 Conclusions 49

Bibliography 50

(9)

Introduction

As the Internet keeps growing larger and the world is becoming more digitized and complex the need for digital tools to navigate and access information has grown massively. The need is not new but has grown at a faster pace in recent years. Search engines of today can now serve millions of pages to hundreds of millions of search queries on a daily basis [1].

The need to search through large amounts of documents with quick responses can be considered fulfilled. The need to find rel- evant information however, is a harder need to satisfy. Partly be- cause it is subjective to the user asking for information and also because it requires a semantic understanding of both the query and the searchable data.

Sometimes a user browsing for information might not know its own explicit needs. A recommendation engine tries to solve this problem for the user. The purpose of a recommender engine is to predict the preference a user would have to an item or document using various methods. The intention is to provide documents or items with the user’s highest predicted preference without the user explicitly asking for it.

1.1 Problem Description

Today there are companies and organizations providing tools for implementing search engines for others to use on their collections of data. These tools use well-known document indexing techniques for fast search and document ranking. It makes it quick and easy

1

(10)

to set up a search engine for anyone willing to learn the software tools.

However, while a baseline search engine installation like this might fulfill the need of quick and responsive search the question still remains whether or not the results are the most relevant. This is a problem widely researched and several complex and advanced algorithms have been studied that aim to improve the relevance of search engine results.

In this project a baseline search engine for food items will be benchmarked towards common search engine metrics. Then, a ma- chine learning apporach (using Word2Vec) and a recommender en- gine (using Collaborative Filtering), both used to improve ranking will be implemented on top of the baseline configuration to com- pare benchmarking scores. Finally a combined version of the two introduced algorithms will also be tested.

This project aims to find whether or not a baseline search engine can see significant improvements to relevance score by applying one or more complex machine learning techniques.

1.2 Research Question

What is the measurable impact of implementing a known machine learning algorithm on a baseline search engine’s relevance score, and how does it compare to collaborative filtering? Can these meth- ods be combined to achieve improved performance?

1.3 Scope of the Project

The aim of this project is to evaluate potential improvements to relevant score by applying machine learning techniques to a base- line search engine. There are several information retrieval soft- wares that implement well-known search techniques available for installation and indexing of existing data. In this project one of the more popular softwares (elasticsearch) is picked for evaluation.

The picked search engine is a crawler-based search engine as it the most commonly used type of search engine at the time of writing.

Data for both indexing and evaluation is provided by Pacing Sweden

AB which has access to a large food items database and historical

(11)

item searches with expected results for evaluation. After deciding on a machine learning algorithm for search engine improvement it is later evaluated towards the baseline configuration and collabora- tive filtering. Determining the best machine learning algorithm is not in the scope of this project but rather evaluating the method- ology to find and measure improvements by introducing machine learning algorithms to baseline configurations.

The database used to match the incoming claims to relevant

food items is Dabas which is publicly available and contains almost

40,000 items (https://www.dabas.com/)[2]. One item document in

the database has 160 distinct properties. This database stores in-

formation about food items with various degrees of structure. Some

fields such as nutritional information, country of origin and weight

are stored in a structured manner. For example, nutritional infor-

mation is divided into several properties such as name, value and

unit. This makes it easy and clear to interpret. However there is

little to no validation to what the vendors of the different food items

input into the database and several fields are open as free text in-

put (e.g. the list of ingredients field). The database is updated every

night to make sure the system can present the latest daily informa-

tion. This is very important for the client using the search engine

since the data needs to be up to date.

(12)

Background

2.1 Information Retrieval

Te explain what a search engine is it is best to start looking at the broader subject called information retrieval. The term information retrieval, as an academic field of study, can be defined as:

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collec- tions (usually stored on computers). [1]

The difference between information retrieval and traditional database lookup is the so called unstructured nature of the system. In a tra- ditional database system a typical search query would for instance be looking up a specific OrderID in a table of orders to find the information needed. The problem is that the ID of the desired doc- ument must be known beforehand. In an information retreival sys- tem a search query can cover a much broader kind of search query.

For instance a query in an IR system could be phrased something like "find movies with aliens" and the system responds with a list of movies with aliens.

The term "unstructured data" refers to data which does not have clear, semantically overt, easy-for-a computer structure. IR systems also have the ability to make "semistructured" searches. An exam- ple would be to query the system for movies with titles that contain

"star wars" and the description contains "Obiwan".

4

(13)

2.1.1 Search Engine

A search engine, usually a website, is an information retrieval sys- tem that gathers information from different sources on the internet and indexes the information. Then the search engine allows a user to look up desired information from the content gathered via a query interface. Most commonly this is done via a text input field and the engine then tries to interpret the input text to match the indexed in- formation. Results are then usually displayed to the user via a list of links to the gathered content that the search engine deems relevant to the query. Examples of well known and frequently used search engines are Google, Bing, Baidu. These search engines index the world wide web and makes it searchable for anyone to use.

2.1.2 Types of Search Engines

Crawler-Based

A crawler-based search engine use a crawling bot or spider to index information to the underlying database for the search engine. A crawler-based search engine usually operates in four steps.

1. Crawl for information

2. Index the documents into the search engine database 3. Calculate relevancy for all documents

4. Retrieve results for incoming search queries

This is the most common type of search engine among today’s pop- ular search engines such as Google, Bing, Baidu etc. This is also the type of search engine used in this study and more in depth de- scriptions of these steps can be found in the following sections.

Human Powered Directories

Human powered directories are, in contrast to an automated crawler,

indexed manually. The site owner of a website submits a short de-

scription of the site to the search engine database. The submission

is then reviewed manually by a database administrator and then ei-

ther added in the appropriate category or rejected. This means that

(14)

the search engine ranking will only be based on the description and keywords submitted by the site owner and not take changes to the web page content into consideration. This type of search engine is no longer used after the wild success of automated engines such as Google.

Hybdrid Search Engines

Hybdrid search engines are search engines that utilize a mix of crawling and human powered directories. They use a crawler-based information retrieval bot to gather information but use manual meth- ods to improve result ranking. For instance a hybdrid search engine may display a manually submitted description in the search results but base the ranking by a mixture of the submitted description and the crawled information from the actual web site itself.

Others

Except for the previously mentioned types of search engines there are search engines designed for searching specific types of media.

For an example Google has a search engine specifically made for searching images. These types of engines usually utilize other tech- niques since the search medium might require the search engine to analyze the contextual meaning of the search query for instance.

2.1.3 Search Engine Software

There are a number of different search engine softwares provided by different organizations. The website db-engines [3] ranks search engine software after popularity and updates the list monthly. A detailed description of how they calculate this ranking is available on their website. At the time of writing the top three ranking search engine software were the following.

Elasticsearch

Elasticsearch is a distributed open-source search engine. It is used

by many large corporations around the world to facilitate and speed

up their search engine developments. Elasticsearch provides tools

for indexing documents and building complex search queries. The

(15)

default installation of Elasticsearch will build an inverted index map- ping every word in the indexed documents to all the documents the word appears in and its location in the documents. It uses different techniques for different data types by default to optimize search re- sponsivity. It’s a highly customizable search engine and can easily be tailored to a specific use case [4].

Splunk

Splunk is an enterprise software product that comes with many prepackaged connectors to make it easy to index and import data from various sources. Splunk is mainly operated from a web inter- face to create various reports with advanced visualization tools and does most indexing automatically. To extend functionality Splunk provides a library of apps called Splunkbase which is a collection of configurations, knowledge objects, views, and dashboards that runs on the Splunk platform [5].

Solr

Solr is an open source search engine developed by The Apache Soft- ware Foundation. It is a schemaless document based search engine much like elasticsearch and their features are very similar as both projects are based on the lucene engine [6].

2.2 How Does a Search Engine Work?

2.2.1 Relevance and Result Ranking

A search engine returns lists of results back to the user sorted by

their relevance to the user’s query. This is called result rank-

ing and is one of the fundamental problems that a search engine

tries to solve. Classically, relevance is being phrased as a numerical

summary of several parameters that together define the relevance

of a document to the query. The resulting list of matching docu-

ments are then sorted by the relevance number ranking the docu-

ment with the highest relevance score at the top and the document

with the lowest relevance at the bottom. How these parameters are

gathered and how the relevance number is being calculated is up

(16)

to the people that design the search engine and what the users of the search engine define as relevant. More general search en- gines that go through any web page on the internet might focus on amount of words matching between the web page and the user’s query. Another search engine focused on articles in a web store might include parameters such as an article’s popularity with other customers or how recently the article was published on the store to the relevance calculation. To solve the problem of result ranking one has to consider the user’s definition of relevance as well as the different techniques behind a search engine to gather and calculate the final relevance score. It is not uncommon to have to reiterate and reevaluate the search engine multiple times before being able to solve the problem [7].

2.2.2 Term Frequency and Inverse Document Fre- quency

The search engine studied in this project is built with elasticsearch which by default uses a numerical statistic called term frequency–inverse document frequency (tf-idf) to rank its search results. Tf-idf is in- tended to reflect the importance of each word to a single document in a collection.

Term frequency (TF) is defined as the number of times a word oc- curs in a document. This means that documents with a high search term frequency will rank higher. However, this means that long doc- uments are more likely to receive higher ranking scores than short documents. To counter that problem one can use inverse document frequency (idf), where terms that occur often in a document (e.g.

“the”) should not be weighted as much as less frequent terms that also matches the document content. The calculation of inverse doc- ument frequency can be altered and tailor made depending on the search problem. The general method of calculation derives from the following formula [1]:

idf = log |{Documents collection}|

|{Documents the term appears in}| + 1

Tf-idf is calculated by multiplying the term frequency and the in- verse document frequency.

tf -idf = tf ∗ idf

(17)

In a previous survey from 2016 it was shown that tf-idf was the most frequently used weighting scheme. Tf-idf has a varying per- fomance when it comes to ranking but because of its wide use and relative ease of use it remains popular. It is usually altered to meet the needs to specific implementations which leads to the varying re- sults. Other more advanced techniques can be hard to access since some authors provide little information about their algorithms mak- ing them hard to replicate [8].

2.3 Machine Learning in Information Re- trieval

Search engines today have come to utilize many different machine learning algorithms to improve the experienced relevancy of the re- sulting documents. It can be used for instance in query analysing to correct spelling, expand queries with synonyms, disambiguate in- tent, analyse documents by page classification, sentiment analysis, detecting entity relationships and more. Choosing the right algo- rithm might prove difficult because of the number of algorithms and its sometimes lack of extensive documentation in litterature. There are still many problems and questions raised that have yet proven to be solved [9]. In the field of Natural Language Processing (NLP) there are several algorithms based on machine learning that can be utilized for improved information retrieval. NLP in the context of computer science is to program computers to process and under- stand natural language data. In information retrieval it is used to ex- tract semantically understanding from documents or queries to im- prove the classification and retrieval process [10]. A machine learn- ing field in natural language processing is called semantic analysis.

Semantic analysis is the process of linguistically parsing sentences

and paragraphs into key concepts, verbs and proper nouns. Using

statistics-backed technology, these words are then compared to the

taxonomy [10]. Two examples of algorithms within this field are

latent dirichlet allocation and latent semantic analysis.

(18)

2.4 Machine Learning Algorithms

There are several machine learning algorithms and approaches to choose from when it comes to information retrieval and it is an ac- tively developing research area. Unfortunately not all algorithms are well documented and some are developed for very specific data sets. The algorithms described below are some commonly used al- gorithms found in recent trends in natural language processing that have good documentation and can be replicated [11].

2.4.1 Latent Dirichlet Allocation

Latent dirichlet allocation (LDA) assumes a document consists of many small topics and tries to categorise documents based on their relevance to certain topics. For instance LDA can classify a topic called bread_related. This topic can then include words such as flour, butter or cheese to have a high probability to belong to the topic [12].

2.4.2 Latent Semantic Analysis

Latent semantic analysis is a natural language processing technique which analyzes the relationship between a set of documents and the terms they contain. This can be used to predict desired properties from unstructured text data. It assumes words with similar meaning appears in similar sentences [13].

2.4.3 Word2Vec

Word2vec is a group of models that are used to reconstruct lin-

guistic context of words. The models are shallow, two-layer neural

networks. The models try to understand the meaning of a word by

building a vector space of all words with the distance between them

representing the distance in semantic meaning. The models most

often assumes that words that appear in the same context share

semantic meaning. Word2Vec can be divided into two different ap-

proaches (continuous bag of words and skip-gram model) to predict

the semantic relations between the words. These are particularly

computing efficient (not requiring much CPU time) which is one of

(19)

the reasons why they are appropriate to use for improving search engines as they are often expected to respond quickly [14].

Continuous bag of words

Continuous bag of words (CBOW) tries to predict a target word based on the surrounding words. Consider the sentence "the quick brown fox jumps over the lazy dog". For example, if the target word is "fox" and the context window is 2 (number of surrounding words) CBOW will store the context words "brown" and "jumps". This way the model can try to predict the target word based on the surround- ing words. [15].

Skip-gram

The skip-gram model which is used primarily when having access to larger data-sets is usually the reverse of what CBOW does. It differs from CBOW by trying to predict the surrounding words given a current word instead of trying to predict the current word based on the surrounding words. This model becomes more complex to compute since it tries to predict multiple surrounding words instead of a single target word. Increasing the window size thus increases the computational time and using a large size can lead to training the model taking a very long time. [15].

2.5 Recommender Engines

A recommender system provides an alternative way of finding in- formation compared to traditional search by recommending items the user might not have found by themselves. Since collaborative filtering has been widely used in recent years for recommendation systems [9] it is the recommendation system technique described in this chapter.

2.5.1 Collaborative Filtering

Collaborative filtering is the most widely used technique in recom-

mendation systems because of its accuracy and simplistic algorithm

design. In general collaborative filtering identifies users who have

(20)

similar ratings/purchases between the same items. It can then pick items from similar users and give them as recommendations [16].

K-Nearest Neighbors (KNN)

One way of implementing CF is to use the K-Nearest Neighbors learning method. KNN is a non-parametrized, lazy learning method.

This means that the generalization of the data is not done until the querying of a document. When a query is made to the system the evaluated item’s feature similarity distance is calculated to each other item in the database. Then the k nearest neighbors are gath- ered and regarded as the most similar item recommendations [17].

KNN is useful since it makes no assumptions about the data , it is a relatively simple and versatile algorithm. KNN is however not a memory efficient algorithm since it stores all of the training data in memory and is quite computationally expensive [18]. KNN docu- ment classification explained in pseudo-code:

trainData ← load(”knn

_t

raindata”) document ← selected _ document k ← selected _ k _ value

distances ← list()

forall point in trainData do

distances.push(euclidian _ distance(point, selected _ document)) end

sort(distances)

nearestN eighbors ← distances[0, k]

documentClass ← get _ class _ of _ majority(nearestN eighbors) Algorithm 1: K-Nearest Neighbors

Non-Negative Matrix Factorization (NMF)

NMF is a matrix factorization method that contstrains the matrices to be non-negative. In a topic modeling scenario such as in this project it would for instance create a document matrix where each column is a document and each element a tf-idf word weight. When this matrix is then decomposed into two factors it would create one matrix with each column representing a topic and each row a word.

The other matrix having columns representing each document and

row each topic. This way it is possible to weight each word and/or

document towards certain topics and build recommendations based

(21)

on this [19].

Singular Value Disposition (SVD)

SVD is used to solve the problem of data sparsity that can arise when using CF. SVD is commonly used to reduce the dimensionality of user-item ratings matrices. This increases the ratings density and thus makes it possible to find hidden relationships from the matrix giving more ratings options for the recommendation engine to work with [16].

2.6 Evaluation Metrics

To determine whether or not the search engine perfoms well ac- cording to the end user there must be a way to evaluate the results.

In this section common evaluation measures used in information retrieval are presented.

2.6.1 Precision

Precision is a way to measure the ability of the search engine to find relevant documents relative to the amount of returned results.

Precision is being calculated by taking the number of relevant docu- ments form the search result divided by the number of results [20].

precision = |{Relevant documents}| ∩ |{Retrieved documents}|

|{Retrieved documents}|

(2.1)

2.6.2 Recall

Recall works similatly to precision but is measured relative to the total amount of relevant documents instead of the search result amount. Recall is calculated by taking the fraction between the number of relevant documents in the search results over the total amount of relevant documents [20].

recall = |{Relevant documents}| ∩ |{Retrieved documents}|

|{Relevant documents}| ^(2.2)

(22)

2.6.3 F-measure

A way to combine both precision and recall into a single measurable number is to use the F-measure.

F-measure = 2 ∗ precision ∗ recall

precision + recall ^(2.3) When both precision and recall are close to eachother the F-measure value is close to the average between them but generally more like the harmonic mean. The F-measure is also called the F

1

-measure because both precision and recall is evenly weighted. To change the weighting between precison and recall the traditional F-measure can be modified to the F

_β

-score.

F

_β

= (1 + β

²

) ∗ precision ∗ recall

β

²

∗ precision + recall ^(2.4) When β > 1 recall is weighted higher and when β < 1 precision is weighted higher. Commonly used β values are 2 and 0.5 and is chosen on how the user priorities recall over precision and vise versa [20].

2.7 Improving Search Results

The most common ways to evaluate search engines are precision, recall and f-measure as described in section 2.6. Characteristics of good search engine results are [21]:

• It should be according to user query

• It should be properly organized according to the relevancy of the content

• It should be properly defined such that it is understood well by the user

• It should be robust

• It should provide satisfaction and quality to the user

• It should not be ambiguous

• It should be readable

(23)

One common method of improving search results is the tf-idf (de- scribed in section 2.2.2) vector space model approach but there are several more approaches to improving results.

One example is for instance using a meta search engine. This solution takes a search query and passes it on to several already ex- isting search engines and aggregates their respective results and reranks them. This resolves the problem of having an outdated database assuming at least one of the search engines being up to date for the query [21].

Another example is storing search history and profiling informa- tion about the user. This usually requires the user to log in and answer forms before using the search engine. It makes the results give a sense of personalization for the end user.

Clustering is also a technique to improve results. It is an unsu- pervised data mining technique that automatically classifies docu- ments without any predefined information on how to do it. It is used to minimize the space where the search has to look which improves query responsiveness and ranking. It also overcomes the problem of vocabulary differences [21].

2.8 Related Work

There have been several earlier attempts to evaluate different ap- proaches and techniques to improve search engine result ranking before. An attempt to improve PubMeds result ranking was done in 2018 where the result ranking algorithm was modified using machine learning algorithms which showed a great improvement in user click-through rate. The previously used system was a tf- idf based ranking algorithm and was maintained and developed by manual experiments or analyses. They wanted to see if using ma- chine learning based ranking algorithms could improve the search engine. They developed a custom algorithm called Best Match in- spired by machine learning based ranking algorithms such as L2R, BM25 and LambdaMART. After deploying the new algorithm they could measure a 20% increase in user click-through rate [22].

In 2016 Roy et al. implemented a query expansion automation

using Word2Vec leading to improved result ranking. This was done

by analyzing the incoming query to the search engine and then re-

formulating it using the trained word2vec corpus they had created.

(24)

The expanded queries was found to almost always outperform their baseline model significantly [23].

Another interesting related work is the attempt to combine Word2Vec and collaborative filtering to improve a recommendation engine.

The results of their experiment suggests that the hybrid algorithm improves the efficency and accuracy of the recommender system greatly for large datasets [24].

2.9 Summary/Conclusion

The algorithms examined in this study are latent dirichlet alloca- tion, latent semantic analysis, word2vec and collaborative filtering.

The choice came down to use word2vec and collaborative filtering as they seem to be the more popular choices in recent trends [11]

[9] but they use quite different solutions. LDA and latent semantic analysis was chosen not to be evaluated further as they would be used in similar scenarios as word2vec and make the scope of the project too big.

Word2Vec could potentially be useful for the project to deter- mine whether or not certain properties are related to each other or possibly synonymous to each other. An example of what it is hoped to resolve is finding out that a certain food label also implies that the item is ecological. Hypothetically after training the model with information about tender document and articles it will be able to widen the possible matches of certain search words.

For this project collaborative filtering might prove useful if it can successfully find similarities between claims and thus recommend items that the automated search query cannot find or rank certain items higher as it finds them relevant to recommend.

In this study Word2Vec and collaborative filtering will be exam-

ined to measure how much improvement can be done to a search

engine optimization problem. They are both well known algorithms

used to improve information retrieval system’s relevance but with

different approaches. Word2Vec is focused on the word semantics

both in the search query and the searchable data and collaborative

filtering tries to match users previous actions to find similarities in

documents. Another part of this study is trying to combine both

these strategies to see if it is possible to do and if that could im-

prove the search results even more. As has been seen in earlier

(25)

attempts both Word2Vec and even attempts to combine the algo- rithms have resulted in improvements. This suggests that these algorithms should show significant improvements compared to the baseline algorithm, however this project will also benchmark the al- gorithms against each other. The intention is to see whether one or the other algorithms can improve the results more than the other and if the hybdrid algorithm outperforms the individual algorithms.

In this project precison, recall and a combined f-measure metric will be used to analyze the results. These methods are commonly used in the information retrieval field to measure performance and accuracy of a search engine or recommendation engine. For ex- ample in 1999, Gordon et al. compared eight different search en- gines using precision and recall metrics to benchmark performance calling the method traditional within the information retrieval field.

They also used statistical comparisons to determine the significance

in difference between the results as will be used in this project [25].

(26)

Methods

3.1 Research Design

This study applies an experimental design to answer how differ- ent information retrieval algorithms change the retrieved search results. In brief, two categorically different information retreival al- gorithms (Word2Vec and Collaborative Filtering) and a custom hy- brid variant were implemented and trained with learning data from a food items database. The models were evaluated by comparing precison, recall and f-measure to a baseline model based on com- mon tf-idf document indexing. The resulting deviation in the chosen measurements were deemed significant or not by applying statis- tical analysis. The constant variables of the experiment were the learning and test data sets fed to the algorithms. The research de- sign is shown in figure 3.1.

Figure 3.1: The experimental design schema

18

(27)

3.2 Data

Evaluation of the different models required proper data sets in or- der to run simulations and tests. For both simulations and tests, completed tender documents were used to build test and validation data sets. This was possible because completed tender documents contained the necessary information about both the requested and delivered articles. Details about data gathering and pre-processing are given in the following sections.

3.2.1 Data Gathering

The process of gathering data was done in two major steps. The first one being gathering all the necessary test data from previously completed tender documents. The second one being gathering all food item information stored in the article database. Figure 3.2 shows a general process flow of the data gathering process.

Figure 3.2: The data gathering process

Incoming Procurements

A procurement is the process of obtaining goods to an organization

from an external vendor. The process for a procurement starts out

with a purchaser creating a procurement order by submitting a ten-

der document. This is done by entering a list of loose definitions of

desired articles and their desired properties in a computer system

(e.g chicken leg, frozen, grilled). How this system works in detail

is unknown by the principal and the tender strategist. What they

(28)

see is only the resulting exported file that the purchaser sends to the tender strategist. This file is mostly in the form of a comma sep- arated values file (CSV) containing a list of all the desired article descriptions and their constraints that go with them. This file struc- ture makes it easy for a computer system to import and interpret the input data. A few example rows from a tender document can be seen in Table 3.1.

Table 3.1: Tender Position Product

Family

Product Class

Product

Description Properties Quantity

Frozen Fish

Unprepared Salmon

Fillet, Salmo Salar (salmon), Frozen pieces, Portion, Skin and bone free

1,397

Colonial Spices Lemon/Lime Pepper

Lemon, Free from MSG,

Spice mix, Plastic container 12 Colonial

Jam/

Marmalade/

Jelly

Strawberry jam

Berries >= 35.00%, Bucket, No colorants or flavorings

510 Pastry/

Desserts Ice Cream Milk free ice cream

Fat content <= 10%, free from milk protein, Vanilla, free from soy, free from lactose, oat base

181 The first line in Table 3.1 show that the purchaser is asking for 1,397 unprepared frozen salmons fulfilling certain properties. The problem now for the search engine is to find the most appropriate articles to present the tender strategist to offer the purchaser. Most likely there will be more than one article matching the description provided so the engine will have to rank the articles found by each matching article’s individual relevance score.

Completed Procurements

As mentioned in section 3.2.1 incoming procurements usually come

in the form of a CSV file. The same goes for the completed docu-

ments. Each row in a file is represented as shown in table 3.2

(29)

Table 3.2: Tender Header

Pos Prod Family Prod Class Prod Desc Properties GTIN

INT TEXT TEXT TEXT TEXT INT

Each row in the incoming tender document was identified by its Pos (position) number. To match the completed tender docu- ment with the incoming document file all needed to be done was to match the position number between the files for each row. Tender documents were processed by an in-house script and stored in a document database in JSON-format to easily allow for JSON-based queries towards Elasticsearch. The resulting database, consisting of close to 10,000 positions from 14 tender documents with claims in all product categories, served both to provide the data set for testing and evaluating the different algorithms.

Articles Database

The article database is accessible over the internet through a REST- API sending responses in JSON-format and consists of a little more than 35,000 items. Since Elasticsearch is JSON-based the simplest solution was to query the database for all its information and in- dex it directly to the local Elasticsearch engine. This was also done with an in house script. Elasticsearch automatically indexed all the database information based on its configuration, setting up tf-idf models and interpreting data types. Without any configuration Elas- ticsearch builds a default index with basic optimizations. After this was done all left needed to be able to run the tests for the baseline model was to define the search query needed to start fetching items from the database.

Because of the way Dabas is designed to not allow fetching the

entire database in one call it took a lot of time to fetch the entire

database. To make sure not to have to go through that process mul-

tiple times the in-house script was actually designed to first store

all the information in MongoDB, serving as a cache, before send-

ing it to Elasticsearch. This way, in case Elasticsearch had to be

reindexed with a new configuration for some other reason, it could

be done directly from MongoDB instead of fetching the information

(30)

over the internet. This saved a great amount of time since rebuild- ing Elasticsearch had to be done several times. This also made sure the test data was always the same between index rebuilds.

3.2.2 Processing

To utilize the imported gathered data for testing and machine learn- ing some processing had to be done. The process of creating the test data was quite straightforward. During the data gathering pro- cess the in-house script used automated tools for identifying each column’s data type (string, integer, floating type etc.) before im- porting it to the database. After that was done another post data gathering processing had to be done to simplify the testing process later in the project. An example of this processing would be the properties column which is a comma separated text string describ- ing each required property of the desired food item. The extra pro- cessing had to go through all of the documents and split the string on the delimiter to create a more flexible data structure for future follow up. An example of a document before and after processing are shown in listing 3.1 and 3.2 below.

Listing 3.1: Tender document data before Processing

1

{

2

"procurement" : "boden",

3

"pos" : 7,

4

"prod_area" : "Colonial/Groceries",

5

"prod_division" : "Flour",

6

"prod_commodity" : "Sifted rye",

7

"properties" : "Free from Nuts, Peanuts, Almonds, apricot seeds and sesame seeds, Only natural sugars",

8

"selected_articles" : "17321575952076,73111407301 02"

9

}

(31)

Listing 3.2: After Processing

1

{

2

"procurement" : "boden",

3

"pos" : 7,

4

"prod_area" : "colonial/groceries",

5

"prod_division" : "flour",

6

"prod_commodity" : "sifted rye",

7

"properties" : [

8

"free from nuts",

9

"peanuts",

10

"almonds",

11

"apricot seeds and sesame seeds",

12

"only natural sugars"

13

],

14

"selected_articles" : [

15

"17321575952076",

16

"07311140730102"

17

]

18

}

In Listing 3.2 the properties field has been transformed into a list of properties where all strings have been lowercased for easier string comparison. The selected_articles string in Listing 3.1 have also been altered to a list as seen in Listing 3.2 and the second GTIN (Global Trade Item Number) has had a 0 prepended to make the number conform to the GLN standard of having 14 numbers. This error was very common in these CSV files as they were usually made in a spreadsheet software that removed leading zeros from numbers before saving the file. The processing had to fix this to be able to find the selected article by its identifying GTIN or otherwise the test validation would not work.

3.2.3 Data validation

The incoming data that was used had to be validated before used

in the final implementation. This was done first by making a small

script used to import the CSV-files to the MongoDB instance used in

the project. During this process every row and column was checked

to include valid data and processed to its proper data type before

(32)

inserted into the database. Any information that did not pass the criteria had to be ignored since it would not be possible to process it later anyway. An example of a finalized and validated document:

Listing 3.3: Validated Tender Position Document

1

{

2

"procurement" : "boden",

3

"pos" : 7,

4

"prod_area" : "colonial/groceries",

5

"prod_division" : "flour",

6

"prod_commodity" : "sifted rye",

7

"properties" : [

8

"free from nuts",

9

"peanuts",

10

"almonds",

11

"apricot seeds and sesame seeds",

12

"only natural sugars"

13

],

14

"selected_articles" : [

15

"17321575952076",

16

"07311140730102"

17

]

18

}

3.3 Models

3.3.1 Baseline Model

As mentioned in the previous section most preparations for the

baseline model was already finished by setting up elasticsearch and

storing the procurement data in a database. All needed to be done

was to define the search query. This query was modeled as a JSON

document schema with the root property being called query. The

query property then in turn contained all the query configuration

needed for the query to run as intended.

(33)

Search Query

The search query is divided into two different groups of restraints as shown in Listing 3.3. One group of sub queries that must be fulfilled to allow an article to end up in the match results. The other group contains queries that should match which only boosts score if matched.

The sub queries that are required to give a match are the queries for product family, product class and product description. These properties are required to match since they define in which cate- gory the article needs to be in. For instance if the procurement claim asks for a frozen (product area), bird product (product divi- sion), chicken leg (product commodity) then it is expected that all results match these requirements. Listing non frozen items or non bird products for instance would never be of relevance.

The optional sub queries are in place to rank the matching arti-

cles against each other and are more numerous than the required

ones. One of the simpler of these queries ranks the user’s whole-

sale company’s self-branded goods higher. I.e. if the article’s brand

name matches the wholesale company’s name it ranks higher. Other

optional queries are also the weight queries. These are left optional

because the search engine has trouble identifying and matching the

correct weight values for the articles gross and net weight. Since

weight information about the article is typically part of some free

text description about the item it is not certain the search query

will find it or match it correctly. Thus, the queries are left optional

but rank articles that actually match higher. The last set of optional

queries are the ones for finding properties in articles. These are

divided into two groups. One group of properties with no value and

one group of properties with an associated value. All of these prop-

erties are matched by searching in a text string of concatenated

properties from the different article documents. This is where the

search engine struggles the most to find matches as properties can

be written in so many different forms and synonyms. Especially the

properties with an associated value are the hardest to find as there

are so many ways to express this type of information in free text.

(34)

Listing 3.4: Search Query

1

{

2

"size": 10,

3

"query": {

4

"bool": {

5

"must": [

6

{"term": {"Product Code": procurement["prod_area"

]}},

7

{"term": {"Product Code": procurement["

prod_division"]}},

8

{"term": {"Article Description": procurement["

prod_commodity"]}},

9

],

10

"should": [

11

{"range": {"Size":

12

"gte": position["min_weight],

13

"lte": position["max_weight],

14

}},

15

{"term": {"manufacturer": wholesale_company_name,

"boost": 2.0}}

16

{"term": {"Ingredients": position["props"][0]}},

17

...

18

]

19

}

20

}

21

}

3.3.2 Word2Vec

When the baseline model was up and running, Word2Vec could be used to try and improve it. As explained earlier Word2Vec can be used to try and expand the search queries to include more words similar to the original query. First the model had to be trained be- fore it could be used.

Training Word2Vec

To begin with Word2Vec had to learn the domain that the search en-

gine was querying. The training data would consist of both procure-

(35)

ment information and article information from the article database.

The train data was built by taking each row in all procurements appending all columns with each other to a single text string. For each row the Dabas information for the selected article would also be gathered and appended to the end of the string. Finally a list of all the text strings could be passed on to Word2Vec for training.

Word2Vec needs to have the train data in this format for it to be able to run its models on it as described in earlier chapters. The process described in pseudo code:

trainData ← list()

forall row in procurements do

articleInf o ← getArticleInf o(row[”selectedArticle”]) text ← ””

forall column in row do text ← text + column + ” ” end

forall column in articleInfo do text ← text + column + ” ” end

trainData.append(text) end

word2vec.train(trainData, size = 150, window = 5, minCount = 2, workers = 10)

Algorithm 2: Training the Word2Vec Model

For the training different values for the size, window, minCount, workers parameters were used (section 3.2.3). Parameter impact on the results is presented in the results chapter. One model was created for each parameter setup for later testing.

Rebuilding the Elastic Index

After Word2Vec was trained it was necessary to rebuild the Elas- ticsearch index to incorporate the model into the search engine.

This was done by iterating through each document in the article

database copy from MongoDB and complement mainly its ingredi-

ents property with what Word2Vec deemed as related words to the

existing ingredients text. Described as pseudo code:

(36)

model ← load(”word2vec.model”) forall item in article database do

similarW ords ←

model.mostSimilar(item[”ingredients”], similarity >=

0.8)

forall word in similarWords do item[”ingredients”].append(word) end

elastic.index(item) end

Algorithm 3: Rebuilding the Elastic Index

After this process the documents in the elastic index had much more information about the articles giving the search query a wider range since the documents were longer. The similarity parameter filters the similarWords list to only return words that Word2Vec finds 80% similar or more back to the item in this example. Si- miliarity being the calculated cosine similarity between words in the Word2Vec vector space. This parameter was determined em- pirically. Setting it too low ended up giving the new elastic index irrelevant new words for the items ingredents description resulting in poor search results. Setting it too high resulted in Word2Vec not recommending any relevant words at all or very few barely mak- ing any impact on the search results. The same expansion was also done for the "Article Description" property for each item. This pro- cess was repeated for each training model with the different train- ing parameter setup creating an elastic index for each model for the future testing.

Expanding the Search Query

Word2Vec was also used to expand the search query built in the baseline model. The structure was kept the same as in Listing 3.3 but the search strings were altered. To make the search queries have a greater chance of finding a wider set of documents the Word2Vec model was used to expand the search strings from the incoming ten- der documents. This was done the same way as in Algorithm 2 but for the search strings "Article Description" and all the "Ingredients"

search strings. The other search strings were not altered as they

function as a filter rather than a full document search. For the

search query expansion a similarity filter of 0.9 was used instead

(37)

which was decided upon by running tests.

3.3.3 Collaborative Filtering

Collaborative filtering clusters documents after similarity based on user feedback. This approach is different from Word2Vec and is in- corporated later in the search engine process. Instead of expanding the search query collaborative filtering was used to expand and im- prove the search result ranking. Tuning collaborative filtering was done by testing different algoriothmical approaches. A description of the different algorithms can be found in section 2.5.1. A short description of each algorithm can be seen in Table 3.3

Table 3.3: Collaborative Filtering Algorithms

KNNBaseline An algorithm taking into account a baseline rating.

KNNWithMeans An algorithm taking into account the mean ratings of each user.

KNNWithZScore An algorithm taking into account the z-score normalization of each user.

NMF An algorithm based on Non-negative Matrix Factorization.

SVD An algorithm based on Singular value decomposition.

SVD++ The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

Training Collaborative Filtering

First the collaborative filtering algorithm had to be trained. This was done by extracing information about already completed ten- der documents and building a dataframe. The dataframe consisted of identifying information about the article, descriptive information about the query and finally a rating scale variable. In this partic- ular implementation it was the article’s GTIN, search query article commodity and how many times the article had been used in previ- ous tender documents with the same article commody requirement.

This dataframe was then fed to the collaborative filtering algorithm

to calculate their similarities.

(38)

Altering the Result Rankings

After elasticsearch had processed the search query and generated a list of search results the collaborative filtering model was used. The search result list was reranked after what the collaborative filtering model predicted the rating scale value to be for the search query’s required article commodity. This way the result ranking was altered to rely on the item similarity calculated by collaborative filtering instead of the inverted document frequency as the engine originally used.

3.3.4 Combining Models

Combining the models was possible since they both targeted differ- ent parts of the information retrieval process. Both the models were trained just as described in 3.3.2 and in 3.3.3. The search query and elastic index was expanded and rebuilt using the Word2Vec techniques. Then after the search engine had retrieved its results the documents were reranked as by the collaborative filtering pre- dictions. In this combined version the search engine had a wider knowledge of the lexical meaning of the search queries and the searchable data set and also an understanding of how popular each article was based on previous user behavior.

3.4 Implementation

The programming language of choice in this project was python for its extensive library of tools and prebuilt algorithms. This made it easier and faster to run tests and tunings.

3.4.1 Tools

To fully run all the scenarios and tests for this project a number of

tools were used. Table 3.4 lists all tools and software packages used

to conduct the entire experiment.

(39)

Table 3.4: Tools

Name Description Use Case

MongoDB Document database Used to store test and val- idation data

Elasticsearch Document based search engine

Used for baseline search engine implementation Python Programming language Used for implementation SciPy Python package for scien-

tific computing

Used for statistical analy- sis functions

gensim Software framework for topic modelling

Used to implement Word2Vec models

surprise Python package for rec- ommendation systems

Used to implement Col- laborative Filtering mod- els

pandas Python Data Analysis Li- brary

Used to build dataframes

3.4.2 Tuning

To find the best performing implementation of each algorithm there had to be some tuning done. The baseline implementation of elas- ticsearch was not altered to make sure the result comparison would be fair. Only the added algorithms were tuned to improve their per- formance.

Word2Vec

Tuning of the Word2Vec algorithm was done mainly by altering the

window size, min_count and the training algorithm. The window

size is the maximum allowed distance between the current and the

predicted word in a sentence. The min_count parameter specifies a

minimum word frequency that a word needs to have to be allowed

in the training model. The two training algorithms tested were the

skip-gram model and CBOW as described in section 2.3.1. The final

implementation used the CBOW model. There are more ways to

alter the Word2Vec model for tuning but these parameters showed

the most impact in the result rankings. To be able to evaluate the

result ranking impact for each parameter change one pre trained

model file was created for each parameter setup. This way the same

(40)

Word2Vec implementation could be rerun with different pre trained models to compare and find the best tuned setup for the algorithm.

Tuning result examples for Word2Vec are presented in section 4.1.

Collaborative Filtering

Similar to Word2Vec the Collaborative Filtering algorithm can be implemented with various tuning parameters. The tool used in this project called surprise makes this task very easy. The package has several pre compiled collaborative filtering algorithm alternatives and so the tuning process was merely to try each one and see which performed better. All that had to be done was to alter the algo- rithm parameter before retraining the model. Results of this tuning process can be found in section 4.2.

3.5 Result Gathering

When gathering results to both determine optimal tuning param- eters and to evaluate the final model setup the dataset for tender documents was partitioned into 14 chunks. One partition was used for evaluation (calculating precision, recall and f-measure) and an- other for training. This was done for every partition for every result gathering run. The final result would then be calculated as the av- erage of all partition runs and the deviation between them.

3.6 Analysis

After gathering precison, recall and f-measure results for each model

and tuning variant it was time to analyze the results. For this the

SciPy tool was used to apply statistical analysis methods to the re-

sults. When all results had been gathered a statistical analysis us-

ing t-test and ANOVA methods were done to determine whether or

not the observed average between the models/tunings differ signif-

icantly or not. The null hypothesis being that the compared results

have identical average values. If the p value is small enough (less

than 0.05) we can reject the null hypothesis of equal values and

assume we have observed a significant difference in results.

(41)

Results

This chapter illustrates the gathered information that later lead to the conclusions drawn in chapter 6. During analysis statistical methods were applied on the gathered evaluation metrics for fur- ther insights to draw conclusions. This was all done by using the tools described in section 3.4.1. The intention with the gathered results was to build a solid ground of information to be able to eval- uate and find out which method and tuning had the biggest impact to the evaluation metrics.

4.1 Word2Vec

First out was gathering information about tuning the Word2Vec al- gorithm. The following figures (4.1, 4.2 and 4.3) show the mean values and standard deviation for each relevance metric for differ- ent choices in window size. More variables were evaluated in the project (size, window, min_count, workers) but since the process was exactly the same for all variables only window size is in the report to serve as an example.

Figure 4.1 shows the precision values for depth 3, 5, 10, and 25 for window sizes 3, 5, 10, 15, 20 and 50. The precision can be seen to increase with the depth size but window size has no apparent influence on precision. The same pattern can be observed for recall as show in figure 4.2 and f-measure as shown in figure 4.3.

Observing the means in the resulting graphs of section 4.1 it looks like the window size had no apparent impact on the relevance scores. To confirm this an ANOVA test was performed.

33

(42)

The null hypothesis was that there is no significant difference be- tween the mean relevance score values measured for the different chosen window size passed to the Word2Vec algorithm.

The alternative hypothesis was that there is a measurable dif- ference in mean relevance score between choosing the different window sizes passed to the Word2Vec algorithm other than random causes.

The results from running an ANOVA test on all the observed means for each depth value can be observed in table 4.1.

Figure 4.1: Word2Vec Window Size Results

3 5 10 15 20 50

10 20 30 40

Window Size

Precision

depth 3 depth 5 depth 10 depth 25

(43)

Figure 4.2: Word2Vec Window Size Results

3 5 10 15 20 50

10 20 30

Window Size

Recall

depth 3 depth 5 depth 10 depth 25

Figure 4.3: Word2Vec Window Size Results

3 5 10 15 20 50

10 20 30 40

Window Size

F -measure

depth 3 depth 5 depth 10 depth 25

(44)

Table 4.1: ANOVA Word2Vec Optimization Precision

Depth 3 5 10 25

P-value 0.869 0.998 0.998 0.994 Recall

Depth 3 5 10 25

P-value 0.872 0.996 0.997 0.988 F-measure

Depth 3 5 10 25

P-value 0.856 0.997 0.998 0.994

As we can see from table 4.1 all the P-values gathered from the test runs are too large (larger than 0.05). This means we cannot reject the null hypothesis and thus conclude we cannot say that the measured differences in mean values are affected by the change in window size.

Results suggest that the window size parameter does not impact results in a meaningful way. That led to using window size 15 for the optimized model as it seemed to increase f-measure for medium depth levels when observing the results in figure 4.3. After follow- ing the same procedure for also the other Word2Vec parameters the optimized model used values shown in table 4.2.

Table 4.2: Word2Vec Optimized Parameters size window min_count workers

150 15 2 10

4.2 Collaborative Filtering

Secondly was gathering information how different implementations of the collaborative filtering algorithm influenced the measured rel- evance score.

Figures 4.4, 4.5 and 4.6 show the average precision, recall and f-measure for all tested algorithms for depth level 3, 5, 10 and 25.

The KNN-based algorithms show the most consistent results.

The baseline algorithm does not see a significant increase in per-

formance over the depth levels for either measurement variable.

(45)

KNNWithMeans and KNNWithZScores have very similar results for precision, recall and f-measure increasingly getting better over the increase in depth level.

The NMF algorithm show a slight increase in score in precision and recall (figure 4.4 and 4.5) with an increase in depth level. Look- ing at f-measure (figure 4.6) a very poor result is shown for depth level 3 but improves significantly for higher depths.

The SVD-algorithms seem to have a very inconsistent perfor- mance. They both have a very big standard error for the precision measures except for SVD with depth level 10 and 25 (figure 4.4).

SVD seems to perform better when it comes to recall (figure 4.5) getting better in relation to the depth level. The SVD++ algorithm show the opposite behaviour however, getting worse. In the final figure (4.6) the results are more consistant. SVD perfoms better with the depth level but suddenly drops for depth level 25 however.

The SVD++ algorithm sees a very small increase in f-measure to- gether with the depth level but with a high standard error.

Observing the means in the resulting graphs (4.4, 4.5 and 4.6) it looks like the choice of collaborative filtering algorithm had no apparent impact on the relevance scores except for maybe the stan- dard deviation. To confirm this an ANOVA test was performed.

The null hypothesis was that there is no significant difference be- tween the mean relevance score values measured for the different chosen collaborative filtering algorithm.

The alternative hypothesis was that there is a measurable differ- ence in mean relevance score between choosing the different col- laborative filterings algorithm other than random causes.

The results from running an ANOVA test on all the observed

means for each depth value can be observed in table 4.3.