• No results found

Dynamic Query Completion Through Search Result Clustering

N/A
N/A
Protected

Academic year: 2021

Share "Dynamic Query Completion Through Search Result Clustering"

Copied!
67
0
0

Loading.... (view fulltext now)

Full text

(1)

Dynamic Query Completion Through

Search Result Clustering

AMELIA ANDERSSON

(2)

Dynamic Query Completion Through Search

Result Clustering

-Dynamiska sökordsförslag genom klustring av

sökresultat

AMELIA ANDERSSON

ameliaa@kth.se - May 2, 2015

-Master’s Thesis at CSC within Computer Science Master Programme in Computer Science and Engineering

-Supervisor: Hedvig Kjellström Examiner: Danica Kragic Jensfelt

(3)
(4)

Abstract

Query completion is a feature employed by most mod-ern search engines. These completions can be derived by different means. The most popular algorithm ranks comple-tions by the frequency with which it appears in a database of old query logs. This project aims to investigate a new method for finding completions: namely through cluster-ing search results and extractcluster-ing terms from the clusters. To test the capabilities of this method, the project imple-mented the back-end to a search system, which includes the search result clustering algorithm Lingo. The system uses the output cluster labels as query completions. Two experiments were conducted, one for Informational queries and one for Navigational queries, each comparing the sys-tem to Apache Solr’s Suggester component. For Informa-tional queries, a new way of scoring query completions was invented. The experiments showed that clustering per-formed better than the Suggester component for Informa-tional queries, the results were inconclusive for NavigaInforma-tional queries.

(5)

Dynamiska sökordsförslag genom klustring

av sökresultat

Sökordsförslag är en funktion som erbjuds av de flesta mo-derna sökmotorer. Dessa sökordförslag kan framkallas på olika sätt. Den metod som används mest ordnar sina sökords-förslag efter antalet gånger sökords-förslagen förekommer i en data-bas av gamla sökordsloggar. Målet med detta projekt är att undersöka en ny metod för att hitta sökordsförslag, nämli-gen nämli-genom att klustra sökresultat och extrahera termer från klustren. För att testa möjlighetetrna med denna metod im-plementerade vi ett bakomliggande system, vilket inklude-rade sökresultatsklustringsalgoritmen Lingo. De resulteran-de klusterrubrikerna från Lingo använresulteran-des som sökordsför-slag. Två experiment utfördes, en för Informativa sökning-ar och en för Navigerande sökningsökning-ar. I båda experimenten jämfördes systemet med Apache Solr’s Suggester kompo-nent. Inför experimentet för Informativa sökningar kom vi fram till ett helt nytt sätt att poängsätta sökordsförslag. Resultaten från experimenten visade att klustringsmetoden presterade bättre än Suggester komponenten för Informa-tiva sökningar. Det gick inte att dra några slutsatser från resultaten av experimentet för Navigerande sökningar.

(6)

Contents

1 Introduction 1

1.1 Motivation . . . 1

1.2 Goal and Scope of the project . . . 3

1.3 Delimitations . . . 3

1.4 Thesis Structure . . . 3

2 Background 5 2.1 Preliminary Concepts . . . 5

2.1.1 Tf-Idf . . . 5

2.1.2 The Vector Space Model . . . 6

2.1.3 Latent Semantic Indexing . . . 7

2.1.4 Suffix Arrays . . . 7

2.1.5 Mean Reciprocal Rank . . . 8

2.1.6 Query Completion vs. Query Suggest . . . 8

2.2 Search Result Clustering . . . 8

2.2.1 Challenges . . . 9

2.3 Search Result Clustering - Implementations . . . 10

2.3.1 Scatter/Gather . . . 11

2.3.2 Suffix Tree Clustering . . . 12

2.3.3 Lingo . . . 12

2.3.4 Dual C-Means . . . 13

2.4 Interactive query expansion . . . 14

2.4.1 Methods . . . 15

2.4.2 Tools . . . 16

2.4.3 Evaluation of Query Completion Systems . . . 17

2.5 Studies on Search Behaviour . . . 17

2.5.1 User Search Goals . . . 17

2.5.2 Search Strategies . . . 18

2.5.3 How are we Searching? . . . 19

2.5.4 When do People use Query Suggestion? . . . 19

2.5.5 Interaction with Dynamic Query Completion . . . 20

(7)

3.1.2 Libraries . . . 25

3.1.3 Dataset . . . 25

3.2 Query Completion . . . 26

3.2.1 Clustering algorithm . . . 27

3.2.2 Cleaning . . . 28

3.2.3 Presenting the Completion List . . . 28

3.3 Practical implementation . . . 28

3.4 Parameters . . . 28

3.5 Example . . . 29

3.5.1 Step 1. Querying Solr . . . 29

3.5.2 Step 2.1 Lingo - Preprocessing . . . 30

3.5.3 Step 2.2 Lingo - Phrase Extraction . . . 31

3.5.4 Step 2.3 Lingo - Cluster Label Induction . . . 32

3.5.5 Step 2.4 Lingo - Cluster Content Discovery . . . 33

3.5.6 Step 3. Cleaning . . . 33

4 Evaluation and Results 35 4.1 Motivation . . . 35

4.2 Start words . . . 36

4.3 Baseline . . . 36

4.4 Evaluation of Informational Queries . . . 37

4.4.1 Scoring . . . 38

4.4.2 Results . . . 39

4.4.3 Discussion . . . 39

4.5 Evaluation of Navigational Queries . . . 40

4.5.1 Results . . . 41 4.5.2 Discussion . . . 41 5 Conclusions 43 5.1 Contributions . . . 44 5.2 Future Work . . . 44 Bibliography 47 Appendices 50 A Stop Words 51

B Start words for Informational Queries 53

C Informational Results 57

(8)

Chapter 1

Introduction

The employment of search engines has in recent years become a significant part of our Internet activity. They have made it possible for us to quickly find information we previously needed to go to the library for, as well as provided us with an easy to use interface to access the websites offering services we need. To use a search engine, a user typically has to submit a query, a phrase containing all the words the user thinks are needed to retrieve all the documents relevant to his or her intent. It is then up to the search engine to interpret these keywords and return the search results the user wants. One can easily view this as a one-sided obligation, the user should type whatever springs to mind and the search engine should immediately be able to guess what he or she actually meant. In reality, if the user’s query is a bad one, the search engine is set an impossible task. Some responsibility should be assumed by the user, to not only type in a search query, but to type in a good search query. One of the tasks of the search engine, should therefore be to aid him or her in that quest.

1.1

Motivation

A typical design of a search engine interface can be seen in Figure 1.1. The query is written in an entry form and when submitted, the results are presented in a vertical list. Searching is a task that requires concentration from the user, and thus the standard search user interface is simple, with few distractions. Along with the fea-tures already mentioned, many search engines today offer a list of term suggestions as the user types a query, in this work referred to as a query completion list. A search user interface should be designed to help users express their information needs, formulate their queries, understand the search results and keep track of their information seeking progress [19]. The query completions are the means by which many search engines try to fulfill the second point of this list, as users have been seen to be inspired by the suggestions presented [9]. Another benefit of presenting suggestions in this way is that choosing from the list may save the user the effort of typing the entire phrase he or she had in mind.

(9)

Figure 1.1. An example of a search engine interface

The problem query completion algorithms have to solve is a difficult one. They have to properly match the suggestions in the list to the users’ possible needs while at the same time making sure the suggestions lead to relevant and sufficiently covering search results. There are currently many variations of query completion algorithms that prioritize different features. A common approach is to draw suggestions from related queries made by past users [35]. One of the problems with this technique however, is that queries follow trends. A large amount of today’s searchers are those who want information on current events [24], and whose queries may therefore be underrepresented in the search engine’s logs. Back in 2009, for example, 20% of Google’s daily queries were ones that had not been seen in the previous 90 days [28]. The method also assumes that previous user queries are good queries in regards to the search results, which may not be the case.

The search solution consulting company Findwise uses another common method in many of their search systems where the suggestions do not rely on query logs, but on the search data itself. This method is described in more detail in Chapter 3.1.1. The suggestions of this method may return the relevant documents, but tend to be too unvarying to really satisfy the users direct need.

In order to explore alternatives to currently available query completion methods, this thesis project will attempt a different approach entirely: given an original query, the project’s algorithm will come up with suggestions through grouping, or more specifically clustering, the search results and extracting key terms from each of the different clusters.

Commercial search systems clearly benefit from users finding what they are looking for faster than when using their competitors’ systems, and a lot of research has gone into trying to minimize the time and effort of searching. The method of clustering search results to improve user experience and effectiveness has been studied for a relatively long time. One of the more popular studies on the sub-ject is “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections” [17]. A common feature for a lot of implementations is that the user

(10)

1.2. GOAL AND SCOPE OF THE PROJECT

is often expected to navigate the search results by clicking their way through the clusters or facets. To our knowledge, there have been no studies on the effects of incorporating this feature into a query-completion function, which is the object of this thesis project.

1.2

Goal and Scope of the project

The main contribution of the thesis project is to investigate the suitability of using clustering as a means of deriving query completion terms. This goal can be expressed in the following subgoals:

(a) Study various search clustering algorithms and determine which of these would be most suitable as a query completion algorithm.

(b) Implement a search system that employs the chosen search clustering algorithm as its query completion algorithm.

(c) Find a good method for testing the system and apply this method.

1.3

Delimitations

This project is delimited to:

• The implementation and evaluation of a query completion algorithm for the search of Wikipedia articles written in English.

• Wikipedia is the only dataset used for evaluation.

• The project does not concern improving the ranking of search results.

1.4

Thesis Structure

The thesis is structured in the following way:

Chapter 2 begins with a section going through the preliminary concepts needed

to understand the rest of the chapter. It then provides a background to the many areas touched by the thesis problem: Search Result Clustering, Query Completion and User behaviour.

Chapter 3 describes the practical part of the thesis project, namely the search

system. It begins by describing the resources used for the building of the system, before moving on to the details of implementation and the incorporation of the clustering algorithm.

(11)

begins with a motivation of the choice of evaluating for Informational queries and Navigational queries. It then proceeds to describe the method, the scoring system and the results of each experiment. The chapter is concluded with a discussion of the results.

Chapter 5 concludes the thesis project with a summary and proposals of future

(12)

Chapter 2

Background

The problem we try to solve in this thesis touches on many areas within information retrieval. The capability of choosing a suitable search result clustering algorithm for our query completion system relies on our knowledge of the algorithms that already exist, their benefits and flaws, and how they are used in practical implementations. In order to justify our choice of approach, we also have to have a sound knowledge of the different types of query completion methods used today, and what their drawbacks are.

In this chapter we will firstly try give a brief run-through of the theory needed to understand some of the concepts mentioned in the following subsections. We will then present the main points and challenges of search result clustering and query completion respectively, as well as the previous work done within these fields.

2.1

Preliminary Concepts

In the subsections below are listed some basic concepts which we try to explain in more detail in order to help the reader’s understanding of the sections ahead.

2.1.1 Tf-Idf

A simple approach to assign a weight of importance to a term in a document is to count the number of times it occurs in the set. The number of occurrences, or the

term frequency (denoted tf ), can be seen as a quantitative representation of the

of the document. This view of a document is known as the bag of words model, in which the weight is not in any way influenced by the order of the terms. With this view, the document “Mary is quicker than John” is identical to the document “John is quicker than Mary”. Although the two documents are different the high similarity of the bag of words representations indicates that they are similar in content.

Using only term frequency for assigning weights, however, come with the prob-lem of attaching too much importance to words that are commonplace. Stop words such as the and a may occur frequently but do not contribute to expressing the

(13)

θ

0 0 1

1

Figure 2.1. Cosine similarity between documents in the same vector space model,

where sim(d1, d2) = cos θ.

actual contents of the document. To balance this, the document frequency (df ), i.e the number of documents in which the term appears, is used to scale the term frequency. Where N denotes the total number of documents in a document set, the

inverse document frequency (idf ), of a term t is defined as:

idft= log

N dft

This frequency definition gives a high score for a rare term, but a low score for a term that appears in many documents, e.g. stop words.

Together, these two weights form the term frequency-inverse document fre-quency (tf-idf ) weighting scheme, given by:

tf -idft,d= tft,d× idft

for a term t in a document d. This gives a score that is high when a term occurs many times in a small number of documents and low when a term occurs in many of the documents in the document set [27].

2.1.2 The Vector Space Model

The vector space model in the context of document processing refers to the repre-sentation of a document set as vectors in a common vector space. Each document

(14)

2.1. PRELIMINARY CONCEPTS

vector can be computed using the tf-idf weighting scheme, where each axis in the

vector space represents a term. The vector space model can be used to measure the similarity between documents. The closer two document vectors are the more sim-ilar the content. Proximity in the vector space model is often given by the cosine

similarity score, where the angle between the vectors is measured. See Figure 2.1

for an illustration of the cosine angle. A document set is often represented by all its document vectors in a term document matrix. Each term in the dataset is repre-sented by a row in the matrix and each document is reprerepre-sented by a column. Thus, each matrix cell aij contains some type of score (perhaps the tf-idf score) for the word i in the document j. Using this model, one can measure the similarity of two documents by computing the cosine similarity of their document-vectors, i.e their columns in the term-document matrix. It is a model commonly used when ranking retrieved documents in a search system. The term document matrix is constructed from search result documents and the query is represented by a vector constructed as though it were a document vector. The query vector is then multiplied with the term-document matrix in order to get the cosine score of the query vector × each document. The higher the score, the stronger the resemblance, allowing the documents to be sorted according to similarity to the query [27].

2.1.3 Latent Semantic Indexing

The vector space representation of the term document matrix come with some disadvantages in the way that the similarity is measured by documents containing the same terms. There are many ways to describe a topic and the model fails to address such problems as synonymy (car - automobile) and polysemy (charge - steed or electron?). The term-document matrix can therefore be assumed to contain a lot of different words that appear in the same context and have the same meaning. To minimize this sort of noise Latent Semantic Indexing (LSI) can be applied. This method finds a low rank approximation to the original term document matrix by using a mathematical technique called Singular Value Decomposition (SVD). SVD decomposes the matrix into two orthogonal matrices and one diagonal matrix, where the diagonal matrix contains the eigenvalues of the term document matrix. Setting all the eigenvalues except the top k to 0 lets us retain a k-rank approximation of the term document matrix when composing the matrices back into a single matrix [27][18].

2.1.4 Suffix Arrays

A suffix array is a simple data structure introduced by Manber and Myers in 1990 [26]. A suffix in this context denotes a substring of a string that spans from some position of the string and ends at the very last character. The data structure stores all suffixes of a text efficiently in an array by storing in each cell of the array an integer value denoting where in the text the suffix starts. The suffixes are sorted

(15)

according to alphabetical order. In the search result clustering algorithm Lingo [30] (described in section 2.3.3), suffix arrays are used to find frequent phrases [26].

2.1.5 Mean Reciprocal Rank

The mean reciprocal rank (MRR) is a measurement that can be used for evalu-ating any process that returns an ordered list of possible correct answers. A query completion system is an example of such a process. The reciprocal rank of a query completion system is defined to be the inverse of the rank of the first correct answer in the returning list. MRR measures the average of the reciprocal ranks of results for a sample of queries. This means that a score for an individual question can take as many values as the size of the returning list, and does not give credit for more than one correct answer [36].

2.1.6 Query Completion vs. Query Suggest

While query completion is sometimes referred to as query suggestion or variations thereof, Kato et al. make the following distinction between query suggestion and query completion:

“... while query suggestion provides a static list of possible queries given

a complete initial query, query completion aims to provide a dynamic

list of possibilities given a prefix string of an initial query, often within the search query box.” [23]

This is the distinction we will make in this paper as well.

2.2

Search Result Clustering

Clustering is the grouping of items according to some measure of similarity [19]. When speaking of document clustering, these items are a set of texts or documents. The aim of document clustering algorithms is to group a set of documents into subsets that are distinctively different from one another and coherent internally [27]. Document clustering can be applied to group search results. The assumption with document clustering in the area of information retrieval is that similar docu-ments will tend to be relevant to the same queries [17], and there exist several search engines, both commercial and otherwise, that have integrated this functionality into their systems. See [4] [2] for two examples.

In A Survey of Web Clustering Engines [16], Carpineto et al. discuss and com-pare a number of different web clustering search engines. They define search result

clustering as grouping the search results returned by a search engine into a

hier-archy of labelled clusters.

According to Carpineto et al., clustering engines are usually seen as complemen-tary to regular search engines, such as the one seen in Figure 1.1. They identify three situations in which clustering engines can complement search engines:

(16)

2.2. SEARCH RESULT CLUSTERING

• Fast subtopic retrieval The user can easily locate a document belonging to a subtopic if the document is placed within the cluster of a larger topic. • Topic exploration Informational searches in unknown or dynamic domains

can benefit from getting a visual overview of the topic at hand.

• Alleviating information overlook Clustering engines summarize the con-tent of many search results and thereby relieve the user of the work needed to browse through several search results.

Facets are another form of grouping documents. Here the data is organized

beforehand according to a manually constructed category system with meaningful labels, as opposed to clustering’s automatically calculated labels. When comparing clustering to facets, Hearst [19] emphasizes that clusters have the advantage over facets of being fully automatable and can be applied to any text collection without having to be labelled by hand, but come with the problem of users finding it dif-ficult to understand what they contain. There is a lack of predictability in search result clustering implementation and the list of clusters is often incomplete and inconsistent.

2.2.1 Challenges

Although there is a clear relation between regular document clustering and search result clustering, Carpineto et al. [16] mean that the latter poses unique challenges concerning both the effectiveness and the efficiency of the underlying algorithms. One of these challenges is the issue of cluster labelling. The labelling of clusters was of somewhat lesser importance in earlier research on document clustering and led to the development of new algorithms that emphasize expressive labels rather than optimal clusters. Other challenges for Web Clustering Engines include:

• Computational efficiency A common feature for most current clustering search engines is that they do not maintain their own index of documents, but rather rely on the search results from one or more publicly accessible search engines. The efficiency of the acquisition of search results is therefore more of an issue than the efficiency of the cluster construction algorithm. • Short input data description Search engines create snippets of each

doc-ument at each search by assembling two or three spans of text occurring in the document that best match the query. This method of creating snippets is called kwic (keyword in context) [38]. The data used for clustering often amount to a URL, an optional title and a short snippet of the document. • Unknown number of clusters Many of the classic document clustering

methods require the number of clusters as input. This is a difficult number to foresee when dealing with search result clustering.

(17)

• Overlapping clusters The same document can be assigned to different cat-egories.

• Graphical user interfaces The graphical user interface should allow interac-tive browsing and reflect the clusters and their relationships in an informainterac-tive way.

Similarly, Zamir & Etzioni [40] identify two additional requirements for Web document clustering methods:

• Relevance Documents relevant to the user’s query should be grouped sepa-rately from irrelevant ones.

• Browsable summaries The user needs to quickly determine if the contents of a cluster is relevant.

2.3

Search Result Clustering - Implementations

There are several different approaches one can take when clustering search results. Carpineto et al. [16] define three types of algorithms for search results clustering:

• Data-centric algorithms These algorithms focus on the data within the clusters, rather than trying to construct meaningful labels. The algorithms used here are very often conventional data clustering algorithms. The clusters are typically labelled by keywords extracted directly from the centre of the cluster.

• Description-aware algorithms These algorithms are aware of the descrip-tion labelling problem and try to construct descripdescrip-tions that can be interpreted by a user.

• Description-centric These algorithms are designed specifically for clustering search results. The quality of the label is usually given higher priority than that of document allocation. Description centric algorithms are often found in commercial search results clustering systems such as Vivísimo and Carrot Search.

In the same way, Hearst [19] differs between clustering via inter-document simi-larity and clustering according to a shared common term. Inter-document simisimi-larity clustering groups documents by overall similarity, and can thus be said to corre-spond to Carpineto et al’s. data-centric algorithms. Discussing the drawbacks of this approach, Hearst reasons that the clustering by inter-document similarity can easily result in cluster labels of varying levels of description due to the unsupervised na-ture of clustering. Another problem is that documents can be similar to each other in different ways and there are therefore many different aspects in which documents can be grouped. As an example Hearst mentions the auto industry. Should articles

(18)

2.3. SEARCH RESULT CLUSTERING - IMPLEMENTATIONS

about Japanese versus German car emissions be grouped by geographical location or into an alternative energy group? Clustering according to shared common terms seems more promising. Although while these types of algorithms are more likely to construct understandable labels, the labels do not always correspond to the contents of the cluster [20] [19].

2.3.1 Scatter/Gather

An example of a data-centric system is the Scatter/Gather project, which is the best known and earliest research on document clustering for search user interfaces [19]. Scatter/Gather [17] was described by its inventors, Cutting et al. as a browsing method which was made to navigate a collection of documents. It was expected to help users formulate their search requests or organize search results, rather than be used to find a particular document. It is an example of a data-centric system.

The system initially performed a clustering on the entire document set

(scatter-ing the data). From the textual summaries of these clusters, the user could select

those relevant to him/her leaving the system to gather and re-cluster the chosen documents. This was repeated until there were sufficiently few documents to choose from.

Figure 2.2. An illustration of the Scatter/Gather system. [17]

Algorithm The method uses agglomerative hierarchical clustering (AHC), which works by putting each document in its own cluster and then iteratively

(19)

merg-ing the clusters that are closest to each other until k clusters are reached. Scat-ter/Gather also uses heuristics called Buckshot and Fractionation to quickly find centres (or seeds) in the vector space model [16].

The labels are determined in the following way: the documents whose term vectors, or profiles, are closest to the cluster’s centroid are put in their own set. These profiles are then summed up, creating a trimmed sum profile. From this the set of most frequently occurring keywords in the trimmed sum’s documents (cluster digests) are extracted [17].

2.3.2 Suffix Tree Clustering

Suffix Tree Clustering (STC) [40] is a description-aware algorithm. In contrast to a lot of document clustering algorithms, STC treats an input document as an ordered set of words and not a bag of words. This makes it possible to utilise the proximity information between the words.

Algorithm

1. Document Cleaning i.e preprocessing. This included removing non-letter characters. No stemming was used.

2. Identifying base clusters This step can be seen as creating an inverted index of phrases for the document collection. A generalized suffix tree with words as basic elements is constructed. Each node in the tree holds the information of where in the document each phrase appears and represents a base cluster. The base clusters are given scores based on the number of documents they contain and the number of words their phrase contains. 3. Combining base clusters into clusters The base clusters are merged based

on a similarity measure of their document sets. They are scored and sorted based on the scores of their base clusters and their overlap.

2.3.3 Lingo

Lingo [30] is an example of a Description-Centric algorithm. It uses a combination of several different concepts such as Vector Space Modelling, Suffix Arrays and Latent Semantic Indexing in order to achieve meaningful cluster descriptions.

Algorithm The algorithm works as follows:

1. Preprocessing. The input documents are tokenized and HTML tags, entities and non-letter characters are removed. This step also includes stemming. 2. Phrase extraction The phrases extracted in this phase are the ones that

occur frequently in the document set and are therefore assumed to be the most informative potential cluster labels. The algorithm ensures that the

(20)

2.3. SEARCH RESULT CLUSTERING - IMPLEMENTATIONS

extracted phrases are as long as possible, do not cross sentence boundaries and neither begin nor end with a stop word. Stop words that appear in the middle of a phrase are kept, as these can be an important part of the phrase (as in for example Ludwig II of Bavaria).

3. Cluster label induction This step has three steps of its own:

a) Term document matrix construction The matrix is constructed out of single terms that exceed a predefined term frequency threshold. The weight of each term is calculated using the standard term frequency, inverse document frequency (tf-idf). Scores of terms that appear in the document titles are multiplied by a constant as they are more likely to have greater descriptive power.

b) Abstract concept discovery The SVD-method is applied to the term-document matrix to find the orthogonal basis of its column space, which is the key to finding labels. In order to find the k most significant base vectors the original term-document matrix is approximated. The value

k is found by satisfying the following condition:

kAkkF kAkF ≥ q

where F indicates the Frobenius norms of the matrices, and q is a per-centage of desired quality.

c) Phrase matching and label pruning Cosine distance is used to mea-sure how close each phrase extracted in the phrase extraction phase is to the abstract concepts obtained from the basis vectors. If two of the vectors are close, the human readable phrase is determined to be worthy of representing the abstract concept and a term label matrix is con-structed. Labels that are too similar to another label, i.e whose cosine distance from another label exceed a certain label similarity threshold, are removed.

4. Cluster content discovery The term label matrix and the term document matrix are multiplied in order to get a measure of similarity between the label and the document. The content of the resulting matrix assigns each document to a cluster label.

5. Final cluster formation Clusters are ranked by a score which is calculated by multiplying their label score with the number of documents assigned to that particular cluster [30][38].

2.3.4 Dual C-Means

Dual C-Means is a more recent algorithm developed by Moreno et al. [29], that aims to combine the advantages of the data-centric algorithms with the advantages

(21)

of the description-centric algorithms. Dual C-Means extends on the standard doc-ument clustering algorithm K-means, which will be described here for the sake of readability: The algorithm works in the vector space, and starts by selecting a set of K initial cluster centers, or seeds. It then assigns each document to the nearest seed. This is called the assignment step. The next step is the update step. Here, new cluster centers are calculated and the documents are reassigned. The algorithm iterates through the two steps until a terminating condition is satisfied.

When discussing in what way the Dual C-Means would improve on its predeces-sors the writers define three important factors that determine the quality of a search result clustering algorithm: clustering accuracy, labelling quality and partitioning shape.

Algorithm Dual C-Means can be seen as en extension of the K-Means algorithm but for two representation spaces. It combines the vector space model representation with the query log based representation of documents to retrieve the clusters and cluster labels. The authors model a dissimilarity measure d between any data from the search result set to any cluster centroid in the query log set. The object of the algorithm is to form clusters in the search result set by assigning each document to a query log set centroid while minimizing this dissimilarity measure. They do this by iterating over the following two steps until the dissimilarity measure converges:

1. Update: Given an initial fixed partition π of size c in the search result set, calculate the corresponding optimal mean representatives of these partitions in the query log set.

2. Assignment: Construct a new partition in the search result set by assigning each document to the query log mean that minimizes the dissimilarity measure for each document.

Here the query log centroids represent the cluster labels and an analysis of the clusters after the clusters have been determined is therefore not needed.

2.4

Interactive query expansion

Dynamic query completion is an intermediate solution between requiring users to think of terms themselves and having them choose from a dictionary of term sug-gestions [19]. Variations of query completion are offered by all major search engines today [12]. Shah et al. define two types of query completion methods, the first being query expansion, in which the system automatically expands the query with search term, and the second being interactive query suggestion. Here the user interacts with the system, choosing expansion terms and maintaining some control over the search process [32].

(22)

2.4. INTERACTIVE QUERY EXPANSION

2.4.1 Methods

Query completion

Suggest-as-you-type, or query auto-completion (QAC) is, according to Bar et al., a tool that offers a user suggestions as he or she types in the search box. A QAC algorithm takes the sequence of characters in the search box, often only a prefix of the complete query that the user intends to enter and returns a list of completions. Most of these algorithms have access to a database of queries mined from query logs. If the search engines do not have enough traffic, however, they construct their database from the search material. This is done during index time. The algorithm uses a datastructure such as a TRIE or a hash table for quick and efficient lookups at runtime [12].

MostPopularCompletion The most popular algorithm for QAC according to Bar-Yossef et al. is what they call MostPopularCompletion. The ranking of the query suggestions is determined by the frequency with which the suggestion ap-peared in the database. In other words, for an input query x, the algorithm returns the most frequent queries in the database that had x as a prefix. [12].

NearestCompletion In an attempt to improve QAC for very short queries (i.e 1-2 characters long), Bar-Yossef et al. introduced the NearestCompletion algorithm. The algorithm uses the user’s previous queries as context for the input query, letting these queries influence the resulting list of query completions. To evaluate the algorithm the MRR was used as measurement (see Chapter 2.1.5). It was shown that the NearestCompletion algorithm’s MRR was on average 48% higher relative to the MRR of the MostPopularCompletion algorithm with a single character as input. However, this was only true while the context was relevant to the current query. When the context was irrelevant, the NearestCompletion’s MRR was essentially zero. The authors then introduced a hybrid of two, which performed much better. On average, its MRR was 31.5% better relative to MostPopularCompletion [12].

Temporal The problem with relying on old query logs is that queries relating to ongoing events are likely to be under-represented and will thus get a lower score than they should. To address this issue, Whiting & Jose [39] propose methods that pay heed to trending as well as consistent queries. These methods include: (1) using a sliding window of a fixed number of days for query log evidence, (2) using a sliding window for query log evidence where the number of days depended on the popularity of the prefix, (3) using time-series modelling to predict the next N query distributions. The results showed that relying on only recent query popularity evidence gave a small but consistent MRR gain for 2-character prefixes [39].

(23)

2.4.2 Tools

Two studies of commercial tools that implement query suggest and/or QAC can be found below:

Prisma

A study was conducted on the logs of AltaVista’s Prisma to assess the effects of using terminological feedback for query refinement. The tool was embedded directly into the standard AltaVista search results page and presented the users with relevant terms that could potentially help the user refine her query. Clicking on the terms launched a new search for the combined terms. Phrases which contained a query term were displayed first, the following two columns were other multi-word phrases and the last column was dedicated to single word terms. The study found that users preferred choosing phrases that contained an original query term.

Anick classified the refinements the users were making with the tool into eleven categories, among these head (adding a linguistic head term to the original query, e.g., triassic → triassic period), modifier (adding a linguistic modifier to a term in the original query, e.g., buckets wholesale → plastic buckets) and elaboration (an entirely new phrase which restricts or adds further context to the original query) were the most popular. After merging one of the eleven categories with elaboration, they concluded the top three categories represented 68% of the user’s refinements. Despite having access to the tool, the study found that a vast majority of reformu-lations were done manually by the users. The users who did use the tool preferred phrases that modified terms in the original query [8].

Yahoo’s Search Assist

A longitudinal log based study was conducted on the logs of Yahoo’s Search Assist which offers two separate refinement tools. The first tool, Gossip, offers suggest-as-you-type expansions on a character basis. These suggestions are based on frequently occurring queries, mined from query logs. The other tool, Viewpoint, suggests related concepts after the query has been executed. These suggestions come from an analysis of the top search results. The study was conducted on four distinct days over a period of 17 weeks with 100 000 users.

The dynamic suggestions in the Yahoo Search Assist tool were clicked on in 30-37% of the sessions. Through retrospective interviews, the study also found that Gossip suggestions gave most users confidence in their search choices, and that the viewpoint suggestions helped them reflect on other ways of expressing their needs. Another interesting thing was that the user’s interest with the query assistant increased over the course of the study, which suggested to the authors that it takes time to get comfortable with newly incorporated features in a search engine [9].

(24)

2.5. STUDIES ON SEARCH BEHAVIOUR

2.4.3 Evaluation of Query Completion Systems

There are different approaches one can take when evaluating a query completion system. Some studies in query expansion have taken a traditional information retrieval approach to their evaluation, using standardized test data for information retrieval, such as that from TREC (Text REtrieval Conference).1 TREC-data comes with documents, detailed information needs (or topics) and relevance judgements to these topics [27]. This paradigm focuses on the retrieval aspect of information retrieval, i.e, the relevance of the search results, through such measures as precision

P recision = #(relevant items retrieved)

#(retrieved items) (2.1)

and recall [27]

Recall = #(relevant items retrieved)

#(relevant items) (2.2)

However, Shah et al. argue that:

“With research in IR expanding to take a broader perspective of the in-formation seeking process to explicitly include users, tasks and contexts in a dynamic setting rather than treating information search as static or as a sequence of unrelated events, the traditional evaluation approach is not appropriate in many cases.” [32]

Thus many studies have deviated from the traditional approach.

In the situation where the objective of the system in question is to make the search process more effective, and save the user the keystrokes needed to formu-late their already thought out queries, old query logs can be used to measure the predictive power of the algorithms. See [33] [34] for examples.

In other cases, e.g., when the system’s objective is to help the user formulate their query, a user study is often performed. Bhatia et al. [13] performed a user study where they let their users score the query completion terms generated from a sample of TREC queries in terms of meaningfulness. These scores where then compared to the scores of terms generated by two baseline methods. Ji et al. [22] deployed prototypes of their fuzzy search system to measure the efficiency of their algorithm.

2.5

Studies on Search Behaviour

2.5.1 User Search Goals

Broder [15] classified a user’s search intent into the following three types:

(25)

• Informational - The user is looking for information on a topic. The topic could be broad or niched, and the search intent could therefore have thousands of relevant search results, or only a couple.

• Navigational - The user wants to find one particular document. He or she wants to use the search engine to navigate to a specific website. Here the search intent only has one relevant search result.

• Transactional - The user wants to perform a specific action, such as booking a flight or downloading an open source text editor.

Rose & Levinson [31] introduced a framework of search goals which expanded on this list. The label transactional was changed to the more general term resource, which could represent queries such as beatles lyrics. See Table 2.1 for a com-plete outline of the framework. Rose & Levinson used the framework to manually categorize three sets of approximately 500 U.S English queries, randomly selected from AltaVista’s query logs. The results of the study were: 61.30 - 63.00 % of the queries were informational, 21.70 - 27.00 % of the queries were resource and 11.70 - 15.30 % of the queries were navigational.

2.5.2 Search Strategies

A user study [10] regarding the strategies experienced web users have for information search and re-access found that some of the popular strategies to this end include: • Having multiple browser windows or tabs open while searching. This strategy allows the user to do several things at once, for example, go through the result list while slow pages download.

• Using a search engine for re-accessing information. This could be problematic, however, as the query terms needed to retrieve the specific documents have to be remembered.

• Typing the URL in the URL field to re-access information. • Saving documents as local files to re-access information.

• 52.3 % of the users had an incorrect understanding of the default operator or no idea of what it was.

• Some of the users used categorizing search engines along with their primary search engine as the categorizing helped them get an overview of the results set and the topic. It was also said to be useful in supplying additional search terms.

• Experience did not alter the way a user formulated queries, but increased the imaginative use of search engines.

(26)

2.5. STUDIES ON SEARCH BEHAVIOUR

A later study conducted by Aula et al. [11] showed that when formulating queries users often use the main facets from the task description as keywords for the query. If this is unsuccessful, however, many users would abandon the keyword approach and use natural language instead. In general, they would often pick an initial strategy and only make small changes to the original query when reformulating until giving up and changing approach entirely if the initial strategy proved unsuccessful. For difficult search tasks, this could happen several times. Regarding the length of the queries, users tended to lengthen their queries towards the end of their search session for easy tasks, while for difficult tasks, the longest query occurred in the middle of the search session.

2.5.3 How are we Searching?

A study conducted by Jansen & Spink [21] summarized and compared the results of nine major Web searching studies. These studies were conducted over a seven year long period (from 1997 to 2002) on the logs of five different Web search engines based in either Europe or the US and the data of the studies being examined ranged from the years 1997 to 2002. Although this study can be said to be outdated, some observations were made on the trends of Web search which may hold true even today:

• Query lengths are not increasing as measured by number of terms.

• The percentage of users who only view the first search result page is extremely high. Users prefer to reformulate the query. This is consistent with [32]. • Web search topics were changing. The users were moving towards using the

Web as a tool for information and commerce, rather than entertainment.

2.5.4 When do People use Query Suggestion?

The study conducted by Kato et al. [23] found that people use query suggestion, i.e., refine their query with suggestions after they have viewed the results of the original query:

1. When the original query is an infrequent query 2. When the original query is a single term 3. When query suggestions are unambiguous

4. When query suggestions generalize or correct the original query

(27)

2.5.5 Interaction with Dynamic Query Completion

Dynamic query completion is a relatively new feature and has only recently become the norm of search engines. A study conducted by Shah et al. [32] aimed to explore the effects of both dynamic query completion (what the referenced paper referred to as query suggestion) and dynamic search results on users’ search behaviours. 36 participants were assigned to one of the following three conditions:

• Query-completion off, Google Instant off • Query-completion on, Google Instant off • Query-completion on, Google Instant on

The results showed that query completion did not lead to longer queries or more concepts in the queries. Neither did it change user’s behavioural patterns. It did however lead to shorter user time spent on the search page and in reformulating queries. In general, if the users couldn’t find what they were looking for on the first search result page, they were almost five times more likely to reformulate their queries than to visit a second page. The results also showed that dynamic search interface exposed users to more information relating to their requests which the authors thought could be useful for learning and sense-making.

(28)

2.5. STUDIES ON SEARCH BEHAVIOUR

Search Goal Description Examples

1. Navigational

My goal is to go to a specific known website that I already have in mind. The only reason I’m searching is that it’s more convenient than typing the URL, or perhaps I don’t know the URL

aloha airlines duke university hospital

kelly blue book

2. Informational My goal is to learn something by reading or

view-ing web pages

2.1 Directed I want to learn something in particular about my topic

2.1.1 Closed I want to get an answer to a question that has a single, unambiguous answer

what is a supercharger 2004 election dates

2.1.2 Open I want to get an answer to an open-ended

ques-tion, or one with unconstrained depth baseball death and injury

2.2 Undirected

I want to learn anything/everything about my topic. A query for topic X might be interpreted as “tell me about”

color blindness jfk junior

2.3 Advice I want to get advice, ideas, suggestions, or in-structions

help quitting smoking walking with weights

2.4 Locate My goal is to find out whether/where some real world service or product can be obtained

pella windows phone card

2.5 List

My goal is to get a list o plausible suggested web sites (i.e the search result list itself), each of which might be candidates for helping me achieve some underlying, unspecified goal

travel

amsterdam universities florida newspapers

3. Resource My goal is to obtain a resource (not information)

available on web pages

3.1 Download My goal is to download a resource that must be on my computer or other device to be useful

kazaa lite mame roms

3.2 Entertainment My goal is to be entertained simply by viewing

items available on the result page live camera in l.a.

3.3 Interact

My goal is to interact with a resource using an-other program/servuce available on the web site I find

weather measure converter

3.4 Obtain

My goal is to obtain a resource that does not re-quire a computer to use. I may print it out, but I can also just look at it on the screen. I’m not ob-taining it to learn some information, but because I want to use the resource itself

free jack o lantern patterns ellis island lesson plans

Table 2.1. A complete table of the framework proposed by Rose & Levinson. All

(29)
(30)

Chapter 3

The Search System

In order reach the main goal of this thesis, which is to investigate the suitability of clustering search results as a means for coming up with query suggestions, a search system was designed. This system includes a search engine, a query completion feature that implements the search result clustering algorithm Lingo, and lastly a simple graphical user interface (GUI). In the following sections the reader will find a thorough description of the system, as well as the resources used to implement it.

3.1

Resources

All the tools, components and libraries used by the search system are listed below.

3.1.1 Tools and Components

Apache Solr

Solr is an open source search platform written in Java. It’s many features include advanced full-text search, with capabilities including phrases, wildcards and joins. It runs as a standalone server and has HTTP-like APIs which allows users to query it via GET-requests [1]. Solr is used as the search engine for the system and it provides a lot of the systems functionality including:

• Indexing and storing the search documents From the articles we extract, index and store document id, text, and title. During indexing, the texts are stemmed using a Porter Stemmer. Stop words (see Appendix A) are omitted. • Ranking the search results Many Wikipedia articles are short articles, redirecting the user to other articles that the article title may concern. Because of Solr’s tf-idf scoring system, the default configuration led to these short articles being ranked unreasonably high. The ranking can be configured to boost certain fields. Our system is configured to put an extra emphasis on the title, letting a hit in the title be worth a lot more than a hit in the text. We have also made the default operator of the search engine the AND operator - all

(31)

words in the search query have to be present in the text for it to be returned as a search result. The combination of these two configurations solves the problem of short, Wikipedia-specific, “X may refer to:”-articles being ranked higher than articles with real content.

• Highlights Solr can also be configured to produce highlights, or snippets of the search results documents. The highlighted texts included the query terms as well as the words surrounding the query terms up to a certain proximity. Using these highlights as input to the algorithm proved more effective than using the full texts, as it dramatically decreased the amount of noise. Limiting the input text to only include highlights also sped up the analysis part of the system.

Solr Suggester

Modern versions of Apache Solr come with a query completion component called Suggester that can be configured to suit many needs. It takes the prefix of a query and performs a look-up in a dictionary. One way to represent the dictionary is as a weighted finite state automaton (WFST). Here the weight of each term is appended to the front of the term. A finite state automaton is constructed from the sorted terms, where the root node has arcs labeled with all possible weights. At look-up-time it traverses the prefix given by the partial query with the highest weight and then walks the n shortest paths to find the top-ranked suggestions. If it fails to find n suggestions at the highest weight, it will traverse through the lower weights until it does so. See Figure 3.1 for an example [37][3]. This component is used as the baseline implementation for our evaluation.

Figure 3.1. An illustration of the WFST representation taken from [37]. The

au-tomata is built from the four weighted terms. During look-up, the algorithm traverses the given prefix (fou), then walks the n shortest paths to retrieve the top-ranked sug-gestions.

(32)

3.1. RESOURCES

Findwise Jellyfish

Jellyfish acts as a service layer between the front-end and the back-end of a search system and can be integrated with a large number of different search platforms [5]. In practice, Jellyfish is used to help construct the input query to the Solr search en-gine. It makes it possible to tweak certain parameters such as number of documents to be displayed on the search page and returns the search results in a structured, easily parsed JSON-format.

3.1.2 Libraries

Efficient Java Matrix Library Efficient Java Matrix Library (EJML) is a linear algebra library that allows for optimized matrix calculations [7]. The EJML library is used for the many difficult matrix computations required by the algorithm, such as singular value decomposition, matrix multiplication and finding the Frobenius norm.

Gson Gson is a Java library which enables the conversion of Java Objects into JSON representation and vice-versa [6]. The Gson library is used to parse the returned Json objects returned by Solr, such as the titles of the articles, and turn them into Java Strings.

3.1.3 Dataset

The dataset used as search data for the system is the latest dump of English Wikipedia. It was downloaded on October 30th, 2014 and contained current re-visions of the articles, excluding talk and user pages.

Structure

The document set is a huge xml file, expanding to over 45 GB and has the following syntax: 1 <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" xml:lang="en"> 2 <siteinfo> 3 Some site info 4 </siteinfo> 5 <page>

6 <title> A very nice title </title> 7 <ns>0</ns>

(33)

9 <revision> 10 <id>628700936</id> 11 <parentid>628699619</parentid> 12 <timestamp>2014-10-07T20:35:08Z</timestamp> 13 <contributor> 14 <username>Username</username> 15 <id>07734</id> 16 </contributor>

17 <comment>A comment about the latest revision</comment> 18 <text xml:space="preserve"> A lot of text with [[links]] and

references &lt;ref&gt;{{cite book |last=Andersson|

first=Amelia |title=A reference |place= Stockholm|publisher= Amelia’s publisher| year= 1999| isbn= 1337}}&lt;/ref&gt; </text> 19 <sha1>a code</sha1> 20 <model>wikitext</model> 21 <format>text/x-wiki</format> 22 </revision> 23 </page>

24 ... Many other pages ... 25 </mediawiki>

The highlighted portions are the contents of the fields stored in Solr. The title field gives us the title of the article, the text field gives us the article’s text, and the page id gives us the unique key we need to identify the page.

Preprocessing

As we can see from the example above, the downloaded input data contains a lot of characters and strings that have nothing to do with the actual text of the article, but are links and reference. As these types of information are useless to us, we employ a parser written by Bouda (see [14]). Some modifications were needed to extract the three fields we needed (id, title and text), and print the output in the format we wanted, but the parser proved very effective in filtering out the data we did not want. It returns a concatenated xml file, much like the one the one downloaded from Wikipedia, which can then easily be indexed into Solr.

3.2

Query Completion

The implementation takes a partial query as input and uses it to build a new query via Jellyfish. The new query is then used to query Solr. The first 100 snippets returned from Solr are collected and used as input for the clustering algorithm.

(34)

3.2. QUERY COMPLETION

3.2.1 Clustering algorithm

The algorithm used for clustering search results is Lingo. Lingo was chosen because when tested against other types of clustering algorithms (both data-centered and description-aware) by Carpinetto et al. [16], it performed well in terms of computa-tional efficiency, which is important in QAC, and label-driven subtopic reach time, which indicates meaningful labeling. The assumption is that if the label is meaning-ful enough to navigate to the right subtopics, it should help navigation in a query completion situation as well.

Lingo is implemented as closely as possible to the description in Chapter 2.3.3, though some adaptations had to be made to turn it into a query completion-algorithm. This section will go through the deviations for each step.

Preprocessing

As mentioned in the previous chapter, the search data was heavily preprocessed before indexation and therefore not as much preprocessing is needed when executing the actual algorithm. All end of sentence characters, such as ’!’, ’.’ and newline, are marked for the phrase extraction phase, making sure that frequent phases do not cross sentence boundaries.

Phrase extraction

This phase did not differ from the original implementation in Chapter 2.3.3 except that heading and trailing stop words are kept. This was decided with the completion approach in mind. During search a user will perhaps start off by giving a partial query, such as Beauty. This partial query may also be part of the phrase Beauty

and the Beast. Removing the two stop words from the frequent phase and the Beast would take away the ability of presenting the phrase in its entirety to the

user. Heading and trailing stop words are therefore not removed until the entire completion phrase has been constructed.

Cluster label induction

Here the implementation deviates slightly from the original description in Chapter 2.3.3: rather than having a term frequency threshold for the term document matrix construction, we decided to simply limit the matrix’ size. This is a more direct way of controlling the efficiency of the algorithm and the quality of the completions.

Cluster content discovery

In this phase, we made some alterations to the original algorithm in order to trans-form it into a query completion algorithm - the main contribution of this thesis. Here, the term label matrix is multiplied with the term document matrix in or-der to get a measure of similarity between each label and each document. In the original algorithm this was done to assign each document to a cluster label for the

(35)

Final cluster formation phase. In our implementation, however, this is done for the purpose of ranking the labels. The frequency of documents assigned to the label thus functions as the final score and determines the order with which the labels are presented as query completions.

Final cluster formation

This step was skipped entirely, as we had no need of forming actual clusters. The terms and phrases returned from Lingo were used as search terms rather than cluster labels.

3.2.2 Cleaning

This step was added to turn the resulting ranked list from the Cluster content discovery phase into a presentable query completion list. The list is sorted according to rank and the completion phrases are appended to the original query. If a phrase contains a term that appears in the initial query, the original term is discarded in favour of the phrase in its entirety. Heading and trailing stop words are then removed from the contents of the list and the list can be presented to the user.

3.2.3 Presenting the Completion List

Finally, the query completion list is passed to the graphical user interface. See the following section for details on this.

3.3

Practical implementation

To be able to demonstrate the workings of the query completion algorithm a simple graphical user interface was designed. The interface follows the standard of search user interfaces, with a query box on the search page and a dynamic list rendered dynamically below the box (see Figure 3.2). The results of the query are then presented in a vertical list.

3.4

Parameters

A summary of the parameter settings used for the system is listed in Table 3.1 below.

Number of Documents 100 Number of Terms 1000 Quality Threshold 70 % Label Similarity Threshold 30 %

(36)

3.5. EXAMPLE pig | breeds of pigs war pigs bay of pigs songs of pigs Search

Figure 3.2. An illustration of the design of the system. The figure shows a user

being presented with a dynamically rendered query completion list, having typed the partial query pig.

3.5

Example

To illustrate the work flow of the system, this section provides an example of a partial query being given completion suggestions. This example is purposely built on the example given by [30] for the Lingo algorithm, extending it to be able to show how the algorithm has been modified and expanded on to work as a query completion algorithm.

3.5.1 Step 1. Querying Solr

Given that the user of the search system has typed the partial query qinput = “document”, the first step in our system is to call Solr in order to obtain the search results used for the analysis. Let us say that the results of this query are the seven documents below. The titles are written in bold text and followed by a short snippet (the number of snippets for a document is set to five in our system, but to be practical only one example is given for each document):

Document 1: Large Scale Singular Value Computations Document on large scale singular value computations.

(37)

Document 2: Software Library for the Sparse Singular Value

Decomposi-tion

Document on software library for the sparse singular value decompo-sition.

Document 3: Introduction to modern Information Retrieval Document on introduction to modern information retrieval.

Document 4: Using Linear Algebra for Intelligent Information Retrieval Document on using linear algebra for intelligent information retrieval. Document 5: Matrix Computations

Document on matrix computations.

Document 6: Singular Value Analysis of Cryptograms Document on singular value analysis of cryptograms. Document 7: Automatic Information Organization

Document on automatic information organization.

These documents are then used as an input to the clustering algorithm.

3.5.2 Step 2.1 Lingo - Preprocessing

In the first phase, the snippet texts are stemmed and the title words are marked. As the assumption is that the title contains words that give a good, human-readable description of the article, the title words get a boost to their tf-idf score during construction of the term document matrix later on. Stop words are also marked, but for the opposite reason. They are excluded from the term-document matrix entirely.

The snippet texts are then indexed. Rare terms are filtered out, along with stop words. This applies firstly to terms that only appear once in the document set, but the document frequency threshold is raised depending on the size of number of unique words. The size of the index determines the size of the term document matrix which is used in many of the costly matrix computations later on. Limiting the number of indexed terms is therefore key in limiting the computation time, and an important parameter. Our experiments showed that a thousand indexed terms render good results to this end.

The resulting stemmed terms in our example that are not filtered out are: T1: document

T2: informat T3: singular T4: valu T5: comput

(38)

3.5. EXAMPLE

T6: retrieval

and the resulting term document matrix is calculated:

A =          0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.85 0.85 0.00 0.00 0.00 0.85 0.00 0.85 0.85 0.00 0.00 0.00 0.85 0.00 1.25 0.00 0.00 0.00 1.25 0.00 0.00 0.00 0.00 1.25 1.25 0.00 0.00 0.00 0.00 0.00 0.85 0.85 0.00 0.00 0.85         

Note that the first row, representing the term document has no score in any of the documents, even though it appears in all of them. This is due to the inverse document frequency part of the tf-idf scoring. A term that appears in all of the doc-uments is deemed useless. For future calculations each column vector is normalized so that each column vector length is equal to one:

Anormalized =          0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.71 0.00 0.00 0.00 0.71 0.00 0.49 0.71 0.00 0.00 0.00 0.71 0.00 0.72 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.83 0.83 0.00 0.00 0.00 0.00 0.00 0.56 0.56 0.00 0.00 1.00         

3.5.3 Step 2.2 Lingo - Phrase Extraction

For computational efficiency the stemmed terms are translated into integers and a list representation of the concatenated snippets is created, preserving the order of the terms for the Phrase Extraction phase. For our example, this resulted in the following list:

wordlist = [1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 9, 10, 11, 12, 13, 5, 6, 14, 15, 1, 2, 16, 17,

18, 19, 20, 21, 1, 2, 22, 23, 24, 11, 25, 19, 20, 26, 1, 2, 27, 7, 28, 1, 2, 5, 6, 29, 30, 31, 32, 1, 2, 33, 19, 34, 35]

Note that here the stop words are kept, as important phrases may include stop words (e.g Conan the Barbarian). Another thing to note is that each sentence marker is translated into a unique integer. This is to stop frequent phrases from crossing sentence barriers. During the phrase extraction phase, frequent phrases are found using the methods described in Chapter 2.3.3. For our example the stemmed phrases:

P1: document on P2: singular valu P3: informat retrieval

(39)

3.5.4 Step 2.3 Lingo - Cluster Label Induction

To reduce the effect of noise the term document matrix is now approximated using the method described in Chapter 2.3.3. How much the matrix should be approxi-mated is determined by the quality threshold q which is a percentage. The calcula-tions to find the k-rank approximation includes costly matrix computacalcula-tions, but as the term document matrix size in our implementation has been set to a maximum of 1000 terms, the time cost is acceptable. We found through experimenting that a good percentage for the quality threshold is around 70%, and for our example this corresponded to a rank k=2.

From the individual terms and the frequent phrases found in the Phrase Ex-traction phase a matrix P of size (t, t + p) is constructed. It can be thought of as a type term document matrix, only that here the phrases and terms are treated as documents. The matrix is then weighted with the tf-idf score of each word and normalized: P =   0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.71 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.71 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.56 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.83 0.00 0.00 0.00 0.00 0.00 0.00 1.00  

To avoid returning query completions that are too similar, a label similarity score is calculated by measuring the cosine score between each pair of column vec-tors in the matrix. In column-pairs with scores reaching above the label similarity threshold, in our implementation set to 0.3, only one is kept. Vectors with no scores at all (see for example the phrase document on) are also discarded. Cutting out these terms from our matrix resulted in the following pruned P-matrix:

Ppruned=          0.00 0.00 1.00 0.00 0.71 0.00 0.00 0.00 0.71 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.56 0.00 0.00 0.00 0.83 0.00 0.00         

The P matrix is responsible for providing the human readable phrases to the concepts found in the document set. The inclusion of single terms is justified by the fact that some concepts are best described by a single terms (e.g Madonna, McDonald’s, famine). The terms and phrases still represented by the matrix after pruning are:

P1: singular valu P2: informat retrieval P3: document

P4: computations

Multiplying the P matrix, representing the human readable phrases, with the

References

Related documents

6 France24, Afghanistan’s media enters the unknown under Taliban rule, 24 August 2021, url; TOLOnews, Afghan Media Activity Faces Sharp Decline: Report, 3 October

60 The UN Special Rapporteur reports ‘allegations of grave human rights and humanitarian law violations’ for the Eritrean refugees in Tigray region, 61 stating that he

According to the January 2020 report of the UN Independent Expert on the situation of human rights in Mali, counter-terrorism operations conducted by the MDSF have led to human

Background information regarding the conditions of Eritrea’s national service, including the civilian branch, can be found in: EUAA Query Response Eritrea - Latest developments

192 AI, Ethiopia: Tepid international response to Tigray conflict fuels horrific violations over past six months, 4 May 2021, url; UNHCOHC, Enhanced interactive dialogue on

Question(s) Information on the political, security and humanitarian situation in the regions of Barh El-Gazel (regional capital, Moussoro) and Mayo-Kebbi Ouest

6 UHAI EASHRI, Landscape Analysis of the Human Rights Situation of Lesbians, Gay Bisexual, Transgender, Intersex People and Sex Workers in the Democratic Republic of the Congo,

Eight members of KDPI were reportedly killed in fighting with Iranian security forces at the Iraqi border in the province of West Azerbaijan on 7 September, according to