Similarity search in multimedia databases
Performance evaluation for similarity calculations in multimedia databases
JO TRYTI AND JOHAN CARLSSON
Bachelor’s Thesis at CSC Supervisor: Michael Minock
1 Introduction 1
2 Background 3
2.1 Similarity search and information retrival . . . 3
2.1.1 Recall and precision . . . 3
2.1.2 Multimedia objects . . . 3 2.2 Text analysis . . . 3 2.2.1 Term frequency . . . 4 2.2.2 Term discrimination . . . 4 2.2.3 Length normalization . . . 5 2.2.4 TF-idf . . . 5 2.2.5 Stopwords . . . 5 2.2.6 Stemming . . . 5
2.3 Vector space model . . . 6
Similarity search is an increasingly popular topic in the informational retrieval field both in the academic as well as in the commercial world. Many online companies and other forms of services on the Internet strive to provide accurate and relevant recommendations to their users based upon their preferences.
For multimedia objects (movies, music, etc) it can be hard to measure how well two objects correlate to each other, as it often contains what is called "fuzzy" objects. In other words, objects which can be hard to assign a specific value or meaning. There are many ways of computing the similarity between multimedia objects, but this report will focus mostly on text similarity between media objects and comparing different heuristic implementations.
Our aim with this report is to investigate how to effectively calculate the k-most nearest neighbours in a multimedia database.
Similarity search and information retrival
2.1.1 Recall and precision
As this report will focus on information retrieval and heuristic methods, some spe-cific terms used in the field will be used throughout this text, namely precision and recall. Precision is the set of relevant results from a returned search. Recall is the amount of relevant results in relation to the whole set of objects.
2.1.2 Multimedia objects
A multimedia object can consist of an arbitrarly number of different fields, but they are usually classified in three different kinds of fields. These fields types are:
Token or text — A description or category with multiple tokens Metric — An enumerable attribute, for example
the date when a music track were released
Precalculated — Items with precalculated distance, for example countries when geograpical distance For each field a similarity score will be calculated separately and summerized.
Each field will be weighted according to 2.1.
score(q, d) = n P i=1 ci∗ sim(qi, di) n P i=1 ci (2.1)
When comparing text fields, there are a lot of different methods and implementa-tions. In short, the analysis of a given text with semantic or with statistical tools.
The implementations described below will focus on statistical tools. When referring to a collection of documents, for example our database with movie descriptions, the term corpus will be used.
c(w,d) — The count of word w in d
f(q, d) — Probability score for document d given the query document q df(w) — Document frequency, the count of documents that contains w
Table 2.1. definitions
2.2.1 Term frequency
Term frequency is a summarization of how many times a word appears in a docu-ment. As mentioned in the stopwords section 2.2.5 common words like "a","and","or" will likely get a high count if stop words are not removed first. Term frequency is often an indicator of how closely related documents are based upon how frequently certain keywords recurs in documents, albeit an unreliable indication because even with removing stop words there are still recurring words which might be unrelated or not add any further information. A common modification is to use the logarithm value to the term frequency, if a term occurs twenty times more than another term in a document it is unlikely that it is twenty times more significant [Christopher D. Manning and Schutze, , 127]. It can be easily seen that it gets skewed if there is an large discrepancy in document sizes, therefore the length of documents is normalized.
A formal definition is formed by [Hui Fang, , 50]
q= w, |d1| = |d2|, c(w, d1) > c(w, d2)
thenf(q, d1) > f(q, d2) (2.2) Common TF heuristic implementations are listed from [Gerard Salton, 1988] in 2.2.
boolean tf(w, d) = [0, 1] 1 if the term is in the document else 0
raw tf(w, d) = c(w, d) Raw term frequency for the document
logarithmic tf(w, d) = log(|c(w, d)| + 1) Logarithmic scaled termfrequency Augmented normalized tf(w, d) = 1
max(cf (wi,d))) Were max(cf(wd)) is the maximum
cf(w, d) for any w in d.
Table 2.2. Term frequency heuristics
2.2.2 Term discrimination
in respect to how many times a term frequents a corpus. Not as term frequency which only takes into account the terms in one document. If the corpus only contains documents in a very specific topic, the topic-specific terms will get a high frequency and because of using the inverse frequency it will therefore be given a low value, as they are not as relevant as say an uncommon word in the corpus. The inverse document frequency is used to get an idea of how frequent terms are in relation to all the given documents. Common terms gets a low value and uncommon values get a high value.
raw 1 No change in idf weight
logarithmic idf(w) = log(N |n) N is the total number of documents in the corpus and n is the number of documents contaning w.
probabilistic idf(w) = log(N −n
n ) Probabilistic inverse frequency factor
Table 2.3. Term discrimination heuristics
2.2.3 Length normalization
As mentioned by [Hui Fang, , 50] length normalization is used to penalize long documents, so as not to favour documents according to their size.
Term frequency (TF) coupled with the inverse document frequency (IDF) gives the TF-IDF method. It is a good statistical tool for assigning weights to terms based upon how often it appears in a given corpus combined with how many times it appears in a specific document. In short it gives a weight based upon how many times a term appears in a given document in relation to how many times it frequents the corpus as a whole. It is a powerful statistical method used in many search engines.
One thing to take into consideration when comparing text fields is if there are words not worth including, words like "a","or","and" does not add or remove any value from a given text, and can thus be safely removed. There are several widely used stop word lists, but many of them are context-dependent. In one of the implementation, described in the method section, a dynamic stop words list is used, based upon the removal of every value below a certain threshold in the idf table.
Words such as "search", "searching" "searches" can all be reduced to the root "search". It comes with a reduction in precision, as some words can lose their actual meaning as there are many words with a similar root but have different meaning. This is something which is called over stemming. Just as stop words, stemming is in some regard context dependent and there are different stemming lists which can be used, the one most known is the M. Porters [Porter, 1980] algorithm which have been translated to several different programming languages since 1980 when it was first published. We decided to not use any stemming in our dataset as it reduces the precision and on the limited size of our documents we decided to not implement it.
Vector space model
One way of measuring the similarity between two or more given documents in a corpus is by indexing each word in a vector space and then comparing them by using the cosine angle derived from the scalar product described in [Christopher D. Manning and Schutze, ] and can be seen in 2.3 below.
cosine(v) = A· B
Where A and B is the vector space of two documents divided by the normalized length of the vectors.
In this section the different implementations used in this report will be described further. The implementations will mostly differ in what kind of weights which will be applied and there will be a base case, a naive implementation, used when comparing the different implementations time and, to some extent, memory consumption.
The most common aproach when calculating similarities for information retrival is to construct an index. A standard implementation is an inverted index [Christopher D. Manning and Schutze, , 67], this index improves calculating speed at the cost of memory. An inverted index maps a term with a list of documents and thier term freqency for that term. When doing a similarity search to find the k nearest neigh-bors in the inverted index it is only necessary to calculate similarities for documents containing at least one common term. For some webservices the similarity search is just i minor part and the cost to keep an large index might make an simpler but slower implementation disirable. The calculational drawback with an inverted index is that it makes document length calculations expensive.
We have implemented similarity search using both an inverted index and a more simple index. The simple index calculates similarities by comparing two documents at the time, it also links terms with the document frequency 2.1.
The database in which the different methods and implementations will be used upon is a small subset of IMDB. We have decided to focus on just a few attributes which was deemed interesting, more precisely the plot, genre and release year.
The description of each movie is between 10-100 words and the test database consists of about 100 thousand movies. Each movie object has a description, release date and a genre descriptor. Both the description of the movie and genres will be computed by using a vector space, detailed in 2.3 and the release year will be
computed as a metric. as can be seen in 3.1.
yearsim(A, B) = log |B − A| (3.1)
To measure the heuristics used in the test cases an implementation were chosen as a scale. The scale used is defined in 3.2. This is a commonly used weight scale.
tf(t, d) = log(c(t, d) + 1) idf(t) = log N |d|, d∈ N ∧ cf (t, d) > 0 cosim(d, q) = t P idf(t) · tf(t, d) · idf(t) · tf(t, q) |d| · |q| , t∈ q (3.2)
Term frequency weights
Boolean, raw and logarithmic term frequency heuristics were implemented as de-scribed in 2.2
Inverse document frequency weights
Raw and logarithimc inverse document frequency heuristecs were implemented as described in 2.3
The two used length normalization heuristics were the vector length and the amount of terms in the document. The vector length based upon the amount of terms as described by [Christopher D. Manning and Schutze, ] is commonly used for inverse index.
The used word removal technique were to remove all words with a logarithmic idf value of more then 2. Over 1
22 of the document contains terms that have a
4.0.1 Computational time
In 4.1 average knn computation time for twenty movies queried, measuring TF, IDF and length heuristics.
4.0.2 Index comparison
Shown in 4.2 is the computational time for different index implementations. Average for twenty movie knn queries.
Average difference in results are shown for k = 10 for 20 movie knn queries.
The outcome of our experiments were in some sense surprising as each we thought that each weight optimization added or modified would result in a time reduction at the cost of precision. This was not the case as time were fairly constant and also with a lower reduction in precision than what was to be expected. The question is of course what is an acceptable trade off between fast computational time and precision.
Quite surprising was the accuracy of the raw TF test case 4.3, one case which we thought would have a low precision. We are not quite sure if it could have been an error on our implementation or if there were not enough testing to give an accurate value.
In our test database which we used, our searches were quick across the different implementations and the loss of precision were only shown with some of the heuristic methods.
The genres in our database had a dimension space size of six and we used TF-IDF, even though in the end it was negligible with the low dimension space and it could have been considered it as a enumerable discrete token.
One of the problems with our implementations is that building the TF-IDF index requires excessive computations and each time the database is updated with a movie the entire movie index needs to be updated, which makes it difficult to maintain but it is required when using the TF-IDF method.
Using statistical methods for improving retrieval performance is a valid approach and it comes as no surprise why it is a favored option for commercial uses. There is a tradeoff between the effective search queries and the exhaustive index maintenance which is needed when updating the database used.
By using different kinds of heuristic methods in conjunction with the TF-IDF index improves the processing query speed at cost of precision, depending on what kind of heuristic used. We found that a use of stop words greatly reduced the com-putational time and also had the lowest precision decrease of the heuristic methods.
[Christopher D. Manning and Schutze, ] Christopher D. Manning, P. R. and Schutze, H. Introduction to information retrieval. Cambridge University Press. [Gerard Salton, 1988] Gerard Salton, C. B. (1988). Ter-weighting approaches in
automatic text retrieval. Information Processing and Management, 24(5):513– 523.
[Hui Fang, ] Hui Fang, Tao Tao, C. Z. A formal study of information retrieval heuristics. Proceedings of the 27th annual international ACM SIGIR conference
on Research and development in information retrieval.
[Jones, 2004] Jones, K. S. (2004). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 60(5):493–502. [Porter, 1980] Porter, M. (1980). Porter stemmer.
[Raghavan and Wong, 1986] Raghavan, V. V. and Wong, S. M. (1986). A critical analysis of vector space model for information retrieval. Journal of the American
Society for Information Science, 37(5):279–287.