### Object Ranking for Mobile 3D Visual

### Search

### HANWEI WU

Abstract

In this thesis, we study object ranking in mobile 3D visual search. The con-ventional methods of object ranking achieve ranking results based on the ap-pearance of objects in images captured by mobile devices while ignoring the underlying 3D geometric information. Thus, we propose to use the method of mobile 3D visual search to improve the ranking by using the underlying 3D geometry of the objects. We develop an algorithm of fast 3D geometric verifica-tion to re-rank the objects at low computaverifica-tional complexity. In that scene, the geometry of the objects such as round corners, sharp edges, or planar surfaces as well as the appearance of objects will be considered for 3D object ranking.

## Contents

1 Introduction 1

1.1 Motivation and Contribution . . . 1

1.2 Outline of Thesis . . . 2

2 Background and Literature Review 3 2.1 Mobile Visual Search System . . . 3

2.1.1 System Architecture . . . 3

2.1.2 Challenges of Mobile Visual Search . . . 3

2.1.3 Bag of Features . . . 4

2.1.4 Term Frequency Inverse Document Frequency (TF-IDF) . 5 2.2 Image Features . . . 6

2.2.1 Scale-invariant feature transform (SIFT) . . . 6

2.2.2 Compressed Histogram of Gradients(CHoG) . . . 6

2.2.3 Multi-View Image Features . . . 8

2.3 Vocabulary Tree of Image Features . . . 10

2.3.1 Vocabulary Tree . . . 10

2.3.2 Adaptive Vocabulary Tree . . . 11

2.3.3 Multi-View Vocabulary Tree . . . 11

2.4 Geometric Verification . . . 12

2.4.1 RANSAC-Based Method . . . 12

2.4.2 Fast Geometry Verification . . . 13

2.5 Basic Knowledge of 3D Geometry . . . 14

2.5.1 Internal Camera Parameters . . . 14

2.5.2 External Camera Parameters . . . 15

2.5.3 Epipolar Constraint . . . 15

2.5.4 3D Reconstruction from Two Views . . . 16

2.5.5 3D Reconstruction from Multiple Views . . . 16

2.6 Geometric Verification Embedded Matching . . . 17

2.6.1 Joint Visual and Geometric Ranking . . . 17

2.6.2 Iterative Ranking Algorithm . . . 17

3 Object Ranking for Mobile 3D Visual Search 19 3.1 The Concept of Ranking . . . 19

3.1.1 Ranking for Information Retrieval . . . 19

3.1.2 Ranking for Visual Search . . . 20

3.2 TF-IDF of Muti-View Vocabulary Trees . . . 21

3.2.1 Reasons for Outliers in Visual Search . . . 21

3.2.3 Credibility Value of Visual Words . . . 22

3.3 Fast 3D Geometric Verification . . . 25

3.3.1 3D Geometric Transformation of Object . . . 25

3.3.2 Process of Fast 3D Geometric Verification . . . 25

3.3.3 Selection of Visual Words . . . 26

3.3.4 3D Misalignment . . . 28

4 Experimental Results 31 4.1 Experimental Setup . . . 31

4.2 Impact of the Credibility Value . . . 31

4.3 Performance Window of Credibility Value . . . 32

4.4 Comparison of Different Geometric Verification Methods . . . 33

4.5 Ranking Results . . . 34

## List of Figures

2.1 System Architecture for Mobile Visual Search . . . 4

2.2 Chanllenges of Mobile Visual Search. . . 4

2.3 SIFT Cell configuration . . . 7

2.2.4 Hierarchical sets of features with four levels from four views. . . 9

2.3.1 Generation of vocabulary tree . . . 10

2.4.1 Fast Geometry Verification Procedure (Sam S. Tsai 2010) . . . . 13

2.5.1 Perspective projection camera (S. Carlsson, 2007) . . . 14

2.5.2 Epipolar plane (S. Carlsson, 2007) . . . 15

2.6.1 Vocabulary tree score over the candidate objects . . . 18

3.1.1 Information Retrieval . . . 20

3.2.1 The Weakness of the Quantization-Based Approach. Legend: ○=centroids, ◻=query descriptor, ×=noisy version of query de-scriptors . . . 23

3.3.1 Example of geometric verification . . . 26

3.3.2 Generation of the Position Array . . . 27

3.3.3 With of visual words selection . . . 29

3.3.4 Illustration of 3D misalignment. The dash line links two 3D point of correspondences. The length of the dash lines is the 3D mis-alignment . . . 30

4.2.1 Comparison on different power of credibility value . . . 32

4.3.1 The recall rate of TF-IDF score with and without credibility value score on different leaf node size in the high data-rate case . . . . 33

4.3.2 The recall rate of TF-IDF score and with credibility value score on different leaf node size in the low data-rate case . . . 34

4.4.1 Comparison of the recall-datarate using different geometric veri-fication methods. . . 34

4.5.1 Comparison of the ranking results using credibility value and ref-erence methods . . . 36

### Chapter 1

## Introduction

In recent years, the development of wireless mobile devices and virtual reality has raised the interest in the mobile visual search(MVS) [1][2]. Mobile Visual search refers to an emerging class of applications where a mobile device takes a photo of the real world, recognizes objects in the photo, and retrieves the corre-sponding information and metadata about these objects from an database [3]. It aims to provide an augmented reality in a real-world environment by utilizing the methods of image-based information retrieval.

Analogies to the text query search, a MVS uses a photo as visual query to search against an image database. In the scope of this thesis, we focus on exact matching: For finding more images of the exact same physical object from the database [4]. Some applications of MVS not only require to retrieval the exact object, but also a list of candidate objects ranked by visual similarity between query object and candidate objects [5]. A successful objects ranking by visually similarity should be very close to the human vision interpretation [6].

Therefore, a success scenario of MVS process would be the user use the mobile device’s camera to take a photo of an object which should occupy the majority part of a scene and send its visual information to the server. Then the server gives back a list of candidate objects, where the first positions of the list are the images containing query objects. And the rest of the list is filled with images containing visual similarity objects.

### 1.1

### Motivation and Contribution

Using test queries to search against a database, visual information from a 3D object change under different perspectives and lighting conditions. Scale and rotation invariant local image features such as SIFT [7],CHoG [8], which are robust to image variations, have began developed to represent the visual infor-mation of images. After robust visual inforinfor-mation is extracted from each object, efficient data structures are built to devise a reliable matching scheme for object retrieval.

on vocabulary trees with TF-IDF scoring function [9]. The image feature de-scriptors are organized in a form called ”vocabulary tree” in the database. However, the conventional method exhibits several weakness for the high recall-datarate requirement and short latency demand, especially for real-time appli-cations. Furthermore, like other image based applications, visual search suffers a dimension loss compared to the three-dimensional world that the human vi-sion system interprets. Hence, those shortcomings serve as the motivation of this thesis.

In this thesis, we modify the TF-IDF scoring function based on the novel multi-view vocabulary tree. We introduce 3D world coordinates into the geo-metric verification stage. Finally, we discuss the impact of the proposed methods to the ranking result.

### 1.2

### Outline of Thesis

The thesis is organized as follows: Chapter 2 provides the background and basic knowledge related to mobile visual search and discusses the limitations and challenges the conventional mobile visual search faces. In Chapter 3, we propose a modified scoring function and fast 3D geometric verification to im-prove the ranking result from two perspectives. In Chapter 4, we presents the experimental results of the proposed methods.

### Chapter 2

## Background and Literature

## Review

### 2.1

### Mobile Visual Search System

In this section, we introduce fundamental topics of a mobile search system. First, we introduce a typical system architecture of mobile visual search. Then we point out the challenges the mobile visual search faces. Finally, we introduce the concept of bag of words and its data indexing method: Term Frequency Inverse Document Frequency(TF-IDF), which are the fundamental concepts used in object-based image retrieval.

### 2.1.1

### System Architecture

The image based retrieval framework is a type of client-sever architecture. Salient image features are extracted from query images on the mobile client side and then sent to the server though wireless transmission. The server re-ceives the extracted and encoded image features from the mobile client, decodes them, and matches them with the stored image features in the image database. Then, the server returns a list of candidate images to the mobile client, which puts the matched object on the first place and other candidate objects that bear a resemblance to the query image in the following positions. A typical system architecture of mobile visual search is presented in Fig. 2.1

### 2.1.2

### Challenges of Mobile Visual Search

Figure 2.1: System Architecture for Mobile Visual Search

These constraints demand that both clients and severs have lower computa-tional complexity and the system should reduce the rate of data transmission as much as possible without sacrificing the precision of the results. A figure of the challenges is summarized in Fig. 2.2.

Figure 2.2: Chanllenges of Mobile Visual Search.

At the same time, mobile visual search offers opportunities. We can utilize the mobility of the client to take more than one image. We therefore are able to apply 3D reconstruction from multiple views both on the client and server side. For example, we can use multi-view vocabulary tress to handle 3D the 3D geometry of objects.

### 2.1.3

### Bag of Features

and image variations. Thus, an image is no longer represented by its descrip-tors but by a visual words frequency histogram, the so called BoF vector. The concept of visual word and TF-IDF scoring function serve as the foundation of the visual search framework.

### 2.1.4

### Term Frequency Inverse Document Frequency

### (TF-IDF)

The bag of feature approach uses the TF-IDF scoring function which stands for term frequency-inverse document frequency. It is borrowed from the text-based retrieval method. In this way, each visual region is mapped to a weighted set of words. Typically, the TF-IDF weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; The TF measures how frequently a term occurs in a corpus of words. Since every corpus is different in size, it is possible that a term would appear much more times in large corpus than smaller ones. Thus, the term frequency is often divided by the corpus size as a way of normalization:

T F(i) = nid nd

(2.1.1)

where nid is number of times term i appears in a document d, nd is the total

number of terms in the document.

The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. The IDF measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(i) = loge

N ni

(2.1.2)

where N is total number of documents. ni is the number of documents with

### 2.2

### Image Features

The bag-of-features approach takes the feature descriptors of images similar to the words in the text content. In order to build a successful system, the feature points of the images should be highly distinctive. Moreover, the feature de-tectors are also expected to be repeatable under the scale change, illumination variation, rotation and translation of the objects. Numerous feature detectors and descriptor computation methods have been proposed in the literature in the past, such as SIFT. For a mobile search system, we need also to consider the limited computational power and memory of the mobile clients as well as the transmission data bandwidth of the system. These all require the feature descriptors not only to possess the qualities mentioned above, but also to be low-dimensional. A wide range of descriptors have been tested and gradient distribution-based descriptors have been shown to perform best [12]. Thus, we choose SIFT as our feature descriptors in the implementation. In this section, we first introduce the feature descriptors SIFT and ChoG. Then we introduce the work of [13] for selecting more robust image features. Our implementation is based on multi-view image features as summarized in 2.2.3.

### 2.2.1

### Scale-invariant feature transform (SIFT)

SIFT stands for Scale Invariant Feature Transform, which is a image-based feature descriptor proposed in [7]. SIFT can be invariant to different changes like scale, rotation and orientation changes. To generate the SIFT descriptor, first, scale invariance feature points of the image can be detected by finding the extrema value from the Difference of Gaussian (DoG) images. Then the location of the detected feature points are determined to sub-pixel accuracy, by applying the Taylor expansion on the scale space function and setting the derivative part of the function to zero. Points having low contrast and lying or being close to edges are rejected during this process. An orientation is assigned to each keypoint. The keypoint descriptor can be represented relative to this orientation and therefore invariance to image rotation. With this, the keypoints are detected with location, scale and orientation. The descriptors of the SIFT are based on the distribution of gradients within an image patch. In order to compute the feature descriptor, a 16× 16 window around an interest point is taken under the scale detected. Then the image patch is divided patch into a 4× 4 grid of cells. The histogram of image gradient directions is computed in each cell, 8 bins each. Finally, a descriptor of 128 dimensions is computed. A SIFT cell configuration is shown in Fig. 2.3

### 2.2.2

### Compressed Histogram of Gradients(CHoG)

Figure 2.3: SIFT Cell configuration

quantization and compression, using index of tree coding of gradient histograms to create the feature descriptors. It can achieve the same performance as SIFT. Hence, it can be one solution to the bandwidth limitation of mobile visual search. Here is a brief description for the pipeline of computing the CHoG. First, se-lect a image patch around the interesting point. Model the illumination changes to the patch appearance by a simple affine transformation of the pixel intensi-ties, which is compensated by normalizing the mean and standard deviation of the pixel values of each patch. Next, apply an additional Gaussian smoothing of σ = 2.7 pixels to the patch. Local image gradients dx, dy are computed

using a centered derivative mask [-1,0,1]. Next, the patch is divided into lo-calized spatial bins. Unlike the descriptor SIFT using a square 4× 4 grid with 16 cells. CHoG divides the patch into log-polar configurations, which is called DAISY. The CHoG uses overlapping regions for spatial binning which improves the performance of the descriptor by making it more robust to interest point localization error. The tests in [8] have shown that DAISY-9 configuration matches the performance of the 4× 4 square-grid configuration. CHoG also reduces the bits by approximating the histogram of gradients with a small set of bins. Several bin configurations VQ-3,VQ-5,VQ-7 and VQ-9 are proposed. All bin configurations have a bin center at (0,0) to capture the central peak of the gradient distribution. The additional bin centers are evenly spaced (with respect to angle) over ellipse, the eccentricity of which are chosen in accordance with the observed skew in the gradient statistics. Then the codewords of each spatial bin are concatenated to form the Uncompressed Histogram of Gradients (UHoG).

The low-bit rate CHoG is produced by compressing the gradient histogram. One way to compress it is to construct and store a Huffman tree built form the distribution. For a smaller number of leaf nodes m, the number of possible trees is also small. Thus, CHoG is produced by enumerating all possible trees and map each one to an index. The indices are then encoded with either a fixed-length code or variable fixed-length code due to the fact that not all trees are equally likely to occur from gradient statics. Finally, compressed codewords from each spatial bin are concatenated to form the final descriptor.

descriptors are pre-computed and stored in a distance table. Then the matching process can be reduce to using indices as look-ups into a distance table. Since the distance computation only involves performing table look-ups, more effective histogram comparison measures can be used with no additional computational complexity.

### 2.2.3

### Multi-View Image Features

Large-scale objects such as buildings are usually hard to match due to significant change of view point and lighting conditions between query and sever. [13] shows that an image database at the sever with a considerable perspective diversity will improve the recall-datarate performance of mobile visual search. [13] aligns the multi-view feature correspondences from different perspectives to generate multi-view features. Thus, a multi-view feature is more discriminative than the single-view feature by effectively eliminating the outliers, for example, the features from the foreground are more discriminative when compared with those from the background. And at the same time, it is more robust to the view point change on the query side. Multi-view features can be generated as follows:

A set of feature correspondences among l images

Ci,j,...,kl = {(fi, fj, . . . , fk)∣fi↔ fj↔ . . . ↔ fk}, (2.2.1)

where fi denotes the feature point in the i-th image fi ↔ fj ↔ . . . ↔ fk is

a multi-view feature correspondence that indicates a correspondence among features in several images. l=∣ {i, j, . . . , k} ∣ indicates number of view images.

Figure 2.2.4: Hierarchical sets of features with four levels from four views.

For each multi-view feature in the set of C_{i,j,...,k}l , [13] uses a representative
descriptor to represent the l corresponding features by taking the median of the
l descriptors as a robust estimate:

̂

dl(u) = Median{dl

h(u) ∶ h = i, j, . . . , k}, u = 1, , . . . , 128 (2.2.2)

### 2.3

### Vocabulary Tree of Image Features

In this section, we first introduce the generation of a vocabulary tree. Then we introduce the work of [14] which improves the performance of vocabualry tree in 2.3.2 and 2.3.3. Our implementation is based on the method introduced in 2.3.2 and 2.3.3.

### 2.3.1

### Vocabulary Tree

For a database with a large number of images, a specific data structure that is able to return a short list of candidate images that share the most similarities with query image efficiently is required. It is infeasible to use pairwise feature matching to compare the query image against every database image. Moreover, the feature descriptors of image are high dimensional data which further in-crease the computational burden. Thus, we need to use a data structure that is effective for searching a large database in high-dimensional spaces. [15] pro-posed data structure ”vocabulary tree” using a hierarchical k-means clustering scheme. Features are extracted from the database of images to form a set of features (x1, x2, . . . , xn). K-means clustering algorithm is applied to the set of

features to partition n observations into k (≤) sets S = {S1, S2, . . . , Sk} so as to

minimize the within cluster sum of squares (WCSS). The objective of K-means algorithm can be expressed in (2.3.1)

argmin

k

∑

i=1xS∑i

∥x − µi∥2 (2.3.1)

Then the same process is recursively applied to each the set S, splitting each set into k new parts. The process is illustrated in Fig. 2.2.4

Figure 2.3.1: Generation of vocabulary tree

depended on the branch factor k and depth L assigned to the vocabulary tree. We will discuss the impacts of k and L on the performance in the section 3.2.1

### 2.3.2

### Adaptive Vocabulary Tree

Empirically, we notice that the performance of Bag-of-Feature(BoF) method for vocabulary tree is significantly affected by the size of the vocabulary when compared to the size of the dataset [16]. We test the performance of different branch factors k and depths L of the vocabulary tree with the same number of leaf nodes. For instance, a vocabulary tree with branch factor k=8, L=4 and vocabulary tree with branch k=4 and L=6 both have 4086 leaf nodes. And the results have shown that the vocabulary trees with the same size of vocabulary have similar recall-datarate performance among different branch factors and depths. On the other hand, the ratio of size of vocabulary and the size of dataset basically reflects the number of descriptors in the leaf nodes. Therefore, in order to improve the performance of the vocabulary tree, we need to build a larger tree with enough visual words compare to the dataset. In order to achieve that, there should be a upper limit of number of descriptors in the leaf node.

In most cases, the distribution of number of descriptors in the leaf node is imbalanced, which means some of the leaf nodes have more descriptors than the rest of leaf nodes. Thus, with the above ideas, we modify the implementation of the hierarchical k-means algorithm to generate a new tree which is adaptive to the data distribution. Instead of setting a specific number of depth as con-ventional method, we let the nodes in the bottom level continue partition into subset nodes until the children nodes contain less than k descriptors. We call this new type of vocabulary tree ”adpative vocabulary tree”. By having a limi-tation of the number of descriptors associated with each leaf node, the adaptive tree is more robust to the descriptor noise.

### 2.3.3

### Multi-View Vocabulary Tree

### 2.4

### Geometric Verification

As the Bag of Features(BoF) representation of objects ignores the geometric layout of the features in the image, exploiting the spatial relations among query and database objects can improve the retrieval performance. Hence, a geometric verification step is often applied to the short candidate list from VT matching. The essence of geometric verification is to measure the level of geometric con-sistency between query object and database objects. The level of geometric consistency reflects by the volume of features agrees with one specific geomet-rical transformation model in a set of correspondences. Typically, a score for final ranking is generated from geometry verification(GV ) by measuring the geometric consistency between query and candidate objects. In this section, we first introduce the RANSAC-based method which is the most used for geomet-ric verification. Then we introduce the work of [17] which is a computational efficient method.

### 2.4.1

### RANSAC-Based Method

Conventional method for GV is to use the RANSAC to distinguish the outliers from the inliers by exploiting the statistical consistency. RANSAC assumes that there exists a parameterized which can explain or be fitted to the observations. For implementation of RANSAC on GV, a sample of m feature points is draw from the matched features randomly. Then the parameters of the model are reconstructed from the set of features and compute the residuals with respect to all feature points. Features with residuals less than some threshold t are classified as hypothetical inliers. Hence, the estimated model is reasonably good if sufficient many features have been classified as hypothetical inliers. The model is evaluated by estimating the error of inliers relative to the model. Though each iteration of estimation, a model is produced that either is rejected because too few points are classified as inliers or is a refined model with a corresponding error measure. Finally, a refined model is accepted if its error is lower than last saved model. Then we extract the outliers from the feature points set, and subtract their scores from the VT. A typical model for object transformation has 6 parameters shown in Equation (2.4.1) and Equation (2.4.2)

[ u_{v ] = [} m1 m2
m3 m4 ] [
x
y ] + [
tx
ty ]
(2.4.1)
⎡⎢
⎢⎢
⎢⎢
⎢⎢
⎣
x y 0 0 1 0
0 0 x y 1 0
. . .
. . .
⎤⎥
⎥⎥
⎥⎥
⎥⎥
⎦
⎡⎢
⎢⎢
⎢⎢
⎢⎢
⎢⎢
⎢⎢
⎣
m1
m2
m3
m4
tx
ty
⎤⎥
⎥⎥
⎥⎥
⎥⎥
⎥⎥
⎥⎥
⎦
(2.4.2)

### 2.4.2

### Fast Geometry Verification

In [17] , the authors proposed a geometric similarity scoring scheme by only using position information of features to re-rank top candidates. The method is called fast geometry verification. For implementation of fast geometry ver-ification, a complete graph of matched feature points is formed within each candidate images and query image. A vector of corresponding log distance ratios is calculated.

SLDR= {log(

dist(lq,i, lq,m)

dist(ld,j, ld,n)∣(i, j), (m, n)M}

(2.4.3)

where dist is the stands for the Euclidean distance between two feature points in the image.

The uniformity of the values indicates the geometric consistency between query and database images. A score is derived from these values. The higher the score is, the more likely that the query image and the database image match. The procedure of fast geometry verification is explained in Fig. 2.4.1.

### 2.5

### Basic Knowledge of 3D Geometry

When a camera takes a photo of one object, an optical process relating the 3 di-mensional object to the 2 didi-mensional image takes place. This process is known as projection. The method of 3D reconstruction is to use the episolar constraints of the projection model to obtain the camera parameters [18][19][16]. Then we can use the camera parameters to recover the 3D world coordinates of points. In this section, we introduce the basic knowledge used in 3D reconstruction from multiple images.

### 2.5.1

### Internal Camera Parameters

If we assume the camera projection center coincides with the origin of the 3D system and camera is aligned with the 3D system, the projection relation be-tween a object and a camera can be described as equation (2.5.1) and Fig. 2.5.1 (S. Carlsson, 2007). x− x0 σxf = X/Z y− y0 σyf = Y /Z (2.5.1)

Figure 2.5.1: Perspective projection camera (S. Carlsson, 2007)

### 2.5.2

### External Camera Parameters

The external camera parameters are the rotation of the camera and the position or translation of the projection center. A general camera projection can be described as ⎛ ⎜ ⎝ x y 1 ⎞ ⎟ ⎠= λKR ⎛ ⎜ ⎝ X− X0 Y − Y0 Z− Z0 ⎞ ⎟ ⎠ (2.5.3)

where R represents the rotation matrix around three axises of camera. And X0, Y0, Z0 represents the camera center. These are the external parameters

of a camera. K is the internal parameters introduced in section 2.5.1. We rearrange the equation (2.5.3) and we can get a more general way to represent the projection of a camera as:

⎛ ⎜ ⎝ x y 1 ⎞ ⎟ ⎠∼ ⎛ ⎜⎜ ⎜ ⎝ m11 m12 m13 m14 m21 m22 m23 m24 m31 m32 m33 m34 m41 m42 m43 m44 ⎞ ⎟⎟ ⎟ ⎠ ⎛ ⎜⎜ ⎜ ⎝ X Y Z 1 ⎞ ⎟⎟ ⎟ ⎠ (2.5.4)

### 2.5.3

### Epipolar Constraint

Though Fig. 2.5.2 (S. Carlsson, 2007), we can observe that the baseline with two corresponding points of projection rays in the image plane are coplanar to each other. If we express it in the reference frame of camera D, we can write the co-planarity as equation (2.5.5) and (2.5.6)

Figure 2.5.2: Epipolar plane (S. Carlsson, 2007)

qT(t × Rp) = 0 (2.5.5)

where E is called the essential matrix. t is the translation vector, and x denote cross product.

### 2.5.4

### 3D Reconstruction from Two Views

Based on the method in [20], given an essential matrix E, four solutions for translation matrix T and rotation matrix R can be calculated by applying singu-lar value decomposition(SVD) on E. Physical impossible solutions are discarded by using the positive depth constraint. After knowing the self-calibration pa-rameters, the set of 3D world coordinates of the corresponding features can be calculated.

### 2.5.5

### 3D Reconstruction from Multiple Views

### 2.6

### Geometric Verification Embedded Matching

Usually, the VT(Vocabulary tree) matching process and the geometric verifica-tion are two separate stages of visual search [1][2]. For VT matching, a short list of candidate objects is generated by ranking the TF-IDF scores of each objects. And then geometric verification applies on the shortlist to check the geometric consistency between candidate objects and query object. Based on the geometric consistency score, a final ranking of the candidate list is returned to the client. Hence, the final ranking result after geometric verification is purely based on the geometric consistency score of the short list. Since the scores of VT matching can be interpreted as the measurement of volume of similar visual elements between the query object and candidate objects, both the scores of VT and geometric verification can be critical to the candidate object ranking. Moreover, the ranking result of geometric verification is only applied to a small number of candidates rather than the complete objects in the database. The work of [14] introduces a way to form a cost function to better utilize the score of VT and geometric verification together and incorporate them to the match-ing process. The method can produce a global rankmatch-ing of all the objects in the database in a low computational cost. In this section, we introduce the method in [14]. Our implementation is based on the method in [14].

### 2.6.1

### Joint Visual and Geometric Ranking

To further improve the ranking result, [14] proposes a method got joint vi-sual and geometric ranking of object. It is stated as a constrained problem. If candidate objects have very similar TF-IDF scores, then the ranking of these candidate objects is determined by geometric similarity. The constrained opti-mization problem reads:

min

k Jk

s.t. ∣sk− sj∣ ≤ δ, ∀j, k ∈ Ω, (2.6.1)

where Jk is the cost of geometric inconsistency, sk is the Tf-Idf scoring function

of the k-th object, and Ω = {k∣∣sk − sj∣ ≤ δ, k ≠ j} is a set of objects indexes

associated with objects on a similar score level as defined by the small threshold δ. By solving this problem, the ranking of a set of objects is determined by sorting visually similar objects according to their geometric similarity.

### 2.6.2

### Iterative Ranking Algorithm

the method proposed in section 3.3.4. The objects of set Ω are re-ranked by the geometric consistency. A description is shown in Algorithm 1.

... ...

Candidate
*obj 5*
*obj 1* *obj 2* *obj 3* *obj 4*
VT Score

**ᵝ**

**ᵝ**

Figure 2.6.1: Vocabulary tree score over the candidate objects

Algorithm 1 3D Geometric Verification Embedded Matching and Ranking 1. Initilize: Set the score sk= 0, k = 1, . . . , N for all objects;

do Update the score by matching the incoming query features against the vocabulary tree; ∀j, k ∈ Ω

a. Update Ω for which∣sk− sj∣ ≤ δ, k ≠ j;

b. Calculate the cost of 3D misalignment in Eqn (3.3.2) for all objects in Ω; c. Update the ranking of objects in Ω by using above costs;

until All incoming query features have been used. 2. Output the result of object ranking;

### Chapter 3

## Object Ranking for Mobile

## 3D Visual Search

### 3.1

### The Concept of Ranking

In this section, we first introduce the types of ranking in the information re-trieval. Then we analyze the ranking criteria for visual search in detail and show how conventional ranking methods reflect these criteria and give the directions for improvement.

### 3.1.1

### Ranking for Information Retrieval

Figure 3.1.1: Information Retrieval

### 3.1.2

### Ranking for Visual Search

For ranking in the context of visual search, the objective is to sort the candi-date images in terms of the visual similarity between the query and database. Ranking visual similarity between objects correspond to the human judgment. In most cases, human vision, regardless of the subjective factors, is sensitive to colors, visual elements and geometric structures that appear on an object. Considering the fact that the SIFT descriptor we use only reflects the image gradients in gray images, we remove the color factor on ranking.

We consider the visual similarity between objects in two criteria. One cri-terion concerns about small blocks of visual element on the objects, such as window patterns, eaves of the buildings. These small blocks of visual elements can be reflected by the feature descriptor. The TF-IDF is a baseline feature vector for query-object pairs. And summation of TF-IDF scores for each query descriptor is the baseline ranking function for vocabulary trees. Hence, the TF-IDF ranking function can be interpreted as the measurement of accumulated similar visual elements between objects.

Another criteria of visual similarity is to compare the global geometry be-tween objects. In geometric terms, two objects are similar if one object can be mapped to another object by geometric transformation which is some combina-tions of translation, reflection, rotacombina-tions and dilation (scale up and down). For the geometry verification step of visual search, we use the feature point posi-tions in the 3D Euclidean space to describe the underlying geometry of objects. We assume there exits a geometry transformation for each query/candidate pair. The idea of geometry verification is to measure the degree of the set of feature point correspondences consistent with its specific geometry transforma-tion. This can also be interpreted as the geometric consistency between two objects. There are different indicators for geometric consistency. For example, the fast geometry verification method in 2.4.2 uses log of distance ratios as the indicator of geometric consistency.

each query-document pair rather than to assign labels to documents. Thus our ranking method for visual search is determined by a numeric scoring function that reflects the relevance of query and database objects. In order to improve the ranking results based on the above two criteria, on one hand, we reduce the impact of the outliers on the scores that reflect the accumulated visually similar elements, on the other hand, the objective is to find a better indicator for approximated geometric consistency.

### 3.2

### TF-IDF of Muti-View Vocabulary Trees

The ranking function of vocabulary tree reflects the size of shared similar visual elements between query and database objects, which is one criteria of visual search. The main challenge for implementing this criteria is the negative im-pact on ranking function posed by the outliers. The outliers are unavoidable by-product using the vocabulary tree-based quantization approach and large number of outliers is the reason for incorrect matches. We define the outliers as the descriptors that add more scores on other database objects than the object it supposes to be matched. In Section 3.2.1, we analyze two types of noise that caused the outliers in visual search and acknowledge their trade-off relation.

Compare to the text retrieval filed where the query terms are exactly the same as the terms stored in the database, the image features of visual search always suffer from noise and have the probability to become outliers. The con-ventional TF-IDF ranking function which is designed for text retrieval doesn’t examine these characteristics of visual search. In Section 3.2.2, we show how adaptive trees can help reduce the outliers caused by quantization noise. And in Section 3.2.3, we propose to add an credibility value factor in the ranking function to mitigate the impact of outliers caused by descriptor noise.

### 3.2.1

### Reasons for Outliers in Visual Search

discriminative power of local descriptors significantly. In the matching pro-cess, several descriptors are assumed to match if they are assigned to the same quantized cell which contains a number of different descriptors from different objects. We regard this loss of discriminative power as quantization noise [25]. Fig. 3.2.1b illustrates the quantization noise, where each small cell in the figure represents the leaf nodes of vocabulary tree. In this case, the query descriptor is quantized into a big leaf node. Except the correct object represented by the matched node, other objects have descriptors in the matched leaf node can add their corresponding scores. If matched leaf node represents a large number of descriptors, then the number of outliers generated in the matching process is usually tend to be high to cause incorrect match.

Another limit we may encounter in the process of matching is the existence of descriptor noise. To varying degrees, the descriptors of query objects always suffer descriptor noise which is caused by lighting variation, rotation and differ-ent perspectives. Hence, if the leaf node size is small, as shown in Fig. 3.2.1a, it will increase the probability that a noisy version of the descriptor is assigned to the incorrect leaf node.

Then we can observe that the quantization noise and descriptor noise have a trade-off relation in term of the size of leaf nodes. As the leaf size shrinks, the dominate reason for outliers change from quantization noise to the descriptor noise. So our strategy to improve the ranking result is to reduce the occurrence of outliers caused by quantization noise and make the ranking function less affected by the outliers caused by the descriptor noise.

### 3.2.2

### Adaptive Tree for Reducing Quantization Noise

In Section 2.3.2, we have mentioned that the ratio of the size of the dataset and the size of a vocabulary tree is critical to the performance. To build a large vocabulary tree compared to its dataset size is a efficient way to improve per-formance. To achieve that, 2.3.2 introduces the concept of adaptive vocabulary tree that limits the number of descriptors each leaf node can represent. In the process of building adaptive vocabulary tree, the leaf nodes whose size are larger than a given size will further partition to small nodes, while other leaf nodes stop dividing. The adaptive tree can be seen as a solution to restrain the quan-tization noise efficiently. For example, in Fig. 3.2.1b, the cell can further divide into smaller cells as dotted lines indicate and the quantization noise decreases.

### 3.2.3

### Credibility Value of Visual Words

After using the adaptive vocabulary tree to restrain the quantization noise, the descriptor noise becomes the dominant cause for outliers in most cases. In this scenario, the leaf nodes normally contain less than 10 descriptors, which is the case that visual word cells in the descriptor space are similar to the variation of a single SIFT descriptor. Under the conventional TF-IDF scoring function, the incorrect quantized query descriptors will add its full scores to the incorrect objects.

(a) Descriptor Noise of Vocabulary Tree

(b) Quantization Noise of Vocabulary Tree

Figure 3.2.1: The Weakness of the Quantization-Based Approach. Legend: ○=centroids, ◻=query descriptor, ×=noisy version of query descriptors

cluster has a virtual boundary to separate each other. If a descriptor lies close to the virtual boundary of cluster, then we consider this descriptor is obscure. This reflects on the human vision inception is that it is difficult to distinguish two image feature patches. We consider the fact that for most descriptor noise caused outliers, the query descriptors lie close to the border of two clusters. Then if we add a factor to punish the data points lying close to the border, the majority of affected query terms would be descriptors suffers descriptor noise. Thus, we propose to assign a credibility value to each term of the TF-IDF feature vector. We take the ratio of two closest distances of query to the centroids as the uncertainty of visual words (see Fig. 3.2.1a). The credibility values can be calculated as (3.2.1), where dis(1) is the Euclidean distance between query descriptor and centroid of the closet leaf node and dis(2) that of the second closest leaf node.

have a full original score. In this sense, the new score function lowers the impact of descriptors caused by outliers. Finally, the scoring function for vocabulary trees is modified as:

cred= 1 −dis(1)

### 3.3

### Fast 3D Geometric Verification

In this section, we first introduce the motivation of incorporating the 3D in-formation of object into the geometric verification stage. Then we present a general description of the process of fast 3D geometric verification and give de-tailed explanation of two critical steps of the process: obtaining and selection of visual words and 3D misalignment measurement.

### 3.3.1

### 3D Geometric Transformation of Object

In the geometric verification stage, we need to utilize the feature point positions to estimate a geometrical transformation model between query and candidates objects. The choices on geometrical transformation model is a trade-off be-tween the description completeness of the geometrical transformation and the computational complexity. The conventional RANSAC-based method have a fairly complete description of geometrical transformation of object regarding to translation, scaling and rotation [26]. However, it has six parameters need to estimate shown previously in Equation (2.4.1) and is computationally expen-sive. Fast 2D geometric verification method proposed in [17] only utilize the 2D information of the image. It explores a weak geometric relation by regard-ing the consistency of feature points position as the sregard-ingle parameter need to be estimated for the object transformation. However, the weak estimation of object transformation is vulnerable to the perceptive change of object. So it is mainly effective for object with planer surface, like book covers, but failed to the large 3D objects, like buildings. Compare to the 2D geometrical transfor-mation model, the 3D environment is more robust to perspective variations of the object [27]. With 3D geometric information of the object, we can extend this method to features in 3D space.

### 3.3.2

### Process of Fast 3D Geometric Verification

The mobility character of mobile visual search provides us the opportunity to take several photos from different viewpoints of one object on the client side. The client side takes pictures of the same object from two perspectives. Then obtain its 3D world coordinates of feature points by conducting 3D reconstruc-tion from multiply images (see 2.5). The 3D feature posireconstruc-tions of candidate objects are also reconstructed and stored in the leaf nodes of vocabulary tree along with its corresponding feature descriptors in the database. In most cases, the feature positions of objects in the database are reconstructed by sufficient number of perspectives of object to have a better precision.

the matches will be verified. In Fig. 3.3.1 is a simplified example on different cases of geometric verification.

Figure 3.3.1: Example of geometric verification

### 3.3.3

### Selection of Visual Words

The quantization approach used in the vocabulary tree method causes difficul-ties to use the feature points positions directly for the geometric verification. The incoming query will match with leaf nodes (visual words) which represents more than one descriptor. This process will inevitably generate outliers: The incoming descriptor may match with wrong candidate objects or wrong descrip-tors in the correct candidate object. Since our method is a weak approximation of the 3D geometry consistency between objects, large number of outliers will degrade the geometric verification and cause unnecessary computation. Hence, the visual words selection is a critical step for improving performance by miti-gating the affects of outliers. Our approach is to add several rejection criteria in the process of obtaining feature point positions for geometric verification.

The process of obtaining feature point positions for geometric verification is similar to the matching process of vocabulary tree. Specifically, each leaf node has a posting list l where the object’s feature points 2D(x,y axis) positions in different views are stored associated with their object ID:

Object IDs View1 position View2 position . . . M− th (x1, y1) (x2, y2) . . . N− th (x1, y1) (x2, y2) . . .

Lx represents the leaf nodes of the vocabulary tree. After each query descriptor has been matched with one leaf node, the position array will be updated by adding feature point positions according to the object IDs of the posting list in the matched leaf node.

Figure 3.3.2: Generation of the Position Array

However, if the posting list in one leaf node is long, the matched leaf node not only tends to have a large quantization noise, but also generates a number of outliers. As a result, we regard the number of descriptors that one visual word represent as the criteria for selection visual words. We define this criteria as Thits. we apply this rejection criteria before the procedure of updating position

array. A whole process of obtaining and selecting the feature point position for geometric verification is showed in Algorithm 2, We note this algorithm as q= z(p). In the algorithm, the posting list l associates with methods getCurrentID and moveT oN extID which is to retrieve corresponding object ID and step forward to next object ID after the current positions are stored in the position array. The position array associates with the method addposition which is to add the feature point positions associates with its corresponding image ID into position array.

Algorithm 2 Feature Positions Selection for Geometric Verification Input query descriptors Q, leaf nodes L;

Output Position array P; Parameters Thits

for all queries wi in Q do

li← match(wi,L)

if li.length> Thits do

while li is not completed

d← li.getCurrentID Pd← P.addposition(li,d) li.moveToNextID end while end for

### 3.3.4

### 3D Misalignment

Considering the low latency requirement of mobile applications, we explore a weak geometric consistency instead of estimating six parameters that RANSAC transformation require to lower the computational complexity. Since we have a discriminative feature selection based on multi-view approach and the 3D environment is more robust for perspective transformation, it is sufficient for us to only use the 3D feature point position information as our single parameter to describe the geometric consistency between query and database object.

For implementation, after the process of obtaining and selection of the visual words in using 3.3.3, we can obtain a set of reliable correspondences ˜Qk. Each

feature correspondence in ˜Qk contains the query feature q and the database

feature p. Using the 3D world coordinates generation method introduced in
Section 2.5, we are able to obtain the set of 3D world coordinates Ð→Wc _{from}

two views for query features and Ð→Ws _{for the database features. We calculate}

the Euclidean distance of every correspondences in the 3D environment accord-ing equation (3.3.1). Then the 3D misalignments vector of correspondences is formed and the summation of the values of vector indicate the geometry verifica-tion between query and database images. Therefore, we take the summaverifica-tion of the vector as the score of geometry verification as equation (3.3.2). The smaller the summation is, the more likely that the query image and database image match.

g(q, p) = log2(1 + ∥Ð→wc(q) − Ð→ws(p)∥) (3.3.1)

Jk= var{log2[1 + d3(p, q)]}, p, q ∈ ˜QK (3.3.2)

Due to the fact that the larger value of Jkindicates a geometric inconsistency

between objects. As introduced in 2.6.1, we hence regard this 3D misalignment Jk as the cost of geometric inconsistency and then form the joint cost function

(a) Feature correspondences without visual word selection

(b) Feature correspondences without visual word selection

0 1 2 0 1 −0.5 0 0.5 1 1.5 2 2.5 3 0 1 2 0 1 −0.5 0 0.5 1 1.5 2 2.5 3

### Chapter 4

## Experimental Results

To evaluate the performance of visual search, we use the recall-datarate and ranking as the benchmarks. The recall rate is the percentage of the correct object in the first position of returned ranking lists. The ranking is evaluated by examine the similarities between the query object and database objects in terms of their positions in the list.

### 4.1

### Experimental Setup

We evaluate our proposed methods in the multi-view image dataset Stockholm Buildings which comprises 50 buildings of that city. The sever holds 254 images of the 50 buildings. At least 2 views have been recorded for each building. The client may use up to 100 additional test images of the 50 buildings. We acquired sever and test images at different viewpoints and at different time. The images have been recorded by a Cannon IXUS50 digital camera at a resolution of 2592× 1944 pixels.

### 4.2

### Impact of the Credibility Value

In this experiment, we compare the conventional TF-IDF ranking function to the updated ranking function with credibility value. We test different powers of credibility value to explore its impact on the recall-datarate performance. Specifically, we change power value a of credibility value in the Eqm(4.2.1).

the visual word to punish the uncertainty in general. And the ranking function with square of the credibility value performs the best with 10 percent(5 objects) improvement on recall-rate. Ranking function with 3rd power of credibility value performs the second best. Thus, we can conclude that by enhancing the effects of credibility value properly (take power of the value), the recall rate have possibilities to improve further. For the following tests, we use the square of credibility value in the ranking function.

Figure 4.2.1: Comparison on different powers of credibility value

### 4.3

### Performance Window of Credibility Value

The effect of the credibility value on the performance depends on the type of the dominant noise. The type of noise is closely related to the number of descriptors associate with one leaf node (in what follows we refer to as ”leaf node size” for simplicity). Since the vocabulary tree we use is based on the work of [14], the leaf node size is uniformly distributed. Without lo of generality, we define the average leaf node size of a vocabulary tree as the number of descriptors represented by vocabulary tree divide the number of leaf nodes. By changing the settings of the branch factor k and depth L, vocabulary trees with different leaf node size can be build from the same dataset. In this experiment, we compare the recall-datarate with different leaf node size.

value is more obvious. The leaf node size ranges from 1-8 descriptors has the biggest performance gap with 6-8 objects improvement.

Since the leaf node size larger than 20 has a recall rate below 60 percent which is not acceptable for most applications. Hence the performance below 60 percent is not in the scope of our discussion. Intuitively, the credibility value has the best effect when the variation of a query descriptor is similar to the range of a leaf node. Hence, we can conclude that the reason for high data-rate and low data-rate queries have different ranges for biggest performance gap is that the low data-rate query descriptors generated by a more strict feature selection criteria [7], which reduces the query descriptor noise. Smaller query descriptor noise indicates that the credibility value can be effective in a smaller cell, hence the biggest performance gap region for low data-rate shifts left. Based on our experimental observation, we define this biggest performance gap regions as the ”performance window”. The ”performance window” can be a factor that balance the bandwidth constraint and storage constraint when designing a system.

Figure 4.3.1: The recall rate of TF-IDF score with and without credibility value score on different leaf node size in the high data-rate case

### 4.4

### Comparison of Different Geometric

### Verifi-cation Methods

Figure 4.3.2: The recall rate of TF-IDF score and with credibility value score on different leaf node size in the low data-rate case

**10**

**0**

**10**

**1**

**10**

**2**

**60**

**70**

**80**

**90**

**100**

**VT + fast 3D**

**VT + RANSAC 3D**

**VT + RANSAC 2D**

**rec**

**all (%)**

**datarate (KB/query)**

Figure 4.4.1: Comparison of the recall-datarate using different geometric verifi-cation methods.

RANSAC algorithm as the underlying 3D geometry is more discriminative than 2D geometry.

We also investigate the average execution time by using different geometric verification methods on a MATLAB platform. To obtain the top five ranked images, the fast 3D geometric verification method needs only 0.16 second to achieve an average recall of 90 percent. The 2D RANSAC algorithm needs 3 second, and the 3D RANSAC algorithm needs 13 seconds to achieve the same recall level.

### 4.5

### Ranking Results

is the returning candidates list ranked by the similarity of object appearance. With the improved scoring function, we can observe that the ranking re-sult is more sensitive to the small block of visual elements of the buildings, for examples, the window patterns and eaves. Thus, the objects which pos-sess more similar visual elements to the query object tend to be obtain higher score and have a higher ranking position. This impact on ranking result is also corresponding human user’s subjective.

(a) Reference Ranking. Example 1 (b) Credibility Ranking. Example 1

(c) Reference method. Example 2 (d) Credibility Ranking. Example 2

(e) Reference method. Example 3 (f) Credibility Ranking. Example 3

(g) Reference method. Example 4 (h) Credibility Ranking. Example 4

(i) Reference method. Example 5 (j) Credibility Ranking. Example 5

Figure 4.5.1: Comparison of the ranking results using credibility value and reference methods

(a) VT Ranking. Example 1 (b) Geometric Ranking. Example 1

(c) VT Ranking. Example 2 (d) Geometric Ranking. Example 2

(e) VT Ranking. Example 3 (f) Geometric Ranking. Example 3

### Chapter 5

## Summary and Discussion

In this master thesis, we investigate some of the most used methods for mo-bile 3D visual search and propose modification of current methods to improve the object ranking result. The proposed modifications have two considerations: 1. Make the objects ranking result more close to the subjective ranking of the human users. 2. The proposed method should be computationally efficient for mobile application. To achieve this, we try to reduce the impacts of outliers on the results. In the vocabulary tree stage, we modify the TF-IDF scoring function by punishing the impacts of possible outliers In the geometric veri-fication stage, we introduce 3D world coordinates into image feature position which make the geometric consistency more sensitive to the outliers. To sum up, we interpret the improvement of our object ranking results in three aspects:

1. By applying the updated TF-IDF scoring function, the objects have bet-ter ranking positions when sharing more similar visual elements between query object and database objects.

2. We extend the fast geometry verification to the 3D space to make it more robust to perspective change of object. We improve the ranking results in term of global geometric appearance of the objects.

## Bibliography

[1] B. Girod, V. Chandrasekhar, D. Chen, N.-M. Cheung, R. Grzeszczuk, Y. A. Reznik, G. Takacs, S. Tsai, and R. Vedantham. Mobile visual search. IEEE signal processing, 28(4), July 2011.

[2] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object re-trieval with large vocabularies and fast spatial matching. pages 1–8. IEEE Conference on Computer Vision and Pattern Recognition, June 2007. [3] D. Chen. Memory efficient image database for mobile visual search. PhD

thesis, Stanford University, April 2014.

[4] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. Efros. Data-driven vi-sual similarity for cross-domain image matching. ACM Transactions on Graphics, 30(6), December 2011.

[5] Y. Sun. Object-based visual attention for computer vision. Artificial In-telligence, 146(1), May 2003.

[6] T. Frese, C. Bouman, and J. Allebach. A methodology for designing im-age similarity metrics based on human visual system models. technical report, Prude University Department of Electrical and Computer Engi-neering, February 1997.

[7] D. Lowe. Distinctive image features from scale-invariant keypoints. Inter-national journal for computer vision, 60(2), November 2004.

[8] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, Y. A. Reznik, R. Grzeszczuk, and B. Girod. Compressed histogram of gradients: A low-bitrate descriptor. International journal for computer vision, 96(3), 2012. [9] T. Gneiting and A. Raftery. Strictly proper scoring rules, prediction, and

estimation. technical report 463, University of Washington Department of Statistics, September 2004.

[10] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quanti-zation: improving particular object retrieval in large scale image databases. pages 1–8. IEEE Conference on Computer Vision and Pattern Recognition, June 2008.

[12] S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, J. Singh, and B. Girod. Location coding for mobile image retrieval. pages 1–7. IEEE International Conference on Image Processing, September 2009.

[13] X. Lyu, H. Li, and M. Flierl. Hierarchically structured multi-view features for mobile visual search. pages 23–32. Data Compression Conference, March 2014.

[14] D. Mars, H. Wu, H. Li, and M. Flierl. 3D geometric verification embedded matching and ranking using multi-view vocabulary tree for mobile visual search. Data Compression Conference, April 2015.

[15] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. pages 2161–2168. IEEE Conference on Computer Vision and Pat-tern Recognition, June 2006.

[16] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. IEEE Conference on Computer Vision and Pattern Recognition, June 2007. [17] S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, R. Vedantham, R. Grzeszczuk, and B. Girod. Fast geometric re-ranking for image-based retrieval. pages 1029–1032. IEEE International Conference on Image Pro-cessing, September 2010.

[18] S. Carlsson. Geometric computing in image analysis and visualization. Lecture notes, KTH Royal Institute of Technology, March 2007.

[19] R. Hartley and A. Zisserman. Multiply View Geometry in computer vision, 2nd edition. Cambridge University Press, 2004.

[20] H. Li and M. Flierl. Mobile 3d visual search using helmert transformation of stereo features. pages 3470–3474. IEEE International Conference on Image Processing, September 2013.

[21] Y. Liu and T. Mei. Optimization visual search reranking via pairwise learning. IEEE Transactions on multimedia, 13(2), April 2012.

[22] C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with non-smooth cost function. Neural Information Processing Systems, January 2006. [23] H. Li. The institute of electronics, information and communication

engi-neers. The Institute of Electronics, Information and Communication En-gineers, 2011.

[24] W. Hsu, L. Kennedy, and S. Chang. Reranking methods for visual search. IEEE on MultiMedia, 14(3), August 2007.

[25] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak ge-ometric consistency for large scale image search. European Conference on Computer Vision, Auguest 2008.