A Repeatability Test for Two Orientation Based Interest Point Detectors

(1)

A Repeatability Test for Two Orientation Based

Interest Point Detectors

Bj¨

orn Johansson and Robert S¨

oderberg

April 26, 2004

Technical report LiTH-ISY-R-2606 ISSN 1400-3902

Computer Vision Laboratory Department of Electrical Engineering

Link¨oping University, SE-581 83 Link¨oping, Sweden bjorn@isy.liu.se, soderberg@isy.liu.se

Abstract

This report evaluates the stability of two image interest point detectors, star-pattern points and points based on the fourth order tensor. The Harris operator is also included for comparison. Different image transformations are applied and the repeatability of points between a reference image and each of the transformed images are computed. The transforms are plane rotation, change in scale, change in view, and change in lightning conditions.

We conclude that the result largely depends on the image content. The star-pattern points and the fourth order tensor models the image as locally straight lines, while the Harris operator is based on simple/non-simple signals. The two methods evaluated here perform equally well or better than the Harris operator if the model is valid, and perform worse otherwise.

(2)

1 Introduction

In the last few years, a number of experiments has been performed to evaluate the stability for interest point detectors and local descriptors, see e.g. [11, 8]. Stable interest points are useful for example in object recognition applications, see e.g. [6, 7, 5], where a local image content descriptor is computed in every interest point. These descriptors can be used in a feature matching algorithm to find the objects and object poses. The need for stable points may not be crucial but depends on what you do with them, e.g. the choice of descriptor. However, in general, stable points should make the system more robust.

Among the number of detectors evaluated in Schmid et. al. [11] it was found that the Harris operator was the most stable one. This report will evaluate two other methods for detection of interest points and compare with the Harris operator. The experiments and philosophy are very much the same as in [11], and we refer to this reference for more details. Basically, different image transformations are computed and the repeatability of points between a reference image and each of the transformed images are computed. The transformations are in [11] generated by actually moving the camera. The homographies are estimated by a registration of the images using a pattern that is projected onto the image by a projector. We use a simpler approach in this paper; the transformations are simulated in the computer. This approach has a drawback in that it introduces interpo-lation noise, and that the simple camera model may be unrealistic. On the other hand we do not have to estimate the homography.

We will first describe the methods that are evaluated in this report. Then we explain some details of the experiment setup and finally present the results.

2 Interest point detectors

This section gives a short description of the methods included in the evaluation. Two versions of the Harris operator are considered. The parameters and thresholds for the methods are chosen such that they all are based on fairly the same region size and that they give approximately the same amount of points.

2.1 Harris, nms

The Harris function is computed as

H = det(T)− α trace(T) , (1) where α = 0.04 and T is the structure tensor

T =

Z

g(x)∇I∇ITdx . (2)

g is a Gaussian window function with σ = 2. The image gradient is computed using

dif-ferentiated Gaussian filters with σ = 1. Local maxima points are found by non-maximum suppression. All filters can be made separable.

(4)

Figure 1: Local features detected by the tensor from left to right: crossing, T-crossing, corner, non-parallel lines and parallel lines.

2.2 Harris, subpixel

Same as the previous one, except that the local maxima points are found with subpixel accuracy. A second order polynomial is fitted to the Harris image around each of the maxima pixels in the previous method, and the local maxima of the polynomial function gives the subpixel position.

2.3 Fourth order tensors

This method uses the tensor representation explained in [9]. The tensor can represent one, two, or three line segments. If the tensor is reshaped to a matrix in a proper way the number of line segments will correspond to the rank of the tensor. For example, a tensor representing a corner will have rank two. By using this property the tensor representation can be used as an interest point detector, where the interest points is the local features illustrated in figure 1. The detection process is explained in detail in [10], but the basics are:

1. Compute the image gradient, where a differentiated Gaussian filter with σ = 1 is used (same as for the Harris methods).

2. The image gradient is improved to suit the tensor representation by using a method described in [4], where the response from edges and lines are amplified and made more concentrated.

3. An orientation tensor T is estimated from the improved image gradient.

4. The fourth order tensor is estimated by applying a number of separable filters on the elements in the orientation tensor. These filters represents a subset of monomes up to the fourth order.

5. The fourth order tensor is reshaped to a matrix and a measurement, c₂, for rank two is calculated: c₂ = −9d + qt 3d− 3qt + t3 , where    t = σ₁ + σ₂+ σ₃ d = σ₁σ₂σ₃ q = σ₁σ₂+ σ₂σ₃ + σ₃σ₁

and where σ_i represent the matrix three largest singular values.

6. A selection of interesting tensors is performed by picking each tensor corresponding to a local maxima in the rank two image weighted with the tensor norm. The interest point position is then calculated as the crossing between the two line segments.

(5)

2.4 Star patterns

The method we use to find star-patterns is a combination of the ideas in [2, 3, 1]. This method is explained in detail in [5, 4]. The basics are:

1. Compute the image gradient∇I, A differentiated Gaussian filter with σ = 1 is used (same as for the Harris methods).

2. Star patterns are found as local maxima to the function

S_star = Z

g(x)h∇I, x_⊥i2dx . (3)

g is a Gaussian window function with σ = 2. S_star is made more selective by inhibition with a measure for simple signals. Local maxima points are then found by non-maximum suppression.

3. The point positions are improved by minimizing the circle pattern function

S_circle(p) = Z

g(x)h∇I, x − pi2dx (4) that are computed around each of the local maxima points.

The algorithm needs to compute a subset of monomes (or derivatives) up to the second order on the three images I_x2, I_y2, and I_xI_y. All filters can be made separable.

3 Experimental setup

3.1 Repeatability criterion

The repeatability criterion is the same as in [11]. We give a short summary here. Let I_r denote the reference image and let I_i denote an image that has been transformed. Let

{xr} denote the interest points in reference image Ir, and let {xi} denote the interest

points in the transformed image i. For two corresponding points in x_r and x_i in image I_r and I_i we have

x_i = H_rix_r, (5)

where H_ridenotes the homography between the two images (the points is here represented in homogeneous coordinates). As in [11] we remove the points that do not lie in the common scene part of images I_r and I_i. Let R_i() denote the set of corresponding points pairs within -distance, i.e.

R_i() ={(x_r, x_i)| dist(H_rix_r, x_i) < } . (6) The repeatability rate for image I_i is defined as

r_i() = |Ri()|

min(n_r, n_i), (7)

where n_r =|{x_r}| and n_i =|{x_i}| are the number of points detected in the common part of the two images. Note that 0≤ r_i ≤ 1.

(6)

3.2 Transformation of scale, rotation and view

The homographies for the rotation, scale, and view transformations can be found in many text books, but we still include a short derivation for sake of completeness. The relation between a point X = X Y ZT in the 3D world and the corresponding point

x = x yT in the image is λ x 1 = P X 1 , (8)

where P = K [R|t] is the camera (projection) matrix. The matrix K contains the camera parameters. We assume the simple model

K = f I x₀ 0 1 , (9)

where f is the focal length and x₀ is the origin for the image coordinate system. The matrix R and the vector t defines the transformation of the 3D coordinate system. For the reference image we assume that the optical axis of the camera is orthogonal to the image in the 3D world and that the distance between the camera and the image is d. This gives R = I and t = 0, and from (8) we then get

X = dK−1 x_r 1 . (10)

We now use (8) and (10) to compute a relation between the point x_r in the reference image and a corresponding point x in another image taken with the camera in a different position, i.e. for general choice of R and t. The relation becomes

λ x 1 = K [R|t] X 1 = KRX + Kt = KRdK−1 x_r 1 + Kt = dKRK−1+ [0|Kt] xr 1 , (11)

and we identify the general homography between the reference image and a transformed image as

H = dKRK−1+ [0|Kt] . (12) We get the following homographies for the special cases of rotation, scale, and view:

• Plane rotation:

H = KRK−1 , where R = 

cos ϕsin ϕ − sin ϕ 0cos ϕ 0

0 0 1



(7)

• Scale change: H = dI + [0|Kt] , where t =  00 d0   . (14)

• Viewpoint change: Equation (12) where

R =  10 cos ϕ0 − sin ϕ0 0 sin ϕ cos ϕ   , t = [I − R]  00 1   . (15)

The list below contains data for the experiments:

• Number of test images: 6, see figure 2.

• Plane rotation: 18 images evenly spread between 0◦ _{and 180}◦ _{(ϕ =} πk

N−1, k =

0, 1, . . . , N − 1, N = 18). The first image is used as reference.

• Scale change: 9 images with a scale change (non-evenly spread) up to three times

the original size (d0 = 1_c − 1, where c = 2 _N−1k 2+ 1, k = 0, . . . , N− 1, N = 9). The first image is used as reference.

• View change: 21 images with a change in view between −45◦ _{and 45}◦ _{(ϕ =} πk

4N,

k = −N, . . . , N, N = 10). The middle image is used as reference. • Two choices of is used, = 0.5 and = 1.5.

Figure 2 shows the test images that is used for the rotation, scale, and view transforma-tions. They range from real images to synthetical images. Figure 3 shows an example of each of the transforms for one of the test images. The images has been expanded by zero padding before the transformation. This helps to avoid loss of points in the trans-formations (note however, that the padding is not enough for the scale transformation).

3.3 Transformation of light

The test images for the light transformations are however not simulated. These images are taken of 3D scenes by a stationary camera, and either the camera shutter or the light source is changed. Three sequences were taken and shown in figure 4. In the first sequence we change the camera shutter, and the middle image is used as reference. The last two sequences are taken by changing the light source position, and the first image is used as reference in both cases. The scene is not planar and we therefore do not really have a ground truth in the last two cases. But we believe that the evaluation is still relevant since similar situations appear for example in object recognition applications, where the training data for an object is taken with different lightning conditions than the query data, see e.g. [5].

(8)

Toy car Aerial image Corner test image 100 200 300 50 100 150 200 250 100 200 300 50 100 150 200 250 300 350 50 100 150 200 250 50 100 150 200 250

Toy monestary Picasso painting Semi-synthetic room

100 200 300 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 300 350 50 100 150 200 250 50 100 150 200

(9)

Rotation

Frame 1

1 512

1

512

Frame 7 Frame 13 Frame 18

Scale

Frame 1

1 512

1

512

Frame 3 Frame 5 Frame 7 Frame 9

View

Frame 1

1 512

1

512

(10)

Change in camera shutter

Frame 1

1 572

1

428

Change of light source position

Frame 1

1 572

1

428

Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Change of light source position

Frame 1

1 572

1

428

Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Figure 4: Test sequences for the light transformations. The first sequence contains 13 frames and the last two sequences contain 7 frames each.

(11)

1 2 3 0 0.5 1 scale, ε = 0.5 1 2 3 0 0.5 1 ε = 1.5 0 50 100 150 0 0.5 1 rotation 0 50 100 150 0 0.5 1 −40 −20 0 20 40 0 0.5 1 view −40 −20 0 20 40 0 0.5 1 harris nms harris subpix

Figure 5: Average results of the two Harris versions on the test images in figure 2 for the rotation, scale, and view transformations.

4 Experimental results

We will show the average results for all the test images. But we will also show the individual results for each test image to show that the result differs depending on the type of test image.

4.1 Comparison of the two Harris versions

From the results it was found that Harris with subpixel accuracy performed overall much better than Harris using only non-max suppression, figure 5 shows one example. The difference is most obvious for smaller than 1 pixel, as would be expected. Because of this result we will only include the subpixel Harris from now on, to make the presentation less messy.

4.2 Rotation, Scale, and View

Figure 6 contain the average results for all methods except Harris nms. The results are somewhat inconclusive, but if we examine each test image separately we see that subpixel-Harris performs best for natural images. The star-patterns and the four order tensors perform equally well or better than Harris on images that better resembles their

(12)

1 2 3 0 0.5 1 scale, ε = 0.5 1 2 3 0 0.5 1 ε = 1.5 0 50 100 150 0 0.5 1 rotation 0 50 100 150 0 0.5 1 −40 −20 0 20 40 0 0.5 1 view −40 −20 0 20 40 0 0.5 1 harris subpix tensor star patterns

Figure 6: Average results on the test images in figure 2 for the rotation, scale, and view transformations.

models, i.e. straight lines and sharper corner points, especially for the scale transformation (figures 9 and 12).

4.3 Variation of illumination

The results on the light transformation sequences is shown in figure 13. We conclude that the differences are not significant between the different methods.

5 Conclusions and discussion

From the experiments we may conclude that the choice of operator depends on the im-age content. The star-pattern method and the fourth order tensor assumes a bit more advanced model than the Harris operator. These methods has the best performance if their corresponding models can describe the image content, as would be expected. If the model is less valid it seems that it is better to use a more crude model as in the Harris operator.

The operators described here is intended to be used in object recognition tasks. Other interest point operators have been used in this application, one of the most successful methods in recent years is to find local maxima in DoG(Difference of Gaussians) scale space, see e.g. [6, 7]. These points were also evaluated (using the implementation by

(13)

Figure 7: Result on test image 1 for the rotation, scale, and view transformations.

(14)

(15)

(16)

0.5 1 1.5 0 0.5 1 Test sequence 1, ε = 0.5 0.5 1 1.5 0 0.5 1 ε = 1.5 2 4 6 0 0.5 1 Test sequence 2, ε = 0.5 2 4 6 0 0.5 1 ε = 1.5 2 4 6 0 0.5 1 Test sequence 3, ε = 0.5 2 4 6 0 0.5 1 ε = 1.5 harris subpix tensor star patterns

(17)

Lowe) on the test images in this report, but the result were poor. However, it is not fair to evaluate the stability for an operator that is computed in several scales. The larger scales may be less stable, but that may not matter if your descriptor is computed in a region that is proportional to the scale of the interest points.

Acknowledgments

We gratefully acknowledge the support from the Swedish Research Council through a grant for the project A New Structure for Signal Processing and Learning, and the Euro-pean Commission through the VISATEC project IST-2001-34220 [12] .

References

[1] J. Big¨un. Pattern recognition in images by symmetries and coordinate transforma-tions. Computer Vision and Image Understanding, 68(3):290–307, 1997.

[2] W. F¨orstner. A framework for low level feature extraction. In Proceedings of the third

European Conference on Computer Vision, volume II, pages 383–394, Stockholm,

Sweden, May 1994.

[3] Björn Johansson. Multiscale curvature detection in computer vision. Lic. Thesis LiU-Tek-Lic-2001:14, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, March 2001. Thesis No. 877, ISBN 91-7219-999-7.

[4] Björn Johansson and Anders Moe. Patch-duplets for object recognition and pose estimation. Technical Report LiTH-ISY-R-2553, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, November 2003.

[5] Bj¨orn Johansson and Anders Moe. Patch-duplets for object recognition and pose estimation. In Ewert Bengtsson and Mats Eriksson, editors, Proceedings SSBA’04

Symposium on Image Analysis, pages 78–81, Uppsala, March 2004. SSBA.

[6] David G. Lowe. Object recognition from local scale-invariant features. In Proc.

ICCV’99, 1999.

[7] David G. Lowe. Local feature view clustering for 3D object recognition. In Proc.

CVPR’01, 2001.

[8] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In

IEEE Conference on Computer Vision and Pattern Recognition, pages 257–263, June

2003.

[9] K. Nordberg. A fourth order tensor for representation of orientation and position of oriented segments. Technical Report LiTH-ISY-R-2587, Dept. EE, Link¨oping Uni-versity, SE-581 83 Link¨oping, Sweden, Februari 2004.

(18)

[10] Klas Nordberg and Robert S¨oderberg. Detection and estimation of features for es-timation of position. In Ewert Bengtsson and Mats Eriksson, editors, Proceedings

SSBA’04 Symposium on Image Analysis, pages 74–77, Uppsala, March 2004. SSBA.

[11] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. Int.

Journal of Computer Vision, 37(2):151–172, 2000.

A Repeatability Test for Two Orientation Based Interest Point Detectors