Real-time Object Recognition on a GPU

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Real-time Object Recognition on a GPU

Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping

av

Johan Pettersson

LITH-ISY-EX--07/4034--SE

Linköping 2007

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Real-time Object Recognition on a GPU

Examensarbete utfört i Bildbehandling

vid Tekniska högskolan i Linköping

av

Johan Pettersson

LITH-ISY-EX--07/4034--SE

Handledare: Henrik Turbell

SICK-IVP

Maria Magnusson

isy, Linköpings universitet

Examinator: Maria Magnusson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department

Computer Vision Laboratory Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2007-10-31 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.control.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10238 ISBN — ISRN LITH-ISY-EX--07/4034--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Objektlokalisering i realtid på en GPU Real-time Object Recognition on a GPU

Författare

Author

Johan Pettersson

Sammanfattning

Abstract

Shape-Based matching (SBM) is a known method for 2D object recognition that is rather robust against illumination variations, noise, clutter and partial occlusion. The objects to be recognized can be translated, rotated and scaled.

The translation of an object is determined by evaluating a similarity measure for all possible positions (similar to cross correlation). The similarity measure is based on dot products between normalized gradient directions in edges. Rotation and scale is determined by evaluating all possible combinations, spanning a huge search space. A resolution pyramid is used to form a heuristic for the search that then gains real-time performance.

For SBM, a model consisting of normalized edge gradient directions, are con-structed for all possible combinations of rotation and scale. We have avoided this by using (bilinear) interpolation in the search gradient map, which greatly reduces the amount of storage required.

SBM is highly parallelizable by nature and with our suggested improvements it becomes much suited for running on a GPU. This have been implemented and tested, and the results clearly outperform those of our reference CPU implemen-tation (with magnitudes of hundreds). It is also very scalable and easily benefits from future devices without effort.

An extensive evaluation material and tools for evaluating object recognition algorithms have been developed and the implementation is evaluated and com-pared to two commercial 2D object recognition solutions. The results show that the method is very powerful when dealing with the distortions listed above and competes well with its opponents.

Nyckelord

Keywords object recognition, pattern matching, GPU, CUDA, transformation, rotation, scale, noise, illumination, occlusion, clutter, evaluation

(6)

(7)

Abstract

Shape-Based matching (SBM) is a known method for 2D object recognition that is rather robust against illumination variations, noise, clutter and partial occlusion. The objects to be recognized can be translated, rotated and scaled.

The translation of an object is determined by evaluating a similarity measure for all possible positions (similar to cross correlation). The similarity measure is based on dot products between normalized gradient directions in edges. Rotation and scale is determined by evaluating all possible combinations, spanning a huge search space. A resolution pyramid is used to form a heuristic for the search that then gains real-time performance.

For SBM, a model consisting of normalized edge gradient directions, are con-structed for all possible combinations of rotation and scale. We have avoided this by using (bilinear) interpolation in the search gradient map, which greatly reduces the amount of storage required.

SBM is highly parallelizable by nature and with our suggested improvements it becomes much suited for running on a GPU. This have been implemented and tested, and the results clearly outperform those of our reference CPU implemen-tation (with magnitudes of hundreds). It is also very scalable and easily benefits from future devices without effort.

An extensive evaluation material and tools for evaluating object recognition algorithms have been developed and the implementation is evaluated and com-pared to two commercial 2D object recognition solutions. The results show that the method is very powerful when dealing with the distortions listed above and competes well with its opponents.

(8)

(9)

Acknowledgments

This thesis project has been carried out at SICK-IVP, a machine vision devel-opment company based in Linköping. First of all I would like to thank Henrik Turbell, my supervisor at SICK-IVP, for supporting me in this project in every way possible. Henrik has contributed with a lot of good ideas and suggestions along the way and has additionally assisted me in planning and keeping me on track.

I would also like to thank my examiner and supervisor, Maria Magnusson, at the Computer Vision Laboratory at Linköping Institute of Technology, for ideas, supervision and suggestions improving the quality and results of this thesis.

I am much grateful to SICK-IVP for providing the resources and equipment, often at short notice, required to complete this work. Finally I thank the staff at SICK-IVP for general advice, sharing experiences regarding thesis projects and being supportive.

(10)

(11)

1.6 Purpose . . . 8 2 Shape-Based Matching 11 2.1 Overview . . . 11 2.2 Similarity measure . . . 12 2.2.1 Vector notation . . . 12 2.2.2 Component notation . . . 13 2.2.3 Angular notation . . . 13 2.3 Model . . . 14 2.4 Search . . . 15 2.4.1 Exhaustive search . . . 18 2.4.2 Tracking . . . 19 2.4.3 Optimization . . . 19 2.4.4 Reversed polarization . . . 20

3 Analysis and improvements 23 3.1 Similarity response . . . 23

3.1.1 Characteristics . . . 23

3.1.2 Distortion . . . 25

3.2 Model . . . 26

3.2.1 Directional non-maximum suppression . . . 26

(12)

3.2.2 Rotation requirements . . . 28 3.2.3 Scale requirements . . . 30 3.2.4 Match interpolation . . . 32 3.3 Search . . . 39 3.3.1 Exhaustive search . . . 40 3.3.2 Tracking . . . 41 3.3.3 Separated tracking . . . 42 3.3.4 Effort . . . 43

3.4 Rotationally symmetric objects . . . 46

3.4.1 Detection and modeling . . . 48

3.4.2 Search . . . 51 3.4.3 Weighting . . . 51 3.4.4 Effort . . . 52 3.4.5 Conclusions . . . 53 4 GPU implementation 55 4.1 GPU . . . 55 4.1.1 Hardware . . . 55 4.1.2 CUDA . . . 56 4.2 SBM on a GPU . . . 56 4.3 Implementation . . . 57 4.3.1 Modeling . . . 57 4.3.2 Resolution pyramid . . . 57 4.3.3 Exhaustive search . . . 57 4.3.4 Tracking . . . 58 4.4 Evaluation . . . 58 5 Experiments 61 5.1 Setup . . . 61

5.1.1 Obtaining evaluation material . . . 61

5.1.2 Ground truth . . . 66 5.1.3 Parameters . . . 66 5.2 Results . . . 67 5.2.1 Representation . . . 68 5.2.2 Correction . . . 68 5.3 Evaluation . . . 68 5.3.1 Recognition . . . 69 5.3.2 Accuracy . . . 71 5.3.3 Observations . . . 74 6 Conclusions 81 6.1 Improvements . . . 81 6.2 Evaluation . . . 82 6.3 GPU implementation . . . 82 6.4 Future work . . . 83

(13)

Contents xi

(14)

(15)

Chapter 1

Introduction

1.1 Object recognition

Given an image of an object referred to as the model image, the task is to locate the object in other images referred to as search images. In a typical application a camera is mounted above a conveyor belt and the target is to identify various ob-jects and extract their pose (position, rotation and scale). The information gained can be used for decision making to reach higher goals such as further inspections to detect flaws, picking up objects or object sorting. Majority of applications require the recognition to take place in real-time.

Objects are assumed to be rigid themselves but are allowed to be translated, rotated and also scaled to a small degree. Recognition should be robust against clutter, partial object occlusion and varying illumination. Clutter and occlusion, and their relation to each other is further described in section 1.4.2. Examples of different situations where an object shall be recognized can be seen in figure 1.1. A complete description of requirements and constraints for the recognition can be found in section 1.5.

Figure 1.1. Example images of where an object should be recognized. The model image is to the left. The others are search images featuring illumination, occlusion and clutter combined with rotation.

1.1.1 Correlation

One of the easiest ways to compare two images is the use of cross correlation of which several variations are described in [5]. The objects to be recognized

(16)

by correlation are assumed to be of an identical shape as the model and may neither be rotated nor scaled. The only manageable transformation is translation. Several modifications (mainly different normalizations) to cross correlation have been made for it to cope better with small variances in illumination. However, none of these modifications deal with partial occlusion in a straight forward way as occluded parts of an object always contributes to the score. The amount is supposed to be less than what the object itself would have contributed with (under certain normalizations) but there will always be an error in average. Therefore, given an amount of occlusion it is difficult to tell how much the score is expected to degrade making it unsuitable as a measure of similarity.

Shape-Based Matching (SBM) is a method suggested by Steger [11] for recog-nizing partially occluded objects which is very robust against illumination changes. It is closely related to correlation. The main difference is that SBM uses the inner product of normalized edge gradient vectors as similarity measure instead of raw image intensity. The response is normalized to the range [−1, 1] but we have often restricted our investigations to the subinterval [0, 1] which is motivated in section 2.2.1. The effect of partial occlusion appears directly in the response of the sim-ilarity measure. A perfectly matching object exposed to 17% occlusion (83% of the object visible) ideally results in the response value of 0.83. A more detailed description of SBM is given in chapter 2.

1.1.2 Search

To recognize an object, the search image is correlated with the model using the similarity measure of SBM. In a sense, the model image is fitted in all possible translations within the search image and a score is assigned to each position. As correlation based methods do not cope with any other object transformation those must be handled by evaluating the results of multiple correlations. Typically a set of models are generated and used, one model for each combination of scale and rotation, illustrated in figure 1.2. These tend to span a vast search space whose size is determined by the limits of the transformations and the required granularity which often can be approximated from the size of the model image, see sections 3.2.2 and 3.2.3. If the chosen granularity is too coarse there is a risk of missing the object as that particular part of the search space is not covered, and the recognition might fail.

The results of certain choices of strategies to traverse the search space and related parameters are briefly discussed in section 1.6 and are investigated further in section 2.4.

1.2 Previous results

There are several publications available on SBM and how it copes with rotation, see for example Steger [11] and Ulrich [14]. An evaluation of the method when also dealing with scale is presented by Steger in [12]. The exact search procedure is not described in any of the articles mentioned. A more thorough description of a

(17)

1.2 Previous results 3

Figure 1.2. The vast search space for recognizing objects with correlation based meth-ods. All rotations and scales must be evaluated explicitly.

fully functioning search procedure will be presented in section 3.3. In the absence of a detailed description we notice that the methods may be alike.

A thorough evaluation of the SBM method can be found in [14] and much of the reasoning for the modified generalized Hough transform in [15] is also valid for SBM. However, none of them deal explicitly with scale as it is assumed to be constant which is true for many industrial applications. There are applications however, where scale must be concerned; for instance if a camera is mounted on the side of a conveyor and objects scroll by sideways at various depths (relative to the camera) on the belt. If significant changes in scale are allowed, it has a huge impact on the search space. Therefore, rotation and scale is studied separately in this thesis, but situations where both are present are still considered.

The currently available performance evaluations of SBM are made on a system that is a bit out of date and does not meet modern standards and expectations. Using a modern system, more fit for the thought applications of this implemen-tation, may very well cause different choices of parameters and search strategies that also affect robustness.

HALCON is a software library for machine vision from MVTec and it contains several 2D matching methods. One of them is referred to as "Shape-Based" in the reference manual [7]. The parameters and descriptions of it identifies well with what is described for SBM [11]. Additionally, MVTec owns a patent cover-ing central parts of the SBM [13]. We therefore assume that the "Shape-Based" matching method in HALCON is close to an implementation of SBM. It is used as a reference for what a good implementation of the SBM can achieve.

Another commercial matching product is Cognex’s PatMax algorithm [4] which will also be used in the evaluation. It is claimed to use geometric features along with pattern brightness and shading to find objects. The object descriptions used are not limited by a discrete grid which makes it possible for PatMax to deliver sub-pixel accurate results [4].

(18)

1.3 Alternative methods

There are several methods available for 2D object recognition. An extensive review of different methods is given by Ulrich in [14]. SBM was chosen as the target in this thesis as it is a simple method with a number of potential improvements. The most interesting of the opposing methods are mentioned here along with short notes of their drawbacks and advantages compared to SBM.

1.3.1 Edge based methods

It is suggested that particularly the edges of an object contains a lot of information useful for recognition [1, 2]. Only considering the edges of an object is a consider-able reduction of data that has to be evaluated. Except for the SBM mentioned above, there are several other methods available. Two of them are described here. None of them can extract rotation or scale in a straightforward way. All rotations and scales must be evaluated explicitly.

The modified generalized Hough transform

The Hough transform (HT) can be used to recognize any analytically describable shape [6]. A generalization of the Hough transform (GHT) who manages any discrete shape is also available [1]. Ulrich presents a modification of the gener-alized Hough transform (MGHT) [14], gaining real-time performance by taking advantage of resolution hierarchies, which is not as straight forward as it might first seem. The principles of it are described below, a complete description can be found in [14].

Initially, a model of the object to be recognized is generated. A gradient map is generated from the model image and edges are extracted. Then a table, mapping edge gradient direction (represented as an angle) to offset vectors to the center of the model, is created.

In the search procedure a gradient map of the search image is used. For each gradient direction (represented as an angle) a number of displacement vectors are looked up using the table. A vote is placed on each position pointed at by a displacement vector. If the point belongs to an edge of the object, at least one of the vectors point at the center of the object that is to be found, and a vote will be placed for that position. After this is done, a peak has appeared in the voting space at the center of the object as this is the single point that the most dislocation vectors points at (at least one for each gradient belonging to the edge of the object in the search image).

To utilize a resolution pyramid the model image is divided into tiles for all levels but the coarsest. A separate lookup table is generated for each tile. Searching then begins with a regular search at the coarsest pyramid level which yields a coarse position of the object which will be refined for each level in the pyramid. In the refinement, the entire gradient map is considered as before, but only the mapping table connected to the tiles that have dislocation vectors pointing to a neighborhood of the known position are evaluated for each gradient. This way the amount of votes to place, and the voting space itself is much smaller than before.

(19)

1.4 Definitions 5

The experimental results of this method are promising and seem to be on par with SBM [14]. However, to be able to utilize a resolution pyramid it becomes far more complex, which makes it much more difficult to analyze, improve and evaluate.

Chamfer matching

The Chamfer method suggested by Borgefors [2] uses a chamfer distance transform of an edge image as a model. Another edge image is extracted from the search image. To find the position of the object a similarity measure is evaluated for each potential position (as for cross correlation). The similarity measure is the mean distance from each edge in the search image to the closest edge in the model for the current translation. This is obtained by accumulating samples from the distance map of the model for each edge in the search image. If there is a perfect match, the mean distance is 0.

The robustness can be improved by taking edge direction information into account [10], but this is not used in its original form. However, it is uncertain whether or not it is enough to reach the robustness of SBM who makes extensive use of edge directions, and does so in a more straight forward manner.

1.3.2 High level feature based methods

Another class of methods extracts a set of features of a higher level and uses them in a model. These features that are chosen to represent an object might be corners, edges, curves, angles or other geometrical relations between points and lines for instance. Another set of features are extracted from the search image. The problem of recognition is now transferred to finding a relation between the two sets of features. This is a nontrivial problem and under the conditions in this thesis, distortions may cause some features to be missing and new features to be present in the search image.

One advantage of these kinds of methods is that the search for objects usually do not depend on a discrete search space as SBM, MGHT and Chamfer matching does. This makes it easier to obtain higher accuracy in the result. Parameters such as rotation and scale can be extracted without having to try all combina-tions explicitly, increasing scalability regarding these transformacombina-tions that are so difficult to handle for the other methods mentioned here. This representation of an object, compared to the model image, and even object edge gradients, is a very large reduction of data. Of course this is good from a performance point of view, but might not be good for the ability to recognize objects. The methods get dependent on that enough instances of some feature are present in the objects to be recognized.

1.4 Definitions

In this section we make some definitions and clarifications of the various terms and expressions used in this thesis.

(20)

1.4.1 Pose

The result of extracting an objects pose is a transformation from the model image to the particular search image in a way the object in both images overlap. To define such a transformation a coordinate system must be fixed for each image. This coordinate system (x, y)T, is a Euclidian ortho-normalized system with its origin in the center of the image, as illustrated in figure 1.3. A transformation to a new coordinate system (x0, y0)T, connected to the object in the search image can be expressed as x0 y0 = cos ϕ sin ϕ − sin ϕ cos ϕ s 0 0 s x y + tx ty , (1.1)

where ϕ is the rotation angle, s is the scale and (tx, ty)T is the translation vector.

This is illustrated in figure 1.3.

x y Model image x y Search image x0 y0 ϕ tx ty

Figure 1.3. An illustration of the transformation from a model image to a search image using a known pose of an object.

1.4.2 Occlusion and clutter

Partial occlusion is where an object is partially occluded by something else in an image. We allow ourselves to refer to partial occlusion as just occlusion. Com-pletely occluded objects (by fully opaque occluders) obviously cannot be recognized and hence there should be no confusion. Clutter is when other things than the object to be recognized are present in the image. Usually it is something textured in the background but clutter is also introduced where there is occlusion as new details are added. For examples, see figure 1.1.

1.4.3 Stages

Most object recognition methods, including SBM, can be separated into two stages, on-line and off-line. The off-line stage is where the models are created. It is only run once for each object to be recognized. Performance is usually not an issue in

(21)

1.5 Requirements and constraints 7

this stage and plenty of computational resources are assumed to be of disposal. The on-line stage is where the matching takes place. It runs each time an object is to be recognized. This is where the amount of resources, both computational and storage, that is required, becomes important as it has to meet the real-time requirements of the system.

1.5 Requirements and constraints

The goal of the development in this thesis is not to solve a particular industrial application but rather to investigate the abilities of the method in doing so in a more general sense. If the method is to be useful for real-world applications it has to meet the requirements in this section.

Here follows a detailed (but not necessarily complete) description of what to be expected from the object recognition in this thesis and under what circumstances. The targeted applications that the algorithm is evaluated for here are not as strict as those for previous evaluations in [14] and [15]. Some of the requirements here have much in common with most of those related to matching of rigid object in Ulrich’s investigations and are thoroughly described and further motivated by him [14].

• Objects are assumed to be rigid. That is, the shape of the object should not change considerably between modeling and search occasions.

• The algorithm shall be developed with real-time applications in mind. It must be possible to choose parameters such that real-time requirements for the search procedure are met. However, the implementation for evaluation is experimental and must not actually run in real-time.

• Objects may be translated, rotated and scaled as described in section 1.4.1. An example application where scale is important is described in section 1.2. Limits of the amount of scale to be handled must be supplied and we have chosen the interval [_1.21 , 1.2]. The investigation of scale effects are done with this interval in mind. Some results may not be accurate for significant devi-ations from it.

• The retrieved pose must not be very accurate for an object to be considered recognized, see section 5.3.1. If accurate measurements are required they can be done in a subsequent step.

• No particular restraints on illumination conditions are set. This does not mean that the illumination may be chosen completely arbitrary. The effect of various illumination conditions is highly dependent on the object and its material properties. Some objects may be more reluctant to illumination changes than others. The background also plays a role as reversed polariza-tion can be caused, secpolariza-tion 2.4.4.

• The size of the images targeted in this thesis has been chosen to 384 × 384. Sometimes, however, experiments have been made on images of size 640 × 480.

(22)

The size has a significant impact on performance and accuracy. Whenever a conversion was necessary from 640 × 480 to 384 × 384, the image was first cropped to the proper aspect ratio and then resampled using at least bilinear interpolation.

• Objects must contain distinct edges that can be extracted using the method described in section 2.3.

• The shapes of the objects to be recognized are not restrained in any particular way. Some objects might be shaped in a way that all their pose parameters cannot be extracted, e.g. the rotation of a perfectly rotationally symmetric object, section 3.4.

• For the algorithm to be able to operate in real-time, the shape of the object must be possible to recognize in a decently high level of a resolution pyramid even though the search image is coarse. This puts a constraint on the size of the object but may vary for different objects and setups.

• The possibility of partial occlusion must be accepted. A limit here depends on many things such as the object, the environment, allowed transformations and the required robustness of the application. The more distortions that are present, the less occlusion is accepted.

• For the evaluation here the algorithm only has to extract a good match of an object in a search image. If additional instances of an object are present in the image they may be neglected.

1.6 Purpose

The aim of this thesis is to implement the SBM method as it is described in [11, 12] and then improve it. To achieve this, a thorough analysis and evaluation of the algorithm were performed. A detailed description of the search procedure of SBM has not been acquired and it was therefore redesigned, in a way that is likely to resemble the original SBM’s search procedure, according to the available material. Finally the implementation was compared to a commercial implementation of SBM available in the HALCON library [7]. Results of another matching method, PatMax provided by Cognex [4], introduced in section 1.2, was also used in the evaluation. The aim of the evaluation is to find weak points in the algorithms rather than verifying their usefulness.

The actual implementation of the original SBM is a simple task as it is based on few very basic image processing operations, such as sobel filtering, correlation and thresholding. It is not crucial that the experimental implementation here show real-time performance and no significant effort has to be put on optimizations.

Some analytical work has already been presented by Ulrich [14] which is pri-mary for the development of the MGHT but also is fully valid for SBM. Much focus in [11, 12] that describes SBM is put on the possible optimizations and im-provements available for the similarity measure. The analysis here will focus more

(23)

1.6 Purpose 9

on the actual search procedure, how it can be modified and of course the effects it will have for both modeling and search.

As the search space of poses is typically very large with the approach used in the SBM algorithm, a complete exhaustive search is of academic interest only. Another basic issue this thesis is to investigate, is if such a method nevertheless could be more robust than other methods and in such a case under what circumstances. By different ways of pruning the search space the method, as we shall see, becomes of practical interest for real-time applications. Another track is adaption of the algorithm to better benefit from modern hardware architectures which will be investigated and evaluated.

A class of objects that currently are difficult to recognize using SBM (and other methods) are objects that are rotationally symmetric at coarse resolutions. The problem is fully defined and investigated in section 3.4.

An automated method for evaluating 2D object recognition will be a partial result. It will span a sufficient database of test images with the required material for verification of the claims of this thesis.

Both SBM and PatMax are protected by patents. The (commercial) reason for SICK-IVP for supervising this thesis is therefore not to obtain a commercial implementation of SBM but to obtain a deeper understanding of the weaknesses and strengths of these methods. This serves as input in the design of a new algorithm, not based on these methods.

(24)

(25)

Chapter 2

Shape-Based Matching

Shape-Based Matching (SBM) is a known method for object recognition that deals with partial occlusion in an intuitive way as presented by Steger [11]. In this chapter we describe the basics of SBM to lay the foundation for further analysis and improvements presented in the next chapter.

SBM has much in common with cross correlation with the main difference being the similarity measure. As for cross correlation, the matching procedure itself can only extract the location of the object. The full pose of the object ( position (x, y) , rotation ϕ and scale s) span a vast four dimensional search space. In order to find the object in different rotations and scales several correlations must be performed. A resolution pyramid is exploited to prune the search space and independently together with various pretermination criterions the method gains real-time performance.

2.1 Overview

The first step is to teach the system to recognize a certain object. This is done off-line, before the actual matching and performance is not an issue. During this procedure an image of the object in a clean environment, controlled illumination, no occlusion or clutter, etc. is provided to the algorithm. The edges of the object are extracted from the image and are then used as a model for the object. The details are described in section 2.3.

In the matching procedure the task is to locate the pose of the object in the search image if it is present, using the previously extracted model. The actual matching is done using a similarity measure to associate each potential pose of the object with a score, much like cross correlation. The overall best score is the result of the search. The similarity measure is calculated using the edges extracted from the model and a gradient map of the search image. The details of the similarity measure and some of its properties are discussed thoroughly in section 2.2.

To gain real time performance, the search procedure utilizes a resolution pyra-mid. First an exhaustive search is done on the coarsest level of the pyramid and a

(26)

number of candidates are extracted. Then each candidate is tracked down the lev-els of the pyramid until it is found in the original search image. This is described in section 2.4.

2.2 Similarity measure

The object to be matched is described by a subset of points (mx, my) and their

associated normalized direction vectors in a gradient map of the model image. All other points in the map may for simplicity (in this example) be assumed to be set to 0. How this subset is extracted is described in section 2.3.

2.2.1 Vector notation

The similarity measure is based on inner products of normalized gradient directions both in the model, md(x, y) = (mdx(x, y), mdy(x, y))T, and the search image,

sd(x, y) = (sdx(x, y), sdy(x, y))T rather than raw pixel intensity as in regular cross

correlation. mN is the number of (non-zero) edges in the model. The following

sum is evaluated for each position (x, y) in the search image and associates a score with it. s(x, y) = 1 mN X (mx,my) hmd(mx, my) |sd(x + mx, y + my) i (2.1)

The complete set of scores, s(x, y), is the response of a single matching, and corresponds to the response of a cross correlation, which can be expressed as

c(x, y) = X

(mx,my)

f (mx, my)g(x + mx, y + my) ,

where f (x, y) and g(x, y) is the model and search (intensity) images respectively. The inner product, as seen in equation (2.1), is considered equivalent to the dot product through out this thesis.

Since all the direction vectors are normalized the measure is highly robust against variations in illumination which mostly affect gradient magnitude. The normalization with the total number of edge values, mN, makes a perfect match

result in the score of 1. When the search image is empty (except for noise) or an area without edges are matched all the inner products ideally cancels and the score is 0. The deviation from this in practice is usually small. In the worst case, a perfect match with an object with completely reversed edge directions the score may drop to −1. This kind of match is not very likely in practice though as even emptiness in the image would yield a higher score (of 0). This is further discussed in section 2.4.4.

In most cases it is only interesting to consider scores in the range [0, 1]. The normalizing of the result justifies the interpretation of the score as the relative amount of the objects edges present in the search image and hence is a measure of partial occlusion. This is useful when trying to select a threshold for which scores that are to be accepted as matches which has a great impact on both robustness and performance.

(27)

2.2 Similarity measure 13

2.2.2 Component notation

By formally replacing the inner product with the dot product in equation (2.1) and separating the gradient map in its x- and y-components the sum may be written in the following way

s(x, y) = 1 mN X (mx,my) md(mx, my) • sd(x + mx, y + my) = 1 mN X (mx,my) mdx(mx, my)sdx(x + mx, y + my) + 1 mN X (mx,my) mdy(mx, my)sdy(x + mx, y + my), (2.2)

which makes it clear that the SBM similarity measure is the sum of two cross correlations. This shows that the similarity measure is a linear operation in itself. Therefore any filtering might as well be done for the response as for both channels of any of the normalized gradient images. This may simplify reasoning about the result of applying various linear filters. Also, just as well any filtering may be done for the model off-line, instead of in the response or search image in real-time.

2.2.3 Angular notation

Another way to express the similarity measure is

s(x, y) = 1 mN X (mx,my) md(mx, my) • sd(x + mx, y + my) = 1 mN X (mx,my) kmd(mx, my)kksd(x + mx, y + my)k cos θ = 1 mN X (mx,my) cos θ, (2.3)

where θ = θ(x, y, mx, my) is the angle between the direction vectors of the

model and search images, respectively. The components, mdx, mdy, sdx and sdy

of the direction vectors mdand sdwere defined in equation (2.2). Now set mdx+ imdy = kmdkeimϕ, such that mϕis the angle in the model image, and sdx+ isdy = ksdkeisϕ, such that sϕis the angle in the search image. Then the angular difference θ can be expressed θ(x, y, mx, my) = mϕ(mx, my) − sϕ(x + mx, y + my).

It has not explicitly been described why it is preferred to weight the angular difference with a cosine function in the similarity measure. Examples of other ways to decide the value of the similarity measure can be found in figure 2.1.

A measure based on simply thresholding the absolute angle difference, dashed curve in figure 2.1, could be used but regardless of how the threshold is selected the deviation from the cosine is huge and this is probably not useful in practice. Another measure based on the absolute difference, the dotted curve in figure 2.1,

(28)

punishes small deviation harder and large deviations smaller than the cosine, but the differences are small and it could probably be used if there was a reason.

−150 −100 −50 0 50 100 150 −1.5 −1 −0.5 0 0.5 1 1.5

Angular difference [deg]

Score

cosine absolute−based thresholded

Figure 2.1. Various approaches to weight angle difference in the similarity measure.

It is likely that the smoothness of the cosine yields a good weight of tolerance of small deviations in directions. That and the resulting efficiency and simplicity of the implementation is reason enough not to consider some of the other meth-ods. Unfortunately, this track reaches beyond the scope of this thesis and is not investigated further.

2.3 Model

Initially the algorithm must be taught the shape of an object to be recognized. To achieve this, a clean image of the object is provided to the algorithm. An example of a model image can be found in figure 2.2.a.

When describing the implementation, we prefer a slightly different notation than what was used for the similarity measure. The model used in SBM consist of a set of mN selected edge pixels at positions (xi, yi) with normalized gradient

directions md(xi, yi), where i = 1, . . . , mN. An example model is illustrated in

figure 2.2.f. The edge pixels can be efficiently represented as a list of edge points and their normalized direction vectors.

The first stage of the model creation is generating the gradient map of the model image f (x, y). This is done by convolution with the sobel operator pair

f (x, y) ∗1 8   1 0 −1 2 0 −2 1 0 −1  ≈ ∂f (x, y) ∂x = fx(x, y) ,

(29)

2.4 Search 15 f (x, y) ∗1 8   −1 −2 −1 0 0 0 1 2 1  ≈ ∂f (x, y) ∂y = fy(x, y) .

The results are illustrated in figures 2.2.b. and 2.2.c.

The sobel operator is not an obvious choice here as performance is not an issue off-line. However, sobel is a good compromise between performance and accuracy for the on-line actions and it is supposed that the results are good enough for the modeling too. The effects of using different gradient extraction operators on- and off-line are unknown (and advised against).

In the next step the magnitude of the gradient map qf2

x(x, y) + fy2(x, y) is

calculated, see figure 2.2.d. It is then thresholded (figure 2.2.e) and the gradient vectors whose magnitude is strong enough are extracted as the edges to use in the model. Finally the direction vectors of the selected edges are normalized, using the magnitude, as md(xi, yi) = (fx(xi, yi), fy(xi, yi)) q f2 x(xi, yi) + fy2(xi, yi) .

The resulting model can be seen in figure 2.2.f. The selected vectors can be stored in a list that then efficiently constitutes the model.

One model has to be built for each combination of rotation and scale that is to be searched. How many models required for some specified angle and scale intervals at some granularity is described in sections 3.2.2 and 3.2.3.

All models are generated by transforming and resampling the original model image and then extracting the edges. This procedure avoids problems related to anisotropy of the sobel operators [14]. This is done (for all angles and scales) on every level of the resolution pyramid.

2.4 Search

This section describes the search procedure as it is described in [11] and [12], but with greater detail in some aspects, influenced by our interpretations. A more thorough description of the search procedure of SBM has not been acquired.

The goal of the search is to recognize a previously modeled object in a provided image, the search image, if it is present at all. An example of a search image can be seen in figure 2.3.a. It is the same object as the modeling was demonstrated on earlier in figure 2.2.a but the illumination was changed. Searches are performed on the normalized gradient map of the search image sd(x, y) which is generated

as an initial step of the search, exactly as for the model as described in section 2.3. Note however that all direction vectors are stored in the search image, i.e. no thresholding is performed. This makes the method strong against faint illumi-nation conditions and varying illumiillumi-nation in general. The normalized gradient search image is illustrated in figures 2.3.b and 2.3.c.

(30)

a) b)

c) d)

e) f)

Figure 2.2. a) Example of a model image, f (x, y). b) Gradient x-component, fx(x, y).

Gray is 0, white is positive and black is negative. c) Gradient y-component, fy(x, y)

using the same color axis as in b. d) Gradient magnitude image. e) Edges extracted by thresholding the gradient magnitude. f) Selected model edges with normalized gradient directions.

In figure 2.3.d the improved model in figure 3.4.f is positioned in a way that it matches perfectly (the object was fixed between the shots). The changed illu-mination has caused some of the gradients to mismatch and hence the similarity

(31)

2.4 Search 17

measure described in section 2.2 will yield a score less than 1. These edges are in the transitional area between the highly and regularly illuminated areas. The directions of the edges entirely in the highly illuminated areas are mostly intact thanks to the normalization.

a) b)

c) d)

Figure 2.3. a) Another image of the object modeled in figure 2.2.a, now to be found

with added illumination. b) Normalized gradient image. c) Close up of normalized

gradient image. d) Close up of gradient image with matching model gradients inserted in white color. The vectors mismatch where the directions have changed due to the added illumination, two o’clock. (Contrast of gradient magnitude image intentionally reduced.)

Gradient images are generated for all levels of a resolution pyramid, see figure 2.4. The resolution is halved for each level in the pyramid. Downsampling is performed recursively, using bilinear interpolation.

The search space for the pose of an object in a search image is limited as the pose is extracted from a set spanned by a certain number of positions, rotations and scales. This set has a magnitude (which equals the number of possible poses the search could result in) that is determined by the granularity of the pose (and scale limits) specified as parameters to the algorithm. As this search space is limited, any search strategy that evaluates all possible poses would result in the pose which has the highest score of all poses. Such a strategy is not of practical interest because of the shear size of the search space. A strategy based on a

(32)

Figure 2.4. An illustration of the resolution pyramid used for the search procedure. The scale step between the layers is two.

heuristic to limit the search must be used.

Multiple strategies have been investigated in [11], [12] and [14], and the winning candidate is breadth first search, mostly because it is difficult to find a feasible heuristic for other strategies. A breadth first search means that the resolution pyramid is searched one level at a time from the top to the bottom. By the heuristic, this is organized in a way that an exhaustive search is performed at the coarsest level as is briefly described in section 2.4.1. Then the other levels of the pyramid are searched using a tracking procedure as briefly described in section 2.4.2.

The steps of the search procedure is described in short and things are kept general as no more details on the search stage of the method have been acquired. A detailed description of the search implemented in this thesis is presented in section 3.3.

2.4.1 Exhaustive search

The search begins with an exhaustive search over all potential poses at the coarsest level of the resolution pyramid. All rotations and scales are matched for given bounding intervals and granularities. Promising matches are detected under the condition that they are local maximums and have a high enough score. These are stored as candidates to the final pose and are the result of the exhaustive search. Note that if the exhaustive search were to be done in the search image in its original resolution a tremendous amount of computational resources would be required. Real-time performance would just not be possible with a reasonable amount of time and resources.

(33)

2.4 Search 19

2.4.2 Tracking

Each candidate available from the exhaustive search is tracked down the finer levels of the resolution pyramid, refining its pose, until a match is found in the original resolution. As a coarse pose of the object is known for the candidate the search in these steps are only done locally. For the coarser resolutions in the pyramid this requires much less computational resources compared to the exhaustive search.

Finally the candidate with the highest score is returned as the result of the search. The thought of discarding candidates at finer resolutions where the effort increases is interesting but is not investigated in this thesis.

2.4.3 Optimization

Under the condition that the score must reach a given threshold smin to be

ac-cepted, the sum may be preterminated under certain circumstances [12]. Using an alternative notation, the SBM evaluation sum can be rewritten and an expression for the partial sum, sj, can be formed in the following way. Let

sj= 1 mN j X i=1

hmd(mx,i, my,i) |sd(x + mx,i, y + my,i) i

be the partial sum of edge gradient dot products up to j. Note that if j = mN,

the entire sum is evaluated and the actual score is obtained as in equation (2.1). Suppose that at one point of the summation sj has been reached. Then mN − j

dot products are left to evaluate. On the rare occasion that each and every one of them contributes with their maximum value of 1, the highest achievable total score is sj+m_mN−j

N . If sj+

mN−j

mN ≤ smin, the sum can be terminated at stage j as it will never reach an acceptable score. This can improve the performance of the search to some extent as a lot of the scores usually do not reach values close to the threshold smin. However, depending on the threshold this termination

criterion often requires most of the sum to be evaluated even in those cases. A harder criterion has been suggested by Steger [12]. It gains much performance but introduce insecurity. This extra factor may cause failure depending on which order the points in a model is checked. This cannot be neglected.

Selecting the threshold smin is a difficult task and several factors must be

considered. First one has to estimate how much the score is likely to degrade due to non-ideal matches caused by distortions. It also depends on the object searched for. At last, one has to decide how much occlusion that is to be accepted. Setting the threshold too high will result in the object not being recognized. Setting it too low will result in a larger computational effort.

Even though the final result of the investigation of this thesis project should be applicable for real-time applications this kind of performance optimizations is not of big concern. It is noted that it can be done with little or no impact on robustness and is not investigated further here.

(34)

2.4.4 Reversed polarization

One situation where SBM is doomed to fail is when the background of an object has changed in a way that the polarization in the silhouette of the object is completely reversed. Consider a gray colored object, placed on a black background in the model image, but on a white background in a search image. All the silhouette edges have reversed gradient directions in the search gradient map and the correct pose of the object would yield a score close to −1. An area in the image containing no variations at all, or noise, would result in a higher score (close to 0) and hence be a better match. Figure 2.5 shows two images with reversed polarization for the silhouette edges of the object.

Figure 2.5. The change of background has caused the polarization in silhouette edges of the object, in the model image to the left and the search image to the right, to be reversed.

Two modifications of the similarity measure, in equation (2.1), to cope with this problem are suggested by Steger [11, 12]. In the first one,

s(x, y) = 1

mN

X (mx,my)

| hmd(mx, my) |sd(x + mx, y + my) i | , (2.4)

the absolute of the inner product is used instead of just the inner product itself. This solves the problem of completely reversed polarization since each edge where the angle between the gradients is ±180◦gets a score of 1. The other modification is to use the absolute of the whole similarity measure,

s(x, y) = 1 mN X (mx,my) hmd(mx, my) |sd(x + mx, y + my) i , (2.5)

which is of course equivalent to taking the absolute of the complete response. Both modifications solve the problem at a cost of robustness in the general case which is unacceptable. The procedure using equation (2.5) is especially dangerous as negative values with almost the same magnitude as the maximum score often appears close to the top, see section 3.1. Several candidates would be wasted on these doomed poses. Similar reasoning can be used to see that the solution in equation (2.4) suffers the very same dangers. These modifications may be

(35)

2.4 Search 21

of interest in special applications where completely reversed polarization can be expected but should not be used when not needed.

Luckily, according to our observations, severe cases of completely reversed po-larization seldom appear in real-world applications. When it does, it may be an effect of severe clutter but might also be caused by strong illumination. The latter sometimes also affects edges that come from the structure of the object. Edges related to the texture of the object may also change. A special case is when the texture of an object has various colors and the color of the light matches them differently (in the gray images used here the intensities are varied differently). Different surface materials in an object may also reflect light differently.

Even though these situations almost never appear in reality the fact that the method cannot handle this is a bit annoying. A person without knowledge of its inner workings will most likely be surprised when it fails in cases where the search image appears more or less undistorted, from a human’s point of view.

(36)

(37)

Chapter 3

Analysis and improvements

In this chapter the SBM method is first analyzed to acquire more details on how it works and what information lies behind some of the decisions in its design. Some things are vaguely described in [11, 12, 14] and have been reworked to suit this thesis. Differences and improvements are an important part of this chapter.

Some of the experiments already conducted [12, 14] are repeated. As noted in section 1.5 the requirements on the surroundings for the evaluation in this thesis are not as high as in previous studies. The images used are in general more varied and are exposed to even greater distortions, especially occlusion and clutter. The goal is to identify the weak points of the method that can be improved rather than verifying its usefulness.

To fully be able to evaluate various aspects of the method it was reimplemented. Details specific to this implementation is presented through this chapter along with some improvements. The goal of the improvements is to gain robustness rather than decrease the need for computational resources.

3.1 Similarity response

The similarity measure of SBM, equation (2.1), is a very central part of the al-gorithm. In order to understand the effects of various distortions and be able to predetermine valid values for parameters to the algorithm when extracting rotation and scale, a small investigation of the response is presented in this section.

3.1.1 Characteristics

In general the response of a correlation using the SBM similarity measure, as de-scribed in section 2.2, is very high frequent with a lot of local maximums and minimums. The spike that appears for a match is usually very distinct. A com-parison between the SBM response and regular cross correlation can be seen in figure 3.1.

The rapid changes in the normalized gradient directions can be seen in figure 3.2. The high locality of gradient directions, aided by the normalization, seems to

(38)

a) b)

c) d)

Figure 3.1. The image in figures 2.2.a and 2.3.a were matched using cross correlation of intensity data and the SBM similarity measure (on a normalized gradient map). a) Cross correlation. b) Surface plot of the cross correlation response. c) SBM similarity measure

response. d) Surface plot of the SBM response clearly exposing the distinctiveness of

the matching spike and the large number of local maximums typically present around it.

be behind these characteristics of the response.

When the gradient map is normalized all variations becomes equally significant which makes the normalized gradient map appear noisy. In some situations the number of local extremums and their very small width is an issue. Many local maximums cause the required number of candidates to extract in the exhaustive search to grow. If the local maximums to be found are not wide enough a finer step length between responses must be used when searching for rotation. To resolve this, one might consider low-pass filtering the normalized gradient maps. As the similarity measure is linear, see section 2.2.2, this is equivalent to filtering the response, which could be hazardous. If two candidates are located close to each other a new candidate would appear in between, that might not be accurate on a finer resolution of the pyramid. Additionally, low-pass filtering (of the response) would not prevent a thin top from being missed.

(39)

3.1 Similarity response 25

a) b)

Figure 3.2. Gradient components from figures 2.2.b and 2.2.c normalized (by division with their magnitude seen in figure 2.2.d). Gray is 0, white is positive and black is negative.

3.1.2 Distortion

There are many kinds of distortion that can cause the matching score to degrade. The most important ones are listed here:

• Occlusion, whose relation to score degradation is briefly described in section 2.2.

• Clutter, which induces a risk of some erroneous pose having a higher simi-larity score than the correct one (at a coarse level) in the search image. It is also related to partially reversed polarization, see section 2.4.4.

• Transformations, such as sub-pixel translations, rotation and scale, cause small variations in the score. For rotation and scale, this is investigated in sections 3.2.2 and 3.2.3.

• Varying illumination, which the method is robust against to some extent as described in section 2.2.

• Changed shape of an object, due to for example non-rigidity or significant deviations from the object plane. This is not analyzed in this thesis as it is insignificant by assumption. If it is present it could be seen as a form of occlusion and thus be managed in some situations, depending on the severity.

• Distortion related to the image acquisition such as noise and lens distortion. Not analyzed in this thesis.

It is very difficult to analyze the effects of all these distortions when they are combined (which they usually are) but they can be analyzed separately. When doing this it is favorable to have a bound of how much each individual distortion should be allowed to degrade the score. This bound has been chosen to 0.9 which is fairly high compared to the secondary highest scores that are present in most

(40)

responses observed (of course depending on the object). The motivation is that the total drop in score is very difficult to predict for general conditions and a sufficient margin should be present. It has also been observed that for distortion related to transformations, in many situations (section 3.2.2 for instance), that when the score around the maximum has fallen below 0.9, the score decrease fast.

3.2 Model

An overview of the modeling procedure as implemented in this thesis can be seen in figure 3.3.

Transformation Gradient Magnitude Thresholding DNMS Normalization

Finest model

Transformation Gradient Magnitude Thresholding DNMS Normalization

Coarsest model ↓2 ↓2 Model image Level 1 Level 2 Level L

Figure 3.3. Overview of the modeling procedure, which is identical for all resolutions. L is the coarsest level and 1 is the finest.

Models are generated for all levels of the resolution pyramid in all combinations of rotation and scale that is necessary. After the model image is transformed a gradient image is estimated using the sobel pair as described in section 2.3. Then the gradient magnitude is calculated and thresholded which is illustrated in figures 3.4.a and 3.4.b. This time the threshold is much lower than before and its purpose is to reduce noise rather than extract nice edges. Edges are then extracted using directional non-maximum suppression (DNMS) as described in section 3.2.1. The intermediate result of this step can be seen in figure 3.4.c and the final edges in figure 3.4.d. Finally the edge gradients are normalized. The result can be seen in figures 3.4.e and 3.4.f.

3.2.1 Directional non-maximum suppression

Ulrich suggests [14] that the edges extracted for MGHT models contain redundant information and can be thinned without a significant loss of robustness. He also

(41)

3.2 Model 27

a) b)

c) d)

e) f)

Figure 3.4. The image used here is the same as in figure 2.2. a) Gradient magnitude image. b) Thresholded magnitude with reduced noise. c) Directional local maximums extracted as described in section 3.2.1. d) The combined results from b and c are used to generate the edges for the model. e) The complete model with normalized gradient directions. f) Close up of model.

expresses that the same model points are used for SBM as for MGHT. Therefore it is likely that the model edges for SBM were also thinned.

(42)

that provides a satisfying result for every object in every scene. Either some edges that may be crucial in the search will disappear or some other edges will get thick. As thick edges contain more edge points, they are weighted unevenly high in the response. Occlusion or other kinds of distortions to such regions would be more severe than for other regions of the object. Thinning also improves the performance as fewer edge values have to be considered in the correlation sum.

Our experiments have shown that thinning should be used for SBM as both robustness and performance is gained, though usually at the cost of a small loss in granularity. In a small number of images the recognized object were approximately one pixel off in translation.

The thinning here is done using directional non-maximum suppression. It is a computation step of the Canny line detection algorithm [3]. That additionally contains a threshold with hysteresis. Given a gradient map and a gradient magni-tude map, the magnimagni-tude in each element is checked against its neighbors in the positive and negative direction of the local gradient. If any of them is greater in magnitude, the element it is suppressed and is not considered being part of an edge. It is a very simple and fast method to extract edges. SBM is not highly dependent on accurate edge extraction and the quality of the results is by far enough. An example of where this method has been applied can be seen in figure 3.5.

3.2.2 Rotation requirements

The number of models, Nϕ, required to cover a certain angle interval is related to

the angle step length which can be estimated by reasoning about the worst case scenario. That is where all the model points are located at the outer border of the model, such that all points suffer maximum relocation due to rotation. This is done by Ulrich in [14] for the MGHT method but the results hold for SBM too. Consider a border pixel at distance N₂ from the center of the image of size N . If the image is rotated ∆ϕ

2 radians, the border pixel should not move more than half a pixel to stay on the same line of pixels as before, see figure 3.6.

The figure gives

sin∆ϕ

2 =

1/2

N/2. (3.1)

N is large and ∆ϕ is small so (3.1) can be approximated as

∆ϕ 2 ≈ 1 N ⇒ ∆ϕ≈ 2 N (3.2)

The estimate of the number of required angles for a full rotation is then

Nϕ=

2π ∆ϕ

≈ πN (3.3)

This result was tested by the following simple experiment. A model was ex-tracted from an image of an object. The image was rotated a certain angle and were then matched with the model who were extracted from the non-rotated im-age. This was done for various rotation angles within an interval centered on 0

(43)

3.2 Model 29

Figure 3.5. An example where directional non-maximum suppression was used to ex-tract edges (without noise threshold). The gradients of the exex-tracted edges are illustrated as white arrows. N 2 N 2 ∆ϕ 2 1 2

Figure 3.6. The dislocation of a border pixel is 1₂ pixel, when an image of size N is

rotated ∆ϕ

(44)

−200 −150 −100 −50 0 50 100 150 200 0.2 0.4 0.6 0.8 1

Score degredation overview

Rotation [deg] Score −15 −10 −5 0 5 10 15 0.4 0.5 0.6 0.7 0.8 0.9 1

Score degredation peak

Rotation [deg]

Score

Figure 3.7. Matching score decreases when an object is rotated but the model is fixed.

For this object the angle interval was calculated to ∆ϕ ≈ 4.0◦, and consequently the

score degradation seems to be acceptable for a rotation within ±2.0◦(dotted lines). If a

score degradation of 0.9 is accepted ∆ϕcan be chosen as ∆ϕ≈ 10.1 (dashed lines).

and the maximum score of each match was recorded for each angle. The result of this using the same object as in figure 2.2 can be seen in figure 3.7. As the object is of a rectangular shape a fairly good match appear at 180◦ where the silhouette of the object matches but most of the edges related to the object texture do not match. The object size N was roughly estimated to 29 pixels (in present resolu-tion) giving ∆ϕ≈ 4.0◦according to equation (3.2). Clearly the score degradations

are acceptable for a rotation within ±2.0◦ as illustrated by the dotted lines in figure 3.7.

Overall once the score has dropped to 0.9 it degrades fast. The angle where the score had reached 0.9 was manually extracted from the plot and was selected as the practical ∆ϕ. This was done for several objects in various resolutions. The score

constantly stayed below the worst case estimate, usually with a great margin. This indicates that it is safe to use the estimate as a limit of how much the rotation may be off before the score degrades too severely. The theoretically derived interval is significantly smaller than what could be used in practice. If score degradations down to 0.9 are accepted the practical angle interval is 10.1◦ as illustrated by the dashed lines, see figure 3.7, right.

3.2.3 Scale requirements

Estimating the number of models required for scaled objects is not as straight forward as for rotation. One reason is that scale is not limited by itself but by an assumption that it stays within a given range. One should also notice that scale is multiplicative in the sense that applying the scale of s twice is the same as applying the scale s · s = s2_{once. (In the same sense rotation is additive, rotating}

θ degrees twice is the same as rotating 2θ once.) Some reasoning on the required scales to cover an interval for cross correlation can be found in [5].

Consider a pixel at the border of the model image of size N at a distance of N₂ from the center. The distance seis the maximum allowed movement for the pixel

until the response become too degraded. It makes sense to set se=1₂, movements

larger than that would cause the pixel to end up on another column of pixels. Due to our experiments the value seems to stay close to 1₂ pixel. According to figure

(45)

3.2 Model 31

3.8, the maximum scale tolerance ∆scan be expressed

∆s N 2 = N 2 + se⇒ ∆s= 1 + 2se N . (3.4) N 2

s

e

∆

s N 2

Figure 3.8. A greatly exaggerated illustration of the effect of scaling pixels on the image border. If a border pixel in an image of size N is scaled by ∆sthe dislocation is sepixels.

The expected coverage, ∆s for various image sizes is plotted in figure 3.9.

The difference in coverage between the low and high resolutions is significant. Fortunately the total covered interval increase fast with the number of models used, so this is not a problem as we shall see.

0 50 100 150 200 250 300 350 400 0.95 0.96 0.97 0.98 0.99 1 1.01 1.02 1.03 1.04 1.05

Scale coverage of a single model

Size (N)

Covered scale

Figure 3.9. The scale interval covered by a single model (∆s) for different image sizes,

se= 1₂.

A scale model placed at ∆sion a scale axis covers the interval [∆si−1, ∆si+1].

(46)

[_s1

r, sr] covered by Nsscale levels is ∆s2(

Ns

2 )_{= ∆}

sNs = sr. (3.5)

Figure 3.10 shows an illustrative example for Ns= 3. Models are placed at ∆−2s ,

1 and ∆2

s. The boundaries of their covered scale intervals join at ∆−1s and ∆1s.

The complete coverage of the distributed models are [∆−3_s , ∆3

s].

scale

1 _∆

1 s

∆

−1 s

∆

3 s

∆

−3 s

∆

2 s

∆

−2 s

Figure 3.10. Three models placed at ∆−2s , 1 and ∆2s cover the total scale interval

[∆−3s , ∆3s].

Solving equation (3.5) for Ns and utilizing (3.4), results in

Ns= log sr log ∆s = log sr log (1 +2se N ) , (3.6)

where log signifies the logarithm to an arbitrary base. By specifying the base of the logarithm to e, using the natural logarithm, and utilizing the fact that 2se

N is

small, we can use the following approximation, ln(1 +2se

N ) ≈

2se

N . Together with

equation (3.6) the required number of scales can be approximated as,

Ns≈

ln sr

2se

N . (3.7)

The same experiment as for rotation was performed to see to it that the score stays acceptable within the predicted scale interval. That is, a model was extracted and then matched to various scales of the same image and the maximum score for each scale was recorded. An example of this is plotted in figure 3.11. The model size was 29 which by equation (3.4) gives the theoretical max scale to 1.034, dotted lines in figure 3.11. If a score degradation of 0.9 is accepted a practical max scale of 1.057 could be used, dashed lines in figure 3.11.

3.2.4 Match interpolation

If models for all possible rotations and scales are generated and stored as suggested (and noticed) in [11] this cause high demands on storage capacity. Especially when the bounds for rotation and scale are vast as the number of models, Nm= NϕNs,

grows fast.

To conquer this problem we would like to transform the model on-line instead which is discussed and discarded in [14]. We have studied the effect of transform-ing model points, as opposed to pregenerattransform-ing models for all transformations by resampling the model image, in some simple experiments. The difference of the two approaches is illustrated in figure 3.12. In the first variant the model data is extracted from the model image. The model data is then transformed and in-terpolation must take place in the search image during matching. In the second