A model-free voting approach to cue integration

(1)

KONST

H

ÖGSKOLAN

Department of Numerical Analysis and Computing Science

TRITA-NA-P98/11•ISSN 1101-2250•ISRN KTH/NA/P--98/11--SE•CVAP220

A Model-Free Voting Approach

to Cue Integration

(2)

VETENSKAP OCH KONST

K

UNGL

T

EKNISKA

H

ÖGSKOLAN

Department of Numerical Analysis and Computing Science

TRITA-NA-P98/11•ISSN 1101-2250•ISRN KTH/NA/P--98/11--SE•CVAP220

A Model-Free Voting Approach

to Cue Integration

Carsten G. Bräutigam

(3)

Kungl Tekniska Högskolan

Augusti 1998

c 1998 Carsten G. Bräutigam NADA, KTH, 100 44 Stockholm ISBN 91-7170-289-X ISSN 1101-2250

ISRN KTH/NA/P--98/11--SE, TRITA-NA-P98/11 Tryck: KTH Högskoletryckeriet, Stockholm 1998 C Tryckt på svanmärkt papper

(4)

(5)

(6)

A

BSTRACT

Vision systems, such as “seeing” robots, should be able to operate robustly in generic environments. In this thesis, we investigate certain aspects of how these demands of robustness of a systems approach to vision could be met.

Firstly, we suggest that robustness can be improved by fusing the vari-ety of information offered by the environment, and, therefore, we investigate the effectiveness of using the coincidence of multiple cues. Secondly, we are concerned about the use of coarse algorithms. Even though the environment provides much information, it is neither necessary nor possible to extract all information available. Therefore, we will show that coarse algorithms will suf-fice for certain problems.

To investigate the effectiveness of using the coincidence of multiple cues, we perform a series of experiments on detecting planar surfaces in binocular images. These experiments are based on two schemes of a somewhat different character.

The first one is a hypothesis-and-test scheme that incorporates the cues in a certain order and hence, by design, imposes a ranking of them. The general idea is to use arbitrary cues exploiting local image data to get an idea about whether the model (a planar surface) is seen in the image and at which loca-tion it is found. If one or more cues strongly indicate a certain instance of a model, then this observation serves as a hypothesis to be tested by other cues to support or reject this hypothesis. In comparison to the cues used for hypothesis generation, those used for hypothesis testing should be more reli-able and can also have a higher computational complexity since they are only employed when needed.

The general idea of the second scheme is to first use a simple, and quick cue exploiting local image data to get an idea of where in the image the model (a planar surface) could be found. After this initial localization step, all cues

(7)

hypothesis forming step, similar to that of the hypothesis-and-test approach. This step though, is much weaker because it only indicates a region in the images where to look. The approach allows direct fusion of incommensurable cues, such as intensity and surface orientation. Generally, it can be regarded as a less restrictive approach than the hypothesis-and-test approach.

We propose that coarse algorithms may be motivated from a robustness and flexibility point of view. Our experiments demonstrate that there is sup-port for this claim, at least, for some tasks of relevance, such as those of finding planar surfaces, or similar simple models.

(8)

A

CKNOWLEDGEMENTS

First of all I want to thank my supervisor Jan-Olof Eklundh for his encourage-ment and enthusiatic support. I would like to express my deepest gratitude for that, being aware that this thesis would not have been possible without him. Many of the ideas for this work were developed together with him.

Particular important for this work has been the cooperation with Hen-rik Christensen and Jonas Gårding. They provided much knowledge, assis-tance and inspiration during all these years and co-authored well-received ar-ticles. Further thanks for their invaluable help go to Mengxiang Li, Magnus Andersson, Thomas Uhlin, Atsuto Maki, Peter Nordlund, Pär Fornland, Anders Lundquist, Lars Bretzner.

I would like to thank Harald Winroth, Matti Rendahl, Jonas Gårding, and Tony Lindeberg for writing much of the basic image related programs, and for providing a good UNIX environment together with Systemgruppen at NADA. Kourosh Pahlavan helped me also to take stereo images with his sophisticated head-eye system, and Mengxiang Li helped calibrating these images. I also want to thank Birgit Ekberg-Eriksson and Emma Gniuli for helping with admin-istrative matters.

I have enjoyed the work at Computational Vision and Active Perception Laboratory (CVAP), especially the stimulating discussions and friendship with Demetrios Betsis, Kourosh Pahlavan, Pascal Grostabussiat, Nader Attar, and Uwe Schneider with whom I spent much time together, also outside the lab.

Many thanks go to Ann Bengtsson who was my very first contact at CVAP. She introduced me to all people at the lab and the whole environment, and cared about me getting settled in Sweden.

I am thankfull for the support of Davi Geiger and Hiroshi Ishikawa from New York University for providing stereo reconstruction experiments in Chap-ter 8 used to compare and evaluate my own techniques.

(9)

ages used in this thesis. Jonas Gårding contributed the shape-from texture algorithms, and Mengxiang Li provided the multi-point matching routines.

Finally I would like to mention all the other members and former mem-bers of CVAP, who all had varying amounts of influence on my work. These people are Andres Almansa, Yannis Aloimonos, Arnon Amir, Antonis Argyros, Ron Arkin, Svante Barck-Holst, Fredrik Bergholm, Josef Bigün, Mårten Björk-man, Jørgen Bjørnstrup, Mattias Bratt, Kjell Brunnström, Carla Capurro, Stefan Carlsson, Rosario Cretella, Jorge Dias, Martin Eriksson, Daniel Fagerström, Cor-nelia Fermüller, Robert Fisher, Antônio Fransisco, David Jacobs, Patric Jensfelt, Kenneth Johansson, Henrik Juul-Hansen, Danica Kragic, Jean-Marc Lavest, Mat-tias Lindström, Oliver Ludwig, Oliver Mertschat, Ambjörn Naeve, Mads Nielsen, Peter Nillius, Niklas Nordström, Kazutoshi Okamoto, Göran Olofsson, Anders Orebäck, Matteo Perrone, Lars Petersson, Federica Rebagliati, Danny Roobaert, Mikael Rosbacke, Hedvig Sidenbladh, Kristian Simsarian, Björn Sjödin, Stephan Spiess, Daniel Svedberg, Dennis Tell, Maria Ögren, Olle Wijk, Klaus Wiltschi, and Wei Zhang.

Excellent proof-reading of this thesis was done by Jan-Olof Eklundh, and Henrik Christensen. Peter Nordlund proofread Chapter 4, 5, and 9, Candace Klinghoffer proofread Chapter 1 and 10, Paolo Pirjanian proofread Chapter 3. This work has been supported by a grant from the Swedish Research Coun-cil for Engineering Science which is gratefully acknowledged.

(10)

C

ONTENTS

Abstract v

Acknowledgements vii

PART ONE — CUE INTEGRATION

1 Introduction 3

1.1 Outline of our approach . . . 4

1.2 Thesis outline . . . 6

2 Cue integration framework 7 2.1 Need for integration . . . 8

2.2 Strategies for integrating different modules . . . 9

2.3 Integration methods . . . 10

2.3.1 Probabilistic methods . . . 10

2.3.2 Rule-based integration methods . . . 13

2.4 Erroneous data and incompatible representation . . . 14

2.4.1 The strength and weakness of probabilistic approaches 14 2.5 Cue integration using coincidence . . . 15

2.5.1 Basic form of cue integration . . . 16

2.5.2 Further refinements of the basic integration scheme . . 16

2.6 Summary and discussion . . . 19

3 Voting methods 21 3.1 Components of a voting algorithms . . . 22

(11)

3.1.2 Voting sub-schemes . . . 24

3.2 Weighted Consensus Voting . . . 24

3.3 Applicability of voting schemes . . . 26

3.4 Summary . . . 27

4 Planar surfaces 29 4.1 Methods based on binocular stereo . . . 29

4.1.1 Algebraic 5-point invariances . . . 30

4.1.2 Segmentation of planar surfaces using plane projectivity 31 4.2 Direct estimation from texture . . . 31

4.2.1 Statistical properties . . . 32

4.2.2 Homogeneous intensity distribution . . . 33

4.3 Estimation from image features . . . 34

4.3.1 Lines and junctions . . . 34

4.3.2 Right-angle pairs . . . 35

4.4 Vanishing points . . . 36

4.5 Motion . . . 37

4.6 Reconstruction . . . 37

4.7 Discussion . . . 38

PART TWO — EXPERIMENTAL EVALUATION

5 Experimental setup 43 5.1 Cue selection . . . 44

5.2 The output of the different cues . . . 45

6 The hypothesis-and-test approach 49 6.1 Hypothesis generation . . . 49

6.2 Hypothesis testing . . . 50

6.2.1 Finding a plane projectivity . . . 50

6.2.2 Warping . . . 51

6.2.3 Computing correlation scores . . . 51

6.2.4 Post-processing . . . 52

6.3 The segmentation problem . . . 52

6.4 Initialization of hypotheses . . . 54

6.5 Experiments . . . 55

7 Voting based cue integration 61 7.1 The intitialization step . . . 61

7.1.1 Noise level detection . . . 62

7.2 Different voting schemes . . . 65

7.2.1 Unanimity voting . . . 65

7.2.2 Plain majority voting . . . 66

(12)

CONTENTS

7.2.4 ’Enhanced’ m-out-of-n voting . . . 70

7.3 Problems . . . 71

7.3.1 Missing texture information . . . 72

7.3.2 More than one planar surface . . . 73

7.3.3 Difficulties in the Color image . . . 74

7.4 Discussion and summary . . . 75

7.4.1 Statistics of the computed cues . . . 75

7.4.2 Summary and discussion . . . 77

8 Comparisons 81 8.1 Our two cue integration schemes . . . 81

8.2 Stereo reconstruction . . . 84

8.3 Discussion . . . 85

9 Extensions 87 9.1 Merging neighbouring regions . . . 88

9.1.1 Using relative orientations . . . 88

9.1.2 Using geometric information . . . 89

9.1.3 Summary . . . 90

9.2 Using confidence maps . . . 92

9.3 Using statistical information . . . 95

9.4 Summary and discussion . . . 96

10 Concluding discussion 99

PART THREE — APPENDIX

A Images 105 B Experimental results 111 C Technical details 147 C.1 Cue estimation methods . . . 147

C.1.1 Junction estimation . . . 147

C.1.2 Right-angle pairs . . . 147

C.1.3 Perspective distortion of texture patterns . . . 148

C.1.4 Disparity . . . 148

C.1.5 Finding planar surfaces using invariants . . . 150

C.1.6 Using ML estimators for texture estimation . . . 151

C.2 Clustering . . . 152

Bibliography 155

(13)

(14)

PART

ONE

(15)

(16)

C

HAPTER

1

I

NTRODUCTION

Computer vision systems, such as “seeing” robots, aim at understanding the three-dimensional properties of objects or scenes taken by one or more ob-server cameras. The goal of research on such systems should be to enable them to operate robustly in generic environments, engineered indoor envi-ronments, as well as, natural outdoor settings. Both types of environments provide visual information that is very rich; more specifically, they provide a whole set of cues to scene structure. Visions systems can benefit greatly from this structure by using multiple cues relying on the richness of information.

Traditional computer vision research applies a reductionist perspective. Approaches to derive image features, to compute shape from stereo, shad-ing, and texture, to analyse motion and shape, to determine object contours, to use geometry for grouping and recognition, and so on, are all studied in isolation. No doubt, many methods exist that perform these tasks with great success. Nevertheless, little is known about how well these methods perform as modules in a vision system. A “seeing” robot must rely on processes that work continuously in generic environments without failure. Uhlin and Eklundh [1995] point out that, in order to operate in a rich environment, the system should be capable of using and integrating many cues, and, in particular, be able to make use of whatever information is available at a given instant. System failures due to visual modules failing because the actual environment does not meet the assumptions made by the module or does not conform to the implied constraints, are not acceptable from the systems perspective. Techniques to improve the reliability of vision systems have to be employed to guarantee correct continuous operation.

The insight that we obtain from studies of single modules only provides part of the solution to this problem. Each module is based on specific

(17)

assump-tions about the occurrence and computability of certain types of perceptual information. If these conditions are not satisfied the module will not provide useful output. Hence, in a generic and variable environment, the system as a whole cannot rely on any single module. It must use a whole set of modules, matching the possible conditions in the environment. We conjecture that ro-bustness can be improved by fusing the variety of information offered by the environment.

As a consequence, a new problem arises, namely that of selecting between, or fusing of these multiple sources of information. This forms one of the problems that we will investigate in this thesis. We also propose that a step in this direction can be taken by using simple and, in fact, even coarse algorithms, as long as they are applied in such an integrated manner.

A second issue concerns the type of modules which we are considering. Several authors, such as Fermüller and Aloimonos [1995], Uhlin et al. [1995], and Salgian and Ballard [1998] have stressed that the component algorithms of a “seeing system” should be simple for the system to be robust. The overall problem, then, becomes, how and to what extent it is possible to derive per-ceptual information using multiple types of information and sometimes coarse algorithms.

1.1 Outline of our approach

As stated above, realistic environments offer a rich set of visual cues to scene structure and events. These cues can be both complementary and redundant, and various algorithms can be used to compute information about the scene from them. Any such algorithm is, of course, based on a specific set of assump-tions and it is, in general, difficult to determine whether these assumpassump-tions hold in the given environment, or not. In computer vision, this lack of knowl-edge about the scene has been handled by computing uncertainty. Paradoxical though it may seem, these uncertainties provide no guaranteed solution. For instance, a shape-from-texture algorithm could be mislead by surface mark-ings depicting a surface orientation differing from the real surface orientation, and still be 100% certain of its results. In fact, every algorithm computing 3D structure from a flat image ‘hallucinates’ in this sense1_.

One way of overcoming this type of problem is based on the use of the rich information of a natural world by fusing several types of information. A well-known and often applied fusion method, is some sort of averaging, pos-sibly including various mechanisms for dealing with outliers. Several methods rely on probabilistic techniques, or regularization. See Chapter 2 for a more detailed discussion. These methods all require a computational model for the underlying cue, and their fusion requires some type of world model. Moreover,

(18)

1.1. OUTLINE OF OUR APPROACH

the averaging techniques generally require that the cues deliver information of the same sort, to be averaged. In this thesis, we will investigate how well a much weaker approach performs the same task. We will propose a scheme based on the coincidence of several cues, and specifically focus on a method which relaxes the requirement of computational models for each cue and the commensurability of these cues. An important aspect of this scheme is that it provides us with an explicit way of dealing with outliers.

More precisely, we will investigate the fact that the agreement of several cues is much more likely when the cues give correct output, than in case when they give wrong output, and that such a coincidence, therefore, pro-vides strong evidence for the cue output being correct. The application of this fact is incorporated into a framework for finding simple qualitative models in images. The basis of the framework is a hypothesis-and-test approach, where one or more cues indicate an instance of a certain model, which then, serves as a hypothesis to be tested by other cues. To obtain reasonable reliability of the testing phase, the output of multiple visual cues is fused by applying different voting schemes. It is in the voting schemes, that the coincidence of cues is exploited.

This framework fits well in an architecture of a visual system which is able to exploit coarse information sources and simple mechanisms to provide ro-bust data on the scene, without resorting to reconstruction or segmentation. A characteristic of this framework is also that it is likely to find the most con-spicuous exemplar of the searched model first. This property seems quite relevant for an agent using vision to guide its behaviors, since the simplest so-lution becomes available early on. Such an agent can look for other exemplars of the model later, if needed, a process in the spirit of “any-time vision”, with possible use in real-time implementations for robot applications.

While voting can be used to improve reliability of integrated systems, it is not necessary to have exact models of the individual input data cues and their error rates and confidence measures. Voting mechanisms can operate model-free with respect to the individual cues, and exploit the non-accidental consensus of these cues.

We apply voting to finding instances of planar surfaces in binocular images, a scene interpretation task that is of considerable importance in robotics. Hav-ing a hypothesis which tells where to look in the image, we examine local depth and surface orientation cues, the number of image points supporting the hy-pothesis, and, above all, the number of different cues that support planarity. We then classify the shape of the image region, without actually reconstructing the surface or segmenting any image or surface. As cues for testing planarity, we use point based disparities in a multi-point matching framework, perspec-tive distortion of texture pattern assuming weak isotropy, algebraic invariants of five matched stereo points, grey level homogeneity, and monocular junction angles.

(19)

visual data, but it is the combination of these methods that we are interested in. The approach is applied to finding planar surfaces. By the very nature of our method, it will not give the best possible results for this task. However, we show that it will give results that are good enough for many robotics tasks. The methodology we use to show its performance and, in fact, to assess our techniques, is by using systematic experimental evaluation and comparison of the best known methods for stereo reconstruction

1.2 Thesis outline

The current document is structured in the following manner. We first present approaches of cue integration as they have been described in the literature thus far. We explain how cue integration works, its advantages and disadvan-tages, as well as, its application. A new approach exploiting coincidence and applying a coarse voting strategy is proposed. The next chapter deals generally with voting mechanisms and particularly with their applicability for computer vision systems. The chapter provides primarily the formalism and notation of voting as it is used later in the document. In Chapter 4, we will describe how instances of planar surfaces can be found in monocular or binocular images, as a sample task for showing the efficiency of the proposed cue integration scheme.

In the second part of this thesis, we will present experiments of the pro-posed cue integration scheme exploiting coinciding results of multiple cues, to finding instances of planar surfaces in monocular and binocular images, without resorting to scene reconstruction or segmentation. Comparisons be-tween strong and weak integration methods are made, as well as, comparisons to complete reconstruction. The results are discussed and put into a general perspective.

The appendix contains detailed information on the cues used for detect-ing planar surfaces in the experiments, as well as, detailed images from the experiments.

(20)

C

HAPTER

2

C

UE INTEGRATION

FRAMEWORK

“An important shared criterion [of physical systems] is that the behavioral description must be compositional, that is the descrip-tion of a system’s behavior must be derivable from the structure of the system. The term ’structure’ refers to the components of the analysis, component behavior, and the connections between components. The term ’behavior’ refers to the time course of ob-servable changes of state of the components and the system as a whole. Each component has some associated behavior, and the be-havior of the system as a whole results from the interaction of the behaviors of the components through specified connections.”

[Bobrow, 1984]

As Bobrow [1984] describes in this context, physical systems consist of different components, and the overall behavior of a system depends on the behavior of the single components. The same is valid for systems for machine vision. In particular, they can benefit significantly from using information from multiple cues, as described in Chapter 1, and should do so for increasing ro-bustness and accuracy.

Early vision modules, based on shape from stereo, shading, texture, ge-ometric structure, surface contours, motion, accommodation, and occlusion, are well studied and many of them give good results when their underlying assumptions hold. Typically, there are constraints of one form or another that are imposed on the sensory task being addressed by the module. These con-straints range from the surface smoothness constraint as in Grimson [1981],

(21)

the rigidity constraint used in motion analysis (Tsai and Huang [1984], Ull-man [1984], Liu and Huang [1986]), the Lambertian assumption used in shape-from-shading (Horn [1975], Brown et al. [1983]) to the polyhedral object con-straint of block world object recognition systems (Roberts [1965]). Shape-from-texture algorithms assume the non-universal assumption of directional isotropy (Witkin [1981]) or uniform density (Aloimonos [1988]), or weaker forms thereof (Gårding [1993b]). Most shape-from-shading methods require that the reflectance map is given, or at least, a parameterized reflectance map and the position of the light source is known (Pentland [1982], Brooks and Horn [1985]). If these constraints are not met from the beginning, the corre-sponding module will probably fail. If we consider a system, such as, a mobile robot using vision in a generic environment, be it outdoors or indoors, it will naturally be in situations and direct its gaze to places where the constraints for each particular module are not satisfied. When looking at uniformly painted walls and floors, a vision system cannot use texture-based methods for stereo, optic flow or shape-from-texture. On the other hand, it may be easy to find vanishing points in such a case. For robustness, the system should, conse-quently, be able to use whichever module that works in the particular case and thus not rely on a single cue or process.

In this chapter, we will present the concept of cue combination, give an overview of cue integration strategies, and compare the different strategies to each other. We will discuss how these existing schemes apply to the situation discussed above, where some cues are useful, but others are unavailable or highly erroneous, and propose a new approach to this problem by using voting.

2.1 Need for integration

An important motivation for cue combination, is to overcome the weaknesses due to dependencies and constraints of individual cues.

Integration of different modules can also help solving “ill-posed” problems. Problems are called ill-posed, in the sense of Hadamard [1923], when their solutions do not exist, are not unique, or do not vary continuously with the input data. Aloimonos and Shulman [1989] showed how this could be done by applying regularization techniques. Integration might also help to reduce the uncertainty of the valid, but noisy output of given modules by averaging out the different effects of noise of independent modules.

It is not guaranteed that cues are always available or observable in the environment (e.g. using shape-from-texture on an image of a scene without texture information). Natural environments provide a richness of different cues, but nonetheless, there is no guarantee that all cues are always observable or available. Using multiple cues can contribute to a robust system output, whereas the lack of one cue will not lead to failing system output.

(22)

2.2. STRATEGIES FOR INTEGRATING DIFFERENT MODULES

A clear hint that cue integration can lead to robust perception is the robust-ness in human surface perception. Experiments (Bülthoff and Mallot [1987], Bülthoff and Mallot [1988], Yuille et al. [1990], Bülthoff [1991], Bülthoff and Yuille [1991], Johnston [1991], Tittle et al. [1995]) on human visual percep-tion have shown that by using only one single cue for perceppercep-tion of depth, even humans can be fooled very easily. Natural scenes tend to have more cues available and depth perception is much more accurate when exploiting these varieties of cues.

In the following, we will investigate different ways of how cue combination has been approached so far, and which results have been obtained. We will show the results’ shortcomings and, in view of them, develop our voting based fusion approach based on the coincidence of cues rather than their integration.

2.2 Strategies for integrating different modules

Clark and Yuille [1990] classify methods for integration of different vision modules into two main groups, weak coupling and strong coupling. Weak cou-pling is when the output of two or more independent modules are combined, and strong coupling is when the output of one sensory module is affected by the output of another module, so that their outputs are not any more indepen-dent.

The way of integrating different modules into a common scheme depends on the modules to be integrated. For improving the reliability of the output of independent modules, a weighted combination of the outputs of the mod-ules suffices. The weights of a module, with respect to the other modmod-ules, represents the relative reliability of the respective information sources. Com-bining two or more modules which do not provide unique robust solutions by themselves might together form solutions using a-priori constraints.

The combination of independent or dependent sources of information can be found extensively in the literature. For example, Bülthoff and Mallot [1989] are using weak coupling for shape-from-shading, and shape-from-texture. Matthies et al. [1989] apply strong data fusion using Kalman filter techniques for depth-from-motion calculations. Clark and Yuille [1990] cite many exam-ples where either weak or strong coupling was used, sometimes without having any cue combination in mind.

Besides the distinction between weak and strong coupling, cue integration can also be classified according to the method of interaction of the different cues, as follows:

Veto: Information from one cue could override or challenge other cues. Bült-hoff and Mallot [1987], for example, showed that in the human visual sys-tem edge-based stereo matching overrides depth perception from

(23)

shad-ing.

Cooperation: Cooperation might help to reduce the uncertainty of the output by averaging out the different noise of independent modules.

Disambiguation: If information from one cue does not allow a unique solu-tion, information derived from other cues might be used for disambigua-tion.

Accumulation: Information from different cues can be accumulated, for ex-ample, by summation of probabilities or by joint regularization, as de-scribed by Poggio et al. [1985] and Terzopoulos [1986].

Hierarchy: Information from one cue might be used as input data to another cue.

Only the last classification would count as strong coupling while the others are cases where the output of two or more independent modules are combined ‘weakly’. They are not necessarily mutually exclusive.

2.3 Integration methods

We will now give an overview over cue integration strategies. We will dis-cuss probabilistic methods such as the Bayesian approach, Kalman filters, and Dempster-Shafer theory, as well as rule-based methods. Voting will be de-scribed in detail in Chapter 3.

2.3.1 Probabilistic methods

Theories to solve ill-posed problems are often based on the regularization paradigm, where the space of acceptable solutions of the problem is restricted by choosing functions that minimize certain functionals. The choice of the functionals is dependent on mathematical considerations, as well as, physical analysis of the constraints of the problem. Basically, the problem is regular-ized by imposing additional constraints which should be physically plausible.

Typical solutions to the regularization problem are based on relaxation methods with or without confidence measurements. The most widely used methods employ Bayesian probability theory: constraints are embedded based on Bayesian interpretation of sensory information processing algorithms. In these approaches, different possible solutions are assigned probabilities of be-ing the true solution based on prior expectations and on models of the sens-ing process. The prior expectations can be based on previous measurements or on constraints that are imposed on the visual system. Approaches based on Markov Random Fields can be formulated as special cases of the Bayesian

(24)

2.3. INTEGRATION METHODS

approach.

Relaxation processes have been successfully applied to a wide variety of labelling problems in computer vision, Davis and Rosenfeld [1981] present a survey on them. Bayesian image labelling is used by Chou and Brown [1990] for consistently combining several symbolic or numerical sources.

The power of Bayes’ theorem lies in the fact that it relates the quantity of interest (the probability that the hypothesis is true given the respective data) to the probability that we would have observed the measured data if the hypothesis was true.

Bayes theorem states:

P r (X|Y ) = P r (Y|X) × Pr(X)_{P r (Y )} . (2.1) Using the standard formulation to express the a-priori knowledge about an event as a ratio of probabilities

O[X] = P r [X] P r [¬X] (2.2a) one gets P r [X]= O[X] 1+ O[X] and P r [X|Y ] = O[X|Y ] 1+ O[X|Y ]. (2.2b) The quantity P r (X) is called the prior probability and represents the know-ledge about the truth of the hypothesis before we have analysed the cur-rent data. P r (X|Y ) is the posteriori probability which represents the state of knowledge about the truth of the hypothesis considering the current data. The marginalisation equation

P r (X)= Z_∞

−∞P r (X, Y )dY (2.3)

is used to calculate a-priori probabilities, applying some probability density functions (e.g. Binomial or Gaussian distribution) for P r (X, Y ) to reflect prior knowledge. This formalism has been used widely to fuse information from multiple cues.

Nielsen [1995] uses Bayesian depth estimation from binocular stereo im-ages. He uses the assumption that physical space has no preferred direction, and that no further information on the surface orientation is available.

Das and Ahuja [1995] fuse stereo, vergence, and focus for depth percep-tion. The performance of each cue is characterized by a derived expression for the standard deviation of the relative uncertainty in the presence of random errors in the imaging parameters. Das and Ahuja combine the computational and reliability aspects of the cues for a Bayesian cue selection scheme.

When Clark and Yuille [1990] fuse feature based stereo for depth recon-struction, they assume having geometry and calibration information and use

(25)

the epipolar constraint to find matching features. They solve the correspon-dence problem by choosing the disparity field which best fits the a-priori con-straints thus combining the stereo correspondence problem and the surface interpolation into one step (unlike Grimson [1981]). Different matching primi-tives are combined based on Bayesian theory using weights according to their robustness. They also propose to use monocular cues, like eye-movement or focusing, to disambiguate the correspondence problem. A prerequisite, never-theless, is that the reliability of the monocular cue can be estimated.

Szeliski [1988b] uses the smooth surface assumption in conjunction with Bayesian analysis. He obtains a dense depth estimate from sparse range data by using two-dimensional spline fitting and uses this dense estimate for match-ing new data points. The implicit model of the uncertainty of the interpolated surface allows to calculate the uncertainty in the motion estimate from the shape of the energy surface in the vicinity of the optimal estimate. Detailed extensions of this work are described in [Szeliski, 1988a].

Bülthoff and Mallot [1987] carry out psychophysical experiments to inves-tigate how people perceive 3D shape given different shape cues. A Bayesian framework solution of how to integrate shape cues for computing 3D shape is presented in [Bülthoff, 1991]. Particularly interesting are the theories for inhibition of conflicting cues. Bülthoff and Mallot [1987] classify depth cues in three categories: 1. primary depth cues that provide direct depth information, e.g. from binocular stereo, 2. secondary depth cues that may also be present in monocular images, e.g. shading, and 3. cues to flatness, inhibiting the percep-tion of depth, for instance a frame surrounding a picture. In case of conflict, the primary cues override secondary cues.

Kalman filtering is used as sensor data integration tool by Durrant-Whyte [1988], Wang and Wang [1994], Welch and Bishop [1997] and others. Wang and Wang [1994] use Kalman filters to obtain statistically optimal estimates of the imaged surface structure based on possible noisy sensor measurements. The scheme is shown to work in an environment where surface depth, orienta-tion, and curvature measurements are obtained from multiple sensors, and a description of the image object is calculated. Rao and Ballard [1996] describe a general framework for modeling invariant recognition, motion, and stereo. Their algorithms resemble the Kalman filter and are derived from the MDL principle for estimating the generative weights.

Abel and Dourandish [1986] use the Dempster-Shafer theory for data fu-sion, which has three major differences to the Bayesian theory.

1. Evidence is represented as a Shafer belief function instead of a probabil-ity densprobabil-ity function.

2. Evidence is combined using Dempster’s rule of combination. This rule assumes that the observations are independent and have a non-empty set of intersection.

(26)

2.3. INTEGRATION METHODS

3. The computation of the evidence for a proposition does not require prior odds. Without prior beliefs, Dempster-Shafer theory assumes ignorance. For comparison of Bayesian theory with Dempster-Shafer theory see Lee [1988] and Buede [1988]. Applications in sensor fusion are described by Guan et al. [1990] and Shafer et al. [1986].

Summarizing, probabilistic methods have widely been used in connection with fusion of multiple cues. The fusion has in most of the cases been used for computing scene reconstruction. Having a model that fits to the input im-age data, and having prior distributions of possible results, probability meth-ods are rather robust and of moderate complexity. Szeliski [1988a] writes that “Bayesian analysis can in general improve the performance of vision al-gorithms” and proposes his motions estimator from time-varying range data for real-time robot navigation. It may be asserted, however, that the goodness-of-fit of the model to the data is crucial in probabilistic methods, and the determination of an accurate and correct model is difficult and can be of high computational complexity.

2.3.2 Rule-based integration methods

The task of a rule-based integration system is to generate plausible analysis of the image data by building successively more specific interpretations based on data from initial modules. Rules to generate and evaluate such interpretations make the core of rule-based methods. The rules guide the image interpretation process by generating successively more specific interpretations based on the strength of the output of analysis modules, a-priori constraints, and further expectations. The uncertainty of the underlying data, as well as, multiple and ambiguous interpretations for specific hypotheses have to be encoded in the rules. Those interpretations with the highest confidence should have the most influence on which rule to apply.

An advantage of rule-based systems is the ease with which one may exper-iment and thus adapt the rules. A major drawback of rule-based systems is that the rules to a problem must be fixed in advance. Similarly to the Bayesian framework, an accurate and correct model of the scene, the certainties of the used visual cues, and their interdependencies are important for a correctly working system.

Examples of rule-based systems in cue integration can be found in Prid-more et al. [1985] and Nazif and Levine [1984]. PridPrid-more et al. [1985] use a rule-based system combining local topological information with geometrical information for 3D grouping of disparities. Nazif and Levine [1984] use a rule-based system for segmenting images of natural scenes in order to understand their content. They present a solution to the image segmentation problem that is based on the design of a rule-based expert system. General knowledge about low level properties of processes employ the rules to segment the image into

(27)

uniform regions with connected lines. In addition to the knowledge rules, a set of control rules is employed. These control rules include meta rules that embody inference about the order in which the knowledge rules are matched. The control rules also incorporate focus-of-attention rules which determine the path of processing within the image. Furthermore, an additional set of higher level rules dynamically alters the processing strategy.

Belknap et al. [1985] present a rule-based system for combining informa-tion from multiple sources of sensory data. Relainforma-tional rules are responsible for creating complex aggregation of the data in order to obtain object hypotheses with associated confidences. The system is demonstrated using region and line data¸ and a set of relational measures defined over the two pixel-based representations. The extension to include motion, stereo, and range data is described.

2.4 Erroneous data and incompatible representation

The quality of the results of regularization methods, as of any other method, depends on how well the corresponding assumptions are satisfied. Regular-ization often applies some kind of smoothness of the unknown function. This approach is bound to fail when the quantity to be computed is a discontinuous function. Usually, there are quite many discontinuities, and these are rather important to solve visual tasks.

Another problem, as pointed out by Aloimonos and Shulman [1989], is that standard regularization theory deals with linear problems and is based on quadratic stabilizers. For non-quadratic functionals, the search space may have many local minima, and in this case only stochastic algorithms, such as Kirckpatrick [1984], might have success.

2.4.1 The strength and weakness of probabilistic approaches

The strength of probabilistic methods in sensing applications is based on the availability of information useful to construct the likelihood ratios, and on determining prior odds and constraints. But as soon as estimates of the nec-essary probability distributions cannot be attained due to unpredictable sens-ing environments, these approaches give less reliable results (see for example [Shafer, 1976]).

Bayesian models provide estimates of a scene based on image data and a-priori constraints. They constitute a very powerful approach and have been successfully applied in many computer vision fields, such as reconstruction (based on occluding contours, motion, shading), edge detection, texture seg-mentation, surface interpolation, stereo matching, and tracking.

(28)

accu-2.5. CUE INTEGRATION USING COINCIDENCE

rate modelling of the a-priori probabilities of these image features. Because these probabilities depend strongly on the assumption made by the model, they only apply to images, where these assumptions hold. The main problem with Bayesian models is to determine these probabilities. It is difficult, if not impossible, to objectively determine them, because the exact form of the con-straints may require learning and trial-and-error experiments (if compared to the human visual system). Only, if the domain of the application of the vision system can be modeled precisely, the probabilities can be determined from that model. Bülthoff and Yuille [1991] state that

“The Bayesian approach uncritically used may lead to rather mind-less theories of sensor fusion. It is important to carefully analyse the situation and estimate the dependencies and robustness of the different cues.”

2.5 Cue integration using coincidence

An alternative method of integrating cues is to exploit incidental agreement of multiple cues, since the agreement of several cues generally provides strong evidence for a given interpretations. Lowe [1987] presents viewpoint invariant grouping of image features due to non-accidentalness even in the absence of specific information regarding the object present in the image. Also Cho and Meer [1997] are using consensus information for image segmentation.

In this thesis, we will investigate two different strategies for exploiting coincident information. The first strategy is based on a hypothesis-and-test scheme, and can therefore be regarded as partly rule-based. The second strat-egy will use voting methods. Next we will discuss how these two strategies can be applied to visual cues. In Chapter 3, we will describe voting methods and their mathematical basis in more detail.

To perform cue integration by using voting schemes, a common voting space must be set up. For image based cue integration, it is convenient to use the image space as the voting space. That means that the different estimators vote for a particular characteristic for each pixel in the image. In addition to spatial coordinates, the space must also include dimensions corresponding to the possible outcomes. Here it is important to note that cues can be divided into two categories.

• Qualitative classes

• Quantitative descriptors

For the qualitative cues, the voting space is augmented by a dimension corresponding to the possible classes. For figure-ground segmentation, the axes would be (x, y, {figure, background}). Since different cues might provide

(29)

different types of output, it is here necessary to chose the lowest common denominator as the basis for the voting space. Each cue will then provide a binary vote specifying class membership. Subsequently, the voting space is thresholded and the class membership is determined. For the quantitative estimators, the voting space is augmented by dimensions corresponding to the cardinality of the parameter space. For motion segmentation, the dimension would be augmented by 1 (for the motion direction), for determining surface normals, it would be 2 (for slant and tilt).

2.5.1 Basic form of cue integration

In this thesis, we will inevstigate this basic form of cue integration using vot-ing. Even though, voting-based integration schemes will not be as accurate as complete scene reconstruction, this coarse type of weakly coupled integra-tion of visual cues will be shown to give reasonable results in many cases. In Chapter 7 we will show how our voting scheme behaves when relying on quali-tative vote classes only. We will use different voting sub-schemes to determine the outcome of the voting-based integration system, more precisely we will use unanimity, majority voting, and m-out-of-n voting (see Chapter 3 for the definitions of voting sub-schemes). We will apply these schemes to finding instances of planar surfaces in images.

2.5.2 Further refinements of the basic integration scheme

Spatial distribution of the votes

One problem that must be observed is that it might not be sufficient to use point based votes. Having, for example, determined that five points in the hypothesis region belong to a model, e.g. are lying on a plane, does not nec-essarily say that all points in the region, or at least the majority of the points in the region, are lying in the same plane. Since different cue estimators might use different sets of points for computing the cue, votes at different locations in the region might be combined and thus lead to non-robust system output. Consequently, in some situations, it can be of interest to vote in a neighbour-hood of the determined cue position to ensure robustness. This is illustrated in Figure 2.1. If an estimate of the reliability of different cues is available, for example from empirical evaluation, it is possible to estimate the reliability of the integrated representation. For binary cue estimators, each cue can be modeled as a Bernoulli random variable. The combination of different cues can then be estimated by the following combination

X

i≥m

X

Ω

P r (V (v1, v2, . . . )= i), (2.4)

whereΩ denotes possible combination of cues to achieve i-out-of-m votes, and P r denotes the corresponding probability. If the cues can be considered as

(30)

in-2.5. CUE INTEGRATION USING COINCIDENCE Voting space Voting region 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

m-out-of-5 voting scheme

reliab ility 0.4 0.6 0.7 0.8 0.5

Figure 2.1: Use of a neighbourhood voting scheme to ensure that slight differences between different cues do not result in an unstable classifica-tion.

Figure 2.2: Reliability as a function of m, the number of cue estimators that are required to succeed. The reliability is plotted for the values 0.4, 0.5, 0.6, 0.7 and 0.8, respectively.

dependent and of equal reliability, the sum turns into a cumulative binomial. The formula above can be used to demonstrate that voting-based cue integra-tion can improve reliability considerably, even if a small number of cues are in agreement. The relationship between m-out-of-5 voting and cue reliability is shown in Figure 2.2. Here, it is immediately obvious that good reliability can be achieved if only 2 out of 5 cues are in agreement, even if they fail on the average of 60% of the time. Even though our cues are not really independent, we are investigating a neighbourhood voting scheme in Chapter 9.

Hierarchical voting

One disadvantage of the above approach is that the voting is carried out in a space that is common to all cue estimators. This implies that extra information that might be available from some of the estimators is discarded. The problem of disregarding possible extra information can be circumvented by using a hierarchical approach. An initial classification is carried out in the common voting space, as outlined above. Once the classification has been performed, an additional fusion is carried out in the full voting space. This full voting space is the super-set of the output of all estimators. Each estimator outputs a vote in a sub-set of the full voting space. The combination of different cues is performed using the fusion operatorΛ defined in Definition 3.2.1 in Chapter 3.

(31)

Such an approach can, for example, be used for motion estimation. Some of the cue estimators might only specify motion for different regions in the image, while other estimators might specify the perpendicular or full optical flow. Finally, feature based approaches might indicate motion for particular features. All of these cue estimators will vote for motion/no-motion, but some of them will, in addition, vote for a particular velocity and direction of the motion for each pixel, or region. The initial voting-based classification is per-formed to locate regions that might have moving objects. Subsequently, the parametric information is fused using the fusion operator. The fusion might for instance be simple averaging. If confidence information is available for each of the cue estimators, this might also be taken into account.

In Chapter 9 we will apply this type of hierarchical voting technique. When searching for instances of planar surfaces in the image, we will first vote for planarity/non-planarity as the common denominator of all cues, and, in a sec-ond step, use the surface normal, provided by some of the modules, for voting for an orientation of the hypothesized surface. We will also use additional information provided by the cues for improved segmentation of the surface. Hypothesis-and-test scheme

A different approach is to use simple binary cues for hypothesis generation. Once, regions of interest have been hypothesized, richer cues can be used for a more accurate estimation.

In [Bräutigam et al., 1996] is demonstrated how such an hypothesis-and-test approach could be applied for finding planar surfaces in binocular im-ages. Binocular disparities and monocular determined L-junctions are used to estimated local surface normals at specific points in the images. Hypothe-ses of planar surfaces are issued when both cues supported the same surface orientation. The hypothesis testing is done by calculating the plane projec-tivity mapping left image features to the corresponding right image features, and computing cross-correlations between the left image and the warped right image. Experiments with this method will appear in Chapter 6.

In Chapter 7 we will use an even simpler hypothesis generation method, where L-junctions serve as indicator for localizing images of planar surfaces in the scene.

The final proposal of a coarse fusion method

Putting everything together, our proposed fusion method has the following structure:

1. The hypothesis-and-test scheme is the underlying idea of the our fusion method. We will be using simple visual cues to get coarse hypotheses of the location of instances of the model or event we are looking for. This avoids performing subsequent computations on the whole image, but restricts them to the hypothesis areas.

(32)

2.6. SUMMARY AND DISCUSSION

2. In the testing phase, we will use several qualitative cues providing binary votes supporting or rejecting the initial hypotheses. Voting schemes will be applied to determine the outcome.

3. Additional information which might be provided by some cues, is avail-able for hierarchical voting. Whether this step should be done depends on the demands of the individual application.

This method will likely never provide any fusion of the same quality as that of more powerful probabilistic methods. However, as we will show in our experiments, it will, under rather weak assumptions, give useful results in the plane-finding task without requiring any additional segmentation process.

2.6 Summary and discussion

In this chapter, the need of cue integration in the context of a vision system is described. Even though, early vision modules are rather successful, they are not applicable in universal situations. Each module puts certain constraints on the sensory task being addressed, which are not met everywhere in a complex environment. Integrating different visual cues has the following advantages:

• Make sensory task well-posed, i.e. provide unique solutions which de-pend continuously on the input data.

• Reduce dependencies on a-priori assumptions and constraints.

• Reduce uncertainties.

Different strategies for the integration process are surveyed. After dis-cussing the strength and weakness of traditional probabilistic integration ap-proaches, we propos a basic form of cue integration using voting schemes for weak coupling.

(33)

(34)

C

HAPTER

3

V

OTING METHODS

Many different methods for integration of information have been suggested in the literature, as for example described by Bloch [1996] and Lam and Suen [1997]. These method have primarily been used for pattern recognition.

To produce reliable systems from more or less unreliable input data, voting (rules for making collective decisions) has been transferred from the social sciences. Voting is, in general, dealing with a set of equivalent input data objects and producing the output which is approved by most of the input data objects. The intuitive idea of voting is that the probability of a majority of modules vote for a wrong output, is less than the probability of a single module is voting wrong.

While voting can be used to improve reliability of integrated systems, it is not necessary to have exact models of the individual input data cues and their error rates and confidence measures. Voting mechanisms can operate model-free with respect to the individual cues and exploit the non-accidental consensus of these cues. This is the major advantage of the voting process, as it is often difficult to obtain explicit or probabilistic models of the individual cues. The advantage of not needing accurate models is, at the same time, a disadvantage. When accurate and correct models of the input cues are avail-able, they can be used to produce better results than with simple consensus voting schemes. Leung [1995], for instance, proposed a maximum likelihood voting strategy, and showed that using the reliability of each cue to determine the most likely result has better performance than consensus voting.

In this chapter, we will present the basic characteristics of voting schemes and define the m-out-of-n voting scheme as a special case of weighted consen-sus voting, and compare different voting sub-schemes with each other. This part is largely adapted from Parhami [1994], and Pirjanian et al. [1998]. This

(35)

Input Output D a t a Exact/ Inexact Consensus/ Compromise V o t e Preset/ Adaptive Threshold/ Plurality

Table 3.1: Binary 4-cube classification for voting algorithms based on variations in input/output data and input/output votes (from Parhami [1994])

chapter serves rather as an introduction to the notation and formalism, which is used later in the document, than as a presentation of novel contributions.

3.1 Components of a voting algorithms

A voting algorithms is, in general, dealing with n input data objects xihaving

associated weights vi and producing the output data-weight pair (y, w). The

voting algorithm can also produce a set of n ‘support bits’ si, one for each

in-put that indicates whether a given inin-put supports, or agrees with the outin-put y. The main components of a voting algorithm are input data, input votes, output data, and output votes, as well as, a support function and a voting scheme. The support function defines whether two input data objects belong to the same equivalence class, i.e. are equal to some respect, and therefore support the same output data object, while the voting scheme specifies the condition which must be satisfied for a set of equivalent input data objects to decisively vote for the output data.

These main components of a voting algorithm have been used by Parhami [1994] to impose a binary 4-cube classification scheme of voting algorithms, leading to 16 classes as shown in Figure 3.1.

Definition 3.1.1 (Exact and inexact voting)

A voting scheme is called exact, when the input data objects belong to a fixed pre-defined set of values, it is called inexact, when the input data objects are representing flexible neighbourhoods.

When voting on planarity, we will use exact voting scheme, since there are only two possibilities: planar and not planar. The voting scheme for the orien-tation of a planar surface will be an inexact voting scheme, as the orienorien-tation

(36)

3.1. COMPONENTS OF A VOTING ALGORITHMS

is given by an approximate normal vector.

Definition 3.1.2 (Consensus and compromise voting)

A voting scheme is called consensus voting, when the output y is an element of the input data set. It is called compromise voting, when the output is a compromise object which is constructed based on the input data set.

Definition 3.1.3 (Threshold and plurality voting)

Threshold voting requires that the weight w of the output exceeds a certain preset threshold. This threshold is mostly given in percentage of the maximum possible weight. The case, where y is simply identified with the maximum support from the input data objects, is called plurality voting.

A typical plurality voting scheme is applied in parliamentary elections. All input votes have the same weights and the candidate who gets the most votes wins the election.

Definition 3.1.4 (Adaptive and preset voting)

A voting scheme is called adaptive, when the weights vi of the input data are

allowed to change/are adjustable. It is called preset otherwise.

3.1.1 The support function

The support function is an equivalence relation between input data objects. For exact voting, this equivalence relation is the equality between input data objects. When the voting algorithm is dealing with input data objects which represent flexible neighbourhoods, the support function approximates an equality relation using a distance function d : X2 _{→ IR on pairs of input data}

objects, and a threshold . Two input objects xi and xj are equivalent, i.e.

support each other, if d(xi, xj) < . The distance function should, hereby,

satisfy the following conditions: 1. d(xi, xj)≥ 0.

2. d(xi, xj)= 0 iff xi= xj.

3. d(xi, xj)= d(xj, xi)

It should be noted that he distance measure need not be transitive. Example 3.1.1

When voting for a certain surface orientation of image patches, input data objects are normal vectors in IR3. The distance measure might be the angle α between the normal vectors: v, w∈ IR3

d(v, w)= α = arccos_|v||w|v· w

(37)

3.1.2 Voting sub-schemes

For threshold voting, the output data object y “is supported” by input data ob-jects xi, with their associated non-negative real weights vi, when the following

conditions are satisfied:

Unanimity voting: all input data objects support y. w=

n

X

i=1

vi

Majority voting: as many input data objects support y, so that the sum of their respective weights is more than half of the sum of all input data objects. w >1 2 n X i=1 vi

m-out-of-n voting: at least m of the n input data objects support y. w≥ m, vi≡ 1.

If m≤ 1

2n, then y can be non-unique.

t-out-of-V voting: as many input data objects support y, so that the sum of their respective weights is at least t.

w≥ t. If t≤1 2 n X i=1

vi, then y can be non-unique.

In our case, the input data objects are results from calculations on sensory information. We want to check whether they agree because their agreement provides strong evidence for a certain interpretation of the image scene.

The threshold used in the voting scheme is a direct expression of the max-imum number of cues which we accept to disagree while still interpreting the image scene according to an agreement. Having, for example, a Byzantine voting scheme (majority voting with w > 2/3V ) means that the voting must arrive at a consistent conclusion in the presence of fewer than 1/3 of faulty input data objects.

For plurality voting, “y is supported” is equivalent to “no other y0is sup-ported by inputs having more weights”.

3.2 Weighted Consensus Voting

The following definition of weighted consensus voting from Parhami [1994] includes comprehensively all consensus voting schemes of practical interest.

(38)

3.2. WEIGHTED CONSENSUS VOTING

Definition 3.2.1 (Weighted Consensus Voting)

Given n input data objects xi, with n associated non-negative real weights vi,

Weighted Consensus Voting is the calculation of the output y and its weight w such that y is “supported” by several input data objects with weights total-ing w, where w satisfies a condition associated with the desired threshold or plurality voting sub-scheme.

The meaning of the term “supported”, used in the above definition, is de-pendent of the used voting scheme. In exact voting, it would be defined as “an input object xi supports y iff xi = y”, while in inexact voting, it would be

defined as “an input object xi supports y iff xi  y” with some appropriate

defined equivalence relation. This equivalence relation usually incorporates a distance measure d and a threshold in the following relationship:

xi xj iff d(xi, xj) < .

The distance measure d does not have to be transitive, that is from d(xi, xj) <

 and d(xj, xk) < does not necessarily follow that d(xi, xk) < .

A general class of consensus voting schemes, is defined by Pirjanian et al. [1998] as follows:

Definition 3.2.1 (m-out-of-n voting)

A m-out-of-n voting scheme, V : Θ → [0, 1], where n is the number of cue estimators, is defined in the following way:

V (θ) = ( Λ(c1(θ), ..., cn(θ)) if P_n i=1vi(θ)≥ m; 0 otherwise (3.1a) where vi(θ) = ( 1 if ci(θ) > 0; 0 otherwise for i= 1, ..., n. (3.1b) is the voting function andΛ : [0, 1]n_{→ [0, 1] is a function for combining the}

confidence for each estimator. 2

Using this m-out-of-n voting scheme, a cue estimator can give a vote for a given class, θ, when the output of the estimator is greater than 0. If m or more cues vote for a given class θ the value is estimated using the fusion method Λ. As an example multiple cues may be used for estimation of planarity. If more than m cues suggest the presence of a planar structure, the orientation is estimated usingΛ. The motivation for not using simple averaging is that the different cues might have different levels of uncertainty associated. These different levels of uncertainty can be taken into account by the fusion opera-torΛ.

(39)

3.3 Applicability of voting schemes

Voting is a useful operation in the realization of reliable systems that are based on the multi-channel computation paradigm. There have been many investi-gations on how good voting can perform. Lam and Suen [1997] analyse the performance of majority voting, as an application to optical character recog-nition. They assume binomial distribution to determine the probabilities of consensus and show that adding one vote to an odd number of input data cues decreases the error and correct classification rates, whether the input cues are independent or not. Adding a vote to an even number of input data cues increases the recognition rate. They also derive conditions where the as-sumption of equal probabilities amoung the input modules is relaxed, and the assumption of the independence of the modules may possibly dropped.

Blough and Sullivan [1990] analyse several common voting strategies, such as majority, median, mean, and plurality voting. They show that plurality vot-ing has the highest probability of choosvot-ing the correct output value, and that it is the optimal voting strategy, when all output values have equal a-priori probability of occurrence, as well as, equal probability of being produced by a faulty cue. For situations, where these probabilities are unknown or impos-sible to estimate, they show that plurality voting is still superior to all other voting schemes.

Applications of voting schemes are frequently found in fault-tolerant real-time systems and include software redundancy, Byzantine agreement, and clock synchronization (see Bräutigam et al. [1994] for an overview).

Lopresti and Zhou [1997] apply consensus sequence voting to optical char-acter recognition. Consensus sequence voting bases on an idea from molecular biology where it is not uncommon to be confronted by three or more related DNA sequences with the need to determine their most plausible shared ances-tor. In consensus sequence voting, the output of multiple equivalent classifiers is voted on by defining a distance measure d in the output space, and by de-termining C such that the combined distance of C to each classifiers’ output ci

is minimized: D(c1, c2, ...cn)= minC∈ΘPni=1d(C, ci). In a binary output space

with binary distance function, this consensus sequence voting is equivalent to simple majority voting.

Examples of non-majority voting are m-out-of-n voting schemes as described in Pirjanian et al. [1997], Pirjanian et al. [1998], and Bräutigam et al. [1998]. M-out-of-n voting is also similar to n-version programming as de-scribed in [Dugan and Lyu, 1994]. Pirjanian et al. [1998] show that the reli-ability of the voting system will be higher than the relireli-ability of each input cue, when basic design guidelines are followed. Pirjanian et al. [1998] use the approach for an active vision application of tracking objects. Their combined results using the m-out-of-n voting scheme improved tracking performance significantly.

(40)

3.4. SUMMARY

presented by Kanekawa et al. [1989], where the system chooses the output data according to a-priori information of the system state. A self-checking function determines whether a majority voting or a stand-by redundancy is chosen. Stepwise negotiating voting essentially amounts to a 2-out-of-n voting scheme.

Voting is implicitly used in conjunction with the Hough transform. The Hough transform is a method of detecting complex patterns of points in im-age data. It achieves this by determining specific values of parameters which characterize these patterns. Spatially extended patterns in the image space are so transformed to spatially compact features in the space of possible pa-rameter values. The difficult global detection problem in image space is trans-formed into a more easily solvable problem in parameter space. Local peak detection or simple voting are applied to finding solutions in the parameter space. See Illingworth and Kittler [1988] and Leavers [1993] for a survey on Hough transforms and Ballard [1981] for generalizing the Hough transform to detect arbitrary shapes. An example is the work from Palmer et al. [1993], who use a Hough transform with a 2D voting kernel for testing hypotheses in a line finding algorithm. Representing lines by their parametric equation ρ= x cos α + y sin α, the Hough transform uses a set of edge pixels as input and considers all possible lines given by the value pairs (ρ, α) with which the edge pixel may be associated. They apply a plurality voting scheme on the set of (ρ, α) pairs to find the best fitting line for the edge pixels.

3.4 Summary

In this chapter, we present the basic concept of voting methods. We present the different parts of a voting algorithm and explained their function within the cue integration context. We then describe the weighted consensus voting, a rather general voting scheme which will be used in all further voting ap-proaches in this document. It is shown how the different voting sub-schemes are related to each other, and where voting schemes have been used thus far in computer vision and in other fields. This chapter serves rather as an intro-duction to the notation formalism, which is used later in the document, than as a presentation of novel contributions.

(41)

(42)

C

HAPTER

4

P

LANAR SURFACES

The demonstration task that we have chosen for the fusion process is to find regions in the images corresponding to planar surfaces in the world. This is by itself an important task, especially in robot vision. By examining local depth and surface orientation cues in the images, the strength of the support from these local cues, the number of image points supporting it, and, above all, the number of different cues that support planarity, we rapidly generate hypothe-ses about planarity without actually reconstructing the surfaces or segmenting any image or surface. In our experiments, we are using a number of cues to planarity: point based disparities in a multi-point matching framework, per-spective distortion of texture patterns assuming weak isotropy, texture homo-geneity, algebraic invariants of five matched points, grey level homohomo-geneity, and monocular pairs of L-junction of right angles. We will describe these tech-niques in detail in Appendix C.

In this chapter we want to give an overview of existing methods for de-tecting regions in images which correspond to planar surfaces in the scene. These methods comprise algorithms which only produce hypotheses of pla-narity, as well as, algorithms which are based on reconstruction and surface orientations. We will describe in more detail methods based on point corre-spondences in stereo images and direct surface estimates from texture, since these are the methods which we will be using in our experiments.

4.1 Methods based on binocular stereo

Planar regions can be segmented from two views of a scene by using point correspondences. The fundamental matrix relates points in one image to