Estimation of visual focus for control of a FOA-based image coder

(1)

Estimation of visual focus for control of

a FOA-based image coder

Master’s thesis at Image Coding Group, University of Link¨oping

by

Stefan Carl´en

Reg nr: LiTH-ISY-EX-3395-2003 Link¨oping 2003

(2)

(3)

Estimation of visual focus for control of

a FOA-based image coder

Master’s thesis at Image Coding Group, University of Link¨oping

by

Stefan Carl´en

Reg nr: LiTH-ISY-EX-3395-2003

Supervisor: Peter Bergstr¨om

Examiner: Robert Forchheimer Link¨oping, 7th November 2003.

(4)

(5)

Avdelning, Institution Division, Department

Institutionen för Systemteknik

581 83 LINKÖPING

Datum Date 2003-10-15 Språk Language Rapporttyp Report category ISBN Svenska/Swedish X Engelska/English Licentiatavhandling

X Examensarbete ISRN LITH-ISY-EX-3395-2003

C-uppsats

D-uppsats Serietitel och serienummer

Title of series, numbering

ISSN

Övrig rapport ____

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2003/3395/

Titel

Title

Estimering av visuellt fokus för kontroll av en FOA-baserad bildkodare Estimation of visual focus for control of a FOA-based image coder

Författare

Author

Stefan Carlen

Sammanfattning

Abstract

A major feature of the human eye is the compressed sensitiveness of the retina. An image coder, which makes use of this, can heavily encode the parts of the image which is not close to the focus of our eyes. Existing image coding schemes require that the gaze direction of the viewer is measured. However, a great advantage would be if an estimator predicts the focus of attention (FOA) regions in the image.

This report presents such an implementation, which is based on a model that mimics many of the biological features of the human visual system (HVS). For example, it uses a center-surround mechanism, which is a replica of the receptive fields of the neurons in the HVS.

An extra feature of the implementation is the extension to handle video sequences, and the expansion of the FOA:s. The test results of the system show good results from a large variety of images.

Nyckelord

Keyword

(6)

(7)

Abstract

A major feature of the human eye is the compressed sensitiveness of the retina. An image coder, which makes use of this, can heavily encode the parts of the image which is not close to the focus of our eyes. Existing image coding schemes require that the gaze direction of the viewer is measured. However, a great advantage would be if an estimator predicts the focus of attention (FOA) regions in the image.

This report presents such an implementation, which is based on a model that mimics many of the biological features of the human visual system (HVS). For example, it uses a center-surround mechanism, which is a replica of the receptive fields of the neurons in the HVS.

An extra feature of the implementation is the extension to handle video se-quences, and the expansion of the FOA:s. The test results of the system show good results from a large variety of images.

Keywords: image coding, focus of attention, saliency map, FOA expansion

(8)

(9)

Acknowledgments

First of all I would like to thank my supervisor Peter Bergstr¨om and my examiner Robert Forchheimer for giving me the opportunity to work on this project, and also for valuable advices and inputs.

Also, many thanks to Laurent Itti for letting me have access to his implemen-tation of the model.

Last but not least, I would like to thank Sara Arvidsson, Andreas Axelsson, Andreas Bergström and Erik Björhäll for interesting and needed coffee breaks, and for participating as a testpanel.

(10)

(11)

Notation

Symbols

x, y Globally spatial koordinates.

ξ₁, ξ₂ Locally spatial koordinates.

Operators and functions

? 1-D or 2-D convolution.

Abbreviations

DoG Difference of Gaussians FOA Focus of Attention GIN Global Inhibition Neuron HVS Human Visual System. IOR Inhibition of Return LIN Local Inhibition Neuron WTA Winner-Take-All

(12)

(13)

Introduction

1.1 Thesis motivation

When we study an image, there are certain areas that automatically draw ones attention. These regions pop out effortlessly from the surroundings, and will be attended by our eyes while the brain still is unaware of this. Different areas will have different saliency due to, for example, frequency and size. This means that areas will be visited sequentially by the visual focus of the eyes, starting with the most salient area. Along with this, the brain connects these areas and tries to recognize and classify the image content.

The sensor cells in the retina of the eye are highly concentrated to the centre, and therefore the sensitivity decreases fast relatively to the distance from the focus. This means that an image coder, which uses this feature, could use a variable compression technique. Regions of the image, which are far away from the focus of attention, could be compressed hard with low quality. A viewer would, due to the low sensitivity, not notice any difference in image quality. Some image coding schemes have been proposed in [4, 11]. However, they are all dependent on measurements of the gaze direction of a viewer, which also limits the number of viewers of the image. Due to the fact that there are individual variations in gaze behaviour, one would want to have optional focus of attentions (FOA) and also allow these to have varying shapes and sizes.

The image coder developed by Peter Bergstr¨om [1], provides possibilities to have several FOA:s and is also not dependent on measurements of gaze direction. However, this puts great demands on the pre-processing stage, i.e. the estimator of the FOA. If one could construct a model of the human visual system (HVS) and estimate where the focus of attention is, then hopefully these requirements would be possible to fulfil.

(18)

2 Introduction

1.2 Aims of the project

This report describes the investigation of a method to estimate visual focus in an image. The final goal of this thesis is to implement a pre-processing stage of the image coder presented in [1].

The implementation consists of two parts. The first one is based on the model proposed by L. Itti [6, 7]. The model can be seen as a realization of the early pre-processing stage corresponding to the retina of the eye and the first parts of the visual cortex. An implementation of the model already exist, done by Itti, and is used with some adjustments. The output from this part is the most salient points in the input image.

The new part of the estimator is the extension of finding the FOA region sur-rounding these points, and also allowing these regions to expand.

1.3 Outline

• Chapter 2 describes the human visual system, especially the parts that are

essential for the model.

• Chapter 3 describes the development of visual attention research throughout

the history.

• Chapter 4 presents Itti’s model more deeply and draws parallels to the earlier

discussion. Some adjustments of the model are also presented.

• The extension of the model to enable a variable FOA region is introduced in

chapter 5.

• In chapter 6, important variables of the implementation are listed and

dis-cussed

• The two last chapters consist of tests of different images and a following

(19)

Chapter 2

Biological Vision

The human eyes are constantly bombarded with input signals from the surround-ings. If one looks at only a 32× 32 pixel sized and grey-scaled (256 levels) image of the well known Lenna (figure 2.1), it is hard to even distinguish her face. However, despite the poor visual quality, the number of different images that can be gener-ated out of these pixels, are enormous. The amount of data is 32× 32 × 8 = 8192 bits and thus, the possible number of images are 28192 ≈ 102400. This number exceeds by far even the total number of atoms in the whole universe, at least the one that we know of.

5 10 15 20 25 30 5 10 15 20 25 30

Figure 2.1. Downsampled version of the “Lenna” image

Of course the number of images which the human eye can separate are less than 3

(20)

4 Biological Vision

this, but it gives us an understanding of the importance of data reduction. Luckily the eye performs a massive downsampling, which means that the human visual sys-tem just processes a fraction of the input data. Hence, this small amount of data should represent all the needed input, so the brain actually can learn and under-stand the visual scene. This puts great demands on the early pre-processing stage done already in the retina of the eye. The solution, acquired through evolution, is to provide animals with many powerful sensors, capable of providing detailed information in different modalities, and to discard all the other information. This is usually called “selective attention”.

The content of this chapter is mostly based on information from the book written by G.H. Granlund and H. Knutsson [5], but also from the PhD thesis written by L. Itti [6].

2.1 The anatomy of the human eye

The retina of the eye consists of two mayor components: the cones and the rods. They are receptor neurons, which are used for daylight (colour) and night vision respectively. The cones are concentrated in a region near the optical axis of the eye called the fovea, thus the sensitivity of the receptors is compressed to the centre of the image. Furthermore the retina can be divided into three layers: the one with the receptors, one consisting of connection neurons and one with the so called Ganglion cells, which axons form the optic nerve (figure 2.2).

The receptive field of a neuron consists of the receptive area, in this case a part of the retina, whose input signals influence the output of the neuron. These fields could be seen as circular regions, where an on-centre response occurs when the stimuli is turned on within a well-defined circular disc. If the stimuli is moved a small distance away from this active region the response will be suppressed. A hypothetical circuit diagram for the generation of the receptive fields of a Ganglion cell can be illustrated as in figure 2.2. The first measurements of responses from these cells started around 1950 and it was found that by projecting a small spot of light onto the retina, it is possible to identify the group of or the single Ganglion cell that is influenced.

The Ganglion cells receive input from a number of connecting neurons called bipolar cells. Both of these neuron types have center-surround receptive fields as described above. Hence, it is believed that the very small local region of the retina, which is occupied by a bipolar cell, corresponds to the centre of a receptive field.

Another type of connecting neurons, called the horizontal cells, synapse to a larger number of receptors and send their output to a single bipolar cell. Therefore they cover a larger region of the retina and can be seen to symbolize the surrounded receptive field.

Finally we have a third type of connecting neurons, the amacrine cells. The function of these cells are not fully understood, however they are believed to gener-ate some of the differences between Ganglion cell responses, which will be described below.

(21)

2.2 Different classes of the Ganglion cells 5 Horizontal cells Bipolar cells Receptors Ganglion cells Amacrine cells Synaptic nerve Surround Center Surround

Figure 2.2. A hypothetical circuit of a small part of the retina

2.2 Different classes of the Ganglion cells

The responses from Ganglion cells differ in a number of characteristics, most im-portantly in speed and type of response signal. Most researchers believe that the basic analysis of the visual scene is mainly done by the two major types of Gan-glion cells: the X and the Y cells. The X cells are both physically smaller and have a smaller receptive field than the Y cells. On the other hand, the Y cells are of a significantly less number and are, like the rods, mostly distributed away from the fovea. However, the most important difference between these types is found when the stimuli of light is switched or quickly rotated. In this case the Y cell responds strongly, while the X cell shows no noticeable response. This indicates that there are two different subsystems of the visual system. One is aimed at de-tection and analysis of movements (Y) and the other one is specialized for detailed, high-resolution analysis of stationary patterns and for colour vision (X).

2.3 The visual pathways

From the eyes lead two optic nerves, consisting of axons from the Ganglion cells. They come together at the optic chiasm, where they are combined and divided so that the right half of the visual scene is separated from the left half (figure 2.3). This means that the two visual fields are analysed in different hemispheres of the brain. The two optic tracts reach each side of the thalamus, named the lateral geniculate nucleus. This region is divided into six layers, and each one of the layers receives its input from the optic tract. It has been found that the cells in the four upper layers only receives signals from the X cells, while the Y cells

(22)

6 Biological Vision

only provide signals to the two lowest layers. This indicates that the X signals, containing information of colour and form, are separated from the Y signals, which contain information about motion.

Big fibres of axons from the cells in the different layers, lead from the thala-mus and reach the visual cortex in the back of the brain. Actually, the separation of colour and motion is sustained in the cortex, indicating that there are differ-ent pathways for these features throughout the differ-entire visual system. Due to the different functions in the visual cortex, it is divided and named differently.

The first part, named V1 or primary visual cortex, is assumed to contain over 100 million neurons in each hemisphere. These cells have more complex receptive fields than the Ganglion cells. One class of these cells, the simple cells, responds only to a specific orientation, in addition to its position sensitivity. The complex cells, which is another class of Ganglion cells, have larger receptive fields and do not only respond to a certain orientation, but also if there is any motion in that direction. However, these cells do not depend, as much as the simple cells, on the location of the stimuli.

Figure 2.3. Overview of the visual pathway from the eyes to the visual cortex.

From P. Kaiser at www.yorku.ca/eye/brain1.htm

Furthermore there are simple and complex cells, which also respond to partic-ular features as length, width, angles etc. These cells are referred to as hyper-complex cells and their hyper-complexity grows as one reaches deeper layers in V1.

(23)

2.4 Visual attention 7

On one hand, neurons that are sensitive to similar orientation are stacked on top of each other, on the other hand there is a continuous change in orientation in the same layer. Therefore, these layers are presumed to be the first visual maps of the HVS. Different features with different sizes and modalities correspond to at least one neuron in these layers. Hence, the visual cortex can be seen as the corresponding feature maps in the model, which will be described in chapter 4.

2.4 Visual attention

William Jones suggested in the 1960:s that selecting attention to salient objects is done using both an image-based and a task-dependent method. The first one, the bottom-up method, probably uses center-surround mechanisms to find salient objects in a visual scene. These mechanisms are enabled by the receptive fields of, for example the Ganglion cells, mentioned earlier. This is done in a rather feed-forward manner, and one can “follow the image” from the retina of the eye to the visual cortex, where the dependency of location is maintained. The speed of this pre-attentive computation is in the order of 25 to 50 ms per object. This is also called covert attention, meaning the eyes are not moved to a specific pre-determined location in the image.

In addition to this there is an overt attention (top-down method), which has a variable selection criteria, and depends on the task at hand (for example, “look for the red car in the image”). This is most probably controlled from higher areas of the brain, like the frontal lobes, with connections from the visual cortex and the earlier visual areas. Due to the fact that more or less consciousness is used to find attention points, this mechanism is much slower (about 200 ms or more). Although both methods can operate in parallel, it is more efficient to have some sort of a serial strategy, because of the massive sensory input (107− 108bits per second at the optic nerve).

This allows one to break down the problem of processing a visual scene, into a rapid series of less demanding “computations” and analysis. The top-down method then binds these visual features into a unitary percept. This enhances the cortical representation of objects at the attended locations. Hence, focal visual attention can be compared to a spotlight, which successively finds and highlights different actors as they enter the stage. The bottom-up method corresponds to the finding of the actors, and the top-down method corresponds to the highlighting.

Whether a given part of the scene yields a strong or poor response is thought to depend very much on the “context”, i.e. on what stimuli are present in other parts of the visual field. This motivates in some way, the center-surround method used in the model, which will be further described in chapter 4.

(24)

(25)

Chapter 3

The development of visual

attention models

This chapter describes the development and differences of existing visual attention models. The information is mostly acquired from C. Koch [7] and L. Itti [6].

The basis of most computational models is the feature integration theory, ob-tained by Treisman and colleagues in the early 1980s. It was derived from experi-mental results and proved, for example, that selecting attention to salient objects in a visual scene is done using fairly simple visual features. Attention is then nec-essary to bind these features into a more sophisticated object representation. With this theory, a first approach was done to divide visual attention into two distinct methods, one bottom-up, image-based part and one top-down, task dependent part. There are five essential parts of any model of the bottom-up (image-based) attention:

1. Pre-attentive computation of early visual features 2. Competition to yield a single point/region 3. Generation of attention scan paths

4. Interaction between overt and covert attention (i.e. eye movements) 5. Interplay between attention and scene understanding

The biggest difference between models probably lies in the first and the second part.

3.1 Computation of early visual features

In biological vision, the first processing stage is represented by computation of vi-sual features in, mainly, the retina of the eye and the early vivi-sual cortex. Neurons

(26)

10 The development of visual attention models

at the earliest stage are tuned to simple features such as intensity, orientation, motion and color opponency. Neuronal tuning becomes increasingly more special-ized in the high-level visual areas. These neurons respond only to, for example, a certain pattern. These computations are done in a massively parallel manner all across the entire visual scene.

Computer implementations of early visual processes are often motivated by an imitation of biological properties. For example, the responses of orientation-selective neurons are usually obtained through convolution by Gabor filters, which simulates biological impulse response functions.

Another interesting approach consists of implementing detectors, which respond best to certain features. For instance Zetzsche [2] showed, using an eye-tracking device, how the eyes preferentially fixate regions with multiple superimposed ori-entations such as corners and line crossovers. He then derived and implemented non-linear operators that specifically detected those regions.

However, irrespectively of the method used for early feature detection, there are several fundamental computational principles, which has emerged from both experimental and modelling studies. First of all, different features contribute with different strengths to saliency. This relative feature weighting can also be influenced by top-down attention and training. Secondly, there is little evidence for strong interactions across different visual features, such as orientations and motion. How-ever, in a given feature dimension, strong local interactions between filters sensitive to different properties of that feature (for example between different orientations) have been characterized, both in physiology and in psychophysics. These two prin-ciples motivate the use of the normalization technique and channel weighting, both described further in chapter 4.

Last and most importantly, what seems to matter in guiding bottom-up at-tention is feature contrast rather than local absolute feature strength. It means that a neuron’s response is strongly influenced by context, in a manner that even extends far beyond the classical receptive field (RF). This motivates the use of a non-classical center-surround method, which will also be discussed more in chapter 4.

3.2 Competition between locations to yield a

sin-gle point

The question which arises next is how to distinguish a single attentional focus, based on the large amount of feature maps. To solve this problem, most models of bottom-up attention follow the computational architecture proposed by Koch and Ullman in 1985. It is centered around a “saliency map”, or “master map”, which is an explicit, two-dimensional and topographical map that encodes saliency at every location in the visual scene. The map receives input from the early visual processing, and provides an efficient control strategy in which the focus of attention simply scans the saliency map in order of decreasing saliency.

(27)

3.2 Competition between locations to yield a single point 11

reduced to finding the highest activity in the saliency map. However, there must be some amount of spatial competition during the pre-attentive feature detection. Otherwise, the saliency map would end up with a representation that is as complex and difficult to interpret as the original image. Therefore, a location is defined as salient if it wins the spatial competition in one or more feature dimensions, at one or more spatial scales.

As mentioned before, many successful models of the bottom-up attention are built around a saliency map. What differentiates them are the methods to reduce the incoming sensory input and to determine saliency from the map.

Wolfe suggested in 1994 [10] that saliency is computed as the likelihood that

a target will be present at a given location. This is based on both bottom-up fea-ture contrast and top-down feafea-ture weight, and has recently received experimental support from studies of attention based on a given search task.

Tsotsos et al. used a combination of a feed-forward bottom-up and a feedback

selective tuning method [3]. The most salient location, determined from feed-forward feature extraction, is propagated back through the hierarchy and through the activation of winner-takes-all networks. Thus, spatial competition is refined at each level of processing, because the feed-forward paths that are not contributing to the winning location are reduced.

Milanese had another approach of the feedback (top-down) propagation method.

He first defined an energy measure consisting of four terms and then used a relax-ation process to optimize it. The four terms were constructed to enforce spatial competition both between locations and between feature maps. Although the bi-ological plausibility of this process remains to be tested, it is a rare example of a model that can be applied to natural colour images.

Laurent Itti et al. [8] proposed in 1998 a purely bottom-up model, in which

spatial competition for saliency is directly modelled after non-classical surround modulation effects. It uses an iterative filtering scheme, where at each iteration, a feature map receives input from the convolution between itself and a large DoG filter (Difference of Gaussians). After competition, all feature maps are simply summed up to yield the scalar saliency map. This model has also been widely applied to the analysis of natural colour scenes with good performance as a result. It is important to note that centralized control based on a saliency map is not the only computational alternative for the bottom-up attention. Desimone and Duncan argued in 1995 that attention is not explicitly represented by specific neurons and by a saliency map. Instead the attentional selection is performed on the basis of top-down enhancement of the feature maps relevant to the target, and extinction of those that are distracting. This is done without an explicit computation of saliency. At least one model successfully applied this strategy to synthetic stimuli. However, such top-down biasing requires that a specific search task is performed to yield useful predictions.

(28)

12 The development of visual attention models

3.3 Attentional selection

A plausible neural architecture to detect the most salient location is the use of a winner-takes-all network, which simulates a neurally distributed maximum detec-tor. To prevent permanently focusing on the most salient location, an inhibition strategy is used. This means that an already attended location is suppressed, which enables the network to attend to the second most salient location.

This inhibition feature has been widely observed in human psychophysics as a phenomenon called “inhibition of return” (IOR). Results from experiments, indi-cate that visual processing at recently attended locations might be slower. The simplest implementation of IOR consists of triggering an inhibitory signal in the saliency map at the current location. However, this only represents a rough approx-imation of biological IOR, which has been shown to be object-bound. Therefore a better implementation should be able to track and follow moving objects, and compensate for a moving observer as well.

Although simple in principle, IOR is computationally a very important compo-nent of attention. It allow us to rapidly shift the attentional focus over different locations with decreasing saliency, otherwise we would be locked to attend to the same location at all times.

3.4 Attention and recognition

It is obvious that a complete model of attentional control must include top-down influences too. The computational challenge lies in the integration between bottom-up and top-down cues, and the interplay between attentional orientation and scene or object recognition.

One of the earliest models that combines these two parts was MORSEL, in which attentional selection was shown to be necessary for object recognition. This model is applied to the recognition of words processed through a recognition hier-archy. The addition of a top-down attentional selection process allowed the model to recognize the words, one at a time.

Shill’s top-down process is built as a hierarchical knowledge tree. Its root

represent all scene and object classes and as one travels along the branches and reaches the leaves, more specialized object classes are represented. There are also links between the leaves which are used as a discrimination function between the possible objects. During the iterative recognition of an object, the system directs its next fixation towards the location that will maximize the gain of information about the object.

Contrary to these combined models, Stark suggested in 1996 a theory, in which the control of eye movement is almost exclusively under top-down control. The theory proposes that what we see is only remotely related to the activation in our retinas. Instead, it is what we expect to see that is the basis of our perception. This controls the sequence of eye movements, which in return analyses the scene. Stark’s “scanpath theory” has had several successful applications to robotics

(29)

con-3.4 Attention and recognition 13

trol, in which the incoming video sequence is reduced into a small number of regions important for a given task.

One important challenge for combined models of attention and recognition is finding suitable neuronal correspondences for the various components. The com-bined models mentioned above are all in some way biologically inspired, but they do not relate in detail to biological object recognition

(30)

(31)

Chapter 4

The Model

4.1 Description of the model

As many other image processing models are based on the human visual system (HVS), this one makes no exception. For example, it tries to mimic the pre-processing of the input image in the retina, where important features like edges, motion etc. are extracted. However, this model differs somewhat from other classic computer vision models. One difference is that it uses an iterative normalization method to enhance isolated salient regions in the different channels. The result is that in images with only one salient area, this one is enhanced compared to the rest of the image. On the contrary, in images with several similar areas the normalization will not enhance one in front of another.

The first part of the model (figure 4.1) consists of extraction of the relevant features from the input image. The different orientations are computed by filtering the image with Gabor filters. They have the advantage of having closely similar shape as the simple cells in the primary visual cortex. As mentioned before, these cells respond to stimuli of different orientations and sizes, which is accomplished by using the Gabor filters as a sort of filterbank.

The intensity channel is created by simple lowpass filters, resulting in images where areas with different intensity than the surrounding are promoted. If they don’t have well-defined borders or shapes, these regions may have been neglected by the Gabor filters. For detecting motions in an image sequence, a flicker channel is used. It is created simply by lowpass filtering the differential images between three frames in the sequence.

(32)

16 The Model

WTA−Network Expansion of FOA Sequence of FOA:s Orientation

Competition between locations and scales

Brain

Saliency map

Conspicuity maps

Input image

Motion Intensity

Pyramids

Feature

Extraction

Output image of FOA

Different filters for each feature Downsampling

(33)

4.2 Filterbanks 17

The output from each channel is an image pyramid, containing a number of images at different scales. The next thing to do, is letting these scaled images compete against each other to result in one conspicuity map. For example, if motion in an image sequence is detected only at one scale, then the conspicuity map from the flicker pyramid should contain only responses from this level.

The comparison between the levels in every channel is maintained by the center-surround method. It operates between every scale in the pyramid. The different conspicuity maps are then normalized by the normalization technique mentioned earlier. By doing so, the saliency map is created simply by adding each channel’s conspicuity maps. This also enables one extra feature, namely the ability to weight the different channels.

Finally, the model consists of determining the FOA. It is done by a neural network which uses the Winner-Take-All principle. It also has the extra feature of inhibiting an already fired neuron. This enables the possibility to easily find a sequence of FOA:s.

4.2 Filterbanks

Filterbanks are used to divide a signal´s frequency spectrum into sub-bands. This is often done in a logarithmic way, which means that the lowpass or highpass band is repeatedly divided. In every iteration the sub-bands are downsampled to represent a “higher” scale.

This can be seen in figure 4.2 where a Gaussian pulse e−t2/200 _{is downsampled}

by a factor of two. To the left, one can see that the original signal (the solid curve) becomes, not surprisingly, narrower for each downsampling. The right figure also shows this modulation in the frequency domain, i.e. the spectrum becomes more and more high frequent.

−400 −30 −20 −10 0 10 20 30 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2000 −150 −100 −50 0 50 100 150 200 5 10 15 20 25

(34)

18 The Model

Many audio- and image-coding methods are based on filterbanks, where the different sub-bands are treated different, but with the overall requirement that the signal must be (almost) fully reconstructed. In our case we do not have this holdback, so each sub-band can be filtered and modulated “without limitations”.

A schematic overview of the filterbank is seen in figure 4.3. It starts with a highpass filtering of the input image (i.e. subtraction between the image and its lowpass part). Then, the high frequency image is filtered with a feature specific filter, for example, a Gabor filter in the orientation channel, which detects edges and lines in the present frequency interval. The following step is to downsample the low-frequency image and apply the same filters (highpass and Gabor) on this scale. The overall result is that edges and lines are detected in several frequency bands (at different scales of the image). Figure 4.3 shows an example where the Gabor filters are tuned in on structures with an angle of 45◦against the horizontal level.

What happens with the input image at each iteration? For example, imagine that the Gaussian in figure 4.2 is a cross section of a line in an image. The first level edge is a broad one and therefore quite low frequent. If our Gabor filter is centered around 100 Hz, then this edge will be neglected. The downsampling between the levels smears the low-frequency image over the frequency domain. After several downsamplings of the Gaussian, the edge will have frequencies in the region of 100 Hz, and therefore, this level leaves a response from the Gabor filtering.

The intensity filterbank differs only in the sense that it uses simple lowpass filters instead of Gabor filters. The feature maps in this channel is concerned with intensity contrast, which in mammals is detected by neurons sensitive either to dark centers on bright surroundings, or to bright centers on dark surroundings. Both cases are computed using the center-surround method described later in this chapter.

4.2.1 Gabor filters

The Gabor filters, which are used in the implementation, are principally simple sine/cosine filter kernels. However, it is more preferable to use a Gaussian attenu-ation filter along with the sine/cosine structure. This is due to the fact that they are more similar to the receptive fields in, for example the Ganglion cells. Because these structures are not Cartesian or spherical separable, the computational effort is slightly higher, but that is largely compensated by the more biologically plausible filter structure. The sine (edge-detecting) and the cosine (line-detecting) filters are defined as: s(ξ₁, ξ₂) = sin k(ξ₁· cosθ + ξ₂· sinθ) · e−(ξ2σ21+ξ2)2 (4.1) c(ξ₁, ξ₂) = cos k(ξ₁· cosθ + ξ₂· sinθ) · e−(ξ21+ξσ2 22) (4.2) where θ is the direction of the filters, k is a frequency variable and σiis the standard

(35)

4.2 Filterbanks 19 2 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 2 20 40 60 80 100 120 20 40 60 80 100 120 50 100 150 200 250 50 100 150 200 250 2 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 10 20 30 40 50 60 H/L H/L H/L H/L 2 Gabor

HP

Gabor

HP

Gabor

HP

Gabor

HP

LP

(36)

20 The Model g(x, y) =rf (x, y) ? s(ξ₁, ξ₂) 2 + f (x, y) ? c(ξ₁, ξ₂) 2 (4.3) where f(x,y) is the input image and ? is simple 2-D convolution.

0 5 10 15 0 2 4 6 8 10 12 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 0 5 10 15 1 2 3 4 5 6 7 8 9 10 11 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 4.4. Gabor filter kernel with Gaussian attenuation

The Gabor filter implementation proposed by Itti is slightly different. Instead of using the filters above he multiplies the input image with a sine/cosine mask and lowpass filters the resulting image:

g(x, y) =p(fc(x, y) ? e(ξ1, ξ2))2+ (fs(x, y) ? e(ξ1, ξ2))2

where

fc(x, y) = f (x, y)· cos(x, y) and fs(x, y) = f (x, y)· sin(x, y)

and with

e(ξ1, ξ2) = eξ21+ξ

2 2

σ2

Because of this, Itti’s filters (almost) only detect lines and edges in the direction

θ. Another drawback (according to our application) is that they lowpass filter the

input image more than the Gabor filters described earlier. This results in a more diffuse FOA, which will be discussed later. Instead, if one filters the image with the Gabor filter kernels, it results in a bigger interval of directions. This means that a smaller number of directions cover more of the existing ones in the input image.

(37)

4.2 Filterbanks 21

The difference between the two filters can be seen in the following images, where an octagon is filtered. Figure 4.5 shows the octagon both in the spatial and the frequency domain. The image contains edges in four different directions. In figure 4.6 one can see that the filtered octagon, i.e. filtered with Itti’s method (almost) only contains edges in the horizontal direction. This can be compared to the resulting octagon in figure 4.7, which still contains all four directions, but they are weighted differently due to the Gaussian attenuation filter. Also, notice the difference in frequency content between the two methods. The Gabor filter kernels is more high frequent, which in this case yields sharper and well defined boarders of the octagon. 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Figure 4.5. Original image of an octagon with corresponding frequency components

20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

(38)

22 The Model 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Figure 4.7. The resulting octagon filtered with the Gabor method

4.3 The center-surround method

In each channel, the different image levels build up to a pyramid, where the center-surround method now can be used. This method enables the possibility to detect salient objects of different sizes, by interactive comparison between the levels of the pyramid. Basically, it summarizes all the differential images between a given number of levels. In general, the resulting image is defined as:

consp(x, y) = M axX i=Min i+δXmax j=i+δmin di,j(x, y) !

where Min and Max are the minimum and maximum centre levels and δmin,δmax

are the levels, which they will be compared to. For example, if M in = δmin = 1

and M ax = δmax= 3 then

consp(x, y) = d1,2(x, y) + d1,3(x, y) + d1,4(x, y) + d2,3(x, y) + d2,4(x, y) +

d_2,5(x, y) + d_3,4(x, y) + d_3,5(x, y) + d_3,6(x, y) (4.4) Because of these lines of action, the only non-zero pixels in the resulting image, are the ones where the image differs between the levels. A simpler description is:

one level is added to the result only if it has new information about the input image. The resulting image is the so called conspicuity map, one for each

pyramid, which shows salient regions for that specific feature.

The modelling hypothesis is based on one unique saliency map, modelled as a neuronal network. At each spatial location, activity from the large amount of feature maps (45 in the implementation), needs to be combined into a unique scalar measure of saliency. Hence, noise is easily propagated through the hierarchy and hence the system is faced with a severe signal-to-noise ratio problem. Also, if the

(39)

4.4 Normalization techniques 23

different conspicuity maps are combined at this stage, there is a high possibility that none of the salient points or regions in these maps will be salient enough. However, the biggest problem is that features from different properties have different stimuli dimensions, and hence, are not directly comparable. For example, how does one compare a 45◦ oriented line with a 5 % intensity difference?

To solve these problems, a normalization technique is used, which consists of repeated filtering of the feature maps with DoG (Difference of Gaussians) kernels.

4.4 Normalization techniques

What should the features of the normalization be? First of all, it should normalize all the different feature maps into the same dynamical range [0:10]. This is due to the different amplitudes from the various feature extraction filters. However, this is not sufficient in order to achieve direct comparison between different features. In fact, it is because of the competition within each feature that maps them into the same stimulis dimensions. The motivation for this is the hypothesis that simi-lar features, for example 45◦ orientation and 90◦ orientation, compete strongly for saliency, while different features, for example orientation and intensity, contribute independently to the saliency map. Therefore, the comparison between two differ-ent feature maps is reduced to comparing the normalized amplitudes of the maps’ maximum values.

Secondly, the normalization should also promote peaks from the surroundings, and suppress noise and small peaks. One solution could be a simple threshold technique, where only peaks higher than the threshold value are passed. However, the saliency map should not consist of too many areas, otherwise the neural network can not distinguish a limited number of FOA:s.

Another solution could be the more sophisticated max-normalization method. It is based on a multiplying factor applied on every pixel and are determined by the relation between maximum value and mean value of the image. The resulting, normalized image would in this case be:

fN(x, y) = f (x, y)· M ax n f (x, y) o − f(x, y) 2

where f (x, y) is the mean value of the image.

If the image contains one peak, the multiplying factor will be large (the maxi-mum value in the image will be much larger than the mean value). In the case of several peaks, the factor becomes smaller due to the increase of the mean value.

So far so good, but what happens to an image with for example two strong peaks with high mean value. In this case the max-normalization will suppress these peaks, while a human probably would find both spots salient. The problem is that this normalization technique is based on global computations, such as finding the global maximum. First of all, this is not biologically plausible because the neurons in the HVS are only locally connected. Operations over the entire visual field occur late in

(40)

24 The Model

the visual cortex, where some early object recognition is made. Another drawback is that it is very sensitive to noise, especially in cases where the noise is stronger then the signal. The conclusion is that an ideal normalization method would be similar to the one mentioned above, but implemented locally.

This can be achieved by filtering the image with a DoG kernel, which is defined by the amplitude variable ci and the standard deviation σi:

DoG(ξ₁, ξ₂) = g₁(ξ₁, ξ₂)− g₂(ξ₁, ξ₂) = c 2 1 2πσ₁2 · e −(ξ2_1+ξ2₂₎ 2σ2₁ ₋ c22 2πσ₂2 · e −(ξ2_1+ξ2₂₎ 2σ2₂

Figure 4.8 shows a DoG kernel with σ₁= 0.02 and σ₂= 0.25 (viewed in 1-D).

−4 −3 −2 −1 0 1 2 3 4 0 0.5 1 1.5 2

Figure 4.8. DoG kernel

As the DoG sweeps over the feature image, it promotes isolated peaks and suppresses the surroundings to negative values. After rectification the filtered image is added to the original image, resulting in the desired and locally normalized conspicuity map. This normalization is also applied on each conspicuity map, which means that by just adding them together, results in the desired saliency map:

Smap(ξ1, ξ1) =

X

k

Ck· consp(ξ1, ξ2)k

where k stands for the different features and Ck for the corresponding weight.

A small experiment is shown in chapter 7, where the importance of this nor-malization technique is highlighted. The test also shows that competition between different levels and features is essential for a good result.

(41)

4.5 The saliency map 25

4.5 The saliency map

All the different feature maps, 45 in the implementation, have now been reduced to one single map, the saliency map. In opposite to the grey-scaled, or colour, input image, this topographic map has much less nonzero pixels. Obviously, this makes it much easier to sort out a dominating region amongst these points. However, it also means that this map should contain all the necessary information to determine the most interesting FOA:s in the image.

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 0 20 40 60 80 100 120 140 160 180 0 50 100 150 200 250 300 350 400

Figure 4.9. Grey-scaled image of a tank and its corresponding histogram

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 0 500 1000 1500 2000 2500 3000 3500

Figure 4.10. The saliency map for the image and the corresponding histogram

By definition, the maximum value of the saliency map should correspond to the most salient point in the input image, and hence the focus of attention should be directed to this point. However, one must use a more sophisticated method than just taking the maximum value.

First of all, after a winning location has been determined, this pixel should be set to zero. Otherwise, it would remain the maximum value, and the focus of attention would constantly be directed to this position. This is biologically prevented by the IOR feature in the human visual system as described in chapter

(42)

26 The Model

3. A neural network could be used to implement this feature. The neurons have an extra input, a local inhibitory signal, which if activated suppresses the output signal.

Secondly, if one uses a neural network, a control mechanism must be introduced. It should clear all the neurons after a winning location is determined. This means that only one location can win at any given time, and hence, it is called a Winner-Take-All neural network.

In the following section, the network will be further described. It is important to make a distinction between the saliency map and the network. The saliency map is an image, describing the most salient parts in the input image, with the pixel values as a direct rating of how salient a location is. The Winner-Takes-All neural network is the tool or method to find the most salient points, in a sequential and saliently decreasing manner.

4.6 The Winner-Take-All network

This neural network uses the WTA (Winner-Take-All) principle, which means that only one neuron wins the spatial competition and fires off a signal. Each pixel in the saliency map is assigned to a neuron, which activates if the pixel has a value above a certain threshold. However, the activation function (equation 4.5) also depends on a global inhibition neuron (GIN), which is activated when a winning location has been determined.

Vnew = Vold+ kw· (S − Vold· Ggin)

= (1− kw· Ggin)· Vold+ kw· S (4.5)

The GIN is defined by its scalar variable Gginand is normally zero, which makes

Vnew = Vold+ kw· S, where S is the input from the saliency map. However, if V

reaches a positive threshold value, the neuron not only leaves a signal, but also sets

Ggin to a large positive integer. This variable is mutual for all the neurons in the

(43)

4.6 The Winner-Take-All network 27 Saliency Map 0,0 0,2 0,1 1,1 1,2 1,0 S(0,1) S(0,2) S(1,2) S(1,1) S(0,0) S(1,0) LIN(0,2) LIN(0,1) LIN(0,0) V(1,2) V(0,1) V(1,1) V(0,0) V(1,0) V(0,2) GIN WTA Neurons LIN(1,2) LIN(1,1) LIN(1,0)

Figure 4.11. Schematic overview of the WTA network

To implement the IOR feature, as discussed in chapter 3, one would also want a unique, inhibition factor connected to each neuron in the network. This locally operating signal (LIN) should be activated when a neuron in the nearest neigh-bourhood, has fired off a signal earlier. In the implementation, the IOR feature is actually applied on the saliency map, where every pixel-value is depending on the variable Glin in the same way as Ggininfluences equation (4.5).

Snew = Sold+ ks· (f − Glin· Sold)

= (1− ks· Glin)Sold+ ks· f (4.6)

where f is the initial pixel value.

The computation of Glin is based on how close the pixel is to an earlier winning

location, as shown in figure 4.12. Pixels far away from this location are not affected, which makes it possible to easily find the second most salient point. To illustrate the neural network, including the IOR feature, figure 4.12 also shows the result, when the WTA is applied on the saliency map in figure 4.10.

The left plot shows how the influence of the local inhibition neuron decreases with an increasing distance from a winning location. In this case, only pixels within a radius of about 10 are affected. The right image shows the saliency map after a winning location is detected by the WTA neural network. Notice how the region surrounding the winning pixel has decreased in value

(44)

28 The Model 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Figure 4.12. Local Inhibition Neuron and illustration of the IOR feature

4.7 The top-down method

As mentioned in the previous chapter, this model is a purely bottom-up method, and hence, the saliency map should contain all the sufficient information, so that a top-down method is not required. However, the next chapter describes some further processing of the data, which yields a final image with one or several FOA regions. Although this is not a top-down method in its purest sense, it is based on information from lower levels in the hierarchy, namely from the saliency map.

(45)

Chapter 5

Determining a variable FOA

region

The model in the previous chapter finds, in a neuronal plausible and feed-forward manner, the most salient points in an image. However, it would be better for controlling an image coder if the FOA would be a region surrounding the salient object, or at least parts of it. The benefit lies in the improved image quality, but with increased bitrate as a drawback.

Furthermore, the image coder is of an extra interest when it is applied on video sequences. In the case of films there is normally just one interesting region shown in every scene. Therefore, the estimator will find a FOA region surrounding that area. If a new interesting region arises, for example motion in another part of the visual scene, the estimator will shift and instead attend to that area. However, if the scene is stationary, i.e. the FOA is directed at the same location for a longer period of time, the viewers of the movie will start to investigate the rest of the image. Therefore, an expanding FOA region would be a great advantage.

A first approach to fulfill these two requirements would be to simply choose a circle or ellipse centered around the salient point. This would have the advantages of an extremely simple shape, and the increase of radius would be simple to im-plement. However, if the object is a line, the shape of the FOA would be far from optimal in the sense of bitrate. If the FOA region increases in size, one would want it only to increase in the direction of the line.

Another, better approach, is to use a classic top-down method, which tries to recognize the object with the use of neural networks. The area around the salient point is fed backwards through a hierarchy of neuronal tuned filters. Besides being computationally hard, this method requires a certain amount of information of the input area to be able to classify and recognize the object.

A third approach, which will be described below, is based on the saliency map, and hence in opposite to the first approach only forms the FOA region according to the most salient parts in the image.

(46)

30 Determining a variable FOA region

5.1 Segmentation

The first thing to do is to segment out a starting FOA from the saliency map. It should only cover the winning pixel and its nearest surrounding neighbourhood. The segmentation technique used in this implementation is basically a standard one, including a threshold value which is proportional to the winning pixel’s value. The algorithm scans the image for pixels that both have a value above the threshold, and of course are connected to the winner.

5.2 Dilation of the saliency map

The next thing to do, is to dilate (expand) the FOA region iteratively but only in the extension of the original saliency image. Dilation means that, if a non-zero pixel in the image has zero-valued neighbours, then these pixels duplicates that value. A pixel’s neighbourhood is defined by the structure element. For example, connectivity 4 is defined as:

0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0

In other words, the regions in the saliency map expand one pixel at a time. However, as mentioned above, they are only allowed to expand in directions where the original saliency map has non-zero pixels. Figure 5.1 illustrates the described process. Black pixels belong to the object and white pixels belong to the surround-ings.

Adjustment by the saliency map After one dilation

Original FOA After two dilations Adjustment by the saliency map

Saliency map

(47)

5.3 Number and size of the FOA regions 31

After one dilation, the segmentation method is applied on the, now expanded, saliency map. Since the values of the winning pixel and its neighbours have been duplicated to their surroundings, the thresholding will now generate a larger FOA region.

5.3 Number and size of the FOA regions

This dilation process is not limited to only one FOA region. On the countrary, it can take several input pixels, and expand the different FOA regions separately. The different winning pixels are simply generated by a number of iterations of the Winner-Take-All network, as described in the previous chapter. The following example shows the dilation process of two FOA regions.

5.3.1 An example of multiple FOA expansion

Figure 5.2 shows an image of a frog and the corresponding saliency map. It is illus-trated with the grey-scaled image on top of it, and contains a number of different salient peaks. After fed through the WTA-network and the following segmentation, the resulting image, containing the two most salient FOA:s, is shown to the left in figure 5.3. The left eye of the frog and also its colourful stomach are the two objects inside these regions. One dilation of these starting FOA:s and following segmentation yields the FOA image shown to the right in figure 5.3.

100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450

Figure 5.2. Gray-scaled image of a frog and the corresponding saliency map

After two iterations (dilations), one can notice that the mouth of the frog is included in the FOA (figure 5.4). With this image, an interesting question arises: If one expanding FOA region reaches another, should this new region be shown completely or iteratively, one pixel at a time?

This is, first of all, a relatively subjective problem. Secondly, it should be noted that the new region perhaps isn’t as salient as other, unattended regions in the image. The problem should probably be solved by a top-down method, which mimics the higher levels of the visual cortex in the HVS. It is where the brain starts

(48)

32 Determining a variable FOA region 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450

Figure 5.3. The FOA image before and after one dilation

to connect the attended regions and tries to bind them into a unitary concept or object.

As mentioned before, such a complex (and limited researched) method hasn’t been investigated in this thesis. However, the basic principles can be used, i.e. ad-jacent regions should be combined into a unitary FOA region. Hence, this answers the question above, the frog’s entire mouth should be shown as a combined region along with the eye, although only two iterations have occurred.

100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450

Figure 5.4. To the left, the FOA image after two dilations, and, to the right, the final

FOA image after 15 dilations

5.3.2 A second example of the dilation process

The two major input variables of the estimator are the number of dilations and the number of FOA regions, but one can not see them as two un-correlated variables. For example, if one has set the number of FOA regions to a certain number, and iteratively expands these areas, there is a possibility (depending on the saliency map) that some of them will be combined. Therefore the estimator will find new

(49)

5.3 Number and size of the FOA regions 33

FOA regions elsewhere in the image. This is exemplified by the following images. Figure 5.5 shows the image which is fed to the estimator. The adjustments are initially set to 3 FOA regions and 2 iterations, and the system generates the FOA image shown to the the right in figure 5.5.

50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250

Figure 5.5. Gray-scaled image of a plane and the output image containing 3 FOA:s after

2 iterations

After one further dilation the two parts of the aeroplane are combined and the estimator is forced to generate an extra FOA region.

50 100 150 200 250 50 100 150 200 250

(50)

(51)

Chapter 6

Important variables

This chapter lists and describes some of the important variables of the existing implementation (bolded in the text). Most of them are tuned based on a subjective concept of mainly the positions and forms of the FOA regions. To be able to manage a large variety of images, such as indoor and outdoor environment, faces and artificial images, the variables have been adjusted in a compromising manner.

6.1 Feature extraction

As mentioned before, the feature extraction has great responsibilities for both heav-ily reducing the input data, and to provide as much information as possible about the visual scene. These two factors contradict each other, and hence a clever com-promise must be done.

Filter optimization can be done in several ways and is very specific to the de-sired application. The filters used in this implementation are designed by using more or less of a trial-and-error method. The Gaussian filter is defined by its peak amplitude and its standard deviation. The implementation uses these low-pass filters for subdividing the different frequency bands in the filterbank and for detection of intensity differences in the intensity channel.

Tests have been made with three Gaussian filters, which differs only in kernel size: 3, 5 or 9 pixels wide. The standard deviation of the different kernels are kept the same. A bigger extension in the spatial domain gives a more low-frequent filter and thus, in the case of the wavelets, gives slightly different frequency bands. However, tests of all three cases on different kind of images, shows that there is none, or insignificantly small, difference in the final generation of the FOA:s.

An interesting extra feature would be to weight the different sub-bands, either by manually adjusting them or by using a clever, criteria-based determi-nation of the weights. This is something that is hard to investigate, due to the severe complexity of finding such a reliable and biologically plausible criteria (or measurement) for saliency in different sub-bands.

(52)

36 Important variables

The number of orientations of which the Gabor filters are sensitive to, does

not influence the final result if one uses at least four directions. A less number is heavily deteriorating the saliency map. The orientations used in the implementa-tion are placed uniformly along the unit circle: 0, π/4, π/2 and 3π/4 rad.

The Gabor filters are defined partly by the Gaussian features, but mostly

by the frequency of the sine/cosine structure. The selection of this one is di-rectly linked to the sub-dividing of frequency bands in the filterbank method. As discussed earlier in chapter 4, the different choices of Gabor filters yield slightly different results. As can be seen in figure 4.6, the Gabor filter proposed by Itti, are more concentrated on the orientation of the filter, than the Gabor kernels. This is however not the major factor, when the estimator is applied on “real-life” images. In fact, there are surprisingly small differences between the saliency maps for the different Gabor filters. Nevertheless, Itti’s filter generates slightly larger FOA re-gions. It is due to the fact, that they are more of a lowpass filter than the Gabor kernels. This is a drawback, from an image coding point of view, where as much as possible of the non-FOA areas should be compressed.

6.2 Weighting of the different channels

Weighting of the different sub-bands in the wavelets is already discussed, and also the difficulties that come along with it. However, a more straightforward method, but perhaps not more biologically inspired, is to weight the different channels. Experiments have shown contradicting evidence for the brains ability to handle this, especially when dealing with different kinds of images (for example, indoor and outdoor environment).

Tests with different weight constellations showed that, for a generally working estimator, it is recommended to use more of the orientation channel than the intensity channel. An explanation could be that, due to the larger number of feature maps (36 orientation maps against 9 intensity maps), a salient region in the orientation channel has to compete in a much bigger extent, to reach the whole way to the saliency map.

6.3 Normalization

The selection of normalization technique is discussed in chapter 4, but there are some variables of the DoG normalization that should be emphasized. First, the actual DoG filter kernel, which determines how large the promoted peaks should be, both in ’height’ and in extension. However, this can easily be adjusted by the dominating variable, the number of iterations. Few repetitive sweeps by the DoG kernel generates not only a more diffuse conspicuity map, but also more information about the objects surrounding the salient peaks. A larger number of iterations enhance the competition between the different feature maps, and hence, provide a “safer” determination of the most salient area. On the other hand, the saliency map will contain much fewer areas for the FOA to expand into.

(53)

6.3 Normalization 37 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60

Figure 6.1. The conspicuity map for the intensity channel, normalized 1 time (the left

(54)

(55)

Chapter 7

Test results

This chapter briefly presents the implementation of the estimator and with some following test results, both from when the estimator was applied on artificial images and when it was applied on natural images.

7.1 Tests on artificial images

A small test was done on four test individuals, where they were asked to study four different artificial images. The test images were all generated from a Matlab script, which spreads either rods in different orientations or squares, randomly over the entire image. The test images are similar to the ones, used in Itti’s experiments, described in [6]. The purpose of this test was to compare the test individual’s FOA:s and the suggested one from the estimator. Due to the lack of eye gaze measurements, the evidential significance can be questioned. However, the test showed some interesting results

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Figure 7.1. Test image 1 with corresponding FOA region

The first test image, figure 7.1, is an example of how the HVS is sensitive to 39

Estimation of visual focus for control of a FOA-based image coder

Estimation of visual focus for control of

a FOA-based image coder

Estimation of visual focus for control of

a FOA-based image coder

Institutionen för Systemteknik

581 83 LINKÖPING

Abstract

Acknowledgments

Notation

Symbols

Operators and functions

Abbreviations

Contents

List of Figures

Chapter 1

Introduction

1.1

Thesis motivation

1.2

Aims of the project

1.3

Outline

Chapter 2

Biological Vision

2.1

The anatomy of the human eye

2.2

Different classes of the Ganglion cells

2.3

The visual pathways

2.4

Visual attention

Chapter 3

The development of visual

attention models

3.1

Computation of early visual features

3.2

Competition between locations to yield a

sin-gle point

3.3

Attentional selection

3.4

Attention and recognition

Chapter 4

The Model

4.1

Description of the model

Competition between locations and scales

Brain

Saliency map

Conspicuity maps

Input image

Pyramids

Feature

Extraction

Output image of FOA

4.2

Filterbanks

4.2.1

Gabor filters

HP

HP

HP

HP

LP

LP

LP

LP

4.3

The center-surround method

4.4

Normalization techniques

4.5

The saliency map

4.6

The Winner-Take-All network

4.7

The top-down method