Image Segmentation and Target Tracking using Computer Vision

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Image Segmentation and Target Tracking using

Computer Vision

Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping

av

Sebastian Möller LiTH-ISY-EX--11/4424--SE

Linköping 2011

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Computer Vision

Examensarbete utfört i Bildbehandling

vid Tekniska högskolan i Linköping

av

Sebastian Möller LiTH-ISY-EX--11/4424--SE

Handledare: Klas Nordberg

ISY, Linköpings universitet

Thomas Svensson

FOI, Totalförsvarets forskningsinstitut

Examinator: Klas Nordberg

ISY, Linköpings universitet

(4)

(5)

Computer Vision Laboratory Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

2011-05-09 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.ep.liu.se

ISBN

—

ISRN

LiTH-ISY-EX--11/4424--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Bildsegmentering samt målföljning med hjälp av datorseende Image Segmentation and Target Tracking using Computer Vision

Författare

Author

Sebastian Möller

Sammanfattning

Abstract

In this master thesis the possibility of detecting and tracking objects in multispec-tral infrared video sequences is investigated. The current method with fix-sized rectangles have significant disadvantages. These disadvantages will be solved us-ing image segmentation to estimate the shape of the object. The result of the image segmentation is used to determine the infrared contrast of the object. Our results show how some objects will give very good segmentation, tracking as well as shape detection. The objects that perform best are the flares and countermea-sures. But especially helicopters seen from the side, with significant movements, is better detected with our method. The motion of the object is very important since movement is the main component in successful shape detection. This is so because helicopters are much colder than flares and engines. Detecting the pres-ence and position of moving objects is easier and can be done quite successfully even with helicopters. But using structure tensors we can also detect the presence and estimate the position for stationary objects.

Nyckelord

Keywords Tracking, IR, Computer Vision, Machine vision, Segmentation, Quadrature filters, Background model, Frame differences

(6)

(7)

Abstract

In this master thesis the possibility of detecting and tracking objects in multispec-tral infrared video sequences is investigated. The current method with fix-sized rectangles have significant disadvantages. These disadvantages will be solved us-ing image segmentation to estimate the shape of the object. The result of the image segmentation is used to determine the infrared contrast of the object. Our results show how some objects will give very good segmentation, tracking as well as shape detection. The objects that perform best are the flares and countermea-sures. But especially helicopters seen from the side, with significant movements, is better detected with our method. The motion of the object is very important since movement is the main component in successful shape detection. This is so because helicopters are much colder than flares and engines. Detecting the pres-ence and position of moving objects is easier and can be done quite successfully even with helicopters. But using structure tensors we can also detect the presence and estimate the position for stationary objects.

Sammanfattning

I detta examensarbete undersöks möjligheterna att detektera och spåra intressanta objekt i multispektrala infraröda videosekvenser. Den nuvarande metoden, som använder sig av rektanglar med fix storlek, har sina nackdelar. Dessa nackdelar kommer att lösas med hjälp av bildsegmentering för att uppskatta formen på önskade mål.

Utöver detektering och spårning försöker vi också att hitta formen och kon-turen för intressanta objekt för att kunna använda den exaktare passformen vid kontrastberäkningar. Denna framsegmenterade kontur ersätter de gamla fixa rek-tanglarna som använts tidigare för att beräkna intensitetskontrasten för objekt i de infraröda våglängderna.

Resultaten som presenteras visar att det för vissa objekt, som motmedel och facklor, är lättare att få fram en bra kontur samt målföljning än vad det är med helikoptrar, som var en annan önskad måltyp. De svårigheter som uppkommer med helikoptrar beror till stor del på att de är mycket svalare vilket gör att delar av helikoptern kan helt döljas i bruset från bildsensorn. För att kompensera för detta används metoder som utgår ifrån att objektet rör sig mycket i videon så att rörelsen kan användas som detekteringsparameter. Detta ger bra resultat för de videosekvenser där målet rör sig mycket i förhållande till sin storlek.

(8)

(9)

Acknowledgments

I would like to thank my supervisors Thomas Svensson and David Bergström for assisting me with my work and also my supervisor and examiner Klas Nordberg and opponent David Sandberg. Also of course my sources for helping with the ongoing research in this field without whom this would not be possible. Thanks to Mikael Möller and Anneli Gottfridsson for help with proofreading this report and improving the readability.

(10)

(11)

Introduction

This master thesis will cover methods for segmentation and tracking in multi-spectral infrared images. Methods used are structure tensors based on image gradients or quadrature filters; We also tested the method of background modelling with either median estimations or frame differences. These methods were used to create a framework that, as robustly as possible, detects objects of interest, with minimal operator intervention. The paper will in chapter 1 start with a data description section. Chapters 2 to 4 describe the methods mentioned above and their ability to solve the problems of object detection and tracking along with some quality measures, used to estimate the quality of the results achieved. In chapter 5 we report the combinations of methods that gave the best results for our data and their corresponding quality measures. The last chapter contains some topics for future research.

1.1 Problem description

The Swedish Defence Research Agency (FOI) wants to detect targets in multi-spectral infrared images. This is to do analysis of military targets and the pri-mary interest is in infrared radiant intensity contrast which is used to create an IR signature of the target. The IR signature in turn can be used to determine if coun-termeasures look similar to the vehicle they are protecting, similar IR signatures improve the probability of successfully evading a threat.

The infra-red contrast is calculated by taking the difference of the average intensity, of the found object, and the immediate background intensity and then multiplying this difference with the area of the object. To do this we need to be able to automatically track areas such as vehicles and flares. The current method, known as peak filter at FOI, is to place a fixed size rectangle so that the sum of all pixel intensities inside its area is as large as possible. This method has some disadvantages, where the fixed-sized rectangle is the major issue. Objects generally are not rectangular and also change in size due to rotation, changing temperature or speed. The rectangle method also has problems with false positives and poor tracking results where the rectangle may stick to a spurious pixel or not follow

(14)

2 Introduction the target well. These disadvantages are currently fixed by operator intervention which is a time consuming solution.

The report describe methods, for tracking and segmentation, that will perform better than the older methods. These methods solve the problem of the fixed-size rectangles by replacing them with a flexible contour of the same size and shape as the object. For known objects there are preconfigured combinations of methods and with new objects it is possible to create new combinations.

The aim of this report is to present how to estimate a mask, from data, that can be provided to the operator. An example of an estimated mask is seen in figure 1.2b where white are the sections of the image, 1.2a, that is to be kept and black are the sections to be removed and the result is seen in figure 1.2c.

To summarise: Our aim is to separate the object(s) in an image, from the back-ground, and to estimate a mask(s) that can be used for further image pro-cessing.

The system design to complete the tasks set up so far in the introduction is summarized in the figure 1.1. As seen there the design is a pipes & filters design1 to allow for parallel processing and also to make each unit independent of each other. Also the filters can be combined in a multitude of ways and some intriguing possibilities shows up when we allow for feedback of image data.

Figure 1.1. A system summary showing the flow of data and the pipes & filters structure

of the system design.

1.2 Data

With a multi-spectral infrared camera, sensitive for wavelengths between 1.5µm to 5.5µm, it is possible to detect objects that are not visible for the human eye (a human eye is sensitive to wavelengths between 0.39µm and 0.75µm [1]). Our camera uses a number of different spectral bands (n), usually between 3 or 4, to be able to detect different objects. This means that we can, with the help of the camera, see the thermal radiation of objects colder than the 525◦C2 _{which is the} lowest temperature that is visible for the human eye (in optimal conditions). This

1_{Pipes & filters are also known as pipeline and is a system design where the data flow through}

pipes and encounters different filters on the path. This design pattern allow for concurrent processing and easy maintenance since the filters are well separated.

(15)

Figure 1.2. original image, estimated mask and result after applying the mask.

applies to “black” objects that do not reflect any light but in reality there are reflections in the infrared light that can both hide and reveal objects.

The acquired data is stored in a hypercube where each pixel is described by a function f (x, y, t, i). Here x and y are image coordinates, t is time and i ∈ {1, . . . , n} is one of the spectral bands, where n is the number of spectral bands available. The function f return a brightness value for the specified (x, y, t, i), if

i and t are constant but x and y varies we will retrieve an image at time t for

spectral band i.

The datasets presented in figures 1.3 to 1.6 are of the types that are commonly analysed at FOI and where automation could potentially save many man-hours on that analysis. In these images we can also see some typical properties of IR image sensors. Due to their very high sensitivity they have a much higher problem with “veils” that distort the background brightness level in the images. This can be seen as a slow gradient across the image with a darker side and a brighter side. Related to this is also the much lower signal to noise level that makes the Gaussian noise in the image more visible than in most visible light cameras. It is also more difficult to create error free sensors with the lower volumes they are produced in which means that the chips are more likely to have dead/spurious pixels which is seen in the image as salt & pepper noise.

Luckily we can see progress in the image quality over time since we have access to both new and old data. The older image sensor produces data as seen in figure 1.3 where as the newer sensor produces all other images.

(16)

4 Introduction figure 1.3 This is the first dataset and is a test for how well any method

can handle dead pixels and noisy data. As seen in the figure, the real signal is very similar to the noise from the image sensor.

figure 1.4 This dataset is more recent than that in figure 1.3 and, as seen, much of the noise have been removed with an improved image sensor.

figure 1.5 The helicopter datasets show how similar a helicopter hull is to the sky in the infra-red wavelength regions. This dataset is a difficult test on how well the methods handle weak but spatially large signals.

figure 1.6 The last dataset type is the chaff data where many small sig-nals is scattered in a cloud that has to be accurately tracked without too much false positives and witout missing real sig-nals.

1.3 Glossary

Dead pixels is pixels that give values that are unrelated to the image data. These can usually be seen as salt & pepper noise in unfiltered example pictures. They originate from damaged sectors on the image sensor chips.

Spurious pixels is pixels that like dead pixels do not contain data from the image with one distinction. Spurious pixels vary over time.

Mask is an binary image that contain the shape of the detected object as well as a determination if a portion of the image is part of the object or not. An example of a mask can be seen in figure 1.2.

False positives are targets that the algorithms have found that does not actually exist. These usually originate from dead pixels or noisy data. Post processing and the quality measures are an important part in removing and identifying false positives in the processed data.

False negatives are target that the algorithm did not detect, in this text simply referred to as missed targets.

IR signature is a parameter that describe how a target is registered in the IR wavelengths.

Chaff is a type of counter measure consisting of many small pieces of conductive foil. An example can be seen in figure 1.6.

Flare is a hot burning piece of phosphorus with additives to make the IR signature as similar as the defended target as possible. An example of this data can be seen in figure 1.4.

(17)

Figure 1.3. A sample of a dataset using an older camera as well as a longer distance,

also a flare.

Figure 1.4. A sample of a dataset from the better camera taken of a flare dropped from

an aircraft.

Figure 1.5. A sample of a dataset taken of a helicopter.

(18)

6 Introduction

1.4 Notation

Due to the many different algorithms presented, sourced from different publica-tions, there are some notational difficulties. However I try to be as consistent as possible. In this paper the following notation will be used:

x, c Scalars

x, u Vectors

|u| , kxk2 Magnitude of vector, |x| =pP_ix2 i

kxk1 1-norm of vector, kxk1=P

i|xi| ˆ

u Directional vector of unit length

F , M _{Scalar valued functions on R}2

F the complement of image F . i.e 1 − F for normalized images

∗ Convolution

∂

∂x smooth partial differentiation in x the coordinate

∇ smooth gradient ( ∂

∂x, ∂ ∂y)

T

T, N Matrices and tensors

kTk Norm of matrices and tensors

e

N, eB Dual tensors

F , F−1 _{Fourier transform and it’s inverse}

I Input image to the segmentation step

E Segmentation output image

O Post processing output image, mask

1.5 Tensors

In this paper the term tensor is a symmetric 2 × 2 matrix for each pixel where the four elements represents the local region. There is many ways to compute the tensor for images and the most common example is the structure tensor which is generated by calculating the gradient of the image and then taking the outer product of the gradient at each pixel as seen in equation 1.1 and figure 1.8.

Tg= ∇I ∇IT∗ g = ∂I2 ∂x2 ∂I ∂x ∂I ∂y ∂I ∂x ∂I ∂y ∂I2 ∂x2 ! ∗ g (1.1)

Where g is a gauss window for local mean estimation.

When working with tensors all pixels will have four data values assigned to each one of them and in this thesis the values have been separated so that all values in position (1, 1) for all pixels is stored in the upper left quadrant and position (1, 2) in the upper right quadrant and so on. This can be seen in the figures 1.8 and 1.9 where the structure tensor have been estimated for the image in figure 1.7.

1.5.1 Eigenvalues for Tensors

For tensors the eigenvalues λ1 and λ2represent the intrinsic dimensionality of the region around the pixel in question. We arrange the eigenvalues so that λ1≥ λ2.

(19)

Figure 1.7. IR image of a helicopter seen from the side. This picture is used for the

tensor calculation examples in figures 1.8 to 1.11

Now if λ1 and λ2 are small then the region around the pixel is a featureless flat region, this case is called I0D or Intrinsically zero dimensional. If however λ1> λ2 then the region contains a one dimensional feature i.e. a line. If both λ1 and λ2 are large then the region has a two dimensional feature such as a corner or two lines crossing. To summarize:

I0D λ1= λ2= λ0 Planes, featureless areas

I1D λ1> λ2= 0 Lines and edges

I2D λ1≥ λ2> 0 Crossing lines and corners

The orientation of these features is determined by the respective eigenvector v1 and v2. It is however important to remember with which parameters the tensor was estimated since it has an impact on which sized features that will be detected.

Algorithm Recall: T = a b b d

The eigenvalues of T are straightforward to compute by the following equation.

λ = 1

2

(20)

8 Introduction

Figure 1.8. Example of the structure tensor Tg from image gradients. Note that salt

& pepper noise gives strong signals as well as the true target, the helicopter.

Figure 1.9. Example of the tensor Tq from quadrature filters. Note that this is a low

(21)

There is however 2 different cases for the eigenvectors. First case when b 6= 0 and second case when b = 0, see table below.

b 6= 0 b = 0 v1 λ1− d b 1 0 v2 b d − λ2 0 1

However we do not need to explicitly calculate the eigenvalues for the image to be able to estimate the intrinsic dimensionality of the neighbourhoods of all pixel’s. We can save the computation of a square root and still get good results by using the Harris operator [4],

R(T) = det(T) − κ trace2(T) = ad − b2− κ(a + d)2_{= λ1}_λ

2− κ(λ1+ λ2)2 where κ is a tunable parameter usually in the interval 0.04 to 0.15.

(22)

10 Introduction

(23)

(24)

(25)

Chapter 2

Preprocessing

In this first step we will change the input, a multi-spectral image, f into an grey-scale image, I. Thus most important preprocessing step is to reduce the amount of data by combining the brightness, bi = f (x, y, t, i), of spectral bands 1, . . . , n to simplify the calculations. That is, to combine the brightness val-ues b1, . . . , bn so that the pixels (x, y, t, b1), . . . , (x, y, t, bn) are mapped to a pixel (x, y, t, g(b1, . . . , bn)) for all (x, y, t).

2.1 Spectral mixer

The reason we need the transformation g is that the image segmentation methods we will use must have a scalar brightness, not a vector one. The transformations we compare are the following three:

¯

b1+ · · · + ¯bn

n and median(¯b1, . . . , ¯bn) and

¯

bˆ_k

where ˆk is the index for the brightness value given by max(|b1|, . . . , |bn|). Now the bi’s for spectral band i and time t are pre-transformed as

¯_b_i_{= b}_i_{− µ}_ti where we define µti= X x X y f (x, y, t, i) X x X y 1 (2.1)

as the mean1of bi over x and y. This will reduce the complexity of having several spectral bands with different brightnesses and allow the methods to work with the more relevant adjusted brightness, ¯b. The pictures in figure 2.1 show the

1_{Medians have also been evaluated with similar results but with much longer calculation}

times.

(26)

14 Preprocessing

bi−µi

−→

Figure 2.1. The left image shows µit for different times and for three spectral bands.

The right image shows the brightness ¯biafter adjusting for the mean value.

adjustment visually, note that the left image is in a different scale than the right one.

(27)

In figure 2.4 the result of different g’s is shown. The g settled for was absolute maximum since it selects the bˆ_k which have the greatest deviation from the mean, zero. If we assume that brightness is roughly normal distributed (see figure 2.2) then values in the tails should more likely be associated with an object.

Figure 2.2. Image histogram of a single frame showing roughly normal distributed

brightness values. The following property of normal distributions still holds, it is more common the be close to the mean than further away.

This method, however, is not wholly satisfactory and results might be improved if the method incorporated the actual wavelength, for each band, and the bands were combined together in a more photometrically correct way. That is if we could utilize the information in each spectral band to assign a temperature value to each pixel, determined from the black body radiation assumption, this would require some data from our camera filters and knowledge about black body radiation physics. It is however not necessary since this simple and quick method gives satisfactory results.

(28)

16 Preprocessing Figure 2.3. A small cut-out of the original data with four sp ectral bands. Figure 2.4. The result ing images after mixing the adjusted bands w ith mean, median and absolute maxim um resp ectiv ely .

(29)

2.2 Median filtering

Further preprocessing using median filtering can be useful to reduce salt & pepper noise and spurious pixels, with variations in brightness, caused by camera and filter errors. Since these pixels can cause false positives it is advantageous to filter them out.

Disadvantages of median filtering is that it round sharp corners as seen in figure 2.6 and may remove details of the detected shape. Furthermore since median filtering removes spurious pixels these cannot be detected later and may therefore falsely be included in detected object shapes. This can be seen in figure 2.8 and if we compare to figure 2.7 where the dead pixels have been removed from the mask. This will give bias to the contrast calculations and is therefore a very significant drawback.

The following figures will give an artificial example of this bias error in mean brightness, µ. Mean brightness calculations with a mask is very similar to equation 2.1 µ = X x X y F (x, y) · M (x, y) X x X y M (x, y) (2.2)

(30)

18 Preprocessing

A test image, F , with salt & pepper noise and a square object in the middle is created as seen in figure 2.5. To the right, in figure 2.6, is the same image F0 after median filtering where most noise is removed and the sharp corners of the square are smoothed.

Figure 2.5. Original image, F Figure 2.6. Filtered image, F0

In the figure 2.7 is the optimal segmentation2 _{result, M , and in figure 2.8 the} median filtering segmentation result, M0. Note the black spots in the optimal segmentation which indicate pixels that we do not have data for since they where damaged.

Figure 2.7. Optimal mask, M Figure 2.8. Filtered mask, M0

(31)

In figure 2.9 the foreground, F · M0 of the image,F , using the median filtering and segmentation result is seen. In figure 2.10 the background, F · M0 _{is shown. The} object’s estimated mean brightness, µf, which will be used in contrast calculations, is calculated as µf= P x P yF · M 0 P x P yM0

and is in these images 1795, compared to 1000 which is the ground truth.

Figure 2.9. Foreground, F · M0 Figure 2.10. Background, F · M0

In the figures 2.11 and 2.12 the foreground and background of the image is calcu-lated using the optimal segmentation. With this segmentation the mean bright-ness, µo, is calculated as µo= P x P yF · M P x P yM

and is 1043, which is much closer to the ground truth of 1000.

(32)

(33)

Chapter 3

Segmentation

In chapter 3 we will discuss two methods for background removal: Background models and frame differences. We will also look at two methods for object detec-tion: Structure tensors from image gradients and structure tensors from quadra-ture filtering. The background removal is a way to split the image into two seg-ments: Background and foreground. The foreground is expected to be the desired target. Object detection may detect the presence and position of a target but not the shape. These methods will all take a grey-scale image, I, as input and produce an interest estimation image, E. In this interest estimation image, which is also an gray-scale image, a brighter area is more likely to be of interest than a dark one.

Each method gets a section where its implementation, strengths and draw-backs, are discussed. We need to discuss at least four methods because one method is not enough to solve all the different problems with background removal and ob-ject detection.

The goal for these methods is to detect and separate objects such as, but not limited to, helicopters, flares and chaff. This will result in a foreground estimate describing which pixels are occupied by an object and which pixels are just back-ground. The foreground estimation will however need some cleaning with post processing to reducing false positives and improving the results.

3.1 Frame differences

Frame differences utilize that properties of objects in motion is different from stationary objects, which are less likely to be interesting, and noise. This works since taking the difference between two frames will only show the places where there have been a change. The method has significant drawbacks since it cannot handle stationary objects or objects that move slowly in relation to their size. The algorithm is taken from [5].

(34)

22 Segmentation

Figure 3.1. Two IR images originally multi spectral but the spectrum’s have been mixed

together, the objects are a helicopter and a flare

Figure 3.2. Relevant parts of output from figure 3.1 after frame difference calculations

with k1 = 3, the flare is segmented very well but the helicopter is showing significant

shadows from older frames.

3.1.1 Algorithm

The algorithm is quite simple but there is room for extensions. We start with the following definitions:

Ik = image numbered k Fk = Ik− Ik−k1

where k1is a tunable time offset and k belong to the set of all frames, {1, . . . , n} where n is the total frame count in the dataset. The source[5] used a second frame differences step to get a temporal second derivative, which however is not used in this report. To remove spurious pixels the algorithm remove those pixels whose absolute value is larger than the user supplied parameter max

Ek =

Fk, if |Fk| < max 0, if |Fk| > max

(35)

3.1.2 Results

For objects moving fast relative to their size this method performs well and, as seen in the result image in figure 3.2, most of the noise from figure 3.1 is stationary and is filtered away quite nicely. The motion requirement makes this method, by itself, less useful for larger objects. This is because such objects will cast “shadows” on other parts. This disadvantage makes the segmentation results unreliable for large objects.

3.2 Background model

This method is taken from [6], original source is [7], but with an added trick also suggested in the book [6] but sourced from [8], with incrementally updating a median estimation. This median estimation is used to estimate the foreground by subtracting the background estimate. This method will work best for stationary cameras or when we have a homogeneous background with a moving target. For good data the method is also very easy to tune and can be quite fast if a small number of images, k, is used.

3.2.1 Algorithm

There are two tuning parameters: The first one, β, is the number of frames used for each estimate and the second one, α, is how fast the background will be updated. Let as before {Ik}nk=1 denote our consecutive frames. The algorithm works as follows

1. initialize k = β

2. estimate the background

Bk = median (Ik−β+1, Ik−β+2, . . . , Ik)

3. increment k by one

4. estimate our foreground

Ek= |Ik− Bk−1| 5. update the background

Bk= (1 − α)Bk−1+ α median (Ik−β+1, Ik−β+2, . . . , Ik)

6. if k < n return to item 2.

Another method is to incorporate the standard deviation for each pixel in the background. This method adds two steps to the algorithm above, similar to the method by [9] but with a single Gaussian distribution per pixel instead of several.

(36)

24 Segmentation

2. estimate the background

µk = median (Ik−β+1, Ik−β+2, . . . , Ik)

3. estimate the standard deviation

σk= std (Ik−β+1, Ik−β+2, . . . , Ik)

4. increment k by one

5. estimate our foreground

Ek = Ik− µk−1 σk−1

6. update the background

µk = (1 − α)µk−1+ α median (Ik−β+1, Ik−β+2, . . . , Ik)

7. update the standard deviation

σk= (1 − α)σk−1+ α std (Ik−β+1, Ik−β+2, . . . , Ik)

8. if k < n return to item 2.

3.2.2 Results

Overall this background model supersedes the frame difference model in detecting objects as well as having less false positives. It is however slower which, while not a concern for this implementation, could be of interest in future development. For sequences where the target is moving around in the frame the segmentation can be very good as seen in figure 3.3. For sequences where the camera tracks the object well we get the side effect of swapping what is foreground and background, i.e. the helicopter is believed to be the background and the sky the foreground. In figure 3.4 there is an example of the poor segmentation that is the result.

3.3 Structure tensor with image gradients

This is a method for estimating the structure tensor 1 for the neighbourhood at each pixel. The method can be tuned for object size but, to save calculation time, for large objects scaling down the image is recommended.

(37)

Figure 3.3. Here are the original image, the background estimate, the difference and

the estimated shape of the helicopter. As seen the resulting shape is very close to the actual shape in the original image.

Figure 3.4. A video sequence where some pixels are poorly segmented because the

helicopter occupy them very often. The artefacts that cause the problem is clearly seen in the background estimation.

(38)

26 Segmentation

Figure 3.5. On top the two functions g and gx respectively that when convolved will

produce the bottom convolution kernel, gT_∗g

x, that is used when estimating a continuous

derivative over an image.

3.3.1 Algorithm

We will estimate the image derivatives by convolving the image with a 2D signed Gaussian kernel, created from two separable 1D kernels, which will be cut off in the spatial domain at ±3 standard deviations, which is sufficiently large, where the standard deviation is tunable so that we can change how the filter will react to different sized objects. Large variance is used to diminish the response from small objects and amplify the response from large objects since a larger neighbourhood is taken into consideration with larger variance.

First construct the separable filter kernels given the variance σ and extract values that are within 3 standard deviations to get a limited size, pictured on top in figure 3.5. g(x) = e −0.5x.2 σ2 σ√2π gx(x) = −gxσ2

(39)

data by convolutions: ∂I ∂x = I ∗ g T _{∗ g} x ∂I ∂y = I ∗ g ∗ g T x = I ∗ (g T ∗ gx)T

There is an example of there operations in figures 3.6 and 3.7. The image derivatives are used to estimate the structure tensor Tg:

Tg= ∇I ∇IT ∗ g = ∂I2 ∂x2 ∂I ∂x ∂I ∂y ∂I ∂x ∂I ∂y ∂I2 ∂x2 ! ∗ g

Here we will use the structure tensor to estimate the local intrinsic dimension-ality of the image, i.e. the local complexity of the image. A more complex part of the image is deemed more likely to be interesting to track. This is done by calculating the eigenvalues of the tensor or simpler by using the Harris operator. Both of these concepts are further explained in the introduction 1.5.

3.3.2 Results

The method of structure tensors with image gradients gives good results in de-tecting objects that are spatially large since the desired target size can be tuned to ignore the much smaller salt & pepper noise. However it will not perform well if the desired target is of similar size as the noise. The method detects the posi-tion and presence of the object, it is especially useful if the proper shape cannot reliably be estimated. The reason why the method cannot detect the shape of the object depends on that the edges around the object is of similar frequencies as the noise. This disadvantage makes the method a poor choice for image segmentation but it is still useful when other image segmentation methods fail to acquire an estimate for position and presence. The conclusion is that the method is more robust concerning position and presence detection.

3.4 Structure tensor with Quadrature filters

Most of the quadrature filters implementation is sourced from [10] where many methods for estimating local orientation is described but the original source is [11]. The result from this method is an estimation of the structure tensor T of the local neighbourhood for each pixel. From this structure tensor we can calculate eigenvalues which describe the local intrinsic dimensionality and the norm of the tensor. The quadrature filter has a tunable central frequency which is used to filter out object of undesired spatial sizes, which can be very useful when we want to find objects that are large in a noisy image but where the noisy pixels are small, such as salt & pepper noise. The method is very similar to the image gradients with structure tensor method, but with a different way of estimating the tensor.

(40)

28 Segmentation

Figure 3.6. Image used for image derivatives example, note the multiple engines of the

aircraft as well as some salt & pepper noise which all give strong signals

Figure 3.7. Image derivatives from the test image in figure 3.6, Ix0 and I

0

yrespectively

note that the signal from the engines is stronger than that of the salt & pepper noise because of the large variance σ in g. Unfortunately there is some edge effects visible that can cause false detections if they are not filtered away.

(41)

3.4.1 Algorithm

The theory about quadrature filtering of images is best explained in [10] but a full account for it is too much for this text. However the algorithm can be used and implemented from the following description.

The algorithm begins by obtaining Ri, central frequency, and B, bandwidth, from the user where Ri determines the size of desired objects and B the variance within the size. These parameters are problem specific, with different values the filter has different properties, for example with large Risalt & pepper noise is very strong and needs to be removed in preprocessing. After getting the parameters from the user the filters Hn are created in the digital Fourier domain with the

same size as the frame to be filtered. This method uses a tunable count of filter orientations n where the filter orientations represents the direction of interest and lines orthogonal to this direction will give stronger responses. Since −ˆn and ˆ

n will result in the same filter response magnitude, only directions from [0, π) is considered and as such the directions {0,π₄,π₂,3π₄}2_{are selected and filter direction} vectors ˆn generated of unit length. The quadrature filter Hn is defined from

two functions R and D, a radially symmetric function and a directional function respectively:

Hn(u) = R(|u|) · D(ˆu)

where u is the 2D frequency coordinate vector and R and D are defined as:

R(u) = e −4 B2 ln 2ln 2₍|u| Ri) D(u) = (ˆu · ˆn)2 if ˆu · ˆn > 0 0 if ˆu · ˆn < 0

Then for all frames I we apply the following equation:

qn= |F−1(F (I) · Hn(u)| (3.1)

where qn is the filter response for filter direction n, this is then used to estimate

the tensor Tq. Tq= X n e Nn· qn e

N is a set of dual tensors, where [10] explains how they are determined in the general case, a recipe is provided in appendix A, and for the set of directions picked here eNn is:

e N1= 3/4 0 0 −1/4 e N2= 1/4 1/2 1/2 1/4 e N3= −1/4 0 0 3/4 e N4= 1/4 −1/2 −1/2 1/4

2_{The number of directions is tunable and more directions will get slightly better results at a}

(42)

30 Segmentation

3.4.2 Results

Salt & pepper noise gives very strong responses when B and Ri are tuned for small objects much like the results for the gradients method. As seen in the result image in figure 3.8 large objects can be tracked satisfactory but as with the image gradients the contour of the object must be extracted with another method. There is another problem that arises from the digital Fourier transforms, object close to the edges will give strong filter responses on the other side of the image where there is no target due to the cyclic nature of the transform. This problem can be solved by filtering in the spatial domain i.e. replacing the equation 3.1 with:

qn= |I ∗ F−1(Hn)|

Figure 3.8. After filtering the images on top, with quadrature filters, we get the images

below. The parameters where B = 2.2 Ri = 2.2 for the helicopter image and B =

(43)

Chapter 4

Post processing

The segmentation algorithms presented, in the previous chapter, all end up with an grey scale image, E, where white signify foreground and black signify back-ground. This is not good enough for obtaining good object shapes since there is noise present and the desired output, O, is an binary mask with only two values, therefore further processing of the shapes is necessary. First a threshold is applied to the foreground estimate so that we get a boolean value that determines if a pixel is part of an object or not. Then we use different filtering methods, depending on the data at hand, to further process the image. Finally we estimate the quality of our results.

4.1 Threshold

The problem of setting a proper threshold is very important since most algorithms in the end will need a conclusive answer and the only way to get it is to determine what does and does not pass. There are several methods implemented that are usable in different scenarios. The simplest method is to give an interval for the brightness Exy of pixel (x, y). Each pixel with a brightness inside the interval (tl, th) are turned on and the others are turned off,

Oxy=

1 tl< Exy< th, 0 otherwise.

This method has the advantage that if the target moves out of the frame we will not detect any new targets. But the down side is that the user must supply good values for tland th.

A more robust method for producing an binary image is following threshold function,

Oxy=

1 tl(maxx,y(E) − µ) ≤ Exy+ µ ≤ th(maxx,y(E) − µ)

0 otherwise (4.1)

where the parameter µ = mean(E).

(44)

32 Post processing

The choice for tl and th is not as sensitive in this latter method as in the previous one. The downside with this method is that it will usually find new false targets if the real target moves out of the frame since when the target is outside the frame at least the mean value drops.

Another method that have also be attempted is to use the standard deviation for selecting the brightest target. In this case we have

Oxy=

1 tlσ + µ < Exy< thσ + µ 0 otherwise

where σ = std(I). This method has no advantages over the previous one in 4.1.

4.2 Area filter

After having done the threshold filtering we may use area filtering to remove objects that have too large or too small area. Since such objects either are noise or, for example, a cloud. We first label the 8-connected objects in the image using the algorithm from [12]. With labeling complete we have a number of separate objects for which we count the number of pixels belonging to each separate object. Then we remove any object which has too few or too many pixels.

This method is very efficient because it disposes of the most common false objects, those that are smaller than the desired target size as well as those that are larger.

4.3 Point representation

The detected objects should get a singular position to assist tracking. Centre of mass is one method for this that works well and is fast both to implement and to calculate. Another method is binary shrinking1of areas until only one single pixel of each object remains, see the example in the algorithm section.

The centre of mass method has the advantage of speed whereas binary shrinking can separate two objects that might have been too close for the segmentation methods to separate. The prerequisite for successful separation is that the binary image is concave since convex shapes, when shrunk, will end up with only one point.

4.3.1 Algorithm

Both centre of mass and binary shrinking is quite straightforward to implement. For the centre of mass method we define the following variables for object Ok

mxy= value at pixel (x, y) ∈ Ok

rxy= position for pixel (x, y) ∈ Ok

(45)

and then define the centre of object Ok as

rCM=

P mxyrxy P mxy

observe that the object index k has been suppressed and that mxy usually is, but does not have to be, one.

The binary shrinking method is the same as the morphological method “erode” applied over and over again until each object consists of only one single pixel. This algorithm also goes under the name ultimate erosion.

Below is a constructed example image with objects that have not been success-fully separated. The left image shows the centre of mass for the detected objects and the right image the centre of mass after binary shrinking. The intermediary

steps in the algorithm is shown below with a few steps to the left and the final result in the image to the right.

(46)

4.4 Mean Shift Clustering

Mean shift clustering uses the mean shift algorithm from [6], with primary sources [13][14], to determine which objects belong together. This can be very useful for clouds of objects that is detected individually but create a larger object that is more interesting when all the sub-objects are taken into consideration together and at the same time removing objects that do not belong to the larger cloud2. This is done by assigning each small object into a basin of attraction and the basin which collect the most detections is deemed the most interesting one and all detections that fall into it is kept and any detection that fall into another basin is rejected.

To assign a basin of attraction to a detection we iteratively apply a kernel func-tion K which determine the behaviour of the method. Here K(x) = exp(−0.5|x|2) is used.

4.4.1 Algorithm

The first step is to get point representations for all objects and xi will represent the coordinates for object i. yp_j is the origin of the kernel K for iteration j and object p. The two user parameters min and h where min determines when the algorithm stop, usually a small number around 1 is sufficient, and h determine the kernel bandwidth which is more difficult to tune well, for small values of h the objects must be closer together than for larger h to be in the same basin.

1. Initialize position of kernel yp_j = xp.

2. Calculate a better position for the kernel by determining the centre of mass for the objects xi where the mass is the kernel response to the distance between yp_j and xi, weighted with the kernel bandwidth h:

yp_j+1= n X i=1 xiK |yp j − xi| h !, n X i=1 K |y p j− xi| h !

3. Calculate the distance d = |yp_j+1− yp_j| from the kernel motion and use it for a stop criteria: if d > min continue refining the kernel position by returning to item 2.

4. Now that the kernel have converged to the position yp _{and this will be the} first basin. We assign xpto this basin and any other points that converge to the same position will belong to the same basin. Any other converge points will create new basins to collect objects in.

5. Finally select the basin with the most objects in it for the foreground esti-mation, all other basins are discarded.

(47)

4.4.2 Results

This method is very useful for data which has a lot of separate parts where the constituent parts are very small as seen in figure 4.1. This situation makes it hard to use area filtering to remove false positives since they are of the same spatial size as the desired targets. But mean shift works well in detecting which targets belong together and which targets are false positives.

4.5 Convex hull

For some sequences there are many targets and detecting them individually may be less useful than determining a region that span all detections.

The method attempted here is to determine the convex hull of the detections that are close together. This is done by getting a distance threshold from the user, sorting all detections in a list and removing those which are too far from their neighbours. Unfortunately the results vary too much to be very useful as seen in the two example images in figure 4.2.

4.6 Tracking

Tracking is used as an aid in both determining if what the segmentation has found indeed is an interesting object or not and assist in improving the segmentation results. This tracker assisted segmentation could on the first attempt find a small part of a vehicle3_{. The tracker could then determine that this small part is of} interest and a second iteration of segmentation could then focus on this point.

4.6.1 Tracking to reduce false positives

The general idea of tracking is to determine from the behaviour of a detection whether it actually is an object. False positives are expected to either move er-ratically, be stationary or be short lived since sensor noise that is similar to the desired signal will not move across the frame in a smooth fashion. The short lived targets and stationary ones are both attributed to flickering pixels. Good signals are expected to move smoothly and exist for a longer duration. Tracking can also be used to determine the identity of new objects by asserting that the new object is actually another older one which have been occluded temporarily.

4.6.2 Method

The current method for tracking is quite crude and could use some improvements but it performs well enough on the data at hand. The method determines which objects in a new frame is the same ones as in the previous by comparing their

3_{On powered vehicles there is usually some exhaust and engine parts that is much hotter than}

(48)

Figure 4.1. The result after filtration with the mean shift algorithm. False positives

not situated close to the countermeasures cloud is removed leaving a clearer picture as a result.

(49)

Figure 4.2. Two consecutive frames from a video where the convex hull method have

been applied. As seen the results are quite unreliable for this data where the top frame has a good result the next frame includes much area that is not related to the target. This is happens too often for the method to be very useful.

(50)

positions. I and J is the current and previous frames respectively and in and jn represents the positions of the objects in them. We then create the matrix M:

M =    ki1− j1k2 ki1− j2k2 . . . ki2− j1k2 ki2− j2k2 . . . .. . ... . ..   

If the two frames are similar M will be a square matrix with small values in the diagonal and larger values everywhere else. If there is different numbers of objects in J compared to I, M will not be square and we will have to select best matches first and create or remove objects that have no similar detection in the neighbouring frame.

If these steps are done on all frames in a dataset we can then follow any object in the video and calculate the infrared contrast for any object in each frame.

4.7 Quality measures

It is useful to estimate how well the methods have detected the objects in the scene and therefore a few quality estimation measures are usually applied to the video sequences. There are two examples of quality measures with datasets from flares and helicopters in figures 4.4 and 4.3 respectively. There are also many more in the results section with images from the datasets to accompany them.

4.7.1 Image variance

A decent measure of how easy it is to obtain a good and clear mask of the ground objects is to check the spatial standard deviation, σ, of the entire fore-ground estimation image:

σ = v u u t 1 n − 1 n X j=1 (bj− µ)2

where n is the number of pixels in the image, µ is the mean value of all pixels in the image and bj is the intensity of pixel j. A large variation usually means that we have more leeway if we want to use a threshold and will still get a decent result. This estimation method has been used a lot and has given good results for estimating how difficult it will be for the operator to set a proper threshold and subsequently the quality of the segmented data. It is however, much like most methods, not foolproof and cannot easily be used alone to automatically determine if the segmentation is likely to succeed or not. This is because the magnitude of the variance is different for different targets and video sequences but some processing of this data might solve this problem.

4.7.2 Object count

A simple count of the number of distinct objects detected at each frame can tell if the resulting segmentation has performed well or not. We often know in advance

(51)

how many interesting objects there are in a video sequence and therefore can determine which frames that have bad data in them. However there is always the risk that the detected number of objects is good but the actual objects detected may still be wrong.

Algorithm

The method tested are to count the number of distinct 8-connected neighbour-hoods there are in each frame and suggest that this number is the same as the number of objects. More advance methods that merge close objects if they are similar, such as the smoke of a flare which should be counted as a part of the flare and not as a separate object, might get a better estimation but this method is still very useful.

4.7.3 Mean motion

This method gives a simple measure of how much motion there is in the frame where too much or too little motion can give hints whether the detection is finding too much noise or that there are dead pixels that have been confused for objects. There was no sequence where this was a significant problem and as such the quality of the quality measure has not been determined.

Algorithm

For this method a very rough estimation method is used:

• If the number of objects in frame one is the same as in frame two. • Calculate the centre of mass for all objects.

• sort the coordinates, for each frame, in decreasing magnitude. i.e. (11, 13) and (15, 20) into a single, sorted, array 20, 15, 13, 11.

• sum the absolute difference of each frame with its successor and divide with the number of objects.

Example with two objects p1 and p2 with positions (11, 13) and (15, 20) in the first image, f1, and (13, 15) and (15, 24) in the second, f2:

f1 f2 |f1− f2|

20 24 4

15 15 0

13 15 2

11 13 2

for this data the estimated mean motion will be kf1−f2k1

n =

4+0+2+2

2 or 4 pixels. However note that this is not Euclidean distance, 2-norm, but Manhattan dis-tances, 1-norm, but it is still useful since large changes in position will give large values regardless.

(52)

4.7.4 Area

The number of on pixels in a mask can be used to determine if it is the same object that is detected in neighbouring frames since a large change in area probably means that there is a new object that is being tracked. The area measure can also be used to determine how well the shape segmentation has performed since a very small area for a helicopter would imply that only the engine has been detected.

(53)

Figure 4.3. The binary image quality measures for one of the helicopter passage video

sequences. Some missing data can be seen in the movement measure due to some sepa-ration in the segmentation of the helicopter propeller. The area measure tells us that the first few frames are probably bad and at the middle of the video the helicopter passes partially out of frame.

Figure 4.4. All quality measures for one video sequence with a flare dropped from an air

plane. It may seem as the measures indicate a bad segmentation but much data comes from the four engines of the air plane that are strong heat sources that are clearly visible in the video sequence.

(54)

(55)

Chapter 5

Results

Previously we discussed image segmentation algorithms and found that they gave very different results. To summarise our results we introduce the following two concepts:

Good data is when the object is moving around in a stationary or homogeneous background and preferably with a large contrast (positive or negative makes no difference).

Bad data is when the brightness contrast is very low, such as when we have short wavelengths or large distances. At large distances much of the light is lost due to the inverse square law and atmospheric absorption. The segmentation is also problematic when objects are stationary or move slowly in the image.

The results for good data are very satisfactory and should help in the contrast calculations at FOI.

5.1 Segmentation

The segmentation results for flares are very good. The shape of the flare may be followed as accurate as the operator decides (see figure 5.1). In figure 5.1 all objects, in the frame, are detected and we can see clear signals both from the engines of the aircraft, that drops the flare, and the flare itself.

Helicopters can be quite difficult to track and, for bad data, good shape seg-mentation was next to impossible. For good data it was quite easy to get a good shape segmentation and the results were satisfactory (see figure 5.2a). Here our quality measures from 4.7 are useful in determining how good the results are.

For both helicopters and flares the same combination of methods were used but with different settings for the parameters. It is possible to get better results if we are only interested in one type of object. But it is time consuming to find that specific combination of methods that is better than the one presented here.

The first combination method, I, used for flares and helicopters are:

(56)

44 Results

Figure 5.1. A standard flare video sequence result after segmentation. All flare data

have results similar to this if method I is used.

Figure 5.2. pictures of both good and bad segmentations of helicopters. The bad segmentation is from a helicopter heading towards the camera and the helicopter front has a much lower temperature than the sides as well as being quite stationary in the image which makes the detection much more difficult for the motion sensitive methods.

(57)

Figure 5.3. Two frames from the helicopter sequence with only short wavelength image

data used. As we can see the segmentation is significantly worse than the case with several spectrum of data.

Spectral mixing → Background models → Threshold → Area filter

The second combination method, II, used for chaff data is:

Spectral mixing → Background models → Threshold → Mean shift

The third combination method, III, is used for all data:

Spectral mixing → Threshold

The fourth combination method, IV, is used for all data:

Spectral mixing → Peak filter

The third method is introduced as a control to evaluate how much better the methods I and II are to some very simple method. The fourth method is to asses how the new methods, one and two, perform compared to the rectangle method presented in the introduction. The settings at each filtering step are quickly configured with aid of intermediary data. To evaluate the methods we use the quality measures introduced in section 4.7.

At FOI there is also some interest in how well the methods will perform if the spectral bands are limited to only the short wavelengths, which gives much less contrast from the intended target and is a quite hard spectrum to track in. This was tested by removing all other spectral bands in the spectral mixing step and continue as usual after that. The results with such data is worse than if all spectra are used and, for helicopter data, only the helicopter engine can be tracked with any reasonable accuracy. When we lower the threshold value the hull may be seen in some images (see figure 5.3) but the images vary too much for the results to be reliable. However the flare data is always well segmented.

5.2 Evaluation

In the following we will compare the different methods for the three data types flares, helicopters and chaff. The quality measure “image variance” is omitted

(58)

46 Results in these evaluations since it did not provide any more information beyond the quality measure “area”. It is also important to remember that method IV has the disadvantage of the fixed size rectangle that does not represent the detected objects very well.

5.2.1 Comparison of methods I, III and IV

On the next 3 pages we will compare method I with the two controls methods, III and IV, on three different datasets.

Set 1

The first dataset is a flare dropped behind an aircraft where the camera follows the flare. This leads to that the aircraft disappears and that the flare motion is quite jittery, since human operator adjusts the camera intermittently. This jittery motion is seen in the quality estimation “Mean movement” for method I and intermittently for method IV. Method III gives much lower values than method I and III, this is because most of the detections are stationary noise which reduce the mean movement in the image. Method I is the only one which gives the correct object count during the entire video where method III is grossly over estimating due to false positives and method IV, since it must have a fixed number of objects, has the wrong number most part of the sequence.

Set 2

The second dataset is from an older camera and further away from the object. The flare, marked in the images with an ellipse, is much dimmer and smaller than in the first dataset, which makes the segmentation more difficult. In the quality measures we can see that method I mostly detect the correct number of objects where the errors are caused by hot smoke from the flares. Methods III and IV suffer from the more difficult conditions and cannot give any usable results.

Set 3

The third dataset is a helicopter that moves out of frame for a duration, taken with the new camera. Here we can see that both methods I and IV estimate the motion of the helicopter rather well but with a distinction. Method IV gives false motion and false objects during the time when the helicopter is not seen but method I correctly determines that there is no object in the image at that time. For method III this dataset is very difficult since the contrast between the helicopter and the background is so low. This gives many false positives that impact the results.

(59)

Area Mean movement Object count

I

III

(60)

48 Results

I

III

(61)

I

III

(62)

50 Results

5.2.2 Comparison of methods II, III and IV

When comparing method II to methods III and IV we will look at chaff data which is the dataset for which method II was designed. As seen in the object count quality measure the methods II and III performs quite similarly but with the distinction that method II removes some false positives. The similarities is to be expected since, as we can see in the dataset images, the chaff is bright on a dark background, which is the ideal case for method III.

Note that method IV detects the motion of the chaff well when it is bright and tight together but has problems as soon as the chaff cools down and disperses in the air.

(63)

II

III

(64)

52 Results

II

III

(65)

5.2.3 Comparison of methods I and II

The difference between methods I and II is relevant to emphasize since they involve using similar algorithms and method II is significantly slower. On the next two pages are two datasets each utilizing the strengths and weaknesses of each method so that the differences is as clear as possible. And we can see that each method perform best at their respective designed target.

On the first set, method II keeps small detections and correctly estimate the area of them all. On the second set method I successfully removes all false detec-tions.

(66)

54 Results

I

(67)

I

(68)

56 Results

5.2.4 Comparison of methods III and IV

The two control methods both give adequate results on some specific data in some specific case. Both methods however suffer greatly in the robustness criteria since they cannot handle more than good data in specific cases. The next two pages show two datasets where one method performs well and the other does not.

(69)

III

(70)

58 Results

III

Image Segmentation and Target Tracking using Computer Vision

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Image Segmentation and Target Tracking using

Computer Vision

Computer Vision

Examensarbete utfört i Bildbehandling

vid Tekniska högskolan i Linköping

av

Abstract

Sammanfattning

Acknowledgments

Contents

Chapter 1

Introduction

1.1

Problem description

1.2

Data

1.3

Glossary

1.4

Notation

1.5

Tensors

1.5.1

Eigenvalues for Tensors

Chapter 2

Preprocessing

2.1

Spectral mixer

2.2

Median filtering

Chapter 3

Segmentation

3.1

Frame differences

3.1.1

Algorithm

3.1.2

Results

3.2

Background model

3.2.1

Algorithm

3.2.2

Results

3.3

Structure tensor with image gradients

3.3.1

Algorithm

3.3.2

Results

3.4

Structure tensor with Quadrature filters

3.4.1

Algorithm

3.4.2

Results

Chapter 4

Post processing

4.1

Threshold

4.2

Area filter

4.3

Point representation

4.3.1

Algorithm

4.4

Mean Shift Clustering

4.4.1

Algorithm

4.4.2

Results

4.5

Convex hull

4.6

Tracking