Foreground Segmentation of Moving Objects

(1)

Instutitionen f¨

or Systemteknik

Department of Electrical Engineering

Examensarbete

Foreground Segmentation of Moving Objects

Examensarbete utf¨ort i bildbehandling

av

Joel Molin

LiTH-ISY-EX–10/4299–SE

Link¨oping 2010

Department of Electrical Engineering Link¨opings Tekniska H¨ogskola

Link¨oping University Link¨opings Universitet

(2)

Examensarbete utf¨ort i bildbehandling

vid Link¨opings tekniska h¨ogskola

av

Joel Molin

LiTH-ISY-EX–10/4299–SE

Link¨oping 2010

Handledare: Manne Anliot (Image Systems AB, Link¨oping)

Examinator: Klas Nordberg (ISY, Link¨opings Universitet)

(3)

Institution och avdelning Typ av publikation(type of publication): Institution and division Examensarbete (degree thesis)

Computer Vision Laboratory ISRN:

Department of Electrical Engineering LiTH-ISY-EX–10/4299–SE Link¨oping University

SE-581 83 Link¨oping SWEDEN Spr˚ak(language):

Datum(date): Engelska (english)

2010-01-03

URL till elektronisk version(URL to electronic version): Antal sidor(no. of pages): http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-52544 62

Titel Foreground Segmentation of Moving Objects Title

F¨orfattare Joel Molin Author

Sammanfattning Abstract

Foreground segmentation is a common first step in tracking and surveillance applications. The purpose of foreground segmentation is to provide later stages of image processing with an indication of where interesting data can be found. This thesis is an investigation of how foreground segmentation can be performed in two contexts: as a pre-step to trajectory tracking and as a pre-step in indoor surveillance applications.

Three methods are selected and detailed: a single Gaussian method, a Gaussian mixture model method, and a codebook method. Experiments are then performed on typical input video using the methods. It is concluded that the Gaussian mixture model produces the output which yields the best trajectories when used as input to the trajectory tracker. An extension is proposed to the Gaussian mixture model which reduces shadow, improving the performance of foreground segmentation in the surveillance context.

Nyckelord Keywords

Foreground Segmentation, Background Subtraction, Gaussian Mixture Models, Codebook, Tracking, Shadow Detection, Auto Exposure

(4)

Foreground segmentation is a common first step in tracking and surveillance applications. The purpose of foreground segmentation is to provide later stages of image processing with an indication of where interesting data can be found. This thesis is an investigation of how foreground segmentation can be performed in two contexts: as a pre-step to trajectory tracking and as a pre-step in indoor surveillance applications.

Three methods are selected and detailed: a single Gaussian method, a Gaussian mixture model method, and a codebook method. Experiments are then performed on typical input video using the methods. It is concluded that the Gaussian mixture model produces the output which yields the best trajectories when used as input to the trajectory tracker. An extension is proposed to the Gaussian mixture model which reduces shadow, improving the performance of foreground segmentation in the surveillance context.

(5)

Acknowledgements

This thesis has been carried out at Image Systems AB in Link¨oping as part of joint research project between Image Systems and FOI in Link¨oping (the Swedish Defense Research Agency). I would like to thank Manne Anliot and Christer Wigren at Image Systems for the opportunity and for all the

feedback and support. I would also like to thank Niclas Wadstr¨omer and

Fredrik Hemstr¨om at FOI in Link¨oping for their additional feedback and

(6)

1 Introduction 1

1.1 Foreground Segmentation . . . 1

1.2 Specifics for this Thesis . . . 1

1.3 Existing Research . . . 3

1.4 Terminology and Notations . . . 3

2 Methods for Segmentation 5 2.1 A Single-Gaussian Method . . . 5

2.1.1 Approximating the Gaussian Distribution . . . 5

2.1.2 Classification . . . 6

2.1.3 More on Initializing and Updating . . . 6

2.2 Gaussian Mixture Models . . . 8

2.2.1 The Background Model . . . 8

2.2.3 Updating the Model . . . 9

2.2.4 Initializing the Model . . . 10

2.3 A Codebook Method . . . 10

2.3.1 The Background Model . . . 10

2.3.2 Initializing a Codebook . . . 11

2.3.4 Learning New Backgrounds . . . 12

2.4 Color Models . . . 13

2.4.1 YCbCr . . . 13

2.5 Pre- and Post-processing of Image Data . . . 14

2.5.1 Low-pass Filters . . . 14

2.5.2 Reducing Image Size . . . 15

2.5.3 Blob Size Filtering . . . 15

2.5.4 Morphological Operations . . . 15

3 Segmentation-Guided Tracking 17 3.1 Introduction to the Tracking Algorithm . . . 17

3.1.1 Problems . . . 17

3.1.2 How Foreground Segmentation Helps . . . 18

(7)

CONTENTS

3.2.1 Sequences Shot with High-Frequency Cameras . . . 18

3.2.2 Laboratory Sequences . . . 19

3.3 The Projectile Sequence . . . 19

3.3.1 Method Comparison . . . 19

3.3.2 Quality of Tracker Output . . . 22

3.4 The Fish Larva Sequence . . . 23

3.5 The Mouse Sequence . . . 29

3.6 Time Performance . . . 30

3.7 Discussion . . . 31

4 Indoor Surveillance 34 4.1 Difficulties and Characteristics . . . 34

4.2 Test Sequences . . . 34

4.3 Initial Experiments . . . 34

4.4 Working with Auto Exposure . . . 36

4.4.1 The Type of Transformation . . . 36

4.4.2 Choosing a Factor . . . 37 4.4.3 Reference Frame . . . 38 4.4.4 Results . . . 38 4.5 Shadows . . . 38 4.5.1 A Shadow-reducing Segmenter . . . 39 4.5.2 Experiments . . . 40 4.5.3 Discussion . . . 44

5 Recommendations for Parameter Values 49 5.1 Single Gaussian Method . . . 49

5.1.1 Gaussian Mixture Model Method . . . 50

5.1.2 Codebook Method . . . 51

6 Conclusions 52 6.1 Segmentation-Guided Tracking . . . 52

6.2 Indoor Surveillance . . . 53

6.2.1 Future Directions . . . 53

A Output for Tracking Sequences 54

(8)

Introduction

This chapter introduces the concept of foreground segmentation and goes through the goals of this the-sis. The structure of the report is also outlined.

1.1 Foreground Segmentation

Segmentation of images is the process of dividing the image into disjoint regions, with regions representing different objects or properties. Foreground segmen-tation is a specialization of segmensegmen-tation where the goal is to divide the image into background and fore-ground. Background is the static parts of a scene, typical examples are walls, floor and furniture. Fore-ground is anything that isn’t backFore-ground, that isn’t static, i.e. things that move in the scene. Another common term for foreground segmentation is back-ground subtraction. In most cases, the input to the segmentation is a sequence of images, such as video recorded from a camera. Performing foreground seg-mentation on a single image requires detailed a priori information which is usually not available. This pa-per is concerned only with sequences of images, and the outputs will always be sequences of images, one for each input image. Figure 1.1 shows what the seg-mentation output might look like. The black pixels represent background, and the white pixels thus rep-resent foreground.

While segmentation is quite interesting in itself, it is usually only an intermediate step in a larger image

Figure 1.1: Ideal output of a foreground segmenta-tion process. Left to right: empty scene, foreground present, and foreground mask.

analysis process. The binary output image, which is called the foreground mask, is used in later processing stages as a way of knowing where to start.

1.2 Specifics for this Thesis

This thesis is based on the desire to use foreground segmentation as a pre-processing step in a couple of different situations. The overarching goal is to pro-vide a suitable segmentation strategy for each of these situations. In doing so, existing research has been re-viewed and used as the basis of the effort. Typical example input sequences have been used to derive the most disastrous problems for each case.

In short, two different problem classes can be de-fined:

(9)

1.2 Specifics for this Thesis

1. high-frequency sequences and laboratory se-quences,

2. indoor surveillance.

In all cases, the cameras are assumed to be station-ary, i.e. they do not move.

High-Frequency and Laboratory Sequences

For these two problem classes the intended target of the foreground mask is a 2D trajectory tracking al-gorithm. The tracking algorithm is already imple-mented, but it is often confused by background and loses the object it is tracking. The tracker then needs manual correction and re-initialization. Foreground segmentation is intended to assist the tracker by pro-viding a foreground mask in which it is easy to find objects.

The tracking is performed in an interactive appli-cation on a desktop PC, controlled by users which are not necessarily experts on image processing. It is therefore of high value that any initialization param-eters are easy to assign and understand.

High-frequency sequences and laboratory se-quences are different in that the laboratory sese-quences can be expected to be longer. In both cases, the tracking is performed off-line. The images are stored on disk available for random access.

Surveillance Video

The surveillance video problem is quite different from the high-frequency and laboratory problems. Surveil-lance cameras are used where daily life occurs, so the environment can not be fully controlled. Changes in the environment causes a lot of problems that is not an issue in the other two input types. These prob-lems include external lighting conditions (night/day), shadows cast by people, reflections, opening and clos-ing of doors and more.

Since there are many things that can happen in surveillance video this thesis will be restricted to in-door video, which is more interesting than outin-door video considering how the results of this thesis will

be used. By ignoring outdoor video, weather phe-nomenons such as wind, rain and clouds can be ig-nored.

There is no specific target for the output of the segmentation. However, the most prominent example would be human recognition—detecting if a human is in a frame and if possible which human it is. It is not the role of the segmentation to decide if a region shows a human or not, it must report as foreground all regions which are significantly different from their typical state. It is then the post-processors role to examine the foreground mask and extract the regions it is interested in. Shadows are a troublesome and common element of surveillance video, and will be the main focus of attention apart from modeling the background. Shadowed pixels are different than non-shadowed pixels showing the same thing, but they should not be detected as foreground.

Time Requirements

This thesis is primarily an investigation of segmenta-tion methods, and no hard requirements were set in regard to the processing time. That said, very slow methods are not practical for either surveillance data or the tracker. The tracker is an interactive program, and even though segmentation can be done unsuper-vised as a pre-processing step, speed is still desirable. For the surveillance data, even though there is no ex-plicit target, the sheer volume of data indicates that speed is desirable.

Goals

To summarize, this thesis aims to

1. Find a method to perform foreground segmenta-tion on high-frequency and laboratory sequences so that the foreground mask can be used as in-put to a specific tracking algorithm (which is de-scribed in section 3.1).

2. Find a method to perform foreground segmenta-tion on surveillance video so that the foreground mask can be used as input to different applica-tions, such as recognizing people. In particular,

(10)

the method should detect shadow and not report it as foreground.

1.3 Existing Research

Foreground segmentation is an active field of re-search. Although the idea of foreground segmenta-tion is certainly not new, the applicability has been limited by the required computing power. Now that computing power has become significantly cheaper (and cameras are better and cheaper as well), the effort to find efficient algorithms for foreground seg-mentation has become more concentrated.

The simplest methods of segmentation look only at the difference between two frames. In order for this to be successful, a frame must be available which ac-curately describes the background without any fore-ground present. However, such a frame is not always readily available. Foreground may be present at all times. Noise and variations are not considered, and the background can not adapt well over time.

The background has also been statistically modeled by Gaussian distributions ([1]). Every seen frame is incorporated into the distribution, over time con-structing a background model free of foreground. These models still work under the assumption that the background does not change suddenly, but are adaptive so that changes are slowly incorporated into the model. With the model, new observations can be classified as either background (if they agree with the model) or foreground (if they disagree).

The staple work for a lot of research is a publi-cation from Stauffer and Grimson published in 1999 [2]. Power and Shoones published a more detailed pa-per [3] including discussion of the Stauffer-Grimson method along with clarification and some modifica-tions. The method itself is centered around adapting a set of Gaussian distributions to the observed input. In this way, more than one kind of background can be described and handled.

The Stauffer-Grimson method is generally referred to as the Gaussian Mixture Model, GMM for short, although the abbreviation MOG for Mixture of Gaus-sians is also used.

Javed and Shah [4] describes a method which also

uses edge information in addition to a GMM, using the idea that foreground objects will also have sharp edges that are not present in the background model. Wang et al. [5] also combines a GMM with edge in-formation, performing inference using a Bayesian net-work. Wang et al. also models shadow in their model, and imposes spatial constraints using a Markov ran-dom field. The spatial constraints make it more likely that neighboring pixels share the same classification, thus yielding smoother foreground masks.

Another way to model the background put forward by Kim et al. [6] is to use something they call a code-book. A codebook is a variable sized collection of surface descriptions. The codebook avoids probabil-ity distributions all together, using only fixed color distances to match pixel values against each other.

Different color representations can yield different segmentation quality in some cases. Kumar [7] com-pares a number of different color spaces in the context of foreground segmentation.

1.4 Terminology and Notations

A lot of terminology is defined throughout the report where used. However, a few things are not defined in the text, and will be handled here.

Foreground Pixel regions that are not stationary in

the scene. For example people.

Background Pixel regions that are stationary in the

scene. For example walls and floor.

Surface Used to signify the surface of a real world

object. A wall or a shirt are examples of surfaces. Mainly characterized by its color, but more than one surface can have the same color.

Camouflage Happens when a foreground object has

the same color as the background. Often causes a false background classification.

For mathematical formulas, a summary of nota-tions is found in figure 1.2:

(11)

1.4 Terminology and Notations

Notation Description

ha, b, . . .i a color or a tuple of values

argminkf (k) the k that minimizes f (k)

argmaxkf (k) the k that maximizes f (k)

x a vector

|x| vector norm

kSk set size

x◦ y element-wise multiplication

(12)

Methods for Segmentation

This chapter describes in detail a number of seg-mentation methods as well as some pre- and post-processing techniques that will be used throughout the report.

2.1 A Single-Gaussian Method

This method constructs a background model by treating every pixel separately. The idea is that the background can be modeled by a Gaussian distribu-tion, where the distribution’s mean represents the color of the background surface, and the variance represents noise. The actual distribution of the back-ground is not necessarily Gaussian, but the bell curve of the Gaussian distribution can usually be fitted so that it encloses most of the noise.

2.1.1 Approximating the Gaussian

Distri-bution

To set up some terminology, let X = {xi : 1 < i <

N } be the set of observed pixel values for a single pixel location in an image sequence of length N . The dimension of the observations is not important, but is typically 1 (monochrome, e.g. grayscale) or 3 (for color images).

The full sequence is not always available, and the size N is not always known, for example if the se-quence is live-fed from a camera.

If the full sequence is available, the Gaussian distri-bution can be approximated by computing the mean and variance of X. Otherwise the distribution can be approximated by some value and improved over time

using a learning rate factor α. That is, for µt, the

mean at time t, we can compute a new approximation

given the observation xt.

µt+1= µt· (1 − α) + xt· α

For non-monochrome images, the assumption that color components are independent is made, which makes the covariance matrix Σ a diagonal matrix.

The notation σ2 _{representing the diagonal of Σ—}

the variances—will be used when this assumption is made. This is an approximation that has previously been done in [3] and makes evaluating the probability density function of the distribution faster, reducing the cost of matrix inversion and computing the de-terminant.

The variance is updated as the mean using the same learning rate.

d= xt− µt+1

σ2t+1= σ 2

t· (1 − α) + (d ◦ d) · α

The mean can be initialized to the observation of the first frame, but the variance must be set to some predetermined guess. That is,

(13)

2.1 A Single-Gaussian Method

µ1= x1

σ21= σ 2 init1

The mean and variance will improve over time as they are updated from more frames. Note also that

hµt, σ2ti is biased towards observations that are close

in time. As time passes, old observations decay as (1−α) is applied to them over and over. This bias to-wards recent past values can make this dynamic mean better than the traditional fixed mean even for fully available sequences. This results in a time-dependent mean which is often closer to X than the mean taken as a sum over all of X. This is because in actual sequences the background changes slightly over time. The initial color value and a guessed variance might work, but is not optimal. Input from more frames is required to improve the model, especially the vari-ance. Further ways of improving initialization follows in section 2.1.3 after a description of how the distri-bution is used to classify foreground and background.

2.1.2 Classification

The approximated Gaussian distribution hµi, σ

2

i de-scribes the probability density function of the back-ground manifested as a given color, or f (x|bg), where bg is the event that the observation is of a background pixel. Likewise, fg will represent the event that the observation is made on foreground. The events bg and fg are disjoint and exhaustive. We can formulate the foreground probability, P (fg|x), using Bayes’ the-orem:

P (fg|x) = f (x|fg)P (fg)

f (x)

= f (x|fg)P (fg)

f (x|fg)P (fg) + f (x|bg)P (bg) However, only the background likelihood f (x|bg) has been treated by the method so far. The fore-ground distribution is impossible to model in the general case, because foreground can take any val-ues. If foreground were to be modeled similarly to

the background, by adapting to past foreground ob-servations, the method would eventually become bad at detecting foreground which it has not previously seen. When not having any prior knowledge, the fore-ground distribution is best described by a uniform distribution. For d-dimensional observations where the value range is integers between 0 and 255, this becomes:

f (x|fg) = 1

256d

The prior probabilities P (fg) and P (bg) should be each others complement. The simplest method is to fixate them to some values. There are other alterna-tives, the probability can be stored and adapted just as the mean and variance. To do this, one of P (fg) and P (bg) is stored for each pixel. Since they are complements, the stored value can be used to com-pute the other value. E.g., P (fg) can be comcom-puted

by 1 − P (bg) if P (bg) is stored. Let Ptbe the stored

probability, then the value of Pt is updated every

frame by

Pt+1= Pt· (1 − α) +

α if background

0 if foreground

A foreground mask is attained by thresholding the foreground probability using a pre-determined threshold δf g.

2.1.3 More on Initializing and Updating

In section 2.1.1 updating the Gaussian distribution was discussed, but it was (silently) assumed that the observation was always of a background pixel. This is not true, foreground surfaces will occlude background pixels at times.

Ghosting

It is especially detrimental if foreground is present in the frame from which the initial model’s means are taken. Ideally, it is desirable to train the background model using input with no foreground present, but this is not always feasible. If a foreground object is present in the initial frame, the region it occupies will

(14)

start out being classified as background, and when the object moves, both the foreground object and the true background will be classified as foreground. This effect is called ghosting and is visualized in figure 2.1.

Figure 2.1: From left to right: first frame, segmenta-tion of first frame, 30th frame, segmentasegmenta-tion of 30th frame.

To avoid ghosting, additional frames must be used when choosing the initial set of means. A straight-forward approach is to just run the algorithm on a number of frames—discarding the output—and trust that if we run it enough, all pixels will eventually get the right mean.

Initial Means from Multiple Frames

Ghosts can take a long time to train out of the model. To address this issue, a remedy is proposed here that works for sequences where the full sequence can be accessed randomly or semi-randomly.

In sequences with mostly static background and an object moving around, it is likely that if we look at the same pixel location in two frames far apart, at least one of the frames will show background. We can use this to combine several frames into one, which would become a better estimation of the background means than using a single frame.

One way of combination could be to take the mean of the frames, but such a combination will still be in-fluenced by the foreground frames and many frames must be used. A better option would be the me-dian, which would pick one of the real background values and discard the foreground (assuming there are more background observations than foreground observations). Since the median is defined only for

one dimension, another measure is suggested. The suggestion is to first compute the mean, and then select the sample that is closest to the mean in eu-clidean distance. The motivation is that the mean will be biased towards the background values, and thus a background value will be selected.

This improvement is not a replacement for ing, but it can shorten the required number of train-ing frames drastically.

Dynamic Learning Rate

To shorten the training phase, the learning rate can also be adjusted so that it is initially higher, and gradually decreases to a stable minimum. Higher ini-tial learning rates leads to faster convergence of the model, and decreasing the rate to a low minimum leads to stability in the long term.

It can be useful to use the dynamic learning rate for the variance only, using a fixed low rate for the mean at all times. This works well if there are no ghosts in the initial means, in which case there is no need to drastically change the means. This would be suitable if using a frame which is free from background, or if several frames are combined as described in the previous section. If there are ghosts, an increased learning rate would help reduce them.

There are many ways to produce a dynamic learn-ing rate. This method has been used for this thesis:

α′ = max _α 0 1 + n, α

where α is the lower limit, α0 is the first frame’s

learning rate, α′ _{is the dynamic learning rate and n}

is the number of previously processed frames n = 0, 1, . . ..

Foreground and Variance

Training the model is also performed to improve the variances. However, pixels that are partly trained with foreground will generally have a higher variance than they would ideally have. This is because fore-ground observations yield large deviations, and even if they are only seen for a few frames, the deviation

(15)

2.2 Gaussian Mixture Models

is often large enough to contribute a lot to the vari-ance. Figure 2.2 shows the effects of foreground in the training sequence for variance.

Figure 2.2: The effects of training with foreground. The picture shows the red, green, and blue variances of the background model after the whole sequence has been processed as training. The trajectory of the object is quite clear through high variance in all channels even though the object never stays in one place for long.

One way of avoiding such effects is to update the distribution only when the observation can be classi-fied as background, but doing so effectively removes the model’s ability to change background surface. This means that ghosts would stay in the model for-ever. A compromise is to perform initial training dur-ing which the model is always updated. When the initial training is complete further foreground classi-fications are not included into the model.

2.2 Gaussian Mixture Models

In section 2.1, the background was modeled by a sin-gle distribution in each pixel. As was seen, this leads to problems when the background is not static and when there is foreground in the training data. These problems can be at least partly addressed by intro-ducing more distributions. The idea is to model each surface by its own distribution. This—using a num-ber of Gaussian distributions—is called a Mixture of Gaussians model or a Gaussian Mixture Model. Throughout this thesis, Gaussian Mixture Model will often be abbreviated to GMM.

2.2.1 The Background Model

The background is stored as a collection of Gaussian distributions, each of which is coupled with a weight

value, forming a surface descriptor hω, µ, σ2

i. Every pixel location is modeled separately with it’s own set of K distributions. K is typically the same for each pixel location. The sum of the weights is kept to one at all times K X i=1 ωi= 1

2.2.2 Classification

Assuming that a collection of surface descriptors hωk, µk, σ

2

ki for k ∈ [1, K]

is initialized, the observation x can be classified as follows.

Finding the Matching Descriptor

First, the descriptors are searched for the distribution that best matches x. If i is the index of the best matching distribution, we can use the likelihoods of the distributions to find i by

i = argmax k ωk· P (x|µk, σ 2 k)

Sometimes the observation is of a surface that is not modeled by any of the descriptors. To incor-porate this, we only accept descriptors with a value above a certain level. Power and Shoones [3] write about this, formally introducing a special surface de-scriptor with a uniform probability distribution in-stead of a Gaussian distribution. Whenever that sur-face is chosen, the observation is treated as a previ-ously unseen surface. The discussion of what to do with such surfaces is deferred to section 2.2.3.

Stauffer and Grimson used a different approach in their paper [2]. They sacrificed some precision for less computational load, and used the fact that most of the Gaussian density function is concentrated within a few standard deviations to formulate the following match condition

(16)

|x − µi| < λ · σ

That is, as long as the observation does not deviate more than λ standard deviations, it can be considered a match. If several descriptors pass this test, the least deviating one can be chosen. They suggested using λ = 2.5, but the parameter can be left open for experiments.

The above was given as it was in the original paper, where a single scalar was used to represent variance. Adjusting the test so that it uses a per-component variance can be made by using the normalized eu-clidean distance for N dimensions.

v u u t N X i = (xi− µi) 2 σ2 i < λ

If a full covariance diagonal would be used this would turn into the proper Mahalanobis distance, which is more complex to compute and thus not as interesting for an approximation.

Classifying Surface Descriptors

When a matching surface has been found we still have to decide if the surface is really a background sur-face. Not only background surfaces are kept in the model. In fact, non-background surfaces are stored so that they, given enough evidence, can be promoted to background. The classical Stauffer and Grimson background test is as follows.

1. Sort the surface descriptors by ωi/|σ2i|, in

de-scending order.

2. Begin summing the weights of the descriptors, in the sorted order.

3. When the sum is greater or equal to B, stop the summation.

4. All the surfaces that contributed to the sum are background surfaces.

B is an input parameter and must be set manu-ally. Higher values means that more surfaces can be

classified as background simultaneously and that new surfaces can become background faster.

Any surfaces that are not found to be background by the previous test, and any surfaces that were not found amongst the descriptors at all, are classified as foreground.

2.2.3 Updating the Model

When there is a Match

For every observation, the matching surface descrip-tor is updated similarly to how the single Gaussian model was updated (compare section 2.1.1). That is, the matching surface k is updated with a learning rate α as follows µk,t+1= µk,t· (1 − α) + x · α d= x − µk,t+1 σ2k,t+1= σ 2 k,t· (1 − α) + (d ◦ d) · α

The weight is updated for all descriptors, not just the matching one. The matching one is updated to-wards one and the others toto-wards zero. After be-ing updated, the weights are also normalized so that their sum remains one. That is, for each distribution

i, compute the intermediate weight ω′

i by ω′ i,t+1= ωi,t· (1 − α) + α · 1 i = k 0 i 6= k

the intermediate weights are then normalized to form the new weights by

ωi= ω′ i PK j=1ω ′ j New Surfaces

If there was no match, the observation represents a new surface, and a potential background (given enough time). The surface is injected into our model by replacing the surface in the model with the least weight.

(17)

2.3 A Codebook Method k = argmin k ωk µk= x σ2k= σ 2 init· 1 ωk= ωinit

The weights must also be normalized.

2.2.4 Initializing the Model

The model is simply initialized from a single frame, where the color value goes into one descriptor. The

variances are set to the input parameter, σ2

init. The

other descriptors are disabled by setting their weight to zero and will only be used after a new color has been observed.

Ghosting is less of a problem for the GMM model than it is for the single Gaussian model, because the real background will just occupy a new descriptor and not have the variance deterioration problem that the single Gaussian method has. Still, since the ghost surface will have a very high weight and the new sur-face will initially have a very low weight, ghosting can still be an issue.

If the frames are available for random access, it is a very good idea to reduce ghosting by combining several frames for the initialization frame as discussed in section 2.1.3.

Training is also an important step for the Gaussian Mixture Model. Many things that were discussed in section 2.1.3 apply here as well, such as using a dy-namic learning rate.

2.3 A Codebook Method

2.3.1 The Background Model

The codebook method models each pixel separately and the pixel-level background model is called a book. Each codebook contains zero or more code-words. A codeword is a tuple of seven values. The contents of a codeword is described in figure 2.3.

Each codeword describes a single background sur-face. Thus, the codebook method is multi-modal—it can describe more than one kind of background.

Color Description

Of the values in the codeword, only µ, Ilo and Ihi

are used to describe what observations would match this codeword. An observation x is said to match a codeword if

colordist(x, µ) < ǫ ∧ brightness(|x|, Ilo, Ihi)

where, ǫ is specified as a parameter,

colordist(x, µ) = s |x|2₋(x · µ) 2 |µ|2 brightness(I, Ilo, Ihi) =

true Imin ≤ I ≤ Imax

false otherwise

Imax and Imin are computed from the previously

observed smallest and largest intensities, where α and β are pre-defined parameters.

Imin= αIhi Imax= min βIhi, Ilo α

The color distance test checks to see whether the codeword and the observation has roughly the same color. Taking the square of the inner product yields a measure of difference between the colors, and dividing by the length of the codeword vector puts both terms in the root expression on the same intensity level.

2.3.2 Initializing a Codebook

A codebook is always initialized by a sequence of in-put frames—the training sequence. This sequence is used to construct what is called the “fat codebook”. The fat codebook contains every surface which can be observed in the training sequence. After the training sequence has been traversed, the codebook is filtered (“trimmed”) in an attempt to remove any foreground surfaces present during training.

More concretely, the codebook is initialized to con-tain no codewords. Then, for every frame in the training sequence, the following is performed

(i) Look for a matching codeword (use the first if several).

(ii) If a match could be found (just one, simply use the first match), update it (see below)

(iii) If a match could not be found, create a new codeword and insert it into the codebook.

Updating a Codeword

The color is updated as an average with help from the frequency, the other values are trivially updated

µ← f µ + x f + 1 Ihi ← max(Ihi, I) Ilo← max(Ilo, I) f ← f + 1 λ ← max(λ, t − tlast) tlast ← t

where t is the current time.

Creating a New Codeword

The created codeword is initialized using the current observation x and the current time t

µ= x Ihi= I Ilo= I f = 1 λ = 0 tf irst= t tlast = t

Trimming the Fat Codebook

At the end of the training phase, the codebook is trimmed to remove any codewords that can be be-lieved to be foreground. The two tools available for

(19)

2.3 A Codebook Method

making this decision are frequency and maximum negative run-length.

Before any processing occurs, the codebook should be updated so that the MNRL takes the entire train-ing sequence into account. The lengths of the periods

before tf irstand after tlast are added, and replace the

existing MNRL if greater. That is λ ← max (λ, tf irst+ (t − tlast))

If this is not done, surfaces might get an unrep-resentatively low MNRL. If a codeword is matched only twice and in a short period of time, the training procedure described earlier will update the codeword only on the matches, setting the MNRL to the (small) difference in time between the matches. It is clear that the start plus end period is more significant in this case.

The original paper only provided suggestions on how to use the frequency and the MNRL. In general, foreground will have high maximum negative run-length and small frequency. Different approaches are appropriate in different cases. Often, if foreground is known to be rare and fast moving, it is sufficient to look only at the frequency. Of course, in these cases it is also true that the MNRL will be very high. It should not, for sequences with relatively high

per-centage of background, be difficult to specify f1 and

λ1such that the condition

f < f1∧ λ < λ1

provides a good enough trimming of the codebook. This test is used throughout this thesis, but it is by no means the only working test.

2.3.3 Classification

Classification using a codebook is similar to the pro-cessing of the codebook during training. It comes down to

(i) Look for a matching codeword (use the first if several).

(ii) If a match could be found, then classify it as background. Also update the model the same way as when training.

(iii) If a match could not be found, classify it as foreground.

2.3.4 Learning New Backgrounds

During the classification phase described above, there was no way for the model to learn new backgrounds. This is simple enough for many cases. In order to address this, a second codebook can be introduced. This codebook is called the cache by Kim et al. and it stores any background candidates that is found when classifying. To differentiate, the old codebook will simply be called background. Three more parameters are also added

Tadd is the time that a codeword must spend in the

cache before it is considered background. Tdelete specifies how long it takes before a codeword

is removed from the background if it is not matched.

Tcache specifies how long it takes before a codeword

is removed from the cache if it is not matched. The classification algorithm is updated to maintain the cache

(i) Look for a matching codeword in the back-ground (use the first if several).

(ii) If a background match could be found, update the codeword in the same way as when training. (iii) If a background match could not be found, look for a match in the cache. If a match is found, update it.

(iv) Remove any codewords in the cache that satisfy t − tlast > Tcache.

(v) Move any codewords in the cache that satisfy t − tf irst> Tadd to the background.

(vi) Remove any codewords from the background that were added during classification and that satisfy t − tlast> Tdelete.

(20)

(vii) Only when there was a match in the back-ground codebook should the classification be background.

Codewords attained from training are never re-moved from the background. This layered codebook scheme for background updates was designed to ac-commodate parked cars and left bags. The training background is held as the true background.

Discussion

The layered approach to background model mainte-nance is systematic and straight-forward. However, it does add a bit of extra processing for every frame. Care must also be taken to have efficient memory allo-cation if foreground is frequent, since any foreground will result in codewords being added to the cache. Using a cache such as this also has the drawback of increased memory requirements.

2.4 Color Models

There are many ways to represent color. Color is usu-ally described by a few separate components. RGB is a color model that has become very widespread in personal computing. In RGB there are three compo-nents: red, green and blue. The red, green and blue values when combined can represent all the colors of modern computer monitors.

There are other color models which one is less likely to run into, but which are still fairly common. Ex-amples are YUV, HSV, HSL, and CMYK.

The methods described in sections 2.1 and 2.2 are not bound to any specific color model. In fact, they can be used to model any real-valued sequence of any dimension.

2.4.1 YCbCr

YCbCr is an color representation with three color components: luma (Y), blue difference (Cb) and red difference (Cr). Luma contains the brightness infor-mation and the other components contain the chro-maticity. In other words, the Y channel of an image is a rather good grayscale image of a full color image.

It is a digital version of YUV, and the name YUV is commonly used to denote YCbCr as well. Kumar et al. [7] have written and compared RGB and YCbCr as well as a few other color spaces, concluding that YCbCr is the best choice in general for foreground segmentation and shadow detection.

YCbCr is often transformed from RGB so that the Y component is in the range 16–255 (for example, this is used in [7]). However, for the scope of this thesis, the following color transformation has been used to convert colors from RGB to YCbCr using the full 0–255 range (also used by the JFIF standard [8]): T =   0.299 0.587 0.114 −0.168736 −0.331264 0.5 0.5 −0.418688 −0.081312     Y Cb Cr  =   0 128 128  + T ·   R G B  

Note how the sum of the top row of T is 1, where as the other two rows sum to 0. For gray color where R = G = B, this means that all the information will end up in the Y component, as expected.

It is practical to use YCbCr in image processing when we need to consider brightness. With a color representation such as RGB, the brightness is spread out over all three channels, requiring tedious opera-tions to operate with brightness, such as taking the length of the vector hR, G, Bi or similar. In YCbCr, the Y component is simply used.

Moreove, brightness variations are common in video data, and thus RGB data generally have greater covariance between channels than the equivalent YCbCr data. Since the color distributions in the sin-gle gaussian and GMM methods have been approxi-mated as having independent color channels, YCbCr should be a better choice than RGB, since the gen-erally lower covariance of YCbCr is closer to the ap-proximation.

YCbCr is often used by video equipment, and video data may be stored in some YCbCr variant. However, all the YCbCr video in this report has been trans-formed from RGB using the above transformation.

(21)

2.5 Pre- and Post-processing of Image Data

Figure 2.4: A color image and the three channels: Y, Cb and Cr.

2.5 Pre- and Post-processing of

Image Data

2.5.1 Low-pass Filters

Applying a low-pass filter to the input sequence be-fore processing can be very helpful in reducing noise. Figure 2.5 shows the difference of running a single Gaussian segmenter on a noisy sequence with and without a low-pass filter. The smoothing effect of the filter lowers the deviation of observations, mean-ing that we can achieve much lower variations in our model. For example, the images in 2.5 were generated with an initial variance of 30 and taking the 50th out-put frame. When reducing the initial variance to 15, the average filtered segmentation is still virtually free of noise, whereas the non-preprocessed segmentation is very noisy.

The low-pass filtering was very successful in the example, and segmentation results without filtering was noisy. This is not always the case. The back-ground is actually trees and bushes, which wave and reflect light differently over the sequence, so that the majority of the noise is from the background. Cam-era noise is usually less of an issue, and the gain from noise reduction through low-pass filtering is not as big, but can still prove useful.

Low-pass filters are not limited to reducing false positive noise, they can also help to reduce holes in foreground objects which come from only one or a few pixels being similar to the background. Because

Figure 2.5: Top: segmentation on original video; bot-tom: using the same video filtered with a 3x3 average kernel.

of the smoothing effect, any pixel will have heavy influence from its neighbors. Average filters, for ex-ample, use a relatively small amount of the source pixel, a small 3-by-3 kernel takes 8/9 of its computed value from adjacent pixels.

There is also a possible downside to filtering low-pass with a 3-by-3 kernel since foreground pixels could potentially become background pixels. For this to happen though, the object would have to be similar in color to the background, and the pixel in question must be adjacent to background pixels.

(22)

2.5.2 Reducing Image Size

Often, performance can be improved by resampling the input video to a smaller size. By resampling the input to half the width and half the height, the num-ber of pixels is reduced by a factor of four. This would mainly be done because it improves execution time, while also saving memory. When resampling with a lower frequency, it is also customary to low-pass fil-ter the source before actually resampling in order to avoid aliasing. This yields a noise reducing effect as was already discussed in section 2.5.1.

However, it also means that every output pixel rep-resents four input pixels. Not all four-pixel patches share the same classification, and if the foreground mask is scaled up again it will become slightly pix elated. This is not acceptable in all situations and re-ducing the image can not always be used. It deserves a mention because if used correctly at the right times it can yield a substantial performance boost.

2.5.3 Blob Size Filtering

Another way to get rid of noise, and to some extent unwanted foreground, is to perform blob size filtering on the segmentation output.

What we do is find all foreground regions—or blobs—in the foreground mask and count their size. A new foreground mask is then generated contain-ing only those blobs whose sizes were greater than a specified threshold. Figure 2.6 contains the result of noise-reduction through blob size filtering. The noise is effectively removed. However, the noise which is connected to the object remains. Compared to, the connected noise is the only practical difference.

When generating the images in this report, it took 8.6 ms on average to blob size filter a frame, in con-trast, applying the 3-by-3 average filter took 78.9 ms. Apart from noise-reduction. Blob size filtering can also be used with greater size thresholds in order to remove small foreground regions if interesting objects are know to be of some size. For example, if the next step in the processing is interested in humans and the scene is set up so that humans will be captured close up, the threshold can be quite high.

Figure 2.6: Left: segmentation on original video; right: blob size filtered to remove blobs with size less than four pixels.

2.5.4 Morphological Operations

Morphological operations, mostly closing, can also be used to improve the foreground mask. For example, it is often known that objects we are interested in are convex, but since many segmentation methods treat each pixel separately, mistakes can be made that leave holes in the foreground region. Performing a closing operation can help rid the foreground mask of small holes (although closing is mostly powerless against large holes). Figure 2.7 shows a frame of a projectile with two problems. Firstly, it has some kind of mark which assumes a dark green color sim-ilar to the background at that point. Secondly, it passes right over a white pole. In the middle pic-ture, these two problems manifest as holes in the fore-ground mask. Executing a simple closing operation on the mask helps both these problems.

Closing can potentially be dangerous, since it can connect several foreground regions into one.

(23)

2.5 Pre- and Post-processing of Image Data

Figure 2.7: Top: input; middle: single Gaussian seg-mentation; bottom: after a closing operation.

(24)

Segmentation-Guided Tracking

This chapter evaluates the previously detailed seg-mentation methods in the context of tracking. First, the pre-existing tracking algorithm is introduced. Af-ter that, the methods are compared in a number of typical input sequences, which is followed by a dis-cussion of the results.

Appendix A contains some output frames of the segmentations made in this chapter. It can be useful to look there when reading the chapter.

3.1 Introduction to the Tracking

Al-gorithm

The target tracker is a center of gravity tracker op-erated from within an interactive environment. The tracker operates frame by frame by first deciding the set of pixels that the tracked object occupies and then computing a single point from that set to decide the object’s location. The output of the tracker is a se-quence of such points.

The tracker is initialized manually by selecting a pixel location within the object in some frame. Then, a grayscale version of the frame is thresholded with a low and a high threshold. The object is made up of all pixels in the frame whose grayscale value is within the given thresholds and which forms a connected region around the selected pixel location.

Subsequent frames follow the same pattern, except

for how the pixel location is selected. Instead of re-quiring manual input for each frame, the tracker uses its earlier results to predict a value for the new frame. The prediction is then used as a starting point for a search for a connected region of pixels within the thresholds, which—if it can be found within a user-set search area—is designated as being the tracked object.

In case the prediction is not good enough, so that no connected region is found nearby, the tracker is suspended until manually corrected and continued. The connected region must lie entirely within the given search area, so there is a possibility that the tracker fails if the object passes the boundaries of the search area, for example if the object grows.

The point location is found by computing the cen-ter of gravity for the found object. The cencen-ter of gravity is found by hx, yi = 1 kSk X s∈S hxs, ysi

where S is the set of all pixels s in the connected

region, and hxs, ysi is the point location of pixel s.

kSk denotes the number of pixels in the region.

3.1.1 Problems

The problem with the tracker is that it is potentially hard to find good lasting values for the thresholding operation. If the background is similar to the object, it can be impossible to choose values that both cover

(25)

3.2 Interesting Input

the object well and do not cover the background. Fig-ure 3.1 shows a thresholding attempt where the inter-esting object is a projectile, but there is also a sim-ilarly colored vertical pole present. Covering the en-tire projectile would require a lower threshold, which would lead to too much of the pole being covered. In this particular sequence, there are also problems later due to changes in the background. Thresholds that are good for one frame might be wrong for another, since the object might move to a location where the background is within the thresholds, or worse the ob-ject itself can change appearance.

Figure 3.1: Attempt at thresholding the first frame. The red outline marks the boundaries of the ex-tracted object.

3.1.2 How

Foreground

Segmentation

Helps

The idea for improvement is simple. The problematic part of tracking with this method is object extraction through thresholding. A foreground mask however— being only black and white—is trivial to threshold. Since, ideally, the foreground mask is white in the pixels occupied by the object and black in all other pixels, using the foreground mask as input to the un-altered tracker yields the correct output with the cen-ter of gravity technique without the need for

speci-fying thresholds at all. Figure 3.2 shows an example of a foreground mask and how it can be used by the tracker to correctly compute a point location for the same frame that was so problematic in figure 3.1.

Figure 3.2: Top: foreground mask; bottom: outline and location (cross) of the projectile retrieved with the foreground mask.

3.2 Interesting Input

3.2.1 Sequences

Shot

with

High-Frequency Cameras

One class of sequences that we would like to perform tracking in is sequences of fast-moving objects taken with high-frequency cameras. Typical for these se-quences are that one or more objects move relatively fast against a static background. Since we are using high-frequency cameras, sequences are likely to have a low frame count, and the frame rate per second is high; the actual wallclock time that passes during the sequence will be small.

(26)

3.2.2 Laboratory Sequences

These are sequences of video filmed from above facing down. They are called laboratory sequences because they are typically shot in a controlled laboratory en-vironment. Examples of typical input are shown in figure 3.3. The targets are marked by red squares, and are small animals. The desired output is the an-imals’ trajectories.

Figure 3.3: Top: fish larva swimming around in a tank; bottom: a mouse exploring a circular room.

Typical for these sequences is that they have lots of background that is mostly static, since they are filmed in such controlled environments. This is simi-lar to the conditions in the projectile sequence, except the backgrounds tend to be more uniform.

The biggest difference from the high-frequency se-quences class is that these sese-quences most likely are

longer.

3.3 The Projectile Sequence

The projectile sequence depicts a projectile flying over a forest background where people, cars and

structures are also present. The sequence is 370

frames long, during which the projectile travels in a somewhat straight line from the right side to the left. The first and last frame is shown in figure 3.4. The sequence was shot with a high-frequency camera. The sequence includes some noteworthy

difficul-ties. There are foreground present in all frames,

which makes it hard to create a good background model. The projectile also passes over some areas with similar color as itself. The people also make some movements, which is not interesting to detect as foreground.

The most difficult problem that is present in the sequence is the fact that the projectile passes over people that move.

3.3.1 Method Comparison

Four different methods are compared, the three meth-ods described in chapter 2 along with frame dif-ference—a simple frame subtraction followed by a thresholding.

Every method except for codebook was evaluated

for both RGB and YCbCr input. The codebook

method’s match criteria is coupled with RGB because of its notion of intensity and color distance. A new set of match criteria would be needed for YCbCr data to be useful as input to a codebook. The input was also averaged with a 3-by-3 kernel in all cases, since the background in the sequence is quite noisy.

Frame Difference

Frame difference was performed by subtracting the first frame of the sequence from the current, and then thresholding the euclidean distance in the difference

image. So that if gt,x,y represents the pixel value at

(27)

3.3 The Projectile Sequence

Figure 3.4: The first and last frames of the projectile sequence, with the projectile’s trajectory roughly marked as a straight line.

foreground mask at the same time and location, we have

ft,x,y=

foreground |gt,x,y− g0,x,y| < δ

background otherwise For these experiments, δ = 40 was used.

Applying this procedure to the projectile sequence produces visually decent results, but several prob-lems exist. Some problem highlights are depicted in figure 3.5. Since foreground is present in all frames, ghosting immediately disqualifies this method as a good method for segmentation. The method was in-cluded as a reference as it represents a very simple segmentation method.

In this case, ghosting could be avoided by clever switching of background frame. For example, always subtracting a frame 10 frames forwards or backwards in the sequence would probably yield a trackable se-quence. There would still be a ghost, but it would not interfere with the tracker. However, this would rely a lot on the user, and is a sequence specific solution. It would become increasingly hard to predict a good background frame in this way for sequences where more objects are in the scene or if the same object would return to the same location multiple times. For this study, the first few frames were simply skipped so that a trajectory could be produced.

Figure 3.5: Three problems with the output of the frame difference method. It gives an unstable shape, can not handle changes in the scene, and has ghost-ing.

Single-Gaussian

The single Gaussian method was used together with a dynamic learning rate. For each frame, the effective

learning rate α′ _{was computed by}

α′ = max α,0.5 t for t = 1, 2, . . .

The initial means were constructed by combining multiple frames as described in section 2.1.3. The

(28)

parameters were

σ2 init = 15

α = 0.001 δf g = 0.9

As training, the whole sequence was processed first from start to end before the actual classification be-gan. After training, the background model was only updated if the observation was classified as

back-ground. For RGB input, the output looked nice

through the sequence. For YCbCr however, the vari-ance from the training phase was so high that the projectile looked deformed, especially in the early frames, where the dynamic learning rate is at the highest.

Gaussian Mixture Model

The Gaussian mixture model method was run with the following parameters for the projectile sequence:

K = 3 σ2 init = 40 B = 0.7 α = 0.001 λ = 7

The model was trained by processing the full input sequence prior to classification. The initial means were taken from a combination of frames to reduce ghosting, and the learning rate was dynamic, com-puted by α′ = max α,0.5 t for t = 1, 2, . . .

The output of this method tended to look more stable than the other methods. The shape of the projectile is well preserved, although there is a ten-dency to add extra pixels at the border of the object

in the foreground mask, but this is a result of the average filter and not an error in the segmentation. If the border is stable and around the whole object, the center of gravity is also preserved, and we will not have an error.

One unexpected issue with the sequence was dis-covered when applying the GMM segmenter, some frames had noise to either side of the projectile. This noise comes and goes as the projectile passes, but the noise itself does not move. Looking at the difference between a noisy frame and the first frame revealed that there is actually ghost-like spots around the pro-jectile, one dark and one bright. See figure 3.6. Thus, it is not really an error of the segmenter, it does de-tect a deviation from the background model, but it is a nuisance nevertheless. It is absent from the sin-gle Gaussian method’s output because the variance is high due to the projectile.

Figure 3.6: Top: sub-region of the foreground mask; bottom: the difference between the first and the segmented frame, scaled and shifted for clarification (gray is zero).

(29)

3.3 The Projectile Sequence

Codebook

Simply the method described in section 2.3. The cho-sen parameters were

ǫ = 15

α = 0.6

β = 1.2

The fat codebook was constructed from the en-tire input sequence. The codebook was trimmed by

keeping only codewords with λ < 3·370

4 and f >

370 6

(remember that the input sequence was 370 frames long).

It was difficult to get a nice segmentation with the codebook. Selecting a low ǫ means that the shape of the projectile becomes good, but there will be more noise. Selecting a high ǫ means that the shape of the projectile will be worse, but there won’t be as much noise. Since the intensity does not vary that much in this sequence, α and β were not as important.

3.3.2 Quality of Tracker Output

Apart from visual inspection of the foreground mask, the relatively simple trajectory of the projectile lends itself to closer inspection of the trajectory quality. The projectile makes no turns, but continues onwards along a near straight path. Unfortunately, there is no definite truth to compare with, so the exact size of the error can not be known. However, everything in-dicates that the trajectory is a smooth curve. That is, the projectile does not make small irregular move-ments up and down as it flies through the air. This section describes an experiment that measures which method yields the smoothest output.

Coordinates

The coordinates that the tracker produced were first transformed to new coordinate systems by rotation and translation in order to simplify and allow for more detailed plots. Each new coordinate system is laid out so that the origin is positioned in the last

point in the trajectory, and the horizontal axis goes from the last to the first point. The vertical axis is perpendicular to the horizontal axis. Figure 3.7 shows a rough illustration of how the new bases are positioned in the input sequence.

500 1000 1500 2000 100

200 300

Figure 3.7: How the bases of figure 3.8 are positioned (yellow lines). The red reference line is parallel to the image’s horizontal axis.

Seven experiments were performed and each re-sulting trajectory were transformed based on its own start and end. The absolute values do not matter as much because, again, the interesting property is the smoothness of the trajectory. The trajectories are shown in figure 3.8.

Apparently, the projectile’s trajectory is not a straight line, it has some parabolic shape to it. It can also be seen that the frame difference and code-book methods are worse than the single Gaussian and GMM methods, since they contain larger transients. The single Gaussian and GMM are both similarly good looking.

Because of the parabolic shape, a quadratic poly-nomial was fitted to each trajectory. The estimated polynomials are also plotted in figure 3.8 as red lines. The polynomials can be used to put some numbers to the trajectory plots. By computing the standard deviation of the residuals between the tracker output and the estimated curve a numeric smoothness mea-sure is found. The standard deviation of the residuals sorted by method was

Method Standard deviation

GMM (RGB) 0.127

GMM (YCbCr) 0.136

SG (RGB) 0.143

SG (YCbCr) 0.176

Frame diff. (RGB) 0.204

Frame diff. (YCbCr) 0.241

(30)

Again, the GMM method and the single Gaussian seem to be the best choices. The GMM trajecto-ries are a bit less varying than the single Gaussian trajectories for this sequence. The best quality was achieved with the GMM method taking RGB input.

Object Size

The size of the extracted objects during tracking were also collected and plotted in figure 3.9.

Once again, the smoother the better. The projec-tile does actually grow in pixel number size during the sequence. Figure 3.10 shows manual segmentations of the first and last projectiles in the sequence. The last frame occupies 394 pixels, and the first frame occu-pies only 263 pixels. The dashed lines in figure 3.9 are straight lines between these two values, included for reference.

Of the plots in figure 3.9, the GMM plot looks best, followed by the single Gaussian, while frame differ-ence and codebook are a lot worse. Frame differdiffer-ence and codebook both have large fluctuations and do not grow as well as they should. This hints at what was seen earlier, those methods do not give nice shapes and parts of the object disappear at times.

The single Gaussian grows, but it is different from the hand-segmented values at both the start and the end. It is a bit more varying than the GMM plot and not as linear. It also produces too small ob-jects in the beginning of the sequence. Both color models behave roughly the same, with YCbCr pro-ducing slightly smaller objects in general. Because of the average filter, we should expect slightly larger object sizes from the single Gaussian compared with the hand-segmented values. However, that is not true until the end of the sequence.

The GMM plot has a clear trend, but it is worse than the single Gaussian in that it has a few spikes

in the beginning. This can be explained by the

pseudo-periodic noise that was observed earlier (see figure 3.6). The GMM curves are also translated up by around 50–100. This is due to the fact that the average filter adds some false foreground on the bor-ders of the object. Again, the average filter is active in all directions, so the foreground mask should grow in all directions, leaving the center of gravity

theo-retically untouched. RGB and YCbCr input yields similar object sizes, but YCbCr suffers less from the spike phenomenon.

3.4 The Fish Larva Sequence

The fish larva sequence shows a small creature swim-ming around in a tank. Foreground is present in all frames, which makes training a bit more difficult. There is also some slow moving “boring” foreground in the form of some reflections around the borders of the tank.

In addition, the sequence is quite noisy, figure 3.11 depicts a background pixel from the middle of the tank. Apart from some high-frequency noise, there is also a clear sinusoidal pattern in the blue and green color channels.

The main object of interest is a larva consisting of a head and a tail. The head is quite contrasted against the background, and has a bit of area to it. The tail, however, is very thin, and its tip fades almost seamlessly into the background.

A final technical complication with the sequence is that it is interlaced while the larva also has a habit of making very quick turns, moving the tail very swiftly. Initially, attempts were made to segment the inter-laced sequence, but the larva moved too quickly and the tail would end up being disconnected between the fields (see figure 3.12). To solve the interlacing prob-lem, every second row was discarded from the input.

3.4.1 Method Comparison

Single-Gaussian

The single Gaussian method runs into a few problems when applied to this sequence. First of all, there is foreground present in all frames, meaning that the ghosting problem must be addressed. Secondly, the background changes quite a bit with the sinusoidal pattern. Thirdly, the object of interest has two body parts that are differently close to the background, which is potentially bad.

Thanks to the static background, the ghost can be removed from the initial means by combining

(31)

3.4 The Fish Larva Sequence 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3

(a) Frame difference and threshold.

0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3

(b) Single Gaussian method. Figure 3.8: Cont. on next page.

(32)

0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 (c) GMM method. 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 (d) Codebook method.

Figure 3.8: Trajectories for the various methods in blue. Red curve is a quadratic polynomial that approxi-mates the trajectory. Left column shows the result of an RGB segmentation, while the right column shows the same method but with YCbCr input.

(33)

3.4 The Fish Larva Sequence 0 100 200 300 400 0 100 200 300 400 500 _{diff (RGB)} diff (YCbCr) 0 100 200 300 400 200 250 300 350 400 450 500 sg (RGB) sg (YCbCr) 0 100 200 300 400 200 250 300 350 400 450 500 gmm (RGB) gmm (YCbCr) 0 100 200 300 400 200 250 300 350 400 450 500 codebook

Figure 3.9: Area of extracted object as a function over time.

frames. The foreground still causes deteriorated vari-ance however, which is the actual cause for the third issue.

The issue is that the head of the larva is deviating more from the background than the tail does. Since we train the model with foreground present, the vari-ances in the locations that contain foreground during training will become high. The magnitude of the in-creased variance is dependent on the size of the devi-ation of the foreground. It is possible then, that an object—such as the larva’s head—will increase the variance so much that another object—such as the larva’s tail—is classified as background. Figure 3.13 illustrates this. Figure 3.14 shows a potential out-come of such a scenario. Objects broken in two are very detrimental to the tracker since it always works by finding a single connected region.

The whole sequence was processed to train the model, with a dynamic learning rate same as for the projectile sequence. The parameters used where

σ2

init = 15

δf g = 0.9

α = 0.001

After the initial training phase, the model was locked so that observations classified as foreground did not update the model. Figure 3.11 shows that the larva sequence changes considerably over just the 1860 frames used for testing, so we still need to up-date the background somewhat. Only adapting to background disables the model from learning new background surfaces. This is not a problem for the fish sequence however, and shouldn’t be for any lab-oratory sequence because the scene is usually set up so that there is only one background.

The output from running this method was bad in the beginning of the sequence, where the variance was high from the training phase. The problem is that the