Privacy Protecting Surveillance: A Proof-of-Concept Demonstrator

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Privacy Protecting Surveillance: A proof-of-concept

demonstrator

Examensarbete utfört i Informationskodning vid Tekniska högskolan vid Linköpings universitet

av

Fredrik Hemström LiTH-ISY-EX–07/3877–SE

Linköping 2015

Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

(2)

(3)

Privacy Protecting Surveillance: A proof-of-concept

demonstrator

Examensarbete utfört i Informationskodning

vid Tekniska högskolan vid Linköpings universitet

av

Fredrik Hemström LiTH-ISY-EX–07/3877–SE

Handledare: Jörgen Ahlberg Thomas Chevalier Hedvig Sidenbladh Examinator: Robert Forchheimer

(isy), Linköpings universitet Linköping, 12 maj 2015

(4)

(5)

Avdelning, Institution Division, Department

Informationskodning

Department of Electrical Engineering SE-581 83 Linköping Datum Date 2015-05-12 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.ep.liu.se

ISBN — ISRN

LiTH-ISY-EX–07/3877–SE

Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Demonstrator för integritetsskyddad övervakning

Privacy Protecting Surveillance: A proof-of-concept demonstrator

Författare Author

Fredrik Hemström

Sammanfattning Abstract

Visual surveillance systems are increasingly common in our society today. There is a conflict between the demands for security of the public and the demands to preserve the personal integrity. This thesis suggests a solution in which parts of the surveillance images are covered in order to conceal the identities of persons appearing in video, but not their actions or activities. The covered parts could be encrypted and unlocked only by the police or another legal authority in case of a crime.

This thesis implements a proof-of-concept demonstrator using a combination of image pro-cessing techniques such as foreground segmentation, mathematical morphology, geometric camera calibration and region tracking.

The demonstrator is capable of tracking a moderate number of moving objects and conceal their identity by replacing them with a mask or a blurred image. Functionality for replaying recorded data and unlocking individual persons are included.

The concept demonstrator shows the chain from concealing the identities of persons to un-locking only a single person on recorded data. Evaluation on a publicly available dataset shows overall good performance.

Nyckelord

(6)

(7)

Abstract

Visual surveillance systems are increasingly common in our society today. There is a conflict between the demands for security of the public and the demands to preserve the personal integrity. This thesis suggests a solution in which parts of the surveillance images are covered in order to conceal the identities of persons appearing in video, but not their actions or activities. The covered parts could be encrypted and unlocked only by the police or another legal authority in case of a crime.

This thesis implements a proof-of-concept demonstrator using a combination of image processing techniques such as foreground segmentation, mathematical morphology, geometric camera calibration and region tracking.

The demonstrator is capable of tracking a moderate number of moving objects and conceal their identity by replacing them with a mask or a blurred image. Functionality for replaying recorded data and unlocking individual persons are included.

The concept demonstrator shows the chain from concealing the identities of per-sons to unlocking only a single person on recorded data. Evaluation on a publicly available dataset shows overall good performance.

(8)

(9)

Acknowledgments

This master thesis summarizes the work at FOI during 2006-2007. I would like to thank FOI for letting me do my thesis while working for them. I would like to express my gratitude to Jörgen Ahlberg for insightful comments, guidance and the opportunity to work at FOI. Another notable mention is Thomas Chevalier for his help during the demonstration.

A big thank to the people who has proof read the thesis.

Finally I would like to thank the employees at FOI, with whom I have shared many interesting discussions with during the coffee breaks.

Linköping, April 2015 Fredrik Hemström

(10)

(11)

1

Introduction

Visual surveillance systems are increasingly common in our society today. You can hardly take a walk in the center of a modern city without being recorded by several surveillance cameras. There is a conflict between the demands for security of the public and the demands to preserve the personal integrity of the individuals in the public. The rising number of surveillance sensors introduces problems; how can the personal integrity of the people being watched by the sensors be preserved?

Currently there is a growing interest in these issues at FOI. Research is conducted with applications to anti-terrorism, urban crisis management and airport security in mind.

FOI has suggested a solution in the report New Systems for Urban Surveillance (H. Sidenbladh, 2005) to this problem in which parts of the surveillance images are covered in order in to conceal peoples identities, but not their actions or ac-tivities. The covered parts could be encrypted and unlocked only by the police or another legal authority in case of a crime. The proposed system is called In-tegrity Preserving Surveillance (IPS). In literature this is also known as privacy protection. The vision of the IPS system is shown in figure 1.1.

1.1 Structure of the thesis

The thesis is divided into three parts.

• In the first part, Chapter 1 - Introduction, the background to the problem is presented along with prior work, method, limitation, goal and the proposed system design.

(16)

2 1 Introduction

Figure 1.1:Illustration of the vision of the Integrity Preserving Surveillance.

• In the second part, chapters 2 to 6, describes each module in the system. In each chapter the theory, prior work, results and further work are presented. • The third part, chapter 7 and 8, summarizes the result of the previous

chap-ters and final conclusion are presented.

1.2 Background

Consider a typical surveillance that is placed in a public space, figure 1.2. Its normal operation is to continuously capture images and send them to a server for storage. The video stream may be monitored by a camera operator. The images are stored in a long term storage area, such as a hard drive or optical disc. The stored images contain a lot of personal data, where and when a person has been, what he or she was wearing and so on. In case of an incident the data can be replayed and viewed by an operator or legal authority. During this process people not part of the incident may have their identity revealed. There is also a concern that the video footage will leak for example if the storage media is stolen or the storage server is hacked.

Now consider a smart camera that has a built-in algorithm that extracts the iden-tity of a person and instead replaces it with a blurred image or a mask, figure 1.3. Depending on the purpose of the camera it could be optional if the extracted identity is encrypted or deleted. If the video is stored on a long term media, the identity of a person will remain preserved until it is decrypted by a legal author-ity. The system will require different keys to unlock each individual. The concept of unlocking only a single individual and keeping the rest concealed is illustrated in figure 1.1. This could enable cameras to be placed in environments where cam-eras normally is not allowed, for example to prevent pick-pocketing in a locker room.

Given a correctly detected and tracked object, further analysis can be performed without the need for camera images. Such analysis could include people counting, object classification and speed estimate. A set of rules could be determined and,

(17)

1.3 Project goal 3

Storage

Operator

Figure 1.2:Operation of a typical surveillance system.

Storage _Operator

Figure 1.3:A smart camera with the IPS concept running inside. The iden-tity is removed before the image leaves the camera.

if triggered, warn an operator, when something abnormal occurs, such as people walking on a railroad track.

1.3 Project goal

The main goal of the project was to implement a proof-of-concept to demonstrate the IPS system. A part of the work towards the goal is to identify, implement and evaluate suitable image processing algorithms to achieve the vision of the IPS system. The demonstrator should run in real-time, showing the capabilities to preserve the integrity of a person. The final goal is to have a demonstrator that can be shown at an exhibition.

1.4 Method

A literature study was made to identify suitable algorithms. The focus was on algorithms with real-time performance. A selection of algorithms were imple-mented and evaluated on a publicly available dataset. Improvements of the algo-rithms were made to suit the IPS system.

1.5 Prior work

A similar concept of the IPS system has been proposed by IBM in (Senior et al., 2003), however no specific implementation details are presented. Masking of faces is proposed by (Martínez-Ponte et al., 2005) by applying a face detector and

(18)

4 1 Introduction tracking faces over time.

The concept of detecting and tracking objects in video is a wide research area. Many algorithms are based on a foreground segmentation in order to detect ob-jects and the detected obob-jects are then tracked over time. The most commonly used foreground segmentation algorithm is the Mixture of Gaussians model as proposed by (Stauffer and Grimson, 1999). Alternative detection methods in-cludes specific trained detectors as proposed by (Sidenbladh, 2004), (Viola and Jones, 2001) among others.

Noticeable proposed real-time system includes (Haritaoglu et al., 1998) where silhouettes are used for tracking. (Zhai et al., 2006) proposed the KNIGHT sys-tem which features a region tracker with a pixel voting approach to associate foreground segmentation objects over time. Both systems use a foreground seg-mentation approach.

Each chapter in this thesis will present a more detailed description on prior work.

1.6 Design

Since this system shall run in real-time we favor less calculation when possible. A foreground segmentation algorithm is selected as the basis for the object detec-tion. The usage of foreground segmentation has several advantages such as:

• The system is not constrained to only detect one or a few types of objects. • The foreground segmentation can be used for the integrity preservation

when covering an object.

• It is a proven technique in several surveillance systems.

To keep objects separated, a region tracker similar to (Zhai et al., 2006) is used where the pixel voting schema is reused in the integrity preservation.

1.7 Alternative design

Other design options include the use of a trained detector for human motion as proposed in (Sidenbladh, 2004). The disadvantage of a trained detector is that it can detect only the types of objects it was trained to detect. For instance, a walking pedestrian may look different from a human sitting down. To detect different appearances of objects, several detectors may need to be trained. There is also the issue of supplying relevant data for the training. For instance the face detector trained by (Viola and Jones, 2001) uses approximately 20,000 examples. The process of manually labeling training data is time consuming.

(19)

1.8 System overview 5

Input _segmentationForeground

Region tracker Privacy protection Output Morphology Scale approximation Storage

Figure 1.4:System overview

1.8 System overview

The input to the system consists of camera images, the output is the same images but with the identity extracted and replaced with a mask or a blurred image. The output images can be viewed by an operator or stored on disk. The system is divided into a number of modules. Each module is responsible for a certain task. An illustration of the system can be seen in figure 1.4. The information in the system flows as follows:

• The Input module delivers images as input to the system. Images come from a network camera or prerecorded data on a disk. No chapter is as-signed to this module.

• The Foreground segmentation module creates a binary image and is re-sponsible for detecting segments that could potentially belong to objects. In the system two different foreground segmentation algorithms were im-plemented.

• The Morphology module is used to clean noise from the foreground seg-mentation and add an extra level of protection to the binary mask if the segmentation should fail.

• The Region tracker module tracks the foreground segments. It is respon-sible for assigning segments to objects. If no corresponding object exists a new object (track) is created. The module uses scale information too set more optimized thresholds when assigning regions to tracks.

• The Scale approximation module contains functionality for approximating a scale map, where the scale for each pixel in the camera image is approxi-mated.

(20)

6 1 Introduction • The Privacy protection module uses data from both the tracker and

fore-ground segmentation to separate privacy from the data and replace it with colorful regions or a blurred mask. The separated data are encrypted and stored inside the images.

• The Output module is responsible for outputting data to disc or display it. The output data can also be replayed using the developed player. No chapter is assigned for this module.

1.9 Limitations

The focus is on the selection of algorithms used in the demonstrator. Questions regarding encryption, data storage, key handling and legal issues are out of the scope. Specific implementation details are also left out.

(21)

2

Foreground Segmentation

The foreground segmentation is responsible for segmentation of moving objects. The output is used for both the region tracker and as basis for the privacy pro-tection. In this chapter two foreground segmentation methods are described, im-plemented and evaluated. A dataset is developed to determine a good set of parameters for each algorithm. This chapter is outline as follows: background, challenges with foreground segmentation, prior work, description of two fore-ground segmentation algorithms and finally evaluation on datasets.

2.1 Background

Foreground segmentation is the process of extracting foreground objects from video sequences. It is also called background subtraction because the main goal is to eliminate the background, leaving only the segmented objects. This is usually the first step in any surveillance system and also an important task in computer vision.

The result of foreground segmentation is a binary image where white pixels (ones) represent objects and black pixels (zero) are classified as background. Figure 2.1 shows an example of a successful foreground segmentation.

2.2 Challenges

Surveillance cameras are usually mounted on buildings or posts without the abil-ities to pan and tilt. Video from these cameras have a background that is mostly stationary. Even with stationary scenes there are several challenges, the most common are listed below:

(22)

8 2 Foreground Segmentation

(a)Input image (b)Output segmentation Figure 2.1:Example of output from a foreground segmentation

• For outdoor scenes illumination changes are difficult to deal with. The global lighting conditions may change during the course of the day with sudden periods of clouds or sun light.

• Illumination also causes other problem such as the object may cast a shadow that we do not want to detect as an object.

• Non-stationary background objects may cause problems. For example leaves on trees are moving when the wind starts blowing.

• A new stationary object that remains in the scene for a long time. Consider a broken car left on the side of the road. The algorithm needs to determine when the object is considered a static object and is a part of the scene. • Camouflage, when an object has the similar color as the background it is

difficult to distinguish between objects and background.

2.3 Prior work

There is a lot of prior research in this area. According to the survey by (Yilmaz et al., 2006) the field of foreground segmentation dates back to 1979. A common approach in foreground segmentation is to estimate a model of the background scene. The background model is updated based on previous data to adapt to changes in the environment. When a new image arrives it is matched against the background model, this classification is usually done per pixel. A pixel that is not a part of the model is considered an object.

A popular approach is to model each pixel as a mixture of Gaussian distributions as proposed by (Stauffer and Grimson, 1999). They used an online algorithm to update the parameters of the Gaussian distributions. This paper is usually included as base reference when new foreground segmentation algorithms are proposed. This technique is further described in section 2.5.

(23)

2.4 Algorithm and parameter selection 9

(a)Mixture of Gaussian, with a high threshold for foreground

(b) Mixture of Gaussian, with a low threshold for foreground

Figure 2.2:Output from Mixture of Gaussian with different parameter set-tings

A Codebook technique proposed by (Kim et al., 2005) where pixels are clustered into codebooks, without making the assumption that they originate from Gaus-sian noise. Codebook techniques are described in sec 2.6.

Other approaches include estimating the background probabilities at each pixel from many recent samples over time using Kernel density estimation (Elgammal et al., 2000).

2.4 Algorithm and parameter selection

Two different foreground segmentation algorithms were implemented and evalu-ated. The popular Mixture of Gaussian by (Stauffer and Grimson, 1999) and the Codebook algorithm by (Kim et al., 2005). Both algorithms claims to be able to perform in real-time. The Codebook algorithm also claim the ability to handle shadows. To determine a good set of parameters a small benchmark dataset of hand annotated ground truth was developed.

The choice of foreground segmentation algorithm and parameters will depend on both the tracker module as well as the integrity preserving module. The re-gion tracker and privacy protecting module perform best under the same ideal condition. That is, if the objects are well separated, not connected to each other and only pixels inside the object are classified as foreground.

The visual appearance is also important for a demonstrator because it is the first impression of a new technique. The parameters should be selected as to eliminate shadows because they are not pleasant to look at. Figure 2.2 shows the output of a foreground segmentation algorithm with different parameters. In the first image the whole person is segmented and the identity is well hidden. But underneath the person the shadow is also included. Furthermore, the two pedestrians fur-thest to the left in figure 2.2 are segmented into one large segment.

(24)

10 2 Foreground Segmentation By using different parameters better separation and fewer artifacts from shadows can be achieved, with the expense of less coverage of the pedestrian. The second image is more visually pleasing because most viewers do not consider shadows to be objects.

2.5 Description of Mixture of Gaussians

A Mixture of Gaussian (MOG) algorithm was proposed by (Stauffer and Grimson, 1999). The algorithm models the background at each pixel with a number of Gaussian distributions. Each distribution is assigned a weight w. The mean µ and variance σ2 of the distribution along with the weight are updated online based on the recent pixel history.

For simplicity, the following equations are based on a 1-dimensional signal, a typical grayscale video. Extending it to color is straightforward, more details can be found in (Stauffer and Grimson, 1999). The probability for observing a pixel with a given intensity xt at time t is given by the probability density function

(PDF) P (xt) = 1 K K X i=1

wi,t×N (xt; µt,i, σt,i2) (2.1)

where N (x; µ, σ2) = 1 σ √ 2πe −(x−µ)2 2σ 2 _. _(2.2)

K is the number of Gaussian distributions used at each pixel. Typical values for K range from 1 to 5. The function N is the PDF for a Gaussian distribution with

the mean µ and variance σ2.

A pixel xtis a match and is considered belonging to distribution i if it fulfills the

following equation Match(i) =        1 if|xt−µt,i| σt,i < Z 0 otherwise , (2.3) that is, if a pixel is within Z standard deviations. Typical values are ranging from 2 to 3, (Stauffer and Grimson, 1999) uses a value of Z=2.5.

At each pixel there are K distributions and they are ordered in descending order by the key wi/σi. A new pixel xt is matched against all distributions. If a match

is found, µi and σiare updated with

µt,i= (1 − p)µt−1,i+ pxt (2.4a)

(25)

2.6 Description of Codebook algorithm 11 where

p = αN (x; µ, σ2). (2.4c)

If no match is found, the distribution with the lowest weight is replaced with a mean corresponding to xt and with a low initial weight. The weights for all

distributions are updated as follows

wi,y= (1 − α)wi,t−1+ α × Match(i), (2.5)

where Match(i) is one for the matched distribution and zero otherwise. The pa-rameter α is the learning rate. All weights are normalized. The tuning of α is important since it dictates how fast the model can adapt to new changes. If the value is large the model adapts quickly, but on the downside if an object, e.g a person is standing still it will soon be blended into the background model. Given that the K distributions are sorted by wi/σi, the first B distributions are

considered to be the model of the background, where

B = arg min b ( b X i=1 wi > T ). (2.6)

The parameter T determines the minimum portion of distributions that model the background. A pixel xtis considered a background if it matches any of the

first B distributions.

2.6 Description of Codebook algorithm

The Codebook (CB) algorithm proposed by (Kim et al., 2005) adopts a clustering technique where pixels are grouped into a codeword. The idea is to model each image pixel as a codebook containing one or more codewords. A single code-word contains information about color, intensity and temporal information such as how often it is updated. The number of codewords for each pixel depends on the background complexity. A static scene may be modeled with only one code-word per pixel. More complex backgrounds like waving trees might need more codewords to describe the variations.

2.6.1 Codebook and codewords

Let each pixel xihave a codebook Ci = c1, c2, . . . , cLcontaining L codewords. Each

codeword c contains information about color, intensity and temporal information. The elements of a codeword are described in table 2.1. For easier notation a codeword will be described as a 9-tuple.

(26)

12 2 Foreground Segmentation Table 2.1:Paramters of the 9-tuple.

Parameter Description

v = {R, G, B} The mean R,G,B color assigned to the codeword.

Imin The minimum intensity assigned to the codeword.

Imax The maximum intensity assigned to the codeword.

f The frequency with which the codeword has occurred.

λ The maximum negative run-length (MNRL) defined as the longest interval that has not been updated.

tlast The last time the codeword has been accessed.

tfirst The first time the codeword has been accessed.

2.6.2 Color space representation

In experiments performed by (Stauffer and Grimson, 1999) they concluded that illumination changes from global illumination and shadows causes the pixels to be distributed in shape along the axis l going toward the origin point (0, 0, 0) in RGB-space. The principle is that objects will change in intensity but color will remain the same. The matching of a pixel x and a codeword c is separated into a color distance and an intensity criteria. Both criteria needs to be true in order for a pixel to match a codeword.

Color distance

The color distance between the mean color of a codeword v = {Rj, Gj, Bj}and a

pixel x = {R, G, B} is denoted δ, see figure 2.3. It is defined as the orthogonal distance between the line l and the pixel x as

colordist(x, v) = δ = q k_xk2−_p2 _(2.8a) k_xk2_{= R}2_{+ G}2_{+ B}2 _(2.8b) p2= kxk2cos2θ = (RjR + GjG + BjB) 2 R2_j + G2_j + B2_j (2.8c) where p2 is the normalizing scaling factor, for more information refer to (Kim et al., 2005).

Brightness distance

The brightness of a pixel x = {R, G, B} is computed as

I = kxk =

√

R2_{+ G}2_{+ B}2_. _(2.9)

To allow for brightness changes in the detection, statistics for the maximum Ii,max

and minimum Ii,min brightness of the codeword is stored in the 9-tuple. The

(27)

2.6 Description of Codebook algorithm 13

Figure 4 Illustration of the color distance between a the mean color of a codeword and a new pixel

Color distance

The color distance between the mean color of a codeword 𝑣 = {𝑅

𝑗

, 𝐺

𝑗

, 𝐵

𝑗

} and a pixel 𝑋 = {𝑅, 𝐺, 𝐵} is

devoted 𝑝 in the figure XXX. It is defined as the orthogonal distance between the line 𝑙 and the pixel 𝑋

The color distance 𝑐𝑜𝑙𝑜𝑟𝑑𝑖𝑠𝑡(𝑋, 𝑣) can be computed in equations () –() , for more information refer to

[Kim, 2005]

𝑐𝑜𝑙𝑜𝑟𝑑𝑖𝑠𝑡(𝑋, 𝑣) = 𝑝 = √‖𝑋‖

2

_{− 𝑝}

2

_where

𝑝

2

𝑗

𝑅)

2

Brightness distance

The brightness of a new pixel 𝑋 = {𝑅, 𝐺, 𝐵} is computed as

𝜃

R

G

B

x

v

l

𝛿

Figure 2.3:Illustration of the color distance δ between the mean color v of a codeword and a new pixel x in the RGB-space.

The range for each codeword is defined as

Ilower= αImax (2.10)

Iupper= min(βImax,

Imin

α ) (2.11)

where α and β are global parameters. Typical values suggested by (Kim et al., 2005) are α=0.7 and β=1.3.

A pixel is considered to be a part of the codeword if it is between the lower and upper intensity bound. The logical brightness function is defined as

brightness(I) = (

true if Ilower< I < Iupper

false otherwise (2.12)

Updating of a codeword

When a pixel xt = {R, G, B} is found to be a match for codeword cj, then the

9-tuple cj = hv = {Rj, Gj, Bj}, Imin, Imax, f , λ, tlast, tfirstiis updated by the following

equations {_R_j_{, G}_j_{, B}_j} ←₍f Rj+ R f + 1 , f Gj+ G f + 1 , f Bj+ B f + 1 ) (2.13)

(28)

14 2 Foreground Segmentation {_{f , λ, t}_last_{, t}_first} ← {_{f + 1, max(λ, t}_last−_t_first_{), t, t}_first} _(2.15)

2.6.3 Training of the model

The training process of the N first frames is outlined in the algorithm below where ε is a global color distance threshold and N the number of training frames.

(I) For t = 1 to N

(i) xt= {R, G, B} , I =

√

R2_{+ G}2_{+ B}2

(ii) Find codeword cj that meet the conditions of (a) and (b)

(a) brightness(I) = true (b) colordist(xt, vj) < ε

(iii) If a match is found • _{Update codeword} (iv) Otherwise

• _{Create an new codeword by setting}

cL+1= hv = {R, G, B}, {I, I}, {f , 1, tlast, tfirst}i

(II) End

After the training process, a codebook may contain codewords not only from background pixels but also from a foreground object that was present during the training phase. A codeword is considered a foreground object if it is only occasionally observed, thus having a higher λ (maximum negative run-length) than a codeword belonging to the background. To remove such codewords a filter is applied where codewords with λ > N /2 are removed.

2.6.4 Classification

After the initial training is done, classification is performed in a similar process as the training. A new pixel xt is matched against all codewords until a match

is found. The matched codeword is updated and the pixel xt is considered

back-ground. Likewise if no match is found, the pixel is considered foreback-ground. Note that there is no need to evaluate both the color distance and brightness criteria, if one of them returns false. The brightness criterion is evaluated first because it is less expensive.

2.6.5 Layering

The background can change after the initial training. Consider an outdoor sce-nario where a billboard sign is changed or a car is parked on the street. If the algorithm is not adapting to the environment then it will eventually start to de-tect false foreground and background pixels.

To solve this problem (Kim et al., 2005) suggest using a cache system. Given the original background model M of codebooks. A second layer H of codebooks is added with the same structure as described above, called cache. When a new

(29)

2.7 Dataset for parameter selection 15 pixel arrives it is matched against the first layer M. If a match is found, the cor-responding codeword is updated and the pixel is considered background. An unmatched pixel in the first layer is always considered foreground, but it is prop-agated into the second layer H where it is matched and updated just as the first layer M. If no match is found a new codeword is created and inserted in the layer

H.

Codewords in the layer H that stays for a certain amount of time Hadd is

con-sidered a part of the background and are moved into the background layer M. Similar, the codewords that has not been updated for a certain amount of time

Hremovewill be removed from the background layer M. Codewords in the cache

layer H that are occasionally observed are removed from the layer, if the maxi-mum negative run-length λ is bigger than the threshold Hλ.

2.7 Dataset for parameter selection

The foreground segmentation algorithms presented above have several parame-ters. Depending on the choice of parameters different performance is reached. A set of parameters may give very few false foreground pixels but it may not contain enough pixels to use as a mask. Contrary, if too many pixels are used as a foreground region the pedestrians may look like a big blob and foreground regions spilling into the shadows.

To test the performance of the foreground segmentation algorithms, a dataset was constructed containing manually labeled pedestrians. The aim of the dataset is to find a good set of parameters that corresponds to the criteria discussed in section 2.4.

A total of 60 pedestrians were labeled. These pedestrians were randomly cho-sen from 3 different cameras, during a recording period of 3 hours. The cameras were setup as typical surveillance cameras. Figure 2.4 shows examples of 3 la-beled pedestrians. The yellow region is pixels belonging to the pedestrian, the red region is pixels that belong to the background and the black region is pixels not considered during the evaluation. For each pedestrian a patch is extracted. These patches contain both foreground mask and background mask. The area around the pedestrian is labeled as background. The foreground mask is anno-tated so it completely includes the object. The region in-between annotations may contain both the object and background due to frame interlacing. This region is annotated as “don’t care" and is not used for the evaluation.

(30)

Figure 2.4:Example of manually labeled pedestrians. Yellow region is pixels that completely belong to the pedestrian and the red region is background pixels in the region near the object.

2.8 Results

The two implemented algorithms were tested on both the manually labeled dataset, in order to find suitable parameters and then evaluated on the (PETS, 2001) and (PETS, 2006) dataset.

2.8.1 Evaluation on dataset

Each algorithm was given multiple sets of parameters. Each set was applied on the dataset. The evaluated video was approximately 3 hours. During the evalu-ation labeled frames were extracted and the output mask from the foreground segmentation algorithm was compared with the manually labeled mask.

Parameter selection

The total space of all parameter permutations is huge. Evaluating all parameters would take an unfeasible amount of time. For the MOG algorithm the σ was varied. The update rate w was fixed to a reasonable value based on experiments. Since all sequences in the dataset have static background the number of Gaus-sians was set to K = 1. For the CB algorithm the α, ε and β was discretized into steps of 0.1 and tested in the range suggested by (Kim et al., 2005).

Performance measurement

Each labeled pedestrian are compared to the foreground segmentation output of each evaluated parameter set. Pixels within the labeled regions are classified as either true positive (TP), true negative (TN), false positive (FP) or false negative

(31)

2.8 Results 17 (FN). The ratio of correctly classified pixels within the object, also called the true positive rate (TPR) is formulated as

T P R = T P

T P + FN (2.16)

The ratio of wrongly classified background pixels, also called false positive rate (FPR) is formulated as

FP R = FP

T N + FP (2.17)

Each set of parameters will give different TPR and FPR. Ideally the TPR would be 1 and the FPR would be 0. Figure 2.5 shows the relation between TPR and FPR, also called a ROC curve (receiver operating characteristic) for representative set of parameters of each algorithm. In total 120 set of parameters were evaluated on the dataset. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

False Positve Rate

True Positive Rate

MOG CB

Figure 2.5:ROC curve for a representative set of parameters for both Code-book algorithm(blue) and Mixture of Gaussian algorithm (red).

Figure 2.6 shows 5 samples from one of the evaluated sequences. The values from the corresponding ROC curves are also shown in figure 2.6 along with the ground truth.

(32)

Method TPR(%) FPR(%) Samples Parameters Ground truth 100 0 CB(1) 96.1 5.8 ε = 10 α = 0.6 β = 1.7 CB(20) 94.8 2.9 ε = 20 α = 0.7 β = 1.1 CB(14) 92.2 1.6 ε = 20 α = 0.6 β = 1.1 CB(8) 88.0 0.9 ε = 20 α = 0.5 β = 1.1 MOG(6) 84.6 6.9 K = 1 w = 0.99 σ = 10 MOG(9) 68.4 1.5 K = 1 w = 0.99 σ = 20 MOG(10) 60.0 0.86 K = 1 w = 0.99 σ = 25 Image -

-Figure 2.6: Examples of segmentation output for a selection of different pa-rameter setup on the dataset.

(33)

2.8 Results 19

Observation

As seen in the figure 2.5, the Codebook algorithm (CB) performs better than the Mixture of Gaussian (MOG) on this dataset. This is in line with the observation in (Kim et al., 2005) where the CB is also found to have better performance. At roughly 6% FPR, it can be seen in figure 2.6 that CB(1) and MOG(6) give dif-ferent types of false positives (FP). For the MOG most of the false positive comes from shadows, where in CB it originates from artifacts in the image background. The color distance associated with this is set fairly low at 10. It is reasonable to as-sume that the image compression create the artifacts. Because JPEG compresses the color more heavily than intensity and noise in the compression is picked up when using a low threshold.

At 1.5% FPR the head of sample 2 is classified as background in the MOG(9). At a similar FPR rate for the CB(14) algorithm, the head is correctly classified. Furthermore the TPR rate for the CB algorithm is much higher at the same FPR.

2.8.2 Discussion

The ROC curve in figure 2.5 shows that the Coodebook algorithm is better on this dataset. The main advantage of CB is the ability to handle soft shadows. As seen in figure 2.8 none of the two methods can handle darker shadows, for example underneath the car.

Given the result of the dataset, a reasonable parameter setup is CB(8) with TPR 88.0% and FPR 0.9% where soft shadows are handled nicely. The corresponding FPR 0.86% for the MOG algorithm, MOG(10), has a lower TPR at 60.0%. It is noticeable in figure 2.6 that parts of the pedestrian is missing. Figure 2.8 and 2.7 shows the CB(8) and MOG(9) parameter setup evaluated on two publicly avail-able datasets, (PETS, 2001) and (PETS, 2006). Similar to the manually labeled dataset, the codebook algorithm has slightly more filled objects. However it con-tains more random noise.

None of the algorithms gives a perfect segmentation. The largest problems in-clude the camouflage problem where the background is similar to the foreground. This can be seen for both algorithms in figure 2.7b and 2.7c where part of the cap is missing. Failed foreground detection could lead to failure in the privacy pro-tection. In chapter 3 a morphological operation is described that in some cases can preserve the privacy protection.

Sudden illumination changes have not been studied, since no segmentation prob-lems related to illumination changes have been found in the evaluated datasets. Non-static objects are handled differently by the two algorithms. The MOG al-gorithm slowly blends the object into the background based on the parameter α. The CB algorithm has a more distinct approach where new pixels are stored in a cache and added to the background model after a certain time. An object will blend into the background after a considerable amount of time regardless of the choice of algorithm. The effect of this has not been studied.

(34)

(a)Input frames

(b)Output from CB(8) parameter set.

(c)Output from the MOG(9) parameter set.

Figure 2.7: Output from PETS 2006 dataset given CB(8) and MOG(9) pa-rameter set.

(a)Input frames

(b)Output from CB(8) parameter set.

(c)Output from the MOG(9) parameter set.

Figure 2.8: Output from PETS 2001 dataset given CB(8) and MOG(9) pa-rameter set.

(35)

3

Mathematical Morphology Filtering

This chapter describes the post processing filtering applied to the foreground segmentation. This chapter is outlined as follows: background, prior work, intro-duction to mathematical morphology, discussion how scales affect morphology and result from experiments.

3.1 Background

Mathematical morphology or just morphology is a set of fundamental operations on binary images. These operations are typical used as a post-processing step to clean and reduce noise from the output of a foreground segmentation algorithm. In the IPS system it is used to remove noise and add extra level of protection if the foreground segmentation fails.

3.2 Prior work

The introduction on mathematical morphology dates back to the works done by Serra in 1964. (Ronse et al., 2005).

3.3 Introduction to morphology

Consider a binary image X where the pixels x ∈ X are either a one or zero and a structuring element B, that itself can be seen as a binary image. A typical structuring element is shown in figure 3.1. The basic idea is to apply a set of rules called morphological operator at every pixel in the image. The input of the morphological operator are the binary image X and the structuring element

(36)

22 3 Mathematical Morphology Filtering

B. The output is a binary image O, the resulting image may be smaller than the

input image X since the structuring element B needs to be contained inside the image. Typical morphological operators include dilation, erosion, opening and closing. 1 1 1 1 1 1 1 1 1 (a)3x3 square 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 (b)7x7 disk

Figure 3.1:Two commonly used structuring elements.

3.3.1 Dilation

Dilation is a morphological operator that expands the segments in a binary image. It is often used to fill holes. The structuring element is applied at each pixel x, if the center of the structuring element B is over a true pixel then all pixels under the structuring element B is set to true in the output image O.

3.3.2 Erosion

Erosion could be seen as the opposite of dilation, it allows the binary pixels in the image to shrink. The structuring element is applied at each pixel x, if all pixels covered by the structuring element are true then the output O is true otherwise false.

3.3.3 Opening

The two basic operations, erosion and dilation can be combined. The opening op-erator includes running an erosion opop-erator followed by dilation opop-erator. This will filter out any element of the size of the structuring element. Different struc-turing elements can be used for erosion and dilation.

3.3.4 Closing

Closing consists of a dilation operator followed by erosion operator. This can be good for filtering out small holes.

3.3.5 Examples

Figure 3.2 shows different morphology operators when applied to a binary image

X. Structuring elements with smoother edges, like a disk, will give smoother

(37)

3.4 Morphology with different scales 23

(a)Foreground seg-mentation output, image X.

(b)Erosion operator with a 3x3 square el-ement

(c) Dilation oper-ator with a disk shaped 7x7 element

(d)Closing operator with a 3x3 square el-ement

(e) Closing opera-tor with a 7x7 disk shaped element

(f) Opening op-erator with a 3x3 square element

Figure 3.2:Different structuring elements applied to the foreground output in figure 3.2a.

3.4 Morphology with different scales

The morphological operators are per pixel filter operations. An object close to the camera will thus be filtered differently than an object further away. Consider figure 3.3b where the foreground segmentation has failed on the backpack. A close operation is applied with a structuring element of size 7x7 in figure 3.3c. If the same object is closer to the camera, say 50% bigger, then applying a 7x7 element would still leave a hole in the backpack, figure 3.3d. By scaling the structuring element close to 50% to 10x10 gives the same result as with the hole gone, figure 3.3e.

(38)

(a)90 pix. (b)90 pix. (c)90 pix. (d)120 pix. (e)120 pix. Figure 3.3:A closing operator is applied at different scales will give different results. a) Input image. b) Segmentation. c) 7x7 closing operator applied on a pedestrian being 90 pixels high. d) 7x7 closing operator applied on a pedestrian 120 pixels high. e) 10x10 closing operator applied on a pedestrian being 120 high.

Consider a scale map S(x) for which the relative scale in the image is given for every pixel x, see chapter 5. By using the scale map, a single base size for the structuring element can be used and then be applied at different scales in the image.

3.5 Results

The output of the foreground segmentation is not always perfect. Morphological operations can be applied to connect split objects into a single region and cover pixels that has failed in the segmentation.

In experiments, an erosion operation with a 3x3 square structuring element to remove noise followed by a dilation operation with a disk shaped structuring element of size 7x7, showed overall good results. Given the foreground segmen-tation output of the Codebook algorithm with parameter set CB(8), the result on (PETS, 2006) and (PETS, 2001) datasets are shown in figure 3.4 to 3.6.

Figure 3.6 shows a difficult case for the foreground segmentation where parts of the coat blends into the background. The morphological operations are able to almost cover the pedestrian in the front. The objects in the back are similar to the background, in this case the morphology was not able to cover the entire pedestrian.

The use of applying a morphological operator at different scales was omitted, only a small performance improvement was noticed during experiments. The added computational cost does not justify the added performance. Instead

(39)

set-3.5 Results 25 ting a single threshold for each morphological operation, as described above, showed good performance.

In most observed situations the morphology adds to the privacy protection with the added cost of a less distinctive mask. However, a less distinctive mask may add to the privacy protection, but hinder an operator from identify an activity.

(a)Image (b)Foreground

(c)Erosion (d)Dilation

Figure 3.4:Evaluation with a erosion 3x3 square operator followed by a 7x7 dilation operator on (PETS, 2006).

(40)

(41)

4

Region Tracker

The region tracker uses input from the foreground segmentation and associates each detected region with an object. The output from the tracker are estimated objects with size, velocity and position. In this chapter an improved region tracker is proposed based of a region tracker from the literature. This chapter is outlined as follows: background, prior work, method, description of a region tracker, improvements made to the region tracker and finally evaluation on pub-licly available dataset.

4.1 Background

Tracking an object throughout the scene is essential in most surveillance applica-tions. If the system output from the tracker is used to separate individuals, this allows for the individual encryption of objects. Tracking information is also used to visualize colored masks for the operator.

4.2 Prior work

There is a lot of previous research of different tracking algorithms with differ-ent approaches. In the survey by (Yilmaz et al., 2006) they divide the differdiffer-ent approaches into three main groups.

• Point Tracking where detected objects are represented by a point. A de-tected point could for example be the output of a trained classifier. An external tracking framework could then be used to associate continuous points into a track by using a model.

(42)

28 4 Region Tracker • Kernel Tracking where the shape and appearance of an object is tracked from frame to frame. The object is usually related between frames with a transformation such as translation, rotation and affine. Common approaches include template matching, where a region is selected and tracked over time.

• Silhouette Tracking where the goal is to track the silhouette or contour of an object. There are also several proposed systems for detection and tracking. In (Haritaoglu et al., 1998) they present a system where people activities are tracked and classified by shape analysis and contour tracking. (Zhai et al., 2006) describes a surveillance system where foreground regions are tracked by applying a voting scheme to associate regions to objects.

4.3 Method

There are many interesting tracking approaches. A straightforward approach would be to use a human detector such as (Sidenbladh, 2004) and apply a point tracker. However, a human detector can only detect human objects and is also limited by the pose. A human sitting and a human standing could require two different detectors. A more general approach was needed for the IPS system. Tracking objects by contour would be suitable but was not evaluated due to time constraint.

We build on the region tracker idea from (Zhai et al., 2006) where each pixel in each region votes for the best object match. This voting schema is later used for creating the privacy protection mask.

4.4 Region tracker

This section describes the region tracker proposed by (Zhai et al., 2006). A de-tected foreground region does not always correspond to an object. A region could contain part of an object, two or more merged objects or false foreground seg-ments. It is assumed that a region bigger than certain size in pixels often corre-sponds to an object. The region tracker proposed by (Zhai et al., 2006) assigns each region to an object (track). If two regions merge or a region splits, the tracker handles this by applying a set of rules described below.

An object Pk is modeled by the color, shape, size and motion. The color is

mod-eled by a normalized histogram hk. The shape is modeled by a two-dimensional

Gaussian distribution Nk, with the variance equal to the width and height of an

object. A linear prediction model is used to model the motion.

A new pixel xtinside a region Ri (xt∈Ri) votes for which object Pkit most likely

belongs to, for which the joint probability of the shape and color is the maximum arg max

k

(43)

4.5 Improved region tracker 29 Then, the following tracking mechanisms with occlusion, entry, merging, split-ting and exisplit-ting reasoning are used:

Let Vi,kbe the number of votes from a region Ri for an object Pk. Let nk be the

number of pixels represented by the object Pk.

• If (Vi,k/nk) > T , (Vi,q/nq) < T , where k , q, that is, if a significant percentage

T of the pixels in region Ri votes for an object Pk, then all pixels in the

region Ri are used to update the object. If more than one region conforms

to this condition, all regions are used to update the object model. In this case the object has been split into multiple regions.

• If (Vi,k/nk) > T , (Vi,q/nq) > T , that is, if two regions merge into a single

region. In this case only those pixels in Ri that voted for Pk will be used to

update the model of Pk

• If (Vi,k/nk) < T , ∀i, that is, if no observation matches any model k. This may

happen if the object is occluded or has exited the scene. If the predicted position of an object is near the image border, the object is considered to be out of the scene and deleted. Otherwise the linear velocity prediction is updated and the rest of the parameters are kept constant.

• If (Vi,k/nk) < T , ∀k, that is, if a region Ri does not match any model. This

means a new entry. An object is created for this region and is updated by the pixels in the region.

4.5 Improved region tracker

The tracker algorithm outlined in (Zhai et al., 2006) is vague on details and in this section a number of improvements are proposed.

New objects will typical introduce themselves by slowly entering the scene. In the first couple of frames only parts of the object are visible. The region is typically smaller than the whole object. If this initial region is used to create a new object, there is a risk that it might not be representative for the rest of the object. For instance, in figure 6.4 a person is entering the scene, the only visible part is the hand. The appearance of the hand is different from the majority of the object, creating an initial histogram that is not representative of the object.

A solution is to only create a new object when the size of a region is bigger than a certain threshold Tmin. However in scenes where objects close to the camera

are considerable larger than objects further back, finding a good threshold could prove difficult. A scale map S(p) is proposed in chapter 5, where each pixel position p in the image is mapped onto a scale proportional to an reference object. Scale equal to 1 is usually in the center of the image. The threshold for creating a new track is

Tmin= S(Ri,center)Tobject (4.2)

(44)

30 4 Region Tracker number of pixels for an object at scale equal to 1. Tobjectis typically a little lower

to compensate in the case of smaller objects.

An acceptance gate was added to remove unreasonable region to object associa-tions. Consider that we have a new region Riand only one object in the other end

of the frame. Even if the region and the object is unrelated it will be matched as the best match according to equation 4.1. To avoid this we require that all regions must be within a certain distance D of an object to be in the voting.

The reasoning to only delete objects that are near the image border is not enough for real world applications. Objects might disappear in the middle of the scene due to failed foreground segmentation or being occluded. To avoid objects float-ing around in the frame we add a threshold Tmissedof the number of consecutive

frames that an object is not assigned to a region. If the threshold is reached the object is deleted.

An object Pk is modeled with the following parameters:

• posk is the position of the object. This is modeled using a Kalman filter

(Kalman, 1960) with 4 states: position x, y and velocity ˙x and ˙y. The posi-tion is updated by the center of the matched pixels.

• nkis the size of the object in term of numbers of pixels. This is modeled as

a one-dimensional Kalman filter.

• widthk, heightk is the bounding box around the object in pixels. They are

modeled as two one-dimensional Kalman filters.

• Nk is a two-dimensional Gaussian PDF, with the variance equal to widthk

and heightk. The mean is equal to posk

• hk = {hred, hgreen, hblue} is a 32-bin histogram for each RGB channel. The

histogram is updated by hk = (1 − w)hk + whassigned where hassigned is the

histogram of the assigned pixels to an object and w is a fixed blending factor. The histograms are normalized. During experiments the size of 32 bins per channel was found to be good.

4.6 Elementary classification

The tracked objects contain information about speed and size. Elementary clas-sification can be done using these parameters. This section describes two simple means of classification based on object size and speed.

4.6.1 Speed classification

Consider a scene where pedestrians either are walking or running. The speed |_pos_˙ _k|_{is measured in pixels per frame. A threshold T}_running_{could be set to} deter-mine if an object is running or walking.

(45)

4.7 Results 31 speed will have different speed measured in pixels per frame depending on where they are, objects closer to the camera will move faster.

The scale map S can be used to compensate for this effect. Let v be the compen-sated speed using a scale map S. A simple classification of running and walking is formulated as action =        running if v ≥ Trunning walking if v < Trunning v = |pos˙ k| S(posk) (4.3)

4.6.2 Size classification

Similar to section 4.6.1 object classification can be done using the size of the object. Consider a scene where both pedestrians and cars are present. Since cars in general are a lot larger than pedestrians a simple decision threshold Tcarof the

estimated number of pixels nkcan be set. By using a scale map S the decision of

object type can be formulated as

type =        car if c ≥ Tcar pedestrian if c < Tcar c = nk S(posk) (4.4) where c is the compensated number of pixels for an object k.

4.7 Results

The region tracker has been tested on data from (PETS, 2006). In figure 4.1 the input binary image, tracked objects and color coded maximum pixel vote are shown for different situations.

The tracker performs well in most situations when there is a moderate number of people in the scene, as seen in figure 4.1a. Problems arise when objects overlap or when several objects are entering the scene at the same time.

In figure 4.1b two objects with similar colors overlap. It is difficult to separate them since they contain roughly the same histogram. In this case only the Gaus-sain PDF will add to the pixel vote. A similar situation occurs 60 frames later when two objects overlap, figure 4.1b. This time the two histograms differ and it is possible to maintain the contour of the objects.

A situation where two or more objects enter the scene as part of the same fore-ground region is shown in figure 4.1d. The tracker will consider the region as one large object. If the region later is split, the tracker can create new tracks. When two objects with similar colors merge into a single region for an extended period of time, the tendencies is for one of the tracked object to take over the entire region and remove one track.

(46)

32 4 Region Tracker

4.8 Further improvements

The matching of individual pixels is based on a voting scheme on the joint prob-ability of a Gaussian distribution and a color histogram. Objects with similar color will be difficult to match. Adding a texture description for each pixel such as a histogram of local binary pattern (LBP) as proposed by (Ojala et al., 2002) will help to separate the objects, given that they can be discriminated by texture. Other approaches include the addition of an affine pixel tracker, trained detector or contour tracker.

(47)

4.8 Further improvements 33

(a)The region tracker can handle a moderate number of separated objects.

(b)Objects in a single region with similar colors are difficult to seperate.

(c) Objects with dissimilar colors are possible to seperate even if they are merged into the same region.

(d)Objects entering as connected regions will be considered as a single large object.

Figure 4.1:Examples of different tracking scenarios. From left to right: fore-ground segmentation after applied morphology, tracked objects and pixel voting.

(48)

(49)

5

Geometric Camera Calibration and

Scale Approximation

The region tracker uses the scale information to set more efficient thresholds. In this chapter a method for scale map approximation is presented by measuring the height of a small number of objects in the image. This chapter is outlined as follows: background, prior work, description of a camera calibration model, creation of a scale map, approximation of a scale map and finally comparing the approximated result with the result from the camera calibration.

5.1 Background

In many surveillance applications it is important to know how the camera relates to the world. Objects closer to the camera will be larger than objects further away. Using this relation an object can be positioned in a real-world coordinate system or measure the real-world size. In the IPS system, knowledge about scale at different parts of the image is useful for applying better thresholds, especially for the region tracker. The knowledge about the scale for each pixel in an image is stored in a scale map.

5.2 Prior work

The relation between the camera image and the world can be modeled by a cam-era model. The parameters of the model can be determined using a camcam-era cali-bration. This usually involves measuring coordinates in the real-world and find-ing the correspondfind-ing points in the image. Several approaches usfind-ing different camera models and calibration techniques have been proposed.

A typical approach is using a homography projection (Chum et al., 2005). Given a 35

(50)

36 5 Geometric Camera Calibration and Scale Approximation set of image points and corresponding world coordinates, a linear transformation can be created, mapping image points to world coordinates.

The popular camera calibration technique by (Tsai, 1987) estimates the full camera-matrix, rotation camera-matrix, translation vector, focal-length and lens-distortion in one step, using a single image with markers corresponding to world coordinates. A flexible technique for calibrating the lens distortion was developed by (Zhang, 2000), based on a model by (Brown, 1966), where the lens distortion is estimated separately using a set of images with a checkerboard pattern. Separating the lens calibration and camera calibration is utilized in the calibration toolbox developed by (Bouguet, 2004), based on (Zhang, 2000) and (Heikkilä and Silvén, 1997). This camera model is described in section 5.4.

5.3 Method

Given a calibrated camera, using techniques described in (Bouguet, 2004) the construction of a scale map S is straightforward, as described in section 5.5. In reality the placement for surveillance cameras are often optimized for viewing, no calibration is performed, no parameters of any kind are recorded or measured. In this chapter a simple method is described of estimating scales by measuring the pixel height of a small number of pedestrians. This method does not require any measurement of markers in the image as most other calibration method need. The method is compared to the use of a fully calibrated camera.

5.4 Description of a camera model

This section describes the camera model used in the toolbox by (Bouguet, 2004). The camera model is split in two parts, intrinsic and extrinsic model. The in-trinsic model models the internals of each camera, such as focal length and lens distortion. The extrinsic one models the relation to the world. Figure 5.1 shows the flow chart for transforming a world coordinate Xwto an image coordinate xp.

The transformation from image coordinates to world coordinates is the reverse.

5.4.1 Extrinsic model

Consider a camera coordinate system where a coordinate is denoted C = (Xc, Yc, Zc).

The camera coordinate system is aligned such that the image plane n of the cam-era spans the Xc and Yc axis with Zc = 1. A world coordinate in 3D-space is

denoted W = (Xw, Yw, Zw) and can be transformed to the camera coordinate

sys-tem using C=         Xc Yc Zc         = Rc         Xw Yw Zw         + Tc (5.1)

(51)

5.4 Description of a camera model 37

World coordinate Camera coordinate Normalized image coordinate

Distorted normalized image coordinates

Pixel coordinate Lens model

Intrinsic model Extrinsic model

Figure 5.1:Transforming world coordinates to pixel coordinates in an image.

The image plane n has normalized coordinates xn = (un, vn) where the center

(0, 0) is at the (0, 0, 1) ∈ C. A transformation from camera coordinates C into normalized coordinates xnis given by

xn ="u_vn n # ="Xc/Zc Yc/Zc # (5.2) The estimation of Rc and Tc is done using measured known markers in the

im-age with corresponding world coordinates. For more information see (Bouguet, 2004).

5.4.2 Intrinsic model

The intrinsic parameters model the internals of the camera. It consist of the lens distortion coefficients kc, focal length f and principal point (uc, vc). The principal

point is also known as the optical center. A camera matrix Kk is formulated as

Kk =         f 0 uc 0 f vc 0 0 1         (5.3) Given normalized coordinates in the image plane xn= (un, vn) the lens distortion

xd= (ud, vd) is modeled by xd = (1 + kc(1)r2+ kc(2)r4+ kc(5)r6)xn+ dx (5.4a) r2= un2+ v2n (5.4b) dx ="2kc(3)unvn+ kc(4)(r 2_{+ 2x}2₎ kc(3)(r2+ 2y2) + 2kc(4)unvn # (5.4c) where kc is the 5-vector distortion coefficient that contains both radial and tan-gential distortion.

Privacy Protecting Surveillance: A Proof-of-Concept Demonstrator

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Privacy Protecting Surveillance: A proof-of-concept

demonstrator

Privacy Protecting Surveillance: A proof-of-concept

demonstrator

Examensarbete utfört i Informationskodning

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Acknowledgments

Contents

1

Introduction

1.1

Structure of the thesis

1.2

Background

1.3

Project goal

1.4

Method

1.5

Prior work

1.6

Design

1.7

Alternative design

1.8

System overview

1.9

Limitations

2

Foreground Segmentation

2.1

Background

2.2

Challenges

2.3

Prior work

2.4

Algorithm and parameter selection

2.5

Description of Mixture of Gaussians

2.6

Description of Codebook algorithm

2.6.1

Codebook and codewords

2.6.2

Color space representation

Figure 4 Illustration of the color distance between a the mean color of a codeword and a new pixel

Color distance

The color distance between the mean color of a codeword 𝑣 = {𝑅

, 𝐺

, 𝐵

} and a pixel 𝑋 = {𝑅, 𝐺, 𝐵} is

devoted 𝑝 in the figure XXX. It is defined as the orthogonal distance between the line 𝑙 and the pixel 𝑋

The color distance 𝑐𝑜𝑙𝑜𝑟𝑑𝑖𝑠𝑡(𝑋, 𝑣) can be computed in equations () –() , for more information refer to

[Kim, 2005]

𝑐𝑜𝑙𝑜𝑟𝑑𝑖𝑠𝑡(𝑋, 𝑣) = 𝑝 = √‖𝑋‖

− 𝑝

where

𝑝

=

⟨𝑋

, 𝑣

⟩

‖𝑣

‖

‖𝑋

‖

= 𝑅

+ 𝐺

+ 𝐵

‖𝑣

‖

= 𝑅

+ 𝐺

_{− 𝑝}

_where

_{= 𝑅}

_{+ 𝐺}

_{+ 𝐵}