Gait-based reidentification of people in urban surveillance video

(1)

IT 10 040

Examensarbete 30 hp Augusti 2010

Gait-based reidentification

of people in urban surveillance video

Daniel Skog

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Gait-based reidentification of people in urban surveillance video

Daniel Skog

Video surveillance of large urban areas demands the use of multiple cameras. consider tracking a person moving between cameras in such a system. When the person disappears from the view of one camera and then reappears in another, the

surveillance system should be able to determine that the person has been seen before and continue tracking. The process of determining this connection is known as reidentification.

Gait is a biometric that has been shown to be useful in determining the identities of people. It is also useful for reidentification as it is not affected by varying lighting conditions between cameras. Also, it is hard for people to alter the way they are walking without it looking unnatural.

This project explores how gait can be used for reidentification. To investigate this, a number of different gait--based methods used for identification of people were used for reidentification. The methods are based on the active energy image, gait energy image, frame difference energy image, contours of silhouettes, and the self--similarity plot. The Fourier transform of the gait silhouette volume will also be tested. These methods are appearance based and the common theme is that a sequence of silhouettes of the subject is transformed into a representation of the gait. The representations are then used for reidentification by comparing them to other gaits in a pool using a simple classification method based on the nearest neighbor classifier.

Two datasets were used to test the methods. The first dataset was captured with live surveillance cameras in an urban scene and the second using a home video camera.

The lower quality of the footage in the first dataset affected the results, obtaining only about 34% correct reidentifications. This can be compared with the higher quality dataset which gave a result of about 80% correct reidentifications.

Tryckt av: Reprocentralen ITC IT 10 040

Examinator: Anders Jansson Ämnesgranskare: Robin Strand Handledare: Cris Luengo

(4)

(5)

Popul¨ arvetenskaplig sammanfattning

Den här rapporten handlar om ˚aterigenkänning av människor i stadsmiljö genom att betrakta deras sätt att g˚a. Allt eftersom kamerateknik blir billigare och hoten mot samhället fler s˚a ökar antalet kameror som övervakar oss. Det rapporteras [24] att det g˚ar knappt 100 personer per kamera i stockholms län.

Denna siffra inkluderar inte kameror som finns p˚a platser stängda för allmänheten. Kamerorna används främst till inspelning och när n˚agot inträffar tittar man p˚a inspelningen för att se vad som hände. Bättre vore att använda kamerorna för att förhindra brott. För att göra detta krävs att kamerorna övervakas av människor. Dessvärre är det opraktiskt och dyrt att ha människor som övervakar ett s˚a stort antal kameror. En bättre lösning vore att ha datorer som övervakar kamerorna och larmar vakterna om n˚agot händer. S˚adana s˚a kallade intelligenta övervakningssystem är under utveckling och är ett hett omr˚ade inom forskningsvärlden idag.

Det grundläggande steget i m˚anga s˚adana system är bakgrundssegmentering. I detta steg separeras de intressanta delarna i förgrunden av kamerabilden fr˚an den ointressanta bakgrunden. De intressanta delarna är i det här fallet personerna som rör p˚a sig. För att göra detta skapar man en matematisk modell av bilden. Ett exempel p˚a en s˚adan modell kan vara att man använder en stillbild som tagits när det inte varit n˚agra människor i kameravyn. Varje kamerabild jämförs sedan med stillbilden och skillnaderna markeras. Problemet med en s˚adan modell är att den inte förändras om n˚agot i scenen förändras. Man kan till exempel lämna n˚agot p˚a ett bord i bilden eller flytta p˚a n˚agot. Dessa ointressanta förem˚al markeras som förgrund. Ett sätt att lösa detta problem kan vara att ha en statistisk modell för varje pixel, exempelvis med en normalfördelning. Modellen till˚ats förändras lite för varje ny kamerabild. Om pixeln inte har de värden som modellen förutsäger markeras pixeln som förgrund. Antag nu att man filmar ett träd som vajar i vinden och vill modellera bakgrunden statistikt. D˚a behövs det för varje pixel en modell för förem˚alen som syns bakom trädet d˚a löven vajar och en modell för själva löven. Ett sätt att göra detta och som används i denna rapport, är att modellera varje yta som kan ses i pixeln statistiskt.

Sedan jämförs pixeln med b˚ada modellerna och är pixeln varken löv eller bakgrund markeras pixeln som förgrund.

Nästa grundläggande del av ett intelligent övervakningssystem är sp˚arning av människorna som befinner sig i kameravyn. Sp˚arning innebär att kameran följer personen som sp˚aras och sparar dennes position. Antag nu att man övervakar ett stort omr˚ade, till exempel en flygplats. För att göra detta krävs ett stort antal kameror. Antag nu att en kamera upptäcker en misstänkt person och börjar sp˚ara denna. När personen lämnar kameravyn och g˚ar in i en annan är det viktigt att den systemet först˚ar att den här personen har setts tidigare och fortsätter sp˚arningen. Detta kallas ˚aterigenkänning.

En vanlig ˚aterigenkänningsteknik bygger p˚a att använda färgen p˚a personens kläder. Problemet med det är att färgen förändras om belysningen förändras. En persons g˚angstil är densamma oavsett belysning.

G˚angstilen förändras med personens hastighet, kläder och humör men dessa faktorer är förh˚allandevis konstanta mellan kamerorna. Därför vill man i denna rapport undersöka olika tekniker för ˚aterigenkänning som utnyttjar en persons sätt att g˚a. Syftet är att hitta en metod som kan användas tillsammans med andra metoder i ett sp˚arningssystem.

Teknikerna som testas ¨ar baserade p˚a igenk¨anningstekniker som utvecklats under de senaste ˚aren.

Dessa anpassas för ˚aterigenkänning och testas p˚a tv˚a olika videosekvenser. Den första sekvensen är inspelad med tre övervakningskameror. I denna sekvens förekommer det ett stort antal personer som kan vara sv˚ara att urskilja fr˚an backgrunden, vilket försv˚arar ˚aterigenkänning. Den andra sekvensen är inspelad med en videokamera för hemmabruk och i denna sekvens är personerna lätta att urskilja fr˚an bakgrunden. Det är en signifikant skillnad i resultat mellan dessa videosekvenser. I den första lyckas de bästa av de testade ˚aterigenkänningsteknikerna endast ˚aterigenkänna ungefär 34% av personerna. I det andra lyckas man ˚aterigenkänna hela 80% av personerna. Detta tyder p˚a att det finns viss potential för metoderna men att de är beroende av vilken typ av video de används p˚a.

(6)

(7)

1 Introduction

This report is about reidentification of people in digital surveillance video. Consider automatically surveilling a large area such as an aiport. In order to see the majority of the airport one has to use several cameras. When a camera detects a suspicious person, the surveillance system should keep track of where the person is at all times. As long as the person stays within a single camera’s view, all is fine, the position of the person is known to the system. But when the person moves out of the camera’s view and into another problems occur. How is the system supposed to know that the person seen in that camera was seen earlier in another camera? This is known as the reidentification problem.

The purpose of this report is to describe and evaluate methods for reidentification that are based on measuring the gait¹ of the subject (see section 2.5.1 for an introduction to reidentification using gait).

The idea is that such a method can be used to enhance and complement current reidentification methods.

In order to do this, a literature survey is done and some of the most promising methods found are implemented and tested on two sets of data. The methods examined are the active energy image and gait energy image representations of gait (section 4.3.4), the 3D Fourier transform of the gait silhouette volume (section 4.3.5), the frame difference energy image (section 4.3.7), the self similarity plot (section 4.3.8), and a method using the distance curves of a sequence of contours (section 4.3.6).

The main result of the report is that the best of the reidentification methods have a reidentification rate of up to 80% on high quality data, but less than 35% on low quality data (the results are presented in section 5 and for more information about the datasets used see section 3). These results are acheived by the active energy image and the frame difference energy image on the high quality dataset and by the gait energy image and active energy image on the low quality dataset. The results mean that there are methods that give good results under the right circumstances (see section 6 for a discussion of the results).

Many of the reidentification methods use the cadence²of the person. Therefore methods for measuring a person’s cadence are also examined. Methods using the area and ratio between width and height of the bounding box are tested, see section 4.3.3. As it turns out, these method give good results compared to the manual estimation.

1.1 Solution overview

The solution method is based on a literature search of the subject. The best methods found during the search are implemented mainly in MATLAB but partly in C (the background segmentation method, section 4.1). The methods are then tested on two datasets and evaluated.

The reidentification procedure is a pipelined process that starts with distinguishing the interesting parts of the video, i.e. the people, from the uninteresting stationary background (see section 2.3). This is done using the Mixture of Gaussians algorithm (see section 4.1). Using the segmented video, the different persons’ positions are tracked as they move (see sections 2.4 and 4.2). The position data are then used together with the segmented video frames to create a representation of the different persons’ gaits. (see section 2.5 and section 4.3). Reidentification is then performed by comparing the different gaits, using a simple classification procedure (see section 4.4).

1.2 Report layout

The important concepts of the project will be presented in section 2. The different datasets used to evaluate the methods will be described in section 3. The actual methods used to solve the problem will be theoretically described in section4. The results of these experiments are presented in section 5 and discussed in section 6.

2 Background

Short general backgrounds to the different methods used in the project are presented in this section.

First the basic concepts of digital video and video surveillance are introduced in sections 2.1 and 2.2.

Then background segmentation is introduced in section 2.3 followed by tracking in section 2.4 and finally reidentification in section 2.5.

1Gait refers to the pattern of movement of the limbs as a person is walking.

2Cadence refers to the time it takes to take two steps, i.e. one gait cycle.

(10)

2.1 Digital video

The most basic step of any video–based analysis of the world is to actually acquire the video. How an image is modeled mathematically and the process of creating digital video will be briefly described in this section. The steps of creating an image is projection, sampling and digitization and they will be described in the following sections.

2.1.1 Projection

The first step in making an image of a part of the 3D–world (a scene) is to project the scene onto a two–dimensional surface called a projection surface, see figure 1. The image can then be seen as a two–

dimensional continuous function that maps every point of the projection surface to a value. If the value is a scalar it usually corresponds to the brightness at the point in the image. Vectors can be used to represent color. RGB, for example, is represented by a three–dimensional vector in which the elements represent the amount of red, green and blue light respectively.

Figure 1: The two triangles P1P2P3 and Q1Q2Q3 are projected onto the picture plane B. Projection is can be done simply by drawing a line from the focal point O to every point in the scene. The point where the line meets the picture plane gets the same color as the first surface it meets in the scene. Notice that in this case the corners of the triangles are pairwise projected to the same points in the picture plane, this may change if the focal point is moved.

This representation of an image is not possible to store on a computer. To store an image on a computer the points of the image must first be sampled and then the values of the image must be quantized.

2.1.2 Sampling and quantization

Sampling means that the values at a finite set of points in the continuous image are stored on the computer. The stored values are called picture elements, or pixels for short. The values at the sampling points are then separated into a finite number of intervals where every interval is referred to by an integer. This is called quantization and very interval is called a quantization level or a brightness level.

The quantized matrix can then be stored on a computer as a regular matrix.

In practice one uses a digital camera to capture images. In that case the scene is projected onto a 2D grid of sensor cells in the camera. The sensor grid is called an image sensor and measures the intensity of the light entering the camera from the scene. The digital image image is sampled and quantized using this sensor.

Resolution refers to the amount of detail an image can contain. A high resolution image can contain many small details while a low resolution image can not. This is determined not only by the number of sensor cells in the image sensor, but also on how dense the grid is among other things. To benefit from a high resolution image sensor one need to store the image in a larger matrix in order to be able to see the fine details of the image.

A common problem is noise in the image, i.e. there is an error in the values sensed by the image sensor compared to those of the real world. Noise in image sensors is a very technical subject which we

(11)

will not look deeper into here. The noise will be assumed to be additive and approximately described by a Gaussian distribution with mean zero. See section 4.1 for more details.

2.1.3 Sampling of video

Video can be seen as an image that changes continuously in time. In order to store the video on a computer an image is sampled and quantized from the video stream every δt seconds in the same fashion as above. The digitized video images are referred to as frames. The frame rate 1/δt is the number of frames that is sampled each second. This is abbreviated fps for frames per second. Video with a low frame rate seems choppy, but in order to do live video processing, every frame must be processed before the next frame arrives. This means that for a 25 fps video, every frame must be processed in 0.04 seconds, which can be very demanding for high resolution video.

In the report the digital (discrete) representation of an image will be used. That means that an image with size M × N will be seen as discrete function on {1, 2, · · · , M } × {1, 2, · · · , N } to the quantized pixel values. For video, the frame at time t ∈ {1, 2, · · · } will be referred to as It. The pixel at (x, y) at time t will be referred to as It(x, y).

2.1.4 Color

The standard way of representing color is to use RGB. RGB is basically a vector with three elements where the value of each element corresponds to the amount of red, green and blue color contained in the pixel. The values has been quantized to integers and usually ranges from 0 to 255. The set of possible RGB vectors define a set in space which can be visualized as a cube in three dimensions. This set of possible colors is called a color space. By applying a transformation to the RGB values one can define other colors spaces. A common color space is normalized RGB (rgb). It is computed simply by normalizing the RGB vector to unit length and can therefore be visualized as a 2D sphere in space.

Another common color space is the HSI (a.k.a HSV or HSL) color space which is created from the RGB cube by transforming it into a solid bicone. The elements of the color vector are called components or channels.

The assumption about Gaussian, channel independent noise made earlier may not apply to the new color spaces. In the normalized RGB case, the distribution is still Gaussian, but the noise is not independent between the color channels. In the HSV case, one can not even expect Gaussian noise.

2.2 Video surveillance

Digital video is commonly used in surveillance systems. As video technology gets cheaper and the threats on society increases, the number of surveillance cameras increases. The most common usage of these cameras is still to record the video captured and then review it in case of an event. Obviously this is not an optimal usage of the cameras, which should instead be used for preventive purposes. Having people monitoring the cameras is expensive, especially as the number of cameras grows. For example, The United Kingdom is reported to have the largest number of cameras per person in the world with more than 4.2 million cameras (2006) [1] which is too many to be monitored by a reasonable number of people. A more efficient solution would be to have computers supervise the cameras, alerting the guards in case of an event.

Historically, video footage has in general been of bad quality with low frame rate, low resolution and often only gray–scale footage. Now the quality of the video data has increased as technology is getting more advanced with color, megapixel resolutions and with frame rates of over 30 fps. With the increasing computational power of modern computers and the advances in video technology, researchers have started to look at what is called intelligent surveillance. The objective is to create a digital surveillance system that can replace the people monitoring the surveillance cameras. But despite the advances in technology, intelligent surveillance still has a long way to go [28].

In order to do high–level analysis (such as behaviour analysis) of surveillance video recorded from multiple cameras, a number of different sub–tasks must be performed. Some common basic steps are segmentation of moving objects and tracking of moving objects. If the system uses multiple cameras reidentification need to be performed in order to tie together the objects tracked in the different cameras.

In the coming sections short introductions to these three tasks will be presented.

(12)

2.3 Segmentation of moving objects

To make analysis of the video material easier, the interresting parts (the people) must be distinguished from the stationary background. This is a process known as background segmentation or background subtraction. The result of the background segmentation is an image where the pixels belonging to the foreground have the value 1 and the pixels belonging to background the value 0. Some of the difficulties in doing this are noisy video data, changes in lighting conditions and shadows cast by people.

The background is usually segmented by creating and maintaining a model of the background. If a pixel is significantly different from the model it is classified as foreground. This is the reason why shadows are a problem. They are often different enough from the background model to be classified as foreground, but since they do not provide any useful information about the object that has given rise to the shadow, they should instead be classified as background.

A picture of the empty scene can be used as a simple background model. The background picture is subtracted pixel by pixel from the current frame and the result is thresholded. Thresholding basically means that all pixels that have a value larger than a certain threshold value are set to 1 [9]. This creates an image where the foreground pixels has the value 1 and the background pixels the value 0. This model has many flaws of which the most serious are inability to adapt to changes in the scene and the need of manual initialization. The background model should be able to adapt to changes in the scene such as new objects coming into view, lighting changes and so on. A common adaptive background model is to use the pixel wise mean or median of a set of frames. These values converge to the true background value as the number of frames grow. The model image is then subtracted from the current frame and thresholded in the same fashion as above. After thresholding, the oldest frame is removed from the set and the current frame is added. Finally the model is recalculated. A way to avoid the expensive recalculation of the model is to simply add or subtract one to the model pixels depending on whether the incoming pixel is brighter or darker [25].

The mean/median models still have some flaws. They depend on a threshold value and do not respond well to bimodalities in the scene, such as waving leaves and branches. Sudden global illumination changes, such as turning on the light in a room or variations in cloud cover in outdoor scenes also causes problems as the model may not adapt fast enough. The threshold can be made dynamic by modelling each pixel in the background as a Gaussian distribution and then using the variance as a sort of threshold for each individual pixel. This method still does not respond well to bimodalities in the scene. To solve this problem Stauffer and Grimson [26] present a method using a mixture of Gaussian distributions (MoG) for modelling each background pixel and an effective estimation of the parameters based on an EM (Expectation Maximization) approach (see Bilme’s gentle tutorial on the EM algorithm [3] for more about the EM method). This method is used in this project and is more closely described in section 4.1.

2.3.1 Connected components

Using the segmented image one can identify a number of connected components which will be referred to as objects. A connected component is a set of pixels in which all pixels are connected to each other.

Two pixels x0 and xn are connected if it is possible to assign a sequence {xi}^N_i=0 of pixels such that all pairs xn and xn+1 are neighbors. Two pixels are neighboring only if they share a common edge or vertex [9]. The connected components are found using a connected component labeling algorithm which is presented in any standard textbook on image analysis [25]. The results of the connected component labeling is an image where every pixel has a number that identifies which connected component it belongs to. This image can be used to gather information about the objects, such as its area and position but also information about bounding boxes (section 4.3.2), silhouettes (section 4.3.1) and contours (section 4.3.6).

2.4 Tracking

In a given video frame there are a number of moving objects (possibly zero) detected in the background segmentation step. In the next frame these objects have moved slightly or new objects may have come into view. The purpose of tracking is to establish a correspondence between the objects in the current frame and the ones previously seen in the video. The result of tracking is a sequence of pixel coordinates (a trajectory) for every observed object, where each coordinate correspond to the position of the object in a video frame. This sequence is here called a track or a trajectory.

Tracking is a broad subject with many different techniques available as shown by a survey from 2006 [33]. In this section an introduction to the subject will be given.

(13)

If a foreground region of one frame overlaps a foreground region of the previous frame it belongs to the same object if its movement speed is small enough. The centre of mass of the object can be used as position coordinates. As long as they are far apart the method works well. But if they occlude each other they will belong to the same foreground region and the tracker will think that they are one object.

When the occlusion ends the tracker finds two objects again but it does not know the original identities of the objects due to the occlusion. To solve this occlusion problem the objects are usually modelled in some way. That way the objects can be compared to their models and their identities kept.

The basis of every model is the actual features of the object such as its position and appearance. A common approach is to model the position and movement of the object directly. A way of doing this is to use a Kalman filter, invented in 1960 by R.E. Kalman [15]. An introduction to the filter is given by Welsh and Bishop [31]. The Kalman filter has been used with success for tracking of people and cars by Stauffer and Grimson [26]. Another way to track the position of the object is to use the CONDENSATION algorithm (particle filtering) [12].

Another feature is what is called the kernel of the object. The kernel refers to the shape and appearance of the object. The mean shift tracking procedure uses a histogram of an elliptical region of the object for tracking [6].

2.5 Reidentification

The purpose of reidentification is to recognize an object that has left the scene and then reappeared in another part of the scene [16]. For example, consider a surveillance system using multiple cameras. If a person who is tracked in one camera leaves the camera’s view and reappears in another, it is important to be able to reidentify the person and continue tracking. But reidentification is not an easy task in general. The most fundamental problem is that the subject may appear different in different cameras due to change in camera angles or changes in lighting conditions between the cameras. A simple change in walking direction may also affect the reidentification method.

The simplest approach to reidentification is through position and velocity³. If the subject leaves the camera’s view in a certain position with a certain velocity the subject can be reidentified by simply knowing which camera the subject reappears in. This requires that the camera views overlap so that a temporal correspondence between the different cameras can be kept. As the scene grows, one may lose this correspondence and the method does not work, unless one adds additional cameras [10].

A more efficient solution to this problem is to use the actual characteristics of the person, such as color, for reidentification. A problem common to all techniques that uses color for recognition is what is called the color–constancy problem. Color–constancy is the ablility to assign the same color to the same object under different lighting condition [18]. To illustrate, a common technique for recognition of objects using color is to create RGB–histograms of the objects to be recognized. Two objects are then compared by comparing their histograms. This can be done in a number of ways, for example using the city-block distance between the histograms [27]. But if the images of the objects are captured under different lighting conditions, the color histograms look different even though they represent the same object. The histogram–matching methods can be extended to reidentification [16], but the problems with color–constancy remain.

2.5.1 Reidentification using gait

To avoid the problem of color–constancy, other characteristics of humans than color can be used. An early idea was to use biometrics such as gait for reidentification. During the 1970’s it was shown that humans can recognize other humans based on their way of walking and that only a very limited amount of data is needed for recognition. One of the earliest studies from 1973 [13] used a technique called Moving Light Display (MLD). Lights were fastened on the major joints of the human body, then as the person moved in complete darkness, only the lights were seen. The movement sequence was recorded and analyzed, showing that humans can recognize different types of gaits, such as running and walking using only the MLD data. Another study from 1977 [7] shows that humans can recognize each other, using only the MLD data.

Compared to using color for reidentification, a person’s gait is invariant to different lighting conditions.

A person’s gait changes with walking–speed, clothing and even with the person’s mood, but these factors are more or less constant during the short time of the reidentification process. One of the major drawbacks is that gait is a dynamic process which means that the subject must be observed for at least one or two

3Velocity refers to the speed and direction of movement.

(14)

steps before analysis can be done. At normal walking speed this means that it may take more than a second to gather enough data to make analysis possible.

There are in general two different types of methods for gait recognition: model–based and appearance–

based methods. The model based method tries to fit a model to the video data and use the parameters of the model for identification. Unfortunately these kinds of methods are often computationally expensive due to the large number of parameters that need to be fitted [2], problems with self–occlusion, and other difficulties of determining the position of joints in arms and legs [32]. In contrast, the appearance–based methods use the actual images of the subject during the walking sequence and extract features from those. Some features that are used are the silhouette and contour of the person. It is known that the view–point affects the appearance–based methods [5]. Also the high dimensionality of the feature vectors may cause problems (the curse of dimensionality). On the positive side, the appearance–based methods are cheaper and easier to calculate and are also less technical to implement, therefore the focus here will be on appearance based methods.

The gait of a person appears different from different viewing angles, even though the gait itself is the same. This means that the performance of appearance based recognition methods gets worse if the gaits that are to be recognized are viewed from different angles. This is generally the case when doing reidentification. Finding a method that is invariant to the viewing angle would solve many problems, but finding such a method is still an open problem.

Gait is usually used for identification of people and not reidentification. Reidentification is similar to identification in that one tries to identify someone. The difference lies in the conditions that are associated with a reidentfication problem. Most importantly, In identification one also often sees that sampling is done under heavily controlled circumstances with control over viewpoints, background, etc.

In reidentification, on the other hand, one does not have any control over any of these conditions in general. Furthermore, in identification one often has a large number of samples, which means one can use for example dimensionality reduction techniques, such as Principal Component Analysis, to make the identification task easier [22]. This is in general not possible in a reidentification problem as one can not expect to have seen the unknown person earlier. Another difference is that in identification there may be a long time between sampling and actual identification. One may sample a person one day and identify him the next. In reidentification on the other hand, there is often a very short time between sampling and reidentification, often a matter of seconds. This means that identification is a harder problem in this respect since the methods need to be robust to, for example, the person changing clothes.

3 Material

Two sets of recorded video are used, which will be denoted dataset 1 and dataset 2. Dataset 1 was captured from three live surveillance cameras while dataset 2 was captured using a home video camera.

Dataset 1 is of lower quality but has many people moving around and a lot of test subjects. The second dataset is of higher quality but has fewer test subjects. In dataset 2, only one camera angle is used and the different subjects walk one at a time in front of the camera in different direction. This section will describe the video material more closely beginning with a general description of digital video.

3.1 Dataset 1

This dataset comes from an urban outdoor environment where people are moving around. Three cameras in different positions are used to capture the scene. Dataset 1 has the following properties:

• 3 channel RGB, 8 bits per channel,

• low frame rate, about 10 fps,

• low resolution, 288 × 384 pixels,

Furthermore, the environment in which the video is captured adds the following properties to the video material:

• changes in global lighting conditions,

• shadows from objects disturbing the background segmentation,

• very dark or black pixels,

(15)

• saturated pixels,

• matte colors due to shadows and low light,

• occlusion of people due to other people or fixed objects in the scene.

Again, these are common properties which one can expect in an urban outdoor scene. Sample images from the three different cameras can be seen in figure 2.

(a) Camera 1 (b) Camera 2 (c) Camera 3

Figure 2: Sample images from dataset 1, taken from each of the three cameras.

3.2 Dataset 2

This dataset is also from an outdoor environment but it is recorded using a home video–camera and is of higher quality than dataset 1. The video material has the following properties:

• 3 channel RGB, 8 bits per channel,

• high frame rate, about 25 fps,

• high resolution, 576 × 704 pixels,

Due to the controlled environment of dataset 2, the only restriction added from the environment is a small shadow beneath the people in the video and a small global change in lighting. Sample frames from dataset 2 can be seen in figure 3.

Figure 3: Sample frames from dataset 2.

4 Methods

In this section the methods used for reidentification are presented. The first step of these methods after image acquisition is to separate the moving subjects from the static background. To do this a method

(16)

known as the Mixture of Gaussians background segmentation method is used, see section 4.1. The moving objects are then tracked, see section 4.2. Using the tracking data together with the segmented frames the subjects are analyzed. A number of different methods for doing this analysis are used, see section 4.3. Finally, using a simple classification method reidentification is done, see section 4.4.

4.1 Mixture of Gaussians

In this section the Mixture of Gaussians algorithm will be presented. Basically it models every pixel in the background individually as a mixture of Gaussian distributions. The model will be more closely described in section 4.1.1. For every new frame, the model is updated using efficient online estimations of the parameters, see section 4.1.2. The final step is to determine which pixels match the model. These pixels are then classified as background. Those that do not match are classified as foreground. See section 4.1.3 for more about this step.

Shadowing is a major problem in background segmentation. Without a way to distinguish shadows from true foreground, shadows risk being classified as foreground and thereby destroying the segmentation.

A way of distinguishing shadows from true foreground is presented in section 4.1.4.

Finally the segmented image needs to be cleaned from small objects, and small holes in the foreground need to be filled. This is done using morphological operations, see section 4.1.5.

4.1.1 Mathematical background of the Mixture of Gaussians algorithm

The Mixture of Gaussians algorithm was invented by Stauffer and Grimson [26]. A mathematical description of the mixture of Gaussians algorithm is presented by Power and Schoones [21] and their description will be used as a foundation for the description of the MoG presented here.

Formally, a mixture with K components corresponds to a set of K states where each state represents a surface that may come into the view of the pixel. The state is determined by a random process. The pixel values are samples of a random variable which is affected by the state. The pixel process is modelled by one Gaussian probability density function for each state, i.e. as a mixture of K Gaussians. A set of parameters θ_k = {µ_k, Σ_k} is associated with each state. µk is the mean and Σ_k is the covariance of the pixel values generated by state k. The set of all parameters of the entire mix is denoted Φ = {ω_k, θ_k} where ω_k is the probability that the pixel value was caused by state k.

Given the state k, the probability f_k(X|k, θ_k) of the modelled pixel having the value X is given by fk(X|k, θk) = 1

(2π)ⁿ²kΣkke⁻¹²^(X−µ^k⁾^T^Σ⁻¹^k ^(X−µ^k⁾, (1) i.e. the normal probability distribution, also known as a Gaussian. The parameter n is the number of dimensions of X, e.g. RGB valued pixels would give n = 3. If the states are independent the probability F of the modelled pixel having value X is

F (X) =

K

X

i=1

P (k)fk(X|k, θk) =

K

X

i=1

ωkfk(X|k, θk). (2)

The probability P (k|X, Φ) that pixel value X was caused by state k is given by Baye’s theorem as

P (k|X, Φ) =ω_kf_k(X|k, θ_k)

F (X) . (3)

Given N pixel samples X₁, · · · , X_N, the parameters are estimated as

ωk = 1 N

N

X

t=1

P (k|Xt, Φ), (4)

µk= PN

t=1XtP (k|Xt, Φ) PN

t=1P (k|Xt, Φ) , (5)

Σ_k = PN

t=1P (k|Xt, Φ)(Xt− µt)(Xt− µt)^T PN

t=1P (k|X_t, Φ) . (6)

The full derivation of the above equations can be found in Bilmes tutorial [3].

(17)

4.1.2 Update

As the number of observations increase, a method of estimating the parameters of the model without having to keep calculating the above equations is needed. Preferably only the current pixel should be used to update the parameters. That is called an online estimate. Let α ∈ (0, 1) be a parameter set by the user, called the learning rate. An online estimate of ω_k is to use the running average of P (k|X_t, Φ), i.e.

ωk,t= (1 − α)ωk,t−1+ αP (k|Xt, Φ) (7)

where t denotes the time step. Using N ω_k,t=PN

t=1P (k|X_t, Φ) we get µk =

PN

t=1X_tP (k|X, Φ) N ωk,t

, (8)

which is a weighted average. Making it an online average in the same fashion as (7) one gets

µ_k,t= (1 − ρ_k,t)µ_k,t−1+ ρ_k,tX_t. (9) where

ρk,t= αP (k|Xt, Φ) ωk,t

. (10)

Σk can be approximated by Σk = σkI where I is the unit matrix. This approximation is valid as long as the color channels are linear and independent, thus care need to be taken when using non-linear color spaces such as HSV (see section 2.1.4 for more about color spaces). The online update equation for σk is then derived in the same fashion as those above as

σ²_k,t= (1 − ρk,t)σ_k,t−1² + ρk,t(Xt− µt−1)(Xt− µt−1)^T. (11) The parameters w_k,t, µ_k,t and σ_k,t of every Gaussian in every mixture are updated once for every new frame using equations (7), (9) and (11). One can simplify the calculation of ρk,t by noting that P (k|Xt, θk) is close to one for only one Gaussian at a time and close to zero for the others [21]. Therefore P (k|Xt, θk) can be approximated by

M_k,t=

(1, pixel match Gaussian k

0, otherwise . (12)

Whether a pixel match a Gaussian is determined by the matching criterion

kX_t− µ_k,t−1k < 2.5σ_k,t−1. (13)

If the matching criterion is fulfilled, one say the pixel match the Gaussian.

Using the approximation (12) one get ρ_k,t = α/ω_k,t for the matching Gaussian and 0 otherwise.

Notice that α/ω_k,tmay be greater than 1 for some values of ω and should therefore be limited to 1. The approximation means that one only has to update µ and σ for the matching Gaussian. If none of the pixels match the pixel Xt, the least reliable Gaussian k is replaced by setting µk,t = Xt and giving it a low initial weight and high variance. The reliability of the Gaussians are determined by calculating the ratio ωk,t/σk,t for all Gaussians. This ratio increases as ωk,t increases and σk,t decreases. The least reliable Gaussian is the one with the lowest value.

4.1.3 Segmentation

After updating the model, one determines if the pixel belongs to the background or not. To determine this, the states which correspond to background need to be identified. This is done by first sorting the Gaussians by the ratio ωk,t/σk,tin decreasing order. Then, in the sorted sequence of Gaussians, calculate

κt= argmin_b

b

X

k=1

ωk,t> T

!

, (14)

where T is a threshold parameter set by the user. The κ_tGaussians with highest reliability are chosen as the background model. The pixel is classified as background if it matches any of these Gaussians.

Whether the pixel matches is determined by the matching criterion described above in equation 13 with the difference that the updated values µ_k,tand σ_k,tare used.

(18)

4.1.4 Shadow detection

Many shadow detection methods for background segmentation have been developed. The method that will use here is based on the color– and brightness–distortion defined by Horprasert et.al. [11] and adopted for the MoG framework by KaewTraKulPong and Bowden. [14].

Central in the method is calculation of color distortion and brightness distortion between the measured pixel value and the value expected from Gaussians in the background model. If these values fall within certain thresholds for any of the Gaussians, the pixel is classified as a shadow. The brightness distortion ak of Gaussian k is defined as

ak = argmin_zkXt− zµk,tk². (15)

This is the factor that would scale µk,t to the same length as Xt. The length of an RGB vector is a measure of the brightness of the pixel and therefore a works as a measure of the difference in brightness between µk,tand Xt. The color distortion ck of Gaussian k is defined as

ck= kXt− akµk,tk (16)

i.e. the length of the chromaticity line X_t− aµk,t.

In practice the RGB values need to be normalized by division with the component variances due to camera noise and unequal variation between the color components. Denote the normalized pixel values with ¯µ_k,tand ¯X_t. As for the actual calculations, a_k is calculated in a standard fashion by differentiating with respect to z and solving for zero, which results in

a_k =µ¯_k,tX¯_t^T

k¯µk,tk² . (17)

Differentiating again reveals that the calculated extreme point really is a minimum. Using the simplification that the variance is equal to σk,tfor all color bands gives

ak =µk,tX_t^T

kµ_k,tk². (18)

The color distortion is straightforwardly calculated. Using the mentioned simplification of the variances we get

ck= kXt− aµk,tk

σ_k,t . (19)

KaewTraKulPong and Bowden [14] only used a lower threshold τ and classified a pixel as shadow if

|a| < 2.5σk,t and τ < c < 1 where τ ∈ (0, 1). I.e. a pixel is a shadow if the color distortion is within 2.5 standard deviations of the expected value and if the brightness distortion is small and less than one.

By using a single threshold one cannot handle lighting changes where the scene is getting lighter. The lighter pixels risk being classified as foreground in the same way as shadows. Therefore a dual threshold is used here. Thus, a pixel is a shadow or highlight if |a| < 2.5σk,tand α < c < β where α ∈ (0, 1) and β ∈ (1, ∞).

As for the practical implementation, only pixels classified as foreground need to be tested if they are shadow or not, which makes this method very efficient. Some sample segmented frames from dataset 1 can be found in figure 4.

4.1.5 Post Processing

The foreground/background image extracted using the MoG is full of single foreground pixels caused by camera noise. These are removed by simply removing all foreground objects smaller than 10 pixels. This can be done using a connected components labeling method, see section 2.3.1.

Small holes in the segmented image appear due to noise in the image, see figure 4. These are removed using an operation called closing of an image. The closing removes single pixels without changing the larger components. See a standard textbook on image analysis for more details [9].

The closing is performed using a 3 × 3 square structural element. The result of the MoG segmentation with in–between steps can be seen in figure 4.

(19)

Figure 4: Sample segmented frames from dataset 1. The first row displays the original RGB frames, the second row displays the segmentation result of the image above. Black pixels correspond to background, gray to shadow and white to foreground. The third and fourth row display the not post processed and post processed frames respectively.

(20)

4.1.6 Further possible extensions

KaewTraKulPong and Bowden [14] also achieved faster adaptation to the scene by using a higher learning rate at the start of the video sequence. This was not deemed necessary here and the method was not implemented. The number of Gaussians in the mixture can be made dynamic as described by Zivkovic [35], eliminating the need for the user to decide the number of Gaussian components.

A color space that reportedly has been successful is the rgI color space [29]. It is defined as r = R/(R + G + B), g = G/(R + G + B) and I = (R + G + B)/3. This color space gives rise to problems as it is singular at (R, G, B) = (0, 0, 0), which means that dark pixel values will be unstable and seem to have a higher variance than they really have. Also one cannot make the assumption that the r,g and I channels are independent, which introduces the need of a more complex covariance matrix and thus a more complex algorithm. On the other hand, by using this color space, shadow detection can be done more easily.

Using a Markov Random Field based on the MoG, one can add a smoothness condition to the classification step as described by Schindler and Wang [23]. This helps fill holes in foreground regions and remove noise, thus making the post–processing step superfluous.

Other possibilities of improving the MoG is using fuzzy techniques [8] or particle filtering. A survey from 2008 [4] presents recent techniques used in MoG background segmentation.

4.1.7 Parameter values

For dataset 1, the parameters α = 0.005, T = 0.7, β = 0.6, γ = 1.4 and K = 5 were used. For dataset 2 the parameters were α = 0.005, T = 0.7, β = 0.9, γ = 1.0 and K = 5. Some examples of segmentations can be seen in figure 4.

4.2 Tracking

Two attempts were made to create a tracking procedure, one using a Kalman filter [26, 31] and one using a mean shift tracker [6]. Neither of these attempts were successful. As tracking is not the focus of this report the tracking was done ”by hand” simply by recording the positions of the tracked objects in every frame. A track of a person is here defined as a sequence of positions in one camera following the person.

A person must have several tracks to be useful for re–recognition purposes. Occasionally a track may be broken by another person due to occlusion. In these events a new track was created when the occlusion ends. Furthermore, a new track was created if a person changes walking direction. This is because gait recognition requires that the viewing angle and walking pace of the subject is somewhat constant. This is heavily used in dataset 2 where only one camera is used.

4.3 Gait reidentification methods

In this section the central methods of the project is presented, namely the reidentification methods. As mentioned in section 2.5.1 the focus will be on appearance based methods. The reidentification methods will use the data from the background segmentation and tracking step.

The reidentification methods are divided into three steps: first image data is extracted from the video frames using the tracking data and the bounding boxes around the tracked subjects. In some methods the cadence is calculated during this step. In the second step the data gathered in step one is transformed into some useful representation. Two representations are then compared in the third step, producing a number which represents the similarity of the different gaits. The description of the gait reidentification methods in this section follow this division. First methods to find the bounding boxes and cadence of the tracked subjects are presented in sections 4.3.2 and 4.3.3. Then follows the different reidentification methods in section 4.3. The classification method is then presented in section 4.4.

A number of different methods will be described, starting with two 2D–representations, Gait Energy Image (GEI) and Active Energy Image (AEI) in section 4.3.4. Then a method using the 3D Fourier transform of a volume consisting of silhouettes is explored in section 4.3.5. In section 4.3.6 sequences of distance signals created from the contours of the subjects will be used. The Frame Difference Energy Image (FDEI) uses temporal derivatives and a clustering of the silhouettes belonging to the subject.

Errors from the background segmentation step is then repaired by calculating the GEI of every cluster and then adding the appropriate GEI to the derivative. This method is presented in section 4.3.7. The final method is the Self–Similarity Plot, which uses the periodically occurring similarities in a walking person’s stance to represent gait. This method is presented in section 4.3.8.

(21)

Worth commenting is that none of the methods used here are invariant to the viewing angle of the subject. Even though finding a view–point invariant method is one of the most important problems to solve in gait recognition, no method having this property has been found and the problem is still open.

4.3.1 Silhouettes

The silhouette of an object is the set of pixels belonging to the object. The easiest way to find the silhouette of a moving object is to use the segmented image data together with the tracking data. Problems arise when there are errors in the segmented data which may separate the silhouette into many parts.

No good way to repair the silhouette has been found, instead the largest silhouette component in a five pixels radius from the point where the object is according to the tracker is used.

(a) Original image (b) Gray silhouette (c) White silhouette (d) Contour

Figure 5: Examples of silhouettes and contours.

4.3.2 Bounding boxes

The width and height of a subject’s silhouette together with its position defines a rectangle around the subject. The rectangle is called a bounding box. The term bounding box may also refer to the set of pixels contained in the bounding box. The bounding box can easily be determined using result from the connected component labeling algorithm, see section 2.3.1. For an example of a bounding box, see figure 6.

A common theme in all representations of gait is the extraction of bounding boxes around the subjects in each frame. The bounding box should be big enough to fit the subject but not bigger. To correct for tracking– and segmentation errors, a second–degree polynomial is fitted to the tracked x and y positions.

This approximated curve is then used instead of the raw tracking data for positioning of the bounding boxes.

Correction of the width and height of the bounding boxes are done in a similar way. A linear curve is fitted to the measurements of width and height. The fitted line is then subtracted from the raw data.

This is done to compensate for the subject moving closer or further away from the camera, which gives rise to a smaller or larger bounding box. The biggest values needed to fit the subject is the maximum of the de–trended data. This value is finally added to the trend.

Using the corrected position, width and height the bounding boxes can be extracted. The sequence of bounding boxes contain all information needed to analyze the gait of the subject, such as contour and silhouette.

4.3.3 Cadence

A basic and simple characterization of gait is the cadence, i.e. the time needed to take one step. As the person is walking, the area of the silhouette is largest when the person has a wide stance and smallest when the feet stand together, assuming a view from the side of the subject. This can be measured using the result of the connected component labeling algorithm, see section 2.3.1. The measured values are periodic and the number of frames of one period can be determined and is equal to the cadence.

(22)

Figure 6: Sample bounding box. The border is added for clarity.

.

Specifically, the area A(t) of the subject at time t can be described by the function

A(t) = a(t) + p(t) + ε(t), (20)

where a(t) is the area of the subject. This changes periodically with time. The trend p(t) is due to the person moving closer or further away from the camera, affecting the area of the silhouette. The error ε(t) is due to backgrund segmentation errors. The trend is approximated by least–square fitting a second–

degree polynomial to the data. When the trend is subtracted, what remains is the periodic area and a measurement error. This step can be repeated for better result but here the trend is only subtracted once. The de–trended area function is denoted ¯A. Due to the error term it may not be possible to use the peaks of ¯A directly to estimate the period. A standard way of finding the period of noisy data is to use the autocorrelation [30].

The autocorrelation c(u) of the sequence ¯A(t) is defined as c(u) =X

x

A(x) ¯¯ A(x − u) (21)

where u = 0, . . . , n and n is the length of the sequence. In order to sum for n > 0 without going out of bounds, the sequence must be padded with zeros. That means that n zeros are added to the end of the sequence. Worth noting is that calculating the autocorrelation of a sequence is equivalent of convolving it with itself, using this observation it is easy to expand the method into two or more dimensions using multidimensional convolution.

For a periodic sequence the autocorrelation will have the same frequency as the original sequence. The period can be calculated by finding the second peak of the correlation curve and measuring the distance between it and the y–axis. An approximation of the value can be found by looking for sign–changes of the derivative of the curve. This approximation has low precision. An improvement of the measurement can be made by interpolation by fitting a second–degree polynomial to the values closest to the approximate maximum. Sample results can be seen in figure 7.

Instead of using the area of the silhouette one can use the ratio between height and width of the bounding box [30] in the same way.

The walking direction of the subject with respect to the camera affects the results of the cadence measuring method described above. Viewing the subject from the side gives half the cadence compared to viewing it from behind or from the front [30]. To solve this problem one notices that a step takes on average a little more than half a second (here an average step time of 0.56 ± 0.02 seconds was calculated by estimating the cadence of the subjects in dataset 2 manually) at normal walking speed. Thus, if the cadence detected exceeds a threshold the result is halved. Here the threshold is set to 0.75 seconds per step. However, this may be dangerous. If a person is walking slowly, his cadence may be incorrectly halved. As we are only interrested in regular walking in this project, this risk is acceptable.

4.3.4 Energy images

The Gait Energy Image [19] and Active Energy Image [34] are two of the most basic and efficient gait representations. They use pixel–wise means of a sequence of silhouettes or means of differences between

(23)

Figure 7: The images in the left column is the data gathered from the sequence of bounding boxes of the subject. The right column displays the correlation curves of the de–trended data. The cadence is the position of the maximum marked with red rings in the graphs in the right column. In the second row detection of the cadence fails due to the trend not being approximately a second degree polynomial. One also sees from rows one and three that the method handles noisy measurements well.

(24)

silhouettes. They can be combined with dimensionality reduction methods such as principal component analysis and its relatives. Zhang et.al. [34] presents a comparison of the performances of the different energy representations when solving identification problems on a standard data set. A problem with all of these representations is that they are sensitive to changes in viewing angle. On the other hand, they are very lightweight memory–wise and computationally fast.

The most basic of the energy representations is the Gait Energy Image [19]. Let {Bt(x, y)}^N_t=1 be a sequence of bounding boxes containing the silhouette of the subject. The bounding boxes have been resized to a standard size (here 50 × 30 pixels). The GEI G is then defined as

G(x, y) = 1 N

N

X

t=1

Bt(x, y). (22)

This is a gray–level image where bright regions correspond to regions where the silhouette frequently appears, such as the torso of the subject. The representation also reduces the effect of segmentation errors of the silhouette.

The Active Energy Image (AEI) [34] is defined in the following way: let B_t and N be as above and let

Dt(x, y) =

(Bt if t = 1

kBt−1(x, y) − B_t(x, y)k if t > 1. (23) The AEI A is then defined as

A(x, y) = 1 N

N

X

t=1

D_t(x, y). (24)

The AEI is the average of the active regions of the silhouette sequence. The areas where most of the movement contain most information about the gait and are highlighted using this representation.

Examples of GEI and AEI images can be found in figure 8. The image also illustrates the problem with view points. Consider the GEI in the left column, this person walks from the right part of the scene to the left. If this person would instead walk towards the camera, his GEI would look more like the one in the middle column, and it would be very hard to match those two gait energy images.

Figure 8: The first row displays GEI:s and the second AEI:s. The columns correspond to different subjects. Notice the big differences between the columns caused by differences in view point.

4.3.5 Fourier transform of gait silhouette volumes

A volume of silhouettes can be created by simply piling the elements {B_t(x, y)}^N_t=1on top of each other, i.e. let V be the volume, the point V (x, y, t) corresponds to the pixel (x, y) of frame t, see figure 9.

(25)

From this volume one can extract frequency features otherwise unavailable. To extract these features the discrete 3D Fourier transform can be applied [20]. Let ˜V be the Fourier transformed volume. The gait has low frequency compared to noise. Therefore the amount of noise is reduced by removing the high frequencies. Here the lowest 25% were kept and the higher frequencies were removed. Finally, the power spectrum k ˜V k is calculated and used to represent the gait of the subject.

Figure 9: Sample gait silhouette volume.

4.3.6 Contour based method

The contour of a silhouette is the border of the silhouette. It can be found using any standard boundary tracing algorithm [9]. Due to segmentation errors the silhouette may be split into several components.

In such cases the contour of the largest component of the silhouette is used. See figure 5 for an example of a contour.

The distance from the center of mass of the silhouette to the contour can be used to represent the silhouette [30]. This representation is called a distance signal. The distance signal is defined as a sequence S(n) = {d1, d2, ..., dn, ...dN} where dnis the Euclidean distance between border pixel n counted clockwise from a starting pixel, and the center. The border pixel directly above the center is chosen as starting pixel. The signal magnitude is then normalized using the L∞–norm (maximum–norm) and resampled to a standard size, here 360 points. See figure 10 for an example of a distance signal of a contour. The sequence of distance signals is then used as a feature simply by putting them next to each other. See figure 11 for sample gait representations using contour data.

4.3.7 Frame Difference Energy Image

The Frame Difference Enegy Image (FDEI) [5] is calculated in four steps. In the first step the sequence of silhouettes is clustered in a way such that each cluster represents a certain leg stance of the subject.

The GEI of the c:th G_c(x, y) is calculated as G_c= 1

Nc

X

t∈A_c

B_t(x, y) (25)

where A_c is the set of time indices of the silhouettes contained in the c^thcluster and N_c is the number of silhouettes in the same cluster. Then the GEI is denoised as

Dc=

(Gc(x, y) if G(x, y)c> T

0 otherwise (26)

where T is a threshold. Here is T = 0.8 used. In the third step the positive part of the frame difference image is calculated as

FDt(x, y) =

(0 if Bt(x, y) ≥ Bt−1(x, y)

B_t−1(x, y) − B_t(x, y) otherwise, (27)

(26)

Figure 10: Contours and corresponding normalized distance signals.

(27)

Figure 11: Sample contour representations. Every column corresponds to a contour. The gray values correspond to the normalized distance between the center of the silhouette to its contour.

where B0 is chosen as the last silhouette in the sequence. Finally the FDEI is calculated as

FDEIt= FDt+ Dc (28)

where t ∈ A_c and D_c corresponds to the GEI obtained from the c^th cluster. The frame differentiation helps correct missing parts of silhouettes, adding the GEI compensates for cases when the same part of the silhouette is missing in both Bt and Bt−1. The number of clusters is the same as the cadence rounded to its closest integer. This gives one cluster for each frame that makes up a step. See figure 12 for examples of FDEI images.

As a representation of the FDEI its sequence of Frieze–patterns is used. The Frieze–pattern F (y, t) is defined as

F (y, t) =X

x

F Dt(x, y), (29)

i.e. it is the projection of the FDEI on the y-axis. Examples of Frieze patterns can be seen in figure 13.

The Frieze patters are finally put next to each other to form a surface. This surface can then be seen as a regular image, see figure 14. This image is used as a representation of gait.

4.3.8 Self–Similarity Plot

When a person is walking a set of key stances are repeated over and over. Using this observation one can represent gait as the similarity between these stances. A way to do this is to use the Self–Similarity Plot (SSP) [2] of the sequence. The SSP S is defined in the following way: let Btand N be as above, then

S(i, j) =X

x

X

y

|Bi(x, y) − Bj(x, y)|, (30)

where the summation is over the width and height of the image. The SSP is then normalized using the maximum–norm. This procedure produces a periodic pattern where dark pixels correspond to frames where the stances are similar. See figure 15 for sample SSP images.

4.3.9 Comparing gaits using normalized cross–correlation

All the representations of gait described above except the Fourier transform of the gait silhouette volume are gray–scale images, i.e. every pixel represents a single value. A standard way of comparing two images is to use the normalized cross–correlation γ(u, v). It is defined as

γ(u, v) =

P

x,y[I(x, y) − ¯I(u, v)][T (x − u, y − v) − ¯T ] qP

x,y[I(x, y) − ¯I(u, v)]²P

x,y[T (x − u, y − v) − ¯T ]²

(31)

(28)

Figure 12: Sample FDEI:s. The first row are sample denoised GEI:s from different clusters, the second row is sample frame difference images and the third row are the final FDEI:s, that is the sum of the two images above.

Figure 13: Frieze curves corresponding to the FDEI:s in figure 12.

(29)

Figure 14: Sample Frieze surfaces. Every column correspond to a Frieze curve and the gray values correspond to the values of the curve.

Figure 15: Sample SSP:s. These are from dataset 2. No pattern is distinguishable in SSP:s from dataset 1 due to the low frame rate and bad segmentation.

Gait-based reidentification of people in urban surveillance video

Examensarbete 30 hp Augusti 2010

Gait-based reidentification

of people in urban surveillance video

Daniel Skog

Institutionen för informationsteknologi

Abstract

Gait-based reidentification of people in urban surveillance video

Daniel Skog

Popul¨ arvetenskaplig sammanfattning

Contents

1 Introduction

1.1 Solution overview

1.2 Report layout

2 Background

2.1 Digital video

2.2 Video surveillance

2.3 Segmentation of moving objects

2.4 Tracking

2.5 Reidentification

3 Material

3.1 Dataset 1

3.2 Dataset 2

4 Methods

4.1 Mixture of Gaussians

4.2 Tracking

4.3 Gait reidentification methods