• No results found

Recognition of Targets in Camera Networks

N/A
N/A
Protected

Academic year: 2021

Share "Recognition of Targets in Camera Networks"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Department of Science and Technology Institutionen för teknik och naturvetenskap

LiU-ITN-TEK-A--08/120--SE

Recognition of Targets in

Camera Networks

Fredrik Johansson

2008-11-21

(2)

LiU-ITN-TEK-A--08/120--SE

Recognition of Targets in

Camera Networks

Examensarbete utfört i medieteknik

vid Tekniska Högskolan vid

Linköpings universitet

Fredrik Johansson

Handledare Manne Anliot

Examinator Björn Kruse

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

This thesis presents a re-recognition model for use in area camera network surveillance systems. The method relies on a mix of covariance matrix fea-ture descriptions and Bayesian networks for topological information. The system consists of an object recognition model and an re-recognition model. The object recognition model is responsible for separating people from the background and generating the position and description for each person and frame. This is done by using a foreground-background segmen-tation model to separate the background from a person. The segmented image is then tracked by a tracking algorithm that produces the coordinates for each person. It is also responsible for creating a silhouette that is used to create a feature vector consisting of a covariance matrix that describes the persons appearance. A hypothesis engine is then responsible for connecting the coordinates into a continues track that describes the trajectory were aa person has been visiting.

Every trajectory is stored and available to the re-recognition model. It then compares two covariance matrices using a sophisticated distance me-thod to generate a probabilistic score value. The score is then combined with a likelihood-value of the topological match generated with a Bayesian network structure containing gathered statistical data. The topological in-formation is mainly intended to filter the most un-likely matches.

(5)

Contents

1 Introduction 4 1.1 Background . . . 4 1.2 Problem Formulation . . . 5 1.3 Objective . . . 5 2 Problem Analysis 6 2.1 System design . . . 6

2.2 Re-recognition and features . . . 7

2.3 Topology . . . 8

2.3.1 Local . . . 8

2.3.2 Global . . . 9

2.4 Data management . . . 9

2.4.1 Single data bank . . . 9

2.4.2 Multiple data banks . . . 10

2.5 Conclusion . . . 10

3 Model 11 3.1 Overview . . . 11

3.2 TrackEye Interface . . . 11

3.3 The hypothesis engine . . . 12

3.3.1 Measurement . . . 12

3.4 Object recognition model . . . 12

3.4.1 Load Image . . . 13 3.4.2 Segmentation . . . 13 3.4.3 Tracking . . . 14 3.4.4 Merge . . . 14 3.4.5 Feature Extraction . . . 14 3.4.6 Association Node . . . 14 3.5 Re-recognition Model . . . 15

(6)

3.5.1 Memory management . . . 15

3.5.2 Topological structure . . . 16

3.6 Event Node . . . 16

4 Implementation 17 4.1 Overview . . . 17

4.2 The re-recognition node . . . 17

4.2.1 Input data . . . 17

4.2.2 Appearance . . . 20

4.2.3 Topological information . . . 22

5 Results & Conclusion 25 5.1 Results . . . 25

5.1.1 Simulation Set-up . . . 25

5.1.2 Object recognition model . . . 26

5.1.3 Re-recognition model . . . 28

5.2 Conclusion & Future work . . . 29

A Bayesian Networks 32 A.1 Probability and Random Variables . . . 32

A.1.1 Frequentistic versus Bayesian . . . 32

A.1.2 Probability . . . 33

A.2 Bayesian Networks . . . 36

A.2.1 Bayes’ Theorem . . . 36

A.2.2 Likelihood and Inference . . . 37

A.2.3 Bayesian Networks . . . 37

(7)

Chapter 1

Introduction

This chapter presents the background of the thesis, along with a problem formulation and an objective.

1.1

Background

Image Systems AB is a hight-tech company specializing in motion ana-lysis, advanced software development and high resolution digitizing film. One of its software packages, TrackEye, is primarily aimed towards the flight/military market. TrackEye is the world leading system for advan-ced motion analysis on military test ranges. TrackEye was selected as the primary software for demonstration of a use-case scenario in the research-project Intelligent Surveillance.

Intelligent Surveillance is governmentally funded via FOCUS and is conduc-ted in collaboration with FOI, Saab Bofors Dynamics, Image Systems AB and Linkping University.

The aim of project Intelligent Surveillance is to meet the demand for increased security in public places such as airports and train stations. Cur-rent surveillance systems are not pro-active, in the sense that they almost exclusively collect data without analyzing it.

Therefore the aim of the system is to alert suspicious behavior and, if possible, predict likely events. With this aid the operator will have a better situational awareness and will be able to act upon discovered threats or aiding persons in need of help. Suspicious behavior can include a person dropping a bag, a person in a restricted area or crowds uniformly moving away from a spot.

(8)

1.2

Problem Formulation

The initial problem was formulated as:

Develop , within the project Intelligent Surveillance, algorithms and data structures to support, in a reliable and efficient way, re-recognition of people moving inside a network of

non-overlapping cameras.

The result should be presented graphicly.

The result should also be implemented using the TrackEye API C++ environment.

1.3

Objective

The objective of the thesis is:

- A computer model representing the topological relationship amongst different ports in different camera views.

- An option of a long term storing personal signatures for every invisible section between a number of ports.

- A simple method, possibly a text-file, to define the topological model for a certain surveillance system.

- An API towards the FOCUS system to add and retrieve personal signatures at ports in different camera views.

(9)

Chapter 2

Problem Analysis

2.1

System design

The problem of tracking a person in a network of cameras inherits many unsolved problems in the complex field of object recognition. One of the greater issues with camera networks is the amount of data processed. The huge amount of data is one of the reasons why there is a need to automate the process of surveillance as much possible. In [2] the authors propose a framework for analyzing video surveillance feeds and alert the user whenever something suspicious occurs. The authors divide the problems into two different stages, data fusion and event recognition. The last step is the part where the system identifies any abnormal behaviors and alerts the user. Data fusion is the step where they try solving a similar problem as this report. It is the step of analyzing the data and presents motion trajectories of peoples from different camera sources. They promote a modular based set-up in hierarchical structure with multiple slave systems and a single master system. The advantage of such a set-up is that there is no practical limitation in the amount of cameras used since some

computationally intensive tasks can be carried out in the slave systems. A similar approach is suitable in the module based structure in TrackEye where the slave and master systems can be mimicked by using different modules. It is also preferred since some functionality is already

(10)

2.2

Re-recognition and features

Since the aim is to re-recognize a person we must assume that the

recognition of a person has already been accomplished and limit the focus to the re-recognition problem. Therefore we can break down the problem into a comparison between two object recognition samples and produce a score of how good of a match the object is.

The appearance of a person is the only thing that can tell two persons apart. Unfortunately, since the problem of object recognition hasn’t been completely solved we can only hope to have a fairly good measurement of the appearance. The limitations will for example make it very difficult to distinguish two persons dressed in the same color scheme.

A way to approach this problem is to gather as much information as possible, this information represents a persons features. The features can be used to determine if there is a match or not. The position and time is usually available and is very useful in determining if a person has been previously recognized in a surveillance system. It allows for a relatively easy way of filtering the less possible matches even though the appearance matches perfectly. A good example can be found in [8] where the authors have implemented a traffic surveillance system. Cars are captured in a video stream at two locations. The cars are filmed in one camera and re-recognized in the other with the help the travel time and originating lane. A car arriving in the same lane and at a plausible time yields a better match. To combine appearance, time and position a Bayesian model has been developed and implemented. The model allows for parameter updates and restrictions. The restrictions are particularly useful in their scenario where the cars only move in one direction and are guaranteed to end up the second camera.

The same kind of restrictions is not possible in a network of

non-overlapping cameras. In the worst case scenario a person could move from a camera to all other available cameras.

A scenario where there is no restrictions will quickly get unmanageable when there is a lot of people, see figure 2.1 for a particulary bad scenario. The reason is that a person entering a camera would have to be compared to every person in the system. Some sort of data reduction is needed.

(11)

Figure 2.1: Worst case scenario

2.3

Topology

The general approach is to introduce a topological description allowing the system of knowing the relationship of the cameras. Two cameras are related if it is possible to get to one camera from another without being detected in another camera. Two cameras watching both the exits in a hallway would for example be related. With the help of the topological information it is possible to reduce comparing a newly detected person with only those persons who disappeared in a neighboring camera. In TrackEye we can introduce the topology in principally two different ways, a local or a global model.

2.3.1 Local

The local model uses the advantages of the hierarchical structure by allowing modules to solve one step for local recognition and one step for re-recognition. The local recognition module works in a similar manner a single camera recognition system would work. The module would only be connected to one single camera feed and be focused on following persons in

(12)

the cameras field of view. Whenever a person would leave the local module the global module would be alerted and allow access to the persons

collected features. And whenever a person enters a camera and not

identified by the local module the global module would do a search among all the stored persons for a match.

The main advantage of this model is in the streamlined data flow. Every module could be treated as a separate system where some functionality could be implemented in hardware. This would make it possible to construct cameras that could take care of the local object recognition and a central server for global re-recognition similar to the design in [2].

2.3.2 Global

The global model is quite similar to the local model, the main difference is the local step. Instead of reducing the data available at the global module all of the tracked persons are available. Even though the amount of

computations would increase drastically the benefits could be a much more accurate re-recognition result. The reason is that the topological

information wouldn’t have to initially be known, instead they could be trained from the available data using Bayesian networks or similar

methods known as belief networks [3]. This also means that, when all data is available, it is possible to scale the complexity of the implementation in the system to either be very simple relying on small factors or a very complex system relying on all data gathered with extensive parameter training. The main concern though is the huge amount of data needed to be processed if the amount of people surveyed is large.

2.4

Data management

Whether the local or global model is preferred the amount of data available to the system will be great. Careful design of the data flow is therefore beneficial. Whenever a person leaves a camera its features and trajectory need to be stored in memory.

2.4.1 Single data bank

This method is the most direct and simple. Every person’s information is stored in a globally available data bank. Whenever the system identifies an object a search is made for a possible match. Any topological restriction needs to be explicitly defined. This simply means that a check to see if it is

(13)

likely for a person to leave one camera and appear in the particular location.

The greatest advantage lies in its simplicity, implementation should be straight forward. The method is also more robust since an exhaustive search is made for the best possible match. However, the disadvantage is that performance will decrease drastically when the amount of cameras are large.

2.4.2 Multiple data banks

The multiple data banks approach arranges the data according to the topological structure. For every intersection where people can move between cameras there can exists a data bank where the persons

information is stored. Whenever a person leaves a camera the topological information reveals the cameras location and the information stored is only available at theses connected cameras. This means that for every newly detected object a search is only made amongst the persons who can have arrived at that camera. The advantage of this approach is in its

effectiveness. More cameras won’t have the same dramatic effect on the performance as the more nave method of a single data bank. Another advantage is the inherited filtering of the more unlikely matches which also could have an effect on performance.

The effectiveness does have a cost of added complexity and stronger relying on the quality of the object recognition process. A mislabelled person would for example be very hard to re-recognize ones it is stored in the wrong data bank. The method also rely on a effective camera layout, if it is possible to move undetected between a camera and another camera the method uses only a single data bank and the advantages are gone.

2.5

Conclusion

The decision was made to focus on the local topological model. The main reason was the appealing set-up of a master-slave system, which is closer to a genuine surveillance system. It would also result in a less complex system which is almost a requirement when limited to the time span of the thesis. A choice to focus on the single data bank data management method was made. Even this choice was made based on the appeal of simplicity. The data management method is mostly a choice dependent on the

performance criteria and in this case it would always be possible to switch method if it performed inadequately.

(14)

Chapter 3

Model

This chapter introduces the proposed model used for re-recognition in camera networks. This includes the complete chain from analyzing the video up to a successful re-recognition.

3.1

Overview

The problem analysis highlighted some of the major aspects of designing a complex surveillance system. The result indicated that a module based approach was preferable, largely due to the internal structure of TrackEye. The flow like structure makes the system more easily maintainable and allows for groups of people to work together. However, much like a chain, the flow like structure is not stronger than its weakest link and a poorly performing component will affect the end result drastically.

An object recognition model was partially available with the aim of providing adequate data for re-recognition data. Given the capabilities of the object recognition model a re-recognition model was developed and is presented after an introduction the TrackEye interface and a description of the object recognition model.

3.2

TrackEye Interface

The most central term when using TrackEye is the session. A session can be described as an empty sheet of paper where the user can add

functionality by adding icons. Each icon can be designed to produce, consume or manipulate data. The icons can be connected by drawing an arrow from one icon to another, a set of connected icons becomes a

(15)

network were each icons is a node in the network. Icons and nodes have basically the same meaning, the term icons refers to the visual

representation of a node in the network.

The network used by the re-recognition model is referred to as the pipeline to emphasize that data only moves in one direction without support for any feedback loops.

3.3

The hypothesis engine

The hypothesis engine was designed at Saab Bofors Dynamics and was developed to track fighter jets and missiles in a partially cloudy sky.

3.3.1 Measurement

The interface to the hypothesis engine is a measurement class, it is

basically a container used to send data into the engine. The measurement initially contained some book keeping variables purely used by the engine itself. To make the hypothesis engine support re-recognition capabilities the measurement class needed to be extended with support for the data used in the re-recognition model. 2D and 3D coordinates, the covariance matrix and the camera name were added.

Not all tracks at the re-recognition nodes indata will generate a

measurement object. The main reason is that the first couple of observed frames will usually be a small part of a person’s body and therefore generating bad covariance samples. It takes a little while before the whole body enters the cameras field of view.

3.4

Object recognition model

The task of the object model is to analyze video data and track people wherever they move in the video. Whenever a person is tracked the model should produce its position and appearance. At the heart of the object recognition model is the hypothesis engine capable of producing the most likely scenario of how people have moved across the cameras field of view. The object recognition pipeline is defined by adding a series of nodes in order from left to right representing the flow of the data. Every icon represents a node with different functionality. Some existing TrackEye components are used but most of them are developed to solve complex

(16)

Figure 3.1: The object recognition model in a TrackEye session recognition problems. Each pipeline can only operate on one video feed, multiple video feeds is supported by adding duplicating the pipeline. The following nodes are used in the object recognition model:

3.4.1 Load Image

The load image node is a producing node responsible for the loading of the surveillance video. Several formats are supported such as mpeg2.

3.4.2 Segmentation

The segmentation is responsible for separating the peoples from the background. This segmentation is one of the most difficult tasks in the re-recognition pipeline, if it fails or is of poor quality it is very difficult to produce good data for a successful object re-recognition.

The segmentation node inputs the loaded image and produces a silhouette image of all moving objects in the cameras field of view. Objects are represented as white areas against a black background.

A codebook model was chosen due to its capabilities in handling scenes with moving background objects and different illuminations [7].

The algorithms was implemented at FOI and tailored at Image Systems to work in the TrackEye environment.

(17)

3.4.3 Tracking

The tracker node is responsible for tracking every segmented objects position. The tracker is a previously existing tracker used for tracking airbags deployment and is well suited to track the silhouette of the objects. The tracker in this set-up is rather ”dumb” in the sense that it doesn’t combine the measured positions into a trajectory, it only produces a 2D position for each silhouette.

3.4.4 Merge

A merge node combines its inputs into a single data feed and is needed when multiple data feeds are used as input in a node. In this context it is used when 2D and 2.5D1 is used and before the re-recognition node to

combine different cameras in the network. The node can be used in arbitrary situations where different data sets needs to be fused.

3.4.5 Feature Extraction

This node extracts features in the area bounded by the silhouettes. Every pixel is used to calculate a feature vector containing spatial and color information. The feature vectors are then combined to form a covariance matrix describing the appearance of the silhouette at the each frame.

3.4.6 Association Node

The association node is the node housing the hypothesis engine. The hypothesis engine is used to determine the most likely scenario. For each new position generated by the tracker node a measurement is generated containing the position and the time. The measurement is added to the engine to match it with measurements previously added.

The matching is done by using a Kalman filter. The Kalman filter was introduced 1960 by Rudolph E. Kalman [6] and has since been used to estimate noisy sensor measurements. The filter is a set of equations using a state space model to minimize an estimated error covariance. The Kalman filter is extensively used in tracking and motion prediction applications. For more information and an introduction to the Kalman filter see appendix B or see [9].

Whenever a successful match is made the measurements are combined to form a trajectory describing the track of a person. If the match is poor

(18)

enough it is considered to be a new measurement and is the seed of a new track, that’s how new tracks is introduced to the system.

Figure 3.2: The re-recognition model using 2 cameras in a TrackEye session

3.5

Re-recognition Model

The re-recognition model is an extension of the object recognition model. The entire re-recognition functionality is housed in the re-recognition node, from now on referred to as the R node. The R node is directly connected to the object recognition model via the association node. Every association node represents the output of one object recognition sequence. In other words, there is one association node for every camera or video feed.

3.5.1 Memory management

The task of the R node is to pair the trajectories generated at the association node. A trajectory contains the position and appearance of a person for each tracked frame. This information is added to the same, but slightly modified hypothesis engine used in the association node. Since all trajectories are added to the hypothesis engines memory it performs the task of the single data bank discussed in the problem analysis in chapter 2. This means that whenever a new person enters, the R node will try to find a match with all trajectories currently in the hypothesis engines memory.

(19)

3.5.2 Topological structure

Apart from containing the trajectories of each person, the re-recognition node also contains the topological information. The topological

information describes the relationship among the cameras in the surveillance network. The relationship is described with the term gate, where a gate is defined as an area where people can enter or leave the cameras view. These gates can be doors or the edge of the cameras field of view, even the edges of big objects where people can move behind. A gate is defined with the area and a list with other gates that a person can move to undetected from any other installed camera. With this setup it is possible to, in some degree of certainty, tell where a person should appear. The topological information is stored globally and externally accessible as a text file. The text file follows a hierarchical tree-like structure.

3.6

Event Node

The event node is a simple interface to make it possible to read the data from the re-recognition node and exporting the data to various formats using existing TrackEye components. It is for example need when using visualizing tracks in an image diagram or exporting the data to format convenient for MatLab2.

2MATLAB is a high-level language and interactive environment that enables you to

perform computationally intensive tasks faster than with traditional programming lan-guages such as C, C++, and Fortran. www.mathworks.com

(20)

Chapter 4

Implementation

This chapter presents the implementation of the re-recognition model.

4.1

Overview

All components was eventually integrated in TrackEye at Image Systems. The segmentation component where initially developed at FOI and the association component at Saab Bofors Dynamics. All other components were developed by Image Systems.

Both the association and re-recognition node share the same structure due to the fact that they both use an instantiation of the hypothesis engine. The engine is the same for both of the nodes apart from the fact that the re-recognition node has a different input data and comparison functionality from what the engine initially was designed to do.

4.2

The re-recognition node

The re-recognition node, or R node, is connected to the association node directly, or in the case of several video feed via the merge node. Either way the input is the same and consists of the trajectories generated at each association node. Since all tracks don’t qualify for re-recognition some analyzing of the input data is needed.

4.2.1 Input data

Apart from a video feed some initial data is required to set-up a

(21)

enter and leave a set of coordinates where the gates are is needed. To be able to transform the local coordinates into 2.5D world coordinates the system needs to know at least 4 visible coordinates is needed. The 2.5D coordinates is necessary to plot the trajectories on e.g. a map.

For each video feed in a session there will be an association node producing a track for each person. The track contains the 2D pixel coordinates, 2.5D world coordinates and a covariance matrix for each tracked frame, for performance reasons only every fifth frame is tracked. All the tracks from all video feeds are merged and sent to the re-recognition node to let it sort out which tracks belongs to the same person. The tracks with the same persons are then merged to form a new track.

Each TrackEye node can have an iport and/or an oport depending on the nodes functionality. The port represents their respective input or output. To connect two nodes one can connect an oport to an iport where both the ports point to the same data. The re-recognition node is either directly connected to an association node or indirectly connected via a merge node representing several connected association nodes. The structure is still the same and can be analyzed with the same method.

Even though the association node can produce several hypotheses only one is used and therefore only one will be produced and available at the nodes oport. Since only one hypothesis is available the analyze of the iport is quite straightforward.

The data is ordered according to each sensor and track. Under the root node the sensors are connected, a sensor represents the camera or the video feed. There is only one sensor for each association node. Under each sensor the tracks are connected. The tracks are ordered according to the time of the first observation. Every track consists of three components, 2D and 3D coordinates and a covariance matrix.

(insert picture of the association mds) 2D

The 2D coordinates represents the 2D pixel coordinates at the centre of each person were the origin is defined in the bottom left corner. The 2D coordinates are one of the outputs from the auto tracker node. The auto tracker will track every silhouette with at least the size of user-defined value, the default is 60. The size parameter is needed to filter some of the noise from the segmented image. The filtering process is very effective against the salt and pepper typed noise produced by the segmentation of the video. The value should be set to a value somewhere in the range close

(22)

to the smallest size of a detected person in the video feed. 3D

The 3D coordinates are the 2D coordinates transformed in 3D world coordinates. To be able to transform the coordinates at least 4 fixed measured points are required. Theoretically only 3 points is needed to build a plane used for transformation but 4 is the minimum for a relatively reliable result. The transformed 2D coordinates will not be the ’true’ 3D coordinates, instead they represent the 2D coordinates in the world coordinate space with a third dimension representing a fixed height. This is not a big problem since most of the surveillance is inside overlooking a levelled floor. The 3D coordinates is never used in any recognition or re-recognition operations, they are purely used for presentation purposes. They also simplify when interpreting the merged tracks produced by the re-recognition node.

The 2.5D node previously available in the TrackEye system does the calculations. The node uses camera characteristics and a plane to transform 2D into 3D.

Covariance Matrix £

x y U Y V dx dy dx2 dy2 ¤

The covariance matrix is the representation of a person’s visual features, the way they look. The feature extraction node connects to the auto tracker node and receives the position of every pixel inside every segmented silhouette, of every person. For every pixel inside each silhouette the feature extraction node calculates a feature vector

containing the pixel coordinates the color in UYV color space and the first and second order derivative.

Every pixels feature vector is then used to calculate the expected and covariance value to be used to create the covariance matrix. The covariance matrix is a 9x9 symmetric matrix describing the spatial and color correlation for each person.

(23)

4.2.2 Appearance

Conditions

The appearance is the single most important information used to compare two tracks and decide whether they match or not. The R node relies solely on the covariance matrix as the only information about the appearance. Ideally we would like to compare the new measurements first sample of the covariance matrix with the last sample of the measurements stored in the hypothesis engine. The best match would then be considered the same person and a successful re-recognition would be made. Such a method would only work if the segmentation component would generate a perfect silhouette of a person and the person would be captured in the same conditions, e.g. same lighting, same angle, same side etc.

Every track in the hypothesis engines memory is a person that somehow left a cameras field of view, these measurements is simply called tracks. Each of these tracks consists of a set covariance matrix samples, one for each tracked frame. The first and last few samples will most likely be of poor quality since a part of the person would be obscured by a wall when exiting the scene. The remaining samples would also be of varying quality due to the fact that the segmentation won’t be perfect.

It is even more problematic for the tracks generated of a person entering the scene, called observations. Only a handful samples is available before the system needs to decide if it matches a track in the memory. Of those handfuls of samples, a significant amount of the covariance matrices will be generated of obscured silhouettes. The method used by the R node is aimed to tackle these problems.

The matching algorithm

The algorithm presented in this section is used to compare two sets of covariance matrices. The result is a value of how likely it is that they are of the same person.

The process of comparing an observation with a track can be separated in these steps:

1. Calculate the mean of both the track and the observation. The mean of a set covariance matrices is calculated using the following

functions, presented in MatLab source code. function U = cvmean2(X)

(24)

U = X(:,:,1); for j = 1:10 tmp = logxy(U,X(:,:,1)); for i = 2:size(X,3) tmp = tmp + logxy(U, X(:,:,i)); end U = expxy(U,tmp/size(X,3)); end function z = logxy(x,y) sqrtx = sqrtm(x); invsqrtx = inv(sqrtx); z = sqrtx*logm(invsqrtx*y*invsqrtx)*sqrtx; function z = expxy(x,y) sqrtx = sqrtm(x); invsqrtx = inv(sqrtx); z = sqrtx*expm(invsqrtx*y*invsqrtx)*sqrtx;

2. For each covariance matrix in the track, calculate the distance to the mean of the track. The following distance function is used, presented in C++ source code using LAPACK1 package:

#include <math.h>

#include <rw/dsymmat.h> #include <rw/dsymeig.h> #include <rw/dsymfct.h>

double distance(DoubleSymMat A, DoubleSymMat B) { DoubleSymFact C(B);

if(C.fail()) return -1;

DoubleGenMat D(A.rows(), A.cols(), 0); D = inverse(C)%A;

(25)

DoubleEigDecomp eig(D); double sum = 0;

for (unsigned int i = 0; i < A.rows(); i++) sum = sum + log( abs(eig.eigenValue(i)))

* log( abs(eig.eigenValue(i))); return sqrt(sum);

}

3. Calculate both the expected value and variance of the distance to the mean of the track calculated in step 1. The expected value, or mean value, is calculated by µ = 1 n i X n=1 xi (4.1)

and the variance is calculated by

σ2= 1 n i X n=1 (xi− µ)2 (4.2)

4. Construct a Gaussian PDF using the calculated mean and variance from step 3. The Gaussian or normal distribution is defined as:

fX(x) = 1

2πe

−(x−µ)22σ2 (4.3)

where µ is the expected value and σ is the standard deviation. 5. Calculate the distance from the mean of the observation to the mean

of the track.

6. Use the PDF2 with the distance from step 5 to calculate the score.

The distance between both mean covariance matrices is used as the input parameter in the PDF defined in step 4.

4.2.3 Topological information

The topology in the context of camera networks describes how the cameras are related to each other in the network. This information is used to

(26)

effectively enhance the precision of the re-recognition matching process. The topological information is used since it is difficult to solely rely on the appearance due to the insufficient quality of the appearance matching. Whenever a person enters the scene we can not only compare the appearance of all persons in the hypothesis engines memory, but also conclude from where the person most likely departed from. With this information only persons departing from that location would be good candidates of a match.

The topology is a hierarchical structure consisting of a set of sensors, hallways and gates. They can be considered nodes in an inter-connecting network representing the set up of the surveillance system. Observe that the nodes in the topological network are not the same as nodes in a TrackEye session.

Every camera in the camera network is described by a sensor node. The sensor node contains a set of at least one gate node. The gate nodes describe where people can appear and disappear in the cameras field of view. A gate typically describes areas where doors are located or the edges of the screen where people can enter from the side of the cameras viewing plane.

The relationship between the sensors and gates is described by hallways. The idea behind the structure (and the choice of name) is to mimic how real hallways relate to doors. The doors in this analogy are represented by the gate nodes. A hallway can only be entered through a set of doors, and it is only possible to leave the hallway through the same set of doors. The topological information is kept in text file where the properties of each node can be declared.

Bayesian network

The key question that the topological information is used to answer is: given an entry location, which location did the person most likely come from? This is expressed with the help of Bayes’ law as:

L(B|a) =p(B|a)p(A)

p(B) (4.4)

where a is the location of entrance and B describes the probability

distribution of the likelihood for the possible departure locations. It is also possible to express the same expression with Bayesian networks.

(27)

The departure node describes the p(A) of Bayes’ law and the connected entry node represents the relationship p(B|A) of Bayes’ law. The term

p(B) can be omitted as it only act as a scaling factor in this context. • p(A) describes the distribution of where people are leaving and is

estimated by keeping statistics of where people are leaving

• p(B|A) describes the distribution of people leaving location A and

then enter location B. This can be estimated by updating the statistics whenever a person moves from one location to another It is not necessary to use Bayesian networks to describe such a simple relationship. It does however help a lot when describing more complex relationships.

A more complex model can be achieved by loosen the definition of a gate as an area a person can appear inside. Instead it can be convenient to just describe the gate as single point. It is then possible to associate all

appearing persons to a gate, even those who might end up a bit away from normal. Instead of just relying on the distance from the entry position to each gate to calculate which gate the person arrived from we can use a similar approach as used for comparing appearances.

For every gate we can describe a Gaussian PDF that represents the distribution of the distance from the entry position to the gate. The PDF is constructed by estimating the estimated value with the mean distance from each entrance position to the gate position for every person entered through the gate. The variance is estimated using the mean value. By building a PDF for each gate we can calculate how likely it is that a person entered through any gate by testing the PDF with the entrance position. In a sensor with well defined gates it should be quite a high peak towards the most likely candidate. The biggest gain of this approach is that any person arriving will be associated to a gate, even though they end up bit further away than usual from a gate.

(28)

Chapter 5

Results & Conclusion

This chapter presents the results from testing the implementation of the software. The video material used in the testing was produced to

accurately simulate a real-world hallway surveillance scenario. Later in this chapter conclusions and future work is presented.

5.1

Results

5.1.1 Simulation Set-up

To benefit the most from the topological information the scenario would need to be fairly enclosed. An open environment would mean that the topological information would be unreliable due to the many possible entrance and exit locations. A hallway at FOI was chosen to be surveyed during an 8 hour work day. The hallway followed a T-shaped layout with offices along the ”roof” of the T. Each end of the hallway leads to another part of the building. 5 standard video cameras were used. Unfortunately one camera was malfunctioning and the video material was rendered unusable due to severe out of focus.

Most of the people entering the scene appeared and then later re-appeared in the same location. Most probably they came to get something or meet another person. The movement patterns were ideal for re-recognition testing.

(29)

5.1.2 Object recognition model

Segmentation

The test scenario were not designed or aimed at testing the performance of the segmentation module. The module does however affect the overall system performance in a much higher grade than any other in the current setup.

It quickly became apparent that that the segmentation algorithm couldn’t cope with any reflections or shadows in the scene. This was no surprise and was to be expected of the algorithm. The module also produced a lot of noise. This was also to be expected and it could also be found in the original paper [7].

To cope with the reflections and shadows a more sophisticated algorithm or an improvement of the existing algorithm would have been necessary. The best workaround of this problem was to choose locations where the reflections and shadows were kept to a minimum, for example an indoor hallway. The fine grained noise was not a particularly big problem and could be handled in multiple ways. The most suitable solution was to choose a available tool in the tracking node making it possible to track objects within a certain range of sizes. This solution effectively filtered the small noise particles.

The noise was a bigger problem when the illumination was not quite sufficient. This lead to moving objects only partially being segmented; areas with the legs and arms would be disconnected to the rest of the segmented body. Other more severe problems discovered were that the algorithm couldn’t handle gray areas particularly well and performed poorly in situations where people wear wearing clothes in gray shades, see figure 5.1. This could lead to blind spots where there would be no

segmentation at all. The reason to this problem was never found and was treated as a software bug.

The most severe problem with the segmentation module was its tendencies to blow up in particular scenes. This would mean that the segmentation algorithm could perform well initially but as the time progressed more and more of the scene would be interpreted to the foreground and included in the segmented picture. This could result in an almost completely white output where the white represents the objects of interest as can be seen in figure 5.2.

No solution to the problem was found and could be a result of a software bug. It could also be a result of the algorithm itself has problems with

(30)

Figure 5.1: Example of poor segmentation

(31)

certain situations.

Tracking and Association

The tracking node performed very well and accurately tracked every segmented object. It was also expected due to the modules extensive testing and use in both commercial and military use. The tracking module had an very easy job since it only needed to track disconnected objects in a black and white picture.

The A-node with the hypothesis engine was also one of the modules performing according to the expectations. It performed very well given the input and it had no problems constructing the objects trajectories from the positional data from the tracking node.

The performances of both the tracking and association nodes were both pleasing. The output from the association node was however in most cases not very pleasing. It generally produced a lot of trajectories, long and short, and seemingly random. This was a result of the segmentation node not being able to produce good data for the tracking node. The tracking only tracks the areas produced from the movements of the objects

generated from the segmentation node. The tracking has no mechanism for determining the quality of the segmentation and therefore tracks every object in the specified range of sizes.

One particular bad scenario is when the segmented object is split in two. This could be the result of poor illumination, reflection or shadows. This results in the tracking node producing coordinates and silhouettes for both. This gets very confusing for the association node and usually results in an incorrect result.

5.1.3 Re-recognition model

The test scenario was designed to fit the re-recognition model as much as possible. Unfortunately the result didn’t reflect that at all. The reason is, as previously, the poor performance of the segmentation node. Of the 5 cameras only one of them could be used to produce adequate segmentation. Even though the quality of the input to the R-node was of poor quality, some results was achieved. Firstly, the pre-filtering of the in-data worked surprisingly well. The filter was initially intended to only select the new measurements but were also very effective at disqualifying all the false trajectories generated by noise from the segmentation node.

(32)

first enters the scene and then leaves to return again and then move in another direction. The matching algorithm gives both signatures quite high scores up until a shadow is generated when the person moves away from the camera. The testing of the matching algorithm is not near being conclusive but the initial result is promising.

The use of topological information with the use of Bayesian networks has not been tested at all. The method presented in this report relies heavily on the ability to gather statistics on the movements in the camera network.

5.2

Conclusion & Future work

The overlaying system design with the pipeline structure is preferred and often used to break down a problem. One must however be careful with which modules to use, it’s easy to break down the structure into too many part or vice verse. It is also important to clearly specify the I/O

specifications.

The current design works quite well with the implementations made. It clearly show which modules performs well and performs poorly. The design choices made for the topology and data management was unfortunately not tested and it would have been interesting to see how scalable they were. The overall system performance was quite poor with the

segmentation module as the most demanding module. The performance was, however, not a priority.

One design choice that should be revised is the use of the hypothesis engine in the association node. It performs well in the object recognition model. But it is quite a problem connecting the re-recognition model on top of it. The problem arises whenever the A-node changes its hypothesis ordering, when that happen some or all the trajectories is changed. This means that all of the re-recognition calculations needs to be re-done. This is not something that can be made in a future real-time system. It is a very unnecessary performance bottle neck since the hypothesis isn’t really useful in this system layout.

Most of the poor result in the simulation is due to poor performance of the segmentation node. The system would benefit greatly from improved performance in solving the problem of separating peoples from the

background. It is a very difficult problem but some of the problems related to the segmentation node could be the result of software bugs. There could also be much to be gained with the use of different morphological

(33)

from the body in the segmented result.

The re-recognition model presented in this paper is too untested to give a verdict on its performance. The idea of using topological information to conclude the best match is appealing and could very well work in practice. The question is not if you should use it, but more how it should be used. The model in this paper would be best suited in an indoor environment where the entrances and exits is monitored. The topological information works best when used to eliminate candidates more then nominating the best candidate. The only way to really be able to decide if a person has a good match with an already detected person is to use some kind of appearance comparison. The method using co-variance matrices to

describe a persons appearance could very well be good choice. But it does rely a lot on the quality of the segmented image. It would also have been interesting testing the performance in different illumination and camera angles.

Did the thesis meet the objectives? In short no. The initial objectives was set with the precondition of a working object recognition model. This was only true on paper. The actual object recognition performed very poorly and the re-recognition model were more or less untested. This lead to a lot of effort intended for the re-recognition model instead went to develop the local recognition model.

The thesis does however present a re-recognition model that could be used if the input is of fairly good quality. The thesis also presents an evaluation of the system and description of which components that does work and does not work.

(34)

Bibliography

[1] Gunnar Blom, Jan Enger, Gunnar Englund, Jan Grandell, and Lars Holst. Sannolikhetsteori och statistikteori med tillmpningar.

Studentlitteratur, 2005.

[2] Edward Y. Chang and Yuan-Fang Wang. Toward building a robust and intelligent video surveillance system: A case study. IEEE International

Conference on multimedia, 28, 2004.

[3] Nir Friedman. Learning belief networks in the presence of missing values and hidden variables. Proceedings of the Fourteenth

International Conference on Machine Learning, 1997.

[4] Alan Hjek. Stanford encyclopedia of philosophy, interpretations of probability. Accessed 9 November 2008 at

http://plato.stanford.edu/entries/probability-interpret, 2007. [5] Finn V. Jensen. Bayesian Networks and Decision Graphs. Springer,

2001.

[6] Rudolph E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.

[7] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis. Real-time foreground-background segmentation using codebook model. Real-time

Imaging, 11(3):167–256, 2005.

[8] Jitendra Malik and Stuart Russel, editors. Traffic surveillance and

detection technology development: New sensor technology final report.

California Partners for Advanced Transit and Highways (PATH), 1997. [9] Greg Welch and Gary Bishop, editors. An Introduction to the Kalman

Filter, volume In Computer Graphics, Annual Conference on Computer

(35)

Appendix A

Bayesian Networks

This chapter intends to give the reader an introduction to Bayesian networks. Some basic knowledge of statistics and probability theory is needed so the chapter begins by explaining some terminology commonly used in conjunction with Bayesian networks. More information about probability theory can be found in [1]. More information about Bayesian networks can be found in [5].

A.1

Probability and Random Variables

A.1.1 Frequentistic versus Bayesian

There are commonly two interpretations of probability:

1. Frequentists talk about probabilities only when dealing with well defined random experiments. The probability of a random event denotes the relative frequency of occurrence of an experiment’s outcome, when repeating the experiment.

2. Bayesians, however, assign probabilities to any statement

whatsoever, even when no random process is involved. Probability, for a Bayesian, is a way to represent an individual’s degree of belief in a statement, given the evidence.

[4]

A classic example of a frequentistic probability is the outcome of a dice throw. If one would make a large number of rolls with a dice, one could estimate the probability of a ’one’ as the number of ones divided by the total number of rolls. When the total number of rolls approaches infinity

(36)

we get a true approximation of the probability. This kind of

interpretations is also referred to as objective since the result is not dependent on the person carrying out the experiment.

Bayesian probabilities, or commonly known as subjective probabilities, are something we regularly encounter in our daily life. For example one could say ”I have 50% chance of passing this exam”. This approximation is difficult to carry out in a frequentistic manner, instead experience of similar events often adds to a fairly reasonable estimation.

In the area of image re-recognition it is preferable to base our assumptions on measured statistics. Therefore most of the presented probabilities in this report are of a frequentistic nature. An example of this is to register where people are leaving and entering a camera. With such statistic it is possible to get an approximation in which camera a person will end up if they left a certain camera, more on this later.

There are, however, some situations where a subjective input is useful. In Bayesian network the initial parameters needs to be approximated. This can be done by simply giving a qualified guess. The end result will still converge to a ”true” approximation, for more detail see the section about Bayesian networks.

One of the advantages of Bayesian networks is that it allows for seamlessly mixing between objective and subjective measurements. It is important to remember though, that as soon as subjective measurements are used the end result is subjective.

A.1.2 Probability

The probability is represented by a p for probability. The probability of an outcome of a discrete event, A, is formally defined as

p(A) = Possible outcome favoring event A

Total number of possible outcomes (A.1) The probability of an event favouring either A or B is given by

p(A ∪ B) = p(A) + p(B) (A.2) If two event are independent, e.g. they don’t effect one another, the

probability of an event favouring both event A and event B is given by

(37)

The above equation is best illustrated with an example: the probability of throwing a ’six’ is 1/6, the probability of throwing two sixes with two dice is 1/36.

The above equation is best illustrated with an example: the probability of throwing a ’six’ with a dice is 1/6, the probability of getting ’tales’ is 1/2, therefore the probability of getting both ’six’ and ’tales’ is 1/12.

The fundamentals in Bayesian networks involve conditionals probabilities. It’s the probability of an event occurring given the outcome of another event. We can write, given the event B, the probability of event A as

p(A|B) = x (A.4)

Random variables

The name variable is somewhat misleading. It is not really a variable instead it is a function mapping real valued variables to real valued domain. We can for instance define a random variable in a coin toss as

X =

½

1 if heads 0 if tails

In a game we could interpret it as heads is the winning side and tails is the loosing side.

In a tracking scenario we deal with continuous random event and we can’t use random variables as above. Instead we can only evaluate events in an interval. A representation often used is the cumulative distribution function:

FX(x) = p(−∞, x] (A.5) This equation represents the cumulative probability up to and including the random event x. This function has some useful usages but first some important properties:

FX(x) → 0 as x → −∞

FX(x) → 1 as x → +∞

FX(x) → 0 is a non-decreasing function of x

The cumulative distribution function gets really useful when we use its derivative known as the probability density function.

(38)

State Variables Add something here . . . Mean and Variance

Two important variables are associated with random variables, the expected value and the variance. The expected value in the continuous case is defined as

E(X) =

Z

−∞

xf (x)dx (A.6)

A way of interpreting the expected value is as the amount one could expect in a trial in a series of many events.

The variance is defined as

V (X) = E[(X − µ)2] where µ = E(X) (A.7) The variance is a measurement of the statistical spread or dispersion. The square root of the variance is the standard deviation commonly denoted as

σ.

Probability density functions

Every continuous random variable has a probability density function, PDF, associated to it. If the random variable is discreet it is called probability distribution function, the acronym PDF is used for both continues and discrete functions. The PDF tells us how the probabilities are distributed over a defined interval. The PDF is defined as:

fX(x) = 1

2πe

−(x−µ)22σ2 (A.8)

The properties are:

fX(x) is a non-negative function of x Z

−∞

fX(x)dx = 1

As mention above the probability can be given for a defined interval, for the interval [a,b] the probability is given by

pX[a, b] = Z b

(39)

Many stochastic events can be modelled with the help of PDF’s. The most commonly used distribution is the normal distribution.

Normal Distribution

The reason for the popularity of the normal distribution is that it can be shown that the sum of many and uniformly distributed random variables is approximately normally distributed, see [1] for the proof. The

re-recognition model presented in this report will rely on the normal distribution.

The normal distribution is defined as

fX(x) = 1

2πe

−(x−µ)22σ2 (A.10)

where µ is the expected value and σ is the standard deviation.

A.2

Bayesian Networks

A.2.1 Bayes’ Theorem

The name Bayesian networks comes from the 18th century priest Thomas Bayes and his famous theorem. Bayes’ theorem relates the conditional properties between two random events.

p(A|B) = p(B|A)p(B)

p(A) (A.11)

Example, assume that 90% of all blond peoples have blue eyes. What is the probability of a blue-eyed person having blond hair? The answer is given by:

p(blond|blue − eyed) = p(blue − eyed|blond)p(blond)

p(blue − eyed) (A.12)

Bayes’ law is a way of reversing the statement. It is important to note the two additional probabilities in the equation. We need to know both the probability of both blond hair and blue eyes in the population. They can often be difficult to measure and is often approximated and sometimes even trivialized to a scalar.

Bayes’ law is particularly useful when trying to solve re-recognition problems. It is relatively easy to build statistics with information about how people are moving through a camera network, e.g. a person left the hallway and entered the office

(40)

A.2.2 Likelihood and Inference

Informally the term likelihood is a synonym for probability. The technical definition differs somewhat and is used to describe a way to estimate an outcome depending on various known variables. It is easiest to think of likelihood as the reverse of probability. Much like the statement

p(A—B=b) can reason about the outcome of A given B, the likelihood function L(B—A=a) can reason about B given the outcome of A. The likelihood function is just another way to express the use of Bayes’ law.

L(B|A = a) = p(A = a|B) = p(B|A = a)p(B)

p(A = a) (A.13)

This way of reasoning is also referred to as inference and is a big part of statistical models, such as Bayesian networks.

A.2.3 Bayesian Networks

A Bayesian network is a graphical model that represents a set of

probabilistic variables and its dependencies. A variable is represented by a node and a dependency is represented by a directed link, or path. Bayesian networks belong to a family of directed acyclic graphs (DAG), it means that it is impossible to start at a node and travel back to the same node.

Color of A

Observed

color of A

(41)

The simplest possible Bayesian network is a set of two nodes and a dependency link between them. Figure A.1 illustrates a simplified

re-recognition case. The left node represents the average color of person A. The right node represents the observed color of A. Observed points out that the value could differ from the true value. For example a green car could look black in poor lighting conditions. The same problem occurs in computer vision scenarios where the observations often are noisy.

The link is directed from the parent node to the child node, this means that the parent node causes the child node or the child node depend on the parent node. In the example above we could interpret that the observed color is dependent on the real color of A, and we can describe their relationship easily mathematically with:

p(Observedcolorof A|Colorof A) (A.14) Every node in Bayesian network is variable with a finite set of states. The nodes in the example above could for example have the states green and black. The chances of a car being green when we perceive it as black is the same as:

p(Observedcolorof A = black|Colorof A = green) (A.15) When dealing with state variable it is important to note some properties. If A is variable with states a1, . . . , an, then

p(A) = (x1, . . . , xn) (A.16) xi ≥ 0 (A.17) n X i=1 xi = 1 (A.18) (A.19) where xi is the probability of A being in state ai.

Example

One fundamental task in a re-recognition scenario is determining how people move between sensors. It is therefore important to generate statistics of how people are behaving in the camera network. Whenever a person enters a camera we want to know where they most likely came from. We basically want the answer to:

(42)

Total entrance/departed distribution Office Toilet Coffee room

Visits 12 4 8

Percentage 50% 17% 33%

Table A.1: Total number of entrances and departures Coffee room

Office Toilet

Visits 5 3

Percentage 63% 37%

Table A.2: Number of entrances and departures at the coffee room

p(Le = ei|Ld) (A.20)

where Le denotes location of entrance, Ld location of departure and ei the

current state of Le. The state represents the different location where the cameras are installed. We can therefore interpret the expression as, if a person entered ei where did he came from?

The expression is the same as seeking the likelihood of Ld given Le and is

expressed by

L(Ld|ei) = (p(Ld|ei)p(ei))/(p(Ld)) (A.21)

Notice the use of Bayes’ law as described in section A.2.1 on page 36. Consider a part of an office complex where a hallway is connected to an office, a copying room and a small coffee room. The entrance to each of the room is monitored but the hallway itself is not monitored. During a typical day the number of people entering and departing each room is presented in table A.1. Each room is only accessed through one door only and therefore the number of entrances and departures is the same.

We have also got statistics over each room which room they walk to next in tables A.2, A.3 and A.4.

(43)

Office

Coffee room Toilet

Visits 8 4

Percentage 67% 33%

Table A.3: Number of entrances and departures at the office room Toilet

Office Coffee room

Visits 3 1

Percentage 75% 25%

Table A.4: Number of entrances and departures at the toilet

Given all this statistics we can in fact answer some questions. If a person is in the coffee there is 63% chance he will leave for the office. But given a person enters the office how good chance does have coming from the coffee room Bayes’ theorem is the answer.

p(Ld= di|Le= ei) = (p(ei|di)p(di))/(p(ei)) (A.22)

where di is the coffee room and ei is the office. The equation might look a

bit complicated but each probability is given from the tables with the statistical data.

p(ei|di) - is given by table2 and reads 67%

p(di) - is the probability of departing from the toilet. The probability is

given in table2 after excluding the office column. The probability is therefore 3%

17%+33% = 66%.

p(ei) - is the chances of entering ei and is calculated by adding the chances

of entering the office from both the toilet and the coffee room. It is given by (67%)(66%) + (33%)(34%).

If we instead were interested in finding the chances of the person entering the office after visiting the toilet we get

p(toilet|of f ice) = (33%)(34%)

((67%)(66%) + (33%)(34%) ≈ 20% (A.23) which we would expect considering there is no other possible location to arrive from in this simplified example.

(44)

The conclusion of the example is that there is very high likelihood that a person entering the office came from the coffee room. In a re-recognition scenario with two possible choices e.g. two persons with similar

(45)

Appendix B

The Kalman Filter

This appendix presents an introduction to the Kalman filter. It is intended to give the reader a basic knowledge and understanding of how it can be used in tracking scenario. A Kalman filter is basically a tool to make estimations from noisy measurements. The filter s estimations is both stochastic and recursive meaning each new estimation is at least as or more accurate then the previous.

The Kalman filter was introduced 1960 in [6], a famous publication describing a recursive solution to the discrete-data linear filter problem. Even though the description seams quite specific the applications of the filter are indeed wide. It is also one of the reasons why the descriptions on how the filter works is so different. Most descriptions are, naturally, focused in their particular field of interest. Even this introduction is somewhat biased towards the tracking scenario. Historically the filter was aimed to solve a problem in automatic controls and system analysis but has since found it ways into radar installations, medical systems and computer games.

There are two reasons of the success of the Kalman filter, it is a relatively simple model and it is optimal in almost every sense of the word. The last reason might sound a bit strange at first but it should become clear to the reader at the end of this appendix. The reason why the filter is optimal is that it incorporates all available information available at the time to make a prediction.

The need of the filter becomes clear when we look at a normal black-box system where the state of the system is unknown. To be able to control the system we need to somehow measure the state but that is itself a problem since no measurement device is perfect and some measurement error is

(46)

bound to exist. Include a figure here! The errors can be measurement noise, bias and inaccuracies in the measuring device. To make the best possible estimation of the state of the system we can use the Kalman filter. Up until now, the filter itself hasn’t been introduced. The reason is that the filter is best introduced by an example rather than a series of

equations. Imagine you are lost and need to figure out your position at the time, t. You have a map and you estimate to the best of you ability that you are at position, z1. Due to measurement errors such as human error

and/or limited resolution in the map, the measured position is concerned with some uncertainty. This uncertainty is often referred to as the deviation or variance in stochastic terminology.

So far the best estimation of your position is your only estimation ˆ

x(t1) = z1 (B.1)

with variance

σ2x(t1) = σz21 (B.2) Now let’s say you are so lucky you carry a GPS device and make another measurement at roughly the same time t1 = t2. Since the time is almost

identical we can safely assume the position has remained the same. The new measurement has the position, z1 with variance σz2.The GPS device is of good quality and the measurement is of good precision, this is reflected in a much lower variance then we could expect from σz1. We can therefore

say we trust that the precision from the GPS device measurement more than our own reading of the map. However, we don’t trust it completely and we therefore want to combine both of our measurement in the best possible way. We can do this be calculating their joint Gaussian density distribution with expected value

µ = ( σ 2 z2 σ2 z1 + σ2z2 )z1+ ( σ2 z1 σ2 z1 + σz22 )z2 (B.3) and variance 1 σ2 = ( 1 σ2 z1 ) + ( 1 σ2 z2 ) (B.4)

The best estimation of the position would then be ˆ

References

Related documents

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

Generally, a transition from primary raw materials to recycled materials, along with a change to renewable energy, are the most important actions to reduce greenhouse gas emissions

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

Det har inte varit möjligt att skapa en tydlig överblick över hur FoI-verksamheten på Energimyndigheten bidrar till målet, det vill säga hur målen påverkar resursprioriteringar