A feature based face tracker using extended Kalman filtering

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

A feature based face tracker using extended

Kalman filtering

Examensarbete utfört i Bildkodning vid Tekniska högskolan i Linköping

av

Nils Ingemars

LiTH-ISY-EX--07/4015--SE Linköping 2007

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

A feature based face tracker using extended

Kalman filtering

Examensarbete utfört i Bildkodning vid Tekniska högskolan i Linköping

av

Nils Ingemars

LiTH-ISY-EX--07/4015--SE

Handledare: Jörgen Ahlberg

Examinator: Robert Forchheimer

(4)

(5)

Avdelning, Institution

Division, Department

Image Coding Group

Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2007-02-22 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.ep.liu.se

ISBN

—

ISRN

LiTH-ISY-EX--07/4015--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

En feature-baserad ansiktsföljare med hjälp av utökad Kalman-filtrering A feature based face tracker using extended Kalman filtering

Författare

Author

Nils Ingemars

Sammanfattning

Abstract

A face tracker is exactly what it sounds like. It tracks a face in a video sequence. Depending on the complexity of the tracker, it could track the face as a rigid object or as a complete deformable face model with face expressions.

This report is based on the work of a real time feature based face tracker. Feature based means that you track certain features in the face, like points with special characteristics. It might be a mouth or eye corner, but theoretically it could be any point. For this tracker, the latter is of interest. Its task is to extract global parameters, i.e. rotation and translation, as well as dynamic facial parameters (expressions) for each frame. It tracks feature points using motion between frames and a textured face model (Candide). It then uses an extended Kalman filter to estimate the parameters from the tracked feature points.

Nyckelord

(6)

(7)

Abstract

A face tracker is exactly what it sounds like. It tracks a face in a video sequence. Depending on the complexity of the tracker, it could track the face as a rigid object or as a complete deformable face model with face expressions.

This report is based on the work of a real time feature based face tracker. Feature based means that you track certain features in the face, like points with special characteristics. It might be a mouth or eye corner, but theo-retically it could be any point. For this tracker, the latter is of interest. Its task is to extract global parameters, i.e. rotation and translation, as well as dynamic facial parameters (expressions) for each frame. It tracks feature points using motion between frames and a textured face model (Candide). It then uses an extended Kalman filter to estimate the parameters from the tracked feature points.

(8)

(9)

Introduction

The level of this report is aimed for people who are familiar with signal processing and its underlying mathematics, especially linear algebra. The reader is also assumed to have some basic knowledge of digital images.

1.1 Background

A face tracker is a program that tracks a face in a video sequence. It can be used as a tool for model based coding of face video sequences. Model based means that the frames are synthesized using models. By using a face tracker to extract a few face parameters per video frame, the video can be synthesized using a greatly reduced amount of data compared to normal frames.

At the Image Coding Group (ICG) within the department of Electri-cal Enginering, there are several so Electri-called appearance based face trackers that use the whole face to estimate parameters. There have been few fea-ture based face trackers, that use several smaller feafea-tures of the face for parameter estimation.

The previous such feature based tracker was developed by Jacob Ström [1, 15]. It is actually only a head tracker, that tracks the global parameters, rotation and translation, of a rigid head. It uses an extended Kalman filter to estimate face parameters from tracked feature points. Feature points are points in the face with some special visual properties that should be easy to track, for example corner like points. Apart from these global parameters, it estimates the depth of the feature points, which means that an estimate of 3D coordinates is gained. When it was developed it ran in real time, but special expensive hardware was used.

Earlier work on similar face tracking methods has been done by Pertti 1

(12)

2 Introduction

Roivainen [2], and Haibo Li [3]. Roivainen used point displacement fields between frames and a least squares method to obtain 3D motion of the face. One of his methods also includes feedback tracking using synthesized face models very much like in the method presented here. Li had a slightly different approach. Instead of calculating displacement fields, he extracted information directly from gray scale values in the frames. For the facial 3D motion estimation he first used the least squares method and then an M-estimator to gain control over 2D point measurements. He also used 3D motion prediction to get the 3D model approximated to the next frame.

1.2 Purpose

The purpose of this work is to see if Jacob Ströms idea can be altered to do more than just rigid head tracking. To be more specific we want to be able to track a deformable face, which means that rotation and translation as well as facial expressions will be estimated. We also want the face tracker to run in real time, but this time on general commercial hardware.

1.3 The task

The task is to build a face tracker, based on Jacob Ströms idea, that should meet the following requirements:

• The tracker must be able to estimate rotation, translation and facial expressions.

• The tracker must perform in real time (15+ frames per second) on general consumer hardware.

To fit in the time scope we have set a few limitations:

• The tracker only needs to handle limited speed of movement of the tracked face.

• The tracker does not need to be able to track the face under too broad rotation angles (ca 45 degrees relative to the camera).

• The tracker does not need to be able to track the face if it disap-pears/gets occluded and then reappears again.

• The tracker will only be tested in real time, by using a camera. The requirements and limitations on the face tracker evolve from an appli-cation in model based coding, namely low bitrate video conferencing.

(13)

1.4 Outline 3

1.4 Outline

This report is organized as follows. Chapter 2 will contain some background theory that will be used. In Chapter 3 we will describe the face tracker itself. Chapter 4 brings up implementation issues. Chapter 5 will contain test setup and results from testing the face tracker and chapter 6 is where we will put our thoughts and possible improvements.

(14)

(15)

Chapter 2

Theory

2.1 Image processing - pattern template

match-ing

A way to find a pattern, a pixel with local surrounding, in a grayscale image is to match it with a pattern template. This method is called matching [4] and can be used for example if one wants to find a point in an image inside another image. However, a pattern template is a small image containing the pattern. It is normally constructed as an NxN image, where N is odd, so that the center falls onto a pixel location. The center of the pattern will be located at this template center. To match the template with a pixel (and its local surroundings) in the image, the template is placed on the image with its center (the pattern center) in position with the pixel. This makes up a subimage within the image, with the same size as the template. A function, such as the sum of the absolute pixel value differences, is then used to compare the template and the subimage, resulting in a scalar value. The matching is usually done at many locations in the image, resulting in a scalar value for each such location. Depending on what function is used the location with the highest or lowest resulting value is the best match. For the sum example above the lowest value is the best, since it is a measure of the error.

When matching the edge pixels of the image, the template might stretch over the edge. In this situation the missing pixels in the subimage could be exchanged for all zeros, replication of the edge pixel values, or circulation of the opposite edge of the image.

If the image would be a multi channel image (more than one value per pixel), for example a color image (three channels), the function will operate separately on each channel, and will thus return several scalar values per

(16)

6 Theory

location. These values can then in some way be converted to one scalar value, for example by just adding them together.

Figure 2.1 shows an illustration. To the left and right are two small images of 16x16 pixels. In the left image is a single dark pixel with local surroundings that we want to find in the right image. We extract a 5x5 pixel template around the pixel, the middle image. The template is then matched at all locations in the right image. The best match will be found at the pixel with surroundings inside the dashed square.

Figure 2.1. Matching

2.2 Computer graphics - 3D objects and

projec-tion

In common computer graphics a 3D object consists of a set of vertices, sample points on the models surface. A vertex is simply a point with coordinates x y z T

in a 3D cartesian coordinate system. The surface of the object is made up by triangles, where each of the triangle corners consist of one of the model vertices. A texture can be mapped onto the surface, for example from an image, to make it look more realistic. The mapping works by simply setting texture coordinates for all vertices in the model. Texture coordinates for points inside the triangles are interpolated from the triangle corners. The texture coordinates determine which pixel values in the texture image to be used. The texture coordinate system is basically the normalized dimensions of the texture itself, viewed as an image.

There are three global transforms that can be applied to a 3D object, modifying its vertices. These are scaling, rotation and translation.

(17)

Oper-2.2 Computer graphics - 3D objects and projection 7

ating these on a vertex relative to the coordinate system origin is done in the following way.

Scaling   x1 y1 z1  = c   x y z   (2.1)

Rotation around the x-axis   x1 y1 z1  =   1 0 0 0 cos rx − sin rx 0 sin rx cos rx     x y z  = Rx   x y z   (2.2)

Rotation around the y-axis   x1 y1 z1  =   cos ry 0 sin ry 0 1 0 − sin ry 0 cos ry     x y z  = Ry   x y z   (2.3)

Rotation around the z-axis   x1 y1 z1  =   cos rz − sin rz 0 sin rz cos rz 0 0 0 1     x y z  = R_z   x y z   (2.4) Translation   x1 y1 z1  =   x y z  +   tx ty tz   (2.5)

In the equations above, x1 y1 z1

T

are the new coordinates for the vertex. c is the scaling. r_x, r_y and r_z are rotation angles in radians for rotation around the x-, y- and z-axes. t_x, t_y and t_z are translations in the x-, y- and z-directions.

To view 3D objects on a 2D screen some kind of 2D projection has to be done. One way to project is to use perspective projection, which is based on the ideal (pinhole) camera model. Take a look at figure 2.2. The coordinate system origin models the camera lens (pinhole). A 3D object is projected through the origin and onto an image plane, the left one in the figure, although rotated 180 degrees. For simplicity, we project the image plane itself back through the origin, to the right in the figure. The rotation is now inverted and the projection should be easier to understand.

Now assume that the image plane with coordinate system u v T is located at distance f along the positive z-axis as in the figure. It can easily

(18)

8 Theory

Figure 2.2. Perspective projection

be seen from the figure that a vertex with coordinates x y z T and projection coordinates u v T obey the following equation

1 f u v = 1 z x y (2.6) From this we get the projection equation

u v = f z x y (2.7) For more on computer graphics, see [5].

2.3 The Candide face model

The Candide face model [6, 15, 14] is a 3D vertex model which basically consists of a fixed model, a group of shape and action units, and a list

(19)

2.3 The Candide face model 9

of surface triangles. The shape units define the static change of the fixed model so as to resemble the actual face to be tracked. A shape unit could control, for example head height or nose length, and must thus not change over time. The Action units on the other hand define the dynamic changes in the face, like eye blinking or lip stretching, and may therefore change over time.

A more technical definition of the model is as follows.

           bx1 by1 bz1 .. . bxNv byNv bzNv            =            gx1 gy1 gz1 .. . gxNv gyNv gzNv            +            Sx1,1 · · · Sx1,Ns Sy1,1 · · · Sy1,Ns Sz1,1 · · · Sz1,Ns .. . . .. ... SxNv,1 · · · SxNv,Ns SyNv,1 · · · SyNv,Ns SzNv,1 · · · SzNv,Ns               s1 .. . sNs   + +            Ax1,1 · · · Ax1,Na Ay1,1 · · · Ay1,Na Az1,1 · · · Az1,Na .. . . .. ... AxNv,1 · · · AxNv,Na AyNv,1 · · · AyNv,Na AzNv,1 · · · AzNv,Na               a1 .. . aNa    (2.8) or in short form ¯ b = ¯g + S ¯s + A¯a (2.9) ¯

b is the vector of face model vertices after shapes and actions have been added. ¯g is the fixed model. The columns of S contain information on how much the different vertices move in each direction (x,y and z) for the values in the shape unit vector ¯s. The same applies for A and the action unit vector ¯a. Nv is the number of vertices in the model, Nsthe number of

shape units and Na the number of action units.

Figure 2.3 shows (from left to right) the fixed model, a shape and action modified model, and the second model rotated.

(20)

10 Theory

Figure 2.3. An example of the Candide model

2.4 Kalman filtering and extended Kalman

filter-ing

Assume that you have a linear system in the general state form as below (without control input)

x(t + 1) = Atx(t) + w(t) (2.10)

y(t) = Htx(t) + v(t) (2.11)

where t is time, x the vector of dynamic system parameters, y the vector of measured signals, At and Ht relation matrices, w and v vectors of process

and measurement noise respectively. Subscript t indicates possible time dependency.

Now, if you want to estimate the system parameters from measurements and minimizing the parameter error variance, the Kalman filter [13, 1] is the optimal linear estimator. In addition to that, if the noises are normal distribution processes the Kalman filter is the overall optimal estimator, even if nonlinear filter solutions are considered. The Kalman filter equations and update sequence follow next.

The first step is prediction: ˆ

xp(t + 1) = Atx(t)ˆ (2.12)

(21)

2.4 Kalman filtering and extended Kalman filtering 11

where ˆx is the vector of estimated system parameters, P = E[(ˆx − x)(ˆx − x)T] is the estimated parameter error covariance matrix and Qt= E[wwT]

is the process noise covariance matrix. Subscript p stands for prediction. The second step is correction:

ˆ

x(t + 1) = xˆp(t + 1) + K(t + 1)(y(t + 1) − Htxˆp(t + 1)) (2.14)

K(t + 1) = Pp(t + 1)HtT(HtPp(t + 1)HtT + Rt)−1 (2.15)

P (t + 1) = Pp(t + 1) − K(t + 1)HtPp(t + 1) (2.16)

where R_t= E[vvT] is the measurement noise covariance matrix. K is called the Kalman gain and is the correction gain for the innovation (y(t + 1) − Htxˆp(t + 1)), which is basically the difference between measurements and

predicted measurements.

Before use of the filter, ˆx and P are initialized with appropriate values, depending on the application.

The process and measurement noise covariance matrices Q_t and R_t are set by the user. These matrices define the sensitivity of the Kalman filter. The Kalman filter gets faster but more sensitive against measurement noise as R_tgets smaller relative to Q_t. When R_tgets smaller, the filter gets slower and less sensitive. The easiest way to control this is by multiplying one of the matrices by a scalar. Multiplying both matrices by the same scalar will not make any difference, as this will even out in the equations.

If equations 2.10 and/or 2.11 are exchanged for nonlinear equations, we get a nonlinear system and cannot use this exact approach. An extended Kalman filter [13, 1] must be used. Assume the system looks like

x(t + 1) = a(x(t), w(t)) (2.17) y(t) = h(x(t), v(t)) (2.18) where a(ˆx(t), w(t)) and h(ˆx(t), v(t)) are nonlinear functions. We now lin-earize At = ∂a ∂x ˆ x(t) (2.19) Ht = ∂h ∂x ˆ xp(t+1) (2.20) and use these in the previous update equations 2.12-2.16, but exchange equation 2.12 for

ˆ

xp(t + 1) = a(ˆx(t), 0) (2.21)

and equation 2.14 for ˆ

(22)

(23)

Chapter 3

The face tracker

3.1 The algorithm

The main idea of how the face tracker works is pretty simple. It can be described in steps/pseudocode as follows.

1. Adapt the face model to the first frame and choose feature points. Initiate an extended Kalman filter with the face parameters rotation, translation and action units.

2. Use the current frame to find new projection coordinates for the fea-ture points in the next frame, using matching.

3. Update the extended Kalman filter using these new projection coor-dinates.

4. Use the textured face model to refine the feature point projection coordinates in the next frame (the same next frame as in step 2), using matching.

5. Update the extended Kalman filter using the refined projection coor-dinates.

6. Repeat from step 2.

A diagram will be included in section 3.6.

3.2 The coordinate system and video sequence

The coordinate system and projection model used for the camera is the one mentioned in the above section on computer graphics. The camera

(24)

14 The face tracker

is placed in the origin of the coordinate system and ”looking” along the positive z-axis. ”Up” of the camera is along the positive y-axis.

The image plane placed on distance f along the positive z-axis rep-resents the images in the video sequence. The video sequence consists of frames with a specified resolution, for example 640x480 units (pixels), where 640 is the width and 480 is the height. An image is positioned with its cen-ter on the z-axis and scaled down so that the smallest of width or height will fit within [−1, 1] of the projection coordinate (u for width and v for height). The ratio between width and height must still be the same, which yields the other coordinate. For example, with a resolution of 640x480 pix-els it means that the top of the image would get coordinate v = 1, the bottom v = −1, the left side u = 640/480 = 1.333 . . . and the right side u = −640/480 = −1.333.

3.3 Face parameters

The face model used is Candide, as described in section 2.3. In the model itself, only action units will be estimated, the shape units (¯s) have to be manually adapted to the tracked face in the first frame. Scaling of the model is fixed during the tracking and is thus also manually set (normally to 1). Rotation and translation will be estimated. Rotation, translation and action units have to be initialized for the model to fit the correct position on the face once projected onto the image plane.

3.4 Initializing feature points

In the first frame, after face parameters have been set/initialized, feature points must be chosen. The number of feature points may vary. The more feature points that are used, the more data is available for estimating parameters and therefore the estimation should be more accurate. But more points also mean more calculations, for example more feature points to search for in new frames and larger matrices to calculate in the extended Kalman filter. This results in slower tracking.

The feature points should be spread out over the face. This is for two reasons. First, there should be enough feature points near vertices that are affected by the action units one wants to estimate, to be able to receive a good estimation. Second, the smaller the distance between feature points, the larger the relative distance error gets (feature point searching is not perfect) and parameter estimation gets harder. The method used for

(25)

3.5 Tracking feature points 15

finding good feature points is not specifically defined and thus not part of this work.

When the feature points are found as projected points in the first frame, each of them is checked with respect to which projected model surface triangle they fall into. They are then linearly interpolated from the triangle corners into the face model as new vertices with their own values in ¯g, S and A. These values are interpolated from the corresponding values belonging to the corner vertices by using a weight for each corner. The weights sum to 1 and define the linear combination of the corner vertices coordinates to get the feature point vertex coordinates. Figure 3.1 might help to clear things out.

Figure 3.1. Feature point interpolation

The feature points are not actually part of the model, but more like virtual vertices that are different from one tracking session to another.

3.5 Tracking feature points

To track feature points from the current frame to a new frame, pattern template matching is used. The points are seen as totally independent of each other, so the search is done separately for each point. Therefore we consider a single feature point from here on.

First of all the feature point vertex is updated using the current face parameters, and then projected onto the image plane to get the current projection coordinates. A square pattern template is extracted from the current frame with its center at the coordinates. The size of the template (size times size pixels) is odd and variable. It is variable in the sense

(26)

of being dependent on the z-translation parameter of the current model state. Making it variable will cause the template to cover about the same area of the tracked face independent on how far away the face is, which is reasonable. The size calculation is size_used = f /tz ∗ sizef ixed rounded to

closest integer, although the minimum size is 3 pixels.

Because the image plane uses floating point coordinates and the frames are sampled (pixels), the feature point projection coordinates might not be placed in the middle of a pixel. Values for the pixels in the template are therefore linearly interpolated from the four nearest pixels in the frame.

Now, a search for the best matching pixel is done in the new frame. The search is limited to a certain range, see figure 3.2, around the center pixel (the pixel closest to the feature point coordinates). The limited search is for two reasons. First, the new coordinates for the feature point should be located close to the current coordinates. Limiting the search will avoid false best matches that could be found outside the range. The other reason is calculation speed. The shorter the range, the fewer calculations has to be done, and therefore faster tracking. The size of the search range is up to the user to decide with respect to video resolution and speed and how fast the face will move. Also the comparison function can be altered by the user. The frames are converted to grayscale (if needed) before matching, which means the search is done for only one channel.

Figure 3.2. Limited search range

(27)

3.6 Two stage tracking 17

coordinates will have the accuracy of one pixel. To achieve subpixel ac-curacy, the output coordinates are the average pixel coordinates for a 3x3 pixel block around the best match pixel. The weights used are the normal-ized (over the block) scalar results from the matching. Figure 3.3 shows a 1D (one dimension) illustration of the pixel coordinate averaging.

Figure 3.3. 1D pixel coordinate averaging

3.6 Two stage tracking

To adapt the face model (estimating the face parameters) to the face in a new frame, one must first find the projection coordinates for the feature points in the new frame. The new coordinates are then sent to an ex-tended Kalman filter (see the next section) which estimates the new face parameters.

Finding the new feature point coordinates is done as described in the previous section. The problem now is that we have used estimated face parameters to extract pattern templates out of the face in the current frame. This means that the correct feature point coordinates in the frame are probably not in the center of the template images. Then, when the new feature point coordinates are found in the new frame there will be a small displacement error for each point, which will lead to an error in the new estimated face parameters. When adapting to another new frame using these even worse face parameter estimation the errors for the new feature point coordinates get even larger. As we continue adapting to new frames,

(28)

the errors will propagate and grow, and thus we will have a definately unstable tracker. Notice that matching, even though subpixel accuracy is achieved, is not perfect and will generate small errors. But these errors are independent between updates and will therefore not propagate.

To solve this error propagating problem, another search round for the same new frame is made. But instead of using the previous frame, we use the newly estimated face parameters to draw a textured model on top of the new frame. The texture is taken from the very first frame in which the face is supposed to look straight into the camera. By doing this, we know the exact locations of the feature point coordinates in the frame, and the pattern template will be correctly extracted. In this search round, the search range is only one pixel (refinement), since the coordinate errors have not propagated and are still small. The new refined coordinates are sent to the extended Kalman filter and new refined face parameters are estimated. So why not use a textured model directly? This can of course be done and will also reduce computation time. The reason for this combination is that a textured model is not perfect, especially not a low vertex model as the Candide model. When the face moves a lot between frames, it is better if the pattern templates are extracted from a perfect model (the face in the frame). This makes use of the motion information between frames for ”rough” face parameter estimation. The textured model is just used for stabilization.

Figure 3.4 shows the face parameter estimation update between two frames in a diagram.

(29)

3.7 The extended Kalman filter 19

3.7 The extended Kalman filter

The remaining problem is to estimate the face parameters from feature point coordinates. The face (model) can be seen as a system, where the face parameters are the internal parameters and the feature point coordinates are the output data (measurements). We use a Kalman filter to solve this problem.

The unknown internal parameters to estimate are r_x, r_y, r_z, t_x, t_y, t_z and ¯a. The number of parameters to estimate is thus 6 + Na. Changes

of these in time can be modeled as independent noise with zero mean, for simplicity. This means that the matrix A_t will be an identity matrix I. The measurements are the projection coordinates u and v for all feature points, which means that the number of input data is 2Nf, where Nf is the

number of feature points.

The process noise covariance matrix Qt is a diagonal matrix as a result

of the noises being independent. We set all the values to 1, making it an identity matrix. This is it to make it simple, and after a bit of thinking it also seems somewhat reasonable that all parameters should have about the same variance.

When it comes to the measurement noise covariance matrix R_t, a diag-onal matrix is used. Since the measurements are feature point coordinates, the measurement noise is simply the error after matching. Between feature points as well as between coordinates u and v the errors are assumed to be independent. There is no reason that these error variances should be different between coordinates u and v. The values in Rt are set depending

on the result from matching. The feature point with the largest match-ing result scalar D_max gets u and v values of 1. Feature point number i gets values of Dmax/Di. In the case of the best match having the smallest

value, all result scalars are first transformed as D_i = 1/(1 + Di). Doing this

makes the Kalman filter consider feature points with good match results more than those with bad results, for example to handle occlusion. But there is also a small drawback. The match results should be independent on the absolute values of the pixels (light and dark areas), which means that the matching function needs to do some kind of normalization over the pixels.

Setting the values in Qtand Rtthis way is not actually the whole story.

To be able to vary the sensitivity of the filter, Q_t or R_t is scaled with a single scalar. This way we can get a fast or slow tracker by only changing one parameter.

The last matrix is the parameter measurement relation matrix Ht. To

(30)

the following has to be done.

1. Add shape and action units to the fixed model, using equation 2.8. 2. Scale, rotate and translate the vertices, using equations 2.1-2.5. 3. Project the modified vertices, using equation 2.7.

Scaling and rotation must be done before translation, or otherwise the translation will be scaled and rotated as well. As the scaling is the same in all directions, it doesnt matter if it is done before or after rotation. The result will be the same, because scaling is just multiplication of all values by a scalar. Notice that scaling is not necessary. It is included here so there is a possibility to change the global size of the face (head).

Rotation will be done around all three axes. The order of the rotations doesn’t matter, the difference is just that the rotation angles for each axis will differ between cases. All combinations will work. We choose to rotate first around the z-axis, then the x-axis and last the y-axis, as this order feels somewhat natural and easy to handle. Using the rotation matrices in equations 2.2-2.4, this order will yield the complete rotation matrix

  Rxx Rxy Rxz Ryx Ryy Ryz Rzx Rzy Rzz  = R_yR_xR_z (3.1) with

Rxx = cos rycos rz+ sin rxsin rysin rz (3.2)

Rxy = − cos rysin rz+ sin rxsin rycos rz (3.3)

Rxz = cos rxsin ry (3.4)

Ryx = cos rxsin rz (3.5)

Ryy = cos rxcos rz (3.6)

Ryz = − sin rx (3.7)

Rzx = − sin rycos rz+ sin rxcos rysin rz (3.8)

Rzy = sin rysin rz+ sin rxcos rycos rz (3.9)

Rzz = cos rxcos ry (3.10)

Using step 1 and 2 for one feature point, we get its vertex coordinates in the fully modified model

  xk yk zk  = c   Rxx Rxy Rxz Ryx Ryy Ryz Rzx Rzy Rzz     bxk byk bzk  +   tx ty tz   (3.11)

(31)

3.7 The extended Kalman filter 21

and by adding step 3, we get the feature point projection coordinates uk vk = f zk xk yk (3.12) where k is the feature points vertex number

Expanding the formulas might make it easier to follow the next step, but it will take up very much space and is not completely necessary, so it will not be done here.

From these equations, we can see that the function from face parame-ters to projection coordinates is nonlinear, and we cannot construct H_t. We have to use an extended Kalman filter. The function h(ˆx(t), 0) in equation 2.22 is already gained (equation 3.11-3.12). At is still linear (identity

ma-trix) so that is not a problem. But H_thas to be constructed using equation 2.20. From equations 3.12, 3.11 and 2.8 we get

∂ ∂rx u v = = f c_z2 (∂Rxx ∂rx bx+ ∂Rxy ∂rx by+ ∂Rxz ∂rx bz)z − ( ∂Rzx ∂rx bx+ ∂Rzy ∂rx by+ ∂Rzz ∂rx bz)x (∂Ryx ∂rx bx+ ∂Ryy ∂rx by+ ∂Ryz ∂rx bz)z − ( ∂Rzx ∂rx bx+ ∂Rzy ∂rx by+ ∂Rzz ∂rx bz)y ! (3.13) ∂ ∂ry u v = = f c_z2 (∂Rxx ∂ry bx+ ∂Rxy ∂ry by+ ∂Rxz ∂ry bz)z − ( ∂Rzx ∂ry bx+ ∂Rzy ∂ry by+ ∂Rzz ∂ry bz)x (∂Ryx ∂ry bx+ ∂Ryy ∂ry by+ ∂Ryz ∂ry bz)z − ( ∂Rzx ∂ry bx+ ∂Rzy ∂ry by+ ∂Rzz ∂ry bz)y ! (3.14) ∂ ∂rz u v = = f c_z2 (∂Rxx ∂rz bx+ ∂Rxy ∂rz by+ ∂Rxz ∂rz bz)z − ( ∂Rzx ∂rz bx+ ∂Rzy ∂rz by+ ∂Rzz ∂rz bz)x (∂Ryx ∂rz bx+ ∂Ryy ∂rz by+ ∂Ryz ∂rz bz)z − ( ∂Rzx ∂rz bx+ ∂Rzy ∂rz by+ ∂Rzz ∂rz bz)y ! (3.15) ∂ ∂tx u v = f z 1 0 (3.16) ∂ ∂ty u v = f z 0 1 (3.17)

(32)

22 The face tracker ∂ ∂tz u v = −f z2 x y (3.18) ∂ ∂ai u v = = f c_z2

(RxxAx,i+ RxyAy,i+ RxzAz,i)z − (RzxAx,i+ RzyAy,i+ RzzAz,i)x

(RyxAx,i+ RyyAy,i+ RyzAz,i)z − (RzxAx,i+ RzyAy,i+ RzzAz,i)y

(3.19) For every time step we now use equations 3.13-3.19 to calculate Ht as

Ht=           ∂u1 ∂rx ∂u1 ∂ry ∂u1 ∂rz ∂u1 ∂tx ∂u1 ∂ty ∂u1 ∂tz ∂u1 ∂a1 · · · ∂u1 ∂aNa ∂v1 ∂rx ∂v1 ∂ry ∂v1 ∂rz ∂v1 ∂tx ∂v1 ∂ty ∂v1 ∂tz ∂v1 ∂a1 · · · ∂v1 ∂aNa .. . ... ... ... ... ... ... . .. ... ∂u_Nf ∂rx ∂u_Nf ∂ry ∂u_Nf ∂rz ∂u_Nf ∂tx ∂u_Nf ∂ty ∂u_Nf ∂tz ∂u_Nf ∂a1 · · · ∂u_Nf ∂aNa ∂v_Nf ∂rx ∂v_Nf ∂ry ∂v_Nf ∂rz ∂v_Nf ∂tx ∂v_Nf ∂ty ∂v_Nf ∂tz ∂v_Nf ∂a1 · · · ∂v_Nf ∂aNa           (3.20) We now have everything we need, except the initial values for the pa-rameters and the estimated parameter error covariance matrix P . The parameters are set as said in the face parameter section 3.3. P is set to an identity matrix, again for simplicity.

(33)

Chapter 4

Implementation

4.1 C/C++

The face tracker is implemented in C/C++. To manage the image process-ing part (matchprocess-ing), camera input, matrix/vector computation and the extended Kalman filter, we use the open source library OpenCV (Open Computer Vision) [7]. OpenGL (Open Graphics Library) [8] is used for face model rendering and the GLUT (OpenGL Utility Toolkit) library [9] is used to handle window management and control of the tracker.

For image processing we first used a C++ library called CImg (CoolIm-age) [10] which is very simple and easy to use. But later on it was realized that we wanted more advanced matching features than it provided, so this was the reason to use OpenCV, which seemed to be faster but also a bit more complex to use.

The extended Kalman filter and matrix/vector computation was orig-inally implemented using the template based C++ library Bayes++ [11] which uses uBLAS from the C++ library Boost [12] to handle matrices and vectors. But Bayes++ was slow, and since it was noticed that OpenCV included an implementation of the basic Kalman filter as well as mathe-matics for matrices, we converted the code to use that library instead. The result was an estimated performance increase of 5-10 times faster than with Bayes++. While a filter update using Bayes++ took 10-20 ms, it took only 1-4 ms with OpenCV. With OpenCV we also gained more control of the computations done.

OpenGL was an obvious choice for graphics rendering, since it is the main graphics API (Application Programming Interface) for a wide range of operating systems, and has hardware support in most common graphics cards.

(34)

24 Implementation

The tracker is implemented as a single class but many of its functions are stand alone. It assumes that there exists a double buffered OpenGL window, since it uses it to draw textured models. The main program using GLUT simply handles control and frame input to the tracker as well as display functions. Tracking is not stopped when there are no new frames. Instead it uses the current frame as a new frame. This is to be able to test real time performance higher than the actual framerate.

The code is written and compiled using Microsoft Visual C++ 2005 Express Edition on Microsoft Windows XP Home Edition. It is generally written, so essentially all of the code should be portable to other operating systems that support the various libraries used.

4.2 User interface

A GUI (Graphical User Interface) is not present, but basic control and output feedback is displayed in a console window. Initialization of the tracker is controlled by using the mouse and a number of keys on the keyboard. Settings are managed using simple text files.

4.3 Feature points and matching

Finding good feature points in the first frame, which as already said is not part of this work, could be done manually. However, we have chosen to use an automatic function available in the OpenCV library, called ”cvGood-FeaturesToTrack”, which basically finds corners with big eigenvalues. The user has some control over the placement of the feature points, by defin-ing in which model surface triangles they may be in as well as settdefin-ing a minimum distance between them.

OpenCV has currently six built in comparison functions that can be used in the matching. Two levels of summed squared errors, and four levels of correlation (multiplication of pixel values). Which function to use is up to the user. The search range for both tracking stages can be set. The pattern template size as well as the work resolution (in pixels) for the frames are also set by the user, but it is the same for both tracking stages. The work resolution is the resolution of the frames in which we search for the feature points. The higher the resolution, the finer the quality of the searchings, but computation time will increase.

(35)

4.4 The face model 25

4.4 The face model

We have added/changed several shape and action units in the Candide model to make it more flexible. These are not meant to constitute a new version at all (although it could be). The reason is to better suit the test face (the author). The changes can be viewed in the Candide definition file ”candide3.wfm” that we use. There is also a setting on which action units that will be used in the tracking.

To handle the case when the tracking becomes unstable, we have im-plemented limits on all estimated face parameters. The limits are set by the user, and when any parameter exceeds its limits, the tracking will be stalled and the model will reset. The model can be picked up again by placing the face in the model and pressing a button. The feature points will not have to be reinitialized.

4.5 Other issues

Note that there are in fact more settings implemented than those described above. Some of them affect performance as well as quality while others affect quality only. For these we refer to a readme file that is included with the code.

The program has almost no error checking, which means that for ex-ample segmentation errors might occur if the model gets outside the image for some reason.

(36)

(37)

Chapter 5

Results

5.1 Testing

The best way to get a picture of how well the tracker actually performs is of course to use it and test different settings. To make the reader somewhat aware of its real time performance and quality, some example tests will be performed and included here. Screenshots of the tracker both succeeding and failing in different situations will also be displayed.

There will be two test targets. Number one is the realtime performance measured in frames per second that the tracker can handle. Number two are how broad angles relative to the camera that the tracker can still estimate rotation correctly. The second test will be visually measured in the sense that approximative horisontal and vertical angles will be read just before the tracker goes unstable or the model no longer fits the face. All action units will be disabled in these tests. Only rotation and translation will be estimated.

The test results below will use three different number of feature points and three different template sizes. Work resolution will be 240x180 pixels and the search area for the first tracking stage will be 19x19 pixels. The matching function used will be the the most complex function that OpenCV has, namely mean subtracted and normed correlation. Other settings may be viewed in the face trackers configuration file (see the readme file that is included with the code).

Tested facial animations are nose wrinkling, eye brow and mouth mov-ing. These will show up in the screenshots.

The computer used for testing has a 2 GHz AMD Athlon 64 processor and 1 GB of system memory shared with integrated ATI Radeon XPRESS 200 graphics. The tracker uses maximum CPU utilization when tests are

(38)

28 Results

performed. The camera is a QuickCam Pro 5000 from Logitech.

5.2 Test results

In the table below, ’#fp’ is the number of feature points used, ’size’ is the template size in pixels, ’fps’ is rounded frames per second, ’h-rot’ and ’v-rot’ is horisontal and vertical rotation in rounded degrees relative to the camera. There are two values for each rotation. For h-rot, the first is rotation to the left (for the viewer) and the second is rotation to the right. For v-rot, the first is rotation downwards and the second is rotation upwards.

Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Test9 #fp 25 25 25 50 50 50 75 75 75 size 11x11 19x19 29x29 11x11 19x19 29x29 11x11 19x19 29x29 fps 40 38 35 23 21 19 8 8 7 h-rot 35,40 25,40 20,22 36,41 25,35 23,35 38,32 21,20 23,24 v-rot 24,19 24,14 24,14 30,12 23,13 34,10 31,15 25,13 28,16 Table 5.1. Results

The camera runs at 25 frames per second during all tests and the input resolution is 320x240 pixels. The z-translation is about 5 in all the tests, and the distance f is 3. This means that the actual template sizes are about 40% less than given above. In figure 5.1 you can see the mask used to initialize feature points. Figures 5.2-5.10 show the feature points used in the different tests as small black squares.

The three figures 5.11-5.13 show the tracker in success. Settings of Test4 are used but during a different tracking session, so the feature points are different.

(39)

5.2 Test results 29

Figure 5.1. Feature point mask Figure 5.2. Test1

(40)

30 Results

Figure 5.5. Test4 Figure 5.6. Test5

(41)

5.2 Test results 31

Figure 5.9. Test8 Figure 5.10. Test9

(42)

32 Results

Figure 5.13. Angry Figure 5.14. Failure1

(43)

Chapter 6

Conclusions

6.1 Discussion

It works! There are some problems, but in general it works. Real time performance is achieved on a normal computer, although not when using too many feature points. The tracker can also estimate rotation, translation and face animations as can be seen in figure 5.11-5.13, so the requirements are met. A note on the framerate, when measuring frames per second the video stream display was disabled due to a severe performance decrease when writing pixel data to the window framebuffer using OpenGL. But that is not part of the tracker, so it is ok. By changing and optimizing the code, we believe that higher framerate can be gained.

The worst problem as seen in the results are large angles, especially when rotating the head upwards. The tracker usually does not get unstable directly, but the face model is rotated too much, see figure 5.14 and 5.15. The reason for this is probably a combination of a badly fit model and some feature points disappearing (”behind the face”). If the model does not fit, the texture and the real face will not match in the correct positions and thus not the feature points either. If the feature points disappear, the match results will give the wrong information to the extended Kalman filter, and the estimation will have larger errors. See figure 6.1 and 6.2. Figure 6.1 shows a frame and figure 6.2 shows the same frame with the textured estimated model on top of it. One can easily see how angular the model is at the mouth, nose and eyebrows. Horisontal rotation to the left seems to be harder to estimate than to the right, even though the face is symmetric. The reason for this is not known, but could be interesting to investigate.

In figure 5.16 above can be seen that the tracker also cannot handle 33

(44)

34 Conclusions

Figure 6.1. Frame Figure 6.2. Frame with textured

esti-mated model on top

occlusion. This could be a sign that setting R_f the way we did is rather useless, unless some kind of outlier checking is done. Lighting changes though seem not to be a big problem, from tests not shown here, as long as the frames do not get too bright or too dark.

When estimating only rotation and translation, the number of feature points does not seem to matter a lot. This seems reasonable, since only 6 parameters are estimated from at least 50 values (25 feature points) which should be more than enough. The template size on the other hand seems to have some impact. The template is a 2D image. The face is not 2D. When the face moves, the projected image gets deformed and the larger subimage to match, the more deformation is matched against. Therefore it should be better to use a smaller template. But the template must not be too small. The smaller it is, the less data (less pixels) there is to describe the pattern (lower resolution).

A very important thing is the initialization. If the model is not placed well enough on the face and/or the shape units are wrong, it will fit badly and hence it will be badly tracked. Perhaps the results would have been better if we had tweaked the shape units to make the model fit the face better.

The feature points placement is another important issue. This matters more when face animations are involved. If there are no feature points at the eyebrows for example, you will not be able to track the movement of them. This is where the number of feature points comes in for real. More feature points help to cover the different animations. More points are also necessary since with animations the number of estimated parameters grows.

(45)

6.2 Variants of the tracker 35

Another issue is tracking of eye motion. This did not work good at all, but could possibly be improved with some tweaking of the shape units.

6.2 Variants of the tracker

As said in section 3.6, the current frame is used to roughly estimate the face parameters in the new frame. A textured model is then used to stabilize them. We also tested to use the textured model directly, however the result compared to the current two stage algorithm was very ”shaky”. The face parameters were poorly estimated, which made the tracker unstable. By doing the search only one time, the extended Kalman filter will be updated only once, so some computation time will be saved. The feature point search (matching) is done with a full search area and there will be no refinement. As the refinement search area contains only a few percent of the number of pixels in the full search area, there should be almost no performance increase from the skipped refinement in the matching part of the algorithm. A simple test showed that the real time performance increase for the whole algorithm was about 30-40% for the settings used in Test4 in the results in this report.

A variant of the two stage algorithm that could be interesting to try, is to switch the order of the two stages. By using a textured model to do a very rough search for feature points, and then refine directly using the current frame, only one extended Kalman filter update will be needed. A drawback here is that the rough search might be so rough that a refinement search range of 1 pixel is not enough. The refinement search area will have to be increased, and with this follows a higher computational cost.

Another variant with switched orders is to map the current frame as a texture on the model and use that for refinement. We start doing a rough estimation as in the previous paragraph, but we also update the extended Kalman filter. After that, the current frame textured model, with updated parameters, is used to search for better feature point coordinates. Then a second update with the extended Kalman filter has to be done. In essence this means that we update the texture data to correspond to the actual view.

6.3 Future improvements

We have a few future improvements in mind, both implementation per-formance improvements as well as parameter estimation quality improve-ments. First, since most of the computations are on floating point units,

(46)

36 Conclusions

many of them can be done simultaneously. For example, matching of the feature points could theoretically be done at the same time. Any computa-tions that can be done in parallel, could be implemented using more than one thread on the upcoming CPU’s with several cores. One can also use the computation power of a good graphics card, since graphics cards are built with pipelines to perform many floating point calculations in parallel. To improve parameter estimation quality, it would be a great idea to use a better model with a lot more vertices, to make the surface smooth. More shape units would probably help for the model to fit better. To solve the large angle problem, feature points should also be disabled when they are hidden. New points could be added as the face is rotated. This is actually one of the features Jacob Ström included in his headtracker. Other, more sophisticated methods to choose feature points should also be used, to gain more control of where in the face the points will be chosen.

(47)

Bibliography

[1] Jacob Ström, Model-Based Head Tracking and Coding, Ph.D. thesis, No. 733, Linköping Studies in Science and Technology, 2002.

[2] Pertti Roivainen, Motion Estimation in Model Based Coding of

Hu-man Faces, Ph.D. thesis, No. 225, Linköping Studies in Science and

Technology, 1990.

[3] Haibo Li, Low Bitrate Image Sequence Coding, Ph.D. thesis, No. 318, Linköping Studies in Science and Technology, 1993.

[4] Per-Erik Danielsson, Olle Seger, Maria Magnusson Seger, Ingemar Ragnemalm, Bildanalys-TSBB52-2005, course compendium, Depart-ment of Electrical Engineering, Linköping University, 2005. Chapter 4.

[5] Donald Hearn and M. Pauline Baker, Computer Graphics with

OpenGL Third Edition, Pearson Prentice Hall, Upper Saddle River,

USA, 2004.

[6] Jörgen Ahlberg, Model-Based Coding, Ph.D. thesis, No. 761, Linköpings Studies in Science and Technology, 2002. Chapter 5. [7] The OpenCV website:

http://www.intel.com/technology/computing/opencv/ [8] The OpenGL website:

http://www.opengl.org/ [9] The GLUT for Win32 website:

http://www.xmission.com/ nate/glut.html [10] The CImg website:

http://cimg.sourceforge.net/ 37

(48)

38 Bibliography

[11] The Bayes++ website:

http://bayesclasses.sourceforge.net/Bayes++.html [12] The Boost website:

http://www.boost.org/

[13] Fredrik Gustafsson, Lennart Ljung and Mille Millnert,

Signalbehan-dling, Studentlitteratur, Lund, Sweden, 2001. Chapter 8.

[14] The Candide website (within ICG): http://www.bk.isy.liu.se/candide/

[15] Stan Z. Li, Anil K. Jain, Handbook of Face Recognition, Springer Sci-ence+Business Media, New York, USA, 2005. Chapter 4.

(49)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerhe-ten och tillgänglighesäkerhe-ten finns det lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förla-gets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet — or its possi-ble replacement — for a period of 25 years from the date of publication barring exceptional circumstances.

The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/

c

A feature based face tracker using extended Kalman filtering

Institutionen för systemteknik

Department of Electrical Engineering

A feature based face tracker using extended

Kalman filtering

A feature based face tracker using extended

Kalman filtering

Abstract

Contents

Chapter 1

Introduction

1.1

Background

1.2

Purpose

1.3

The task

1.4

Outline

Chapter 2

Theory

2.1

Image processing - pattern template

match-ing

2.2

Computer graphics - 3D objects and

projec-tion

2.3

The Candide face model

2.4

Kalman filtering and extended Kalman

filter-ing

Chapter 3

The face tracker

3.1

The algorithm

3.2

The coordinate system and video sequence

3.3

Face parameters

3.4

Initializing feature points

3.5

Tracking feature points

3.6

Two stage tracking

3.7

The extended Kalman filter

Chapter 4

Implementation

4.1

C/C++

4.2

User interface

4.3

Feature points and matching

4.4

The face model

4.5

Other issues

Chapter 5

Results

5.1

Testing

5.2

Test results

Chapter 6

Conclusions

6.1

Discussion

6.2

Variants of the tracker

6.3

Future improvements

Bibliography