Face Detection and Pose Estimation using Triplet Invariants

(1)

Face Detection and Pose Estimation

using Triplet Invariants

Marcus Isaksson LiTH-ISY-EX-3223-2002

(2)

(3)

Face Detection and Pose Estimation

using Triplet Invariants

Examensarbete utfört i Bildbehandling vid Linköpings tekniska högskola

av

Marcus Isaksson Reg nr: LiTH-ISY-EX-3223-2002

Supervisor: G¨osta Granlund Examiner: Klas Nordberg Link¨oping, February 22, 2002.

(4)

(5)

Avdelning, Institution Division, Department

Institutionen fö r Systemteknik

581 83 LINKÖ PING

Datum Date 2002− 02− 27 Språ k

Language RapporttypReport category ISBN Svenska/Swedish

X Engelska/English X ExamensarbeteLicentiatavhandling ISRN LITH− ISY− EX− 3223− 2002 C− uppsatsD− uppsats Serietitel och serienummer_{Title of series, numbering} ISSN

Ö vrig rapport ____

URL fö r elektronisk version

http://www.ep.liu.se/exjobb/isy/2002/3223/ Titel

Title Ansiktsdetektering med hjä lp av triplet− invarianter Face Detection and Pose Estimation using Triplet Invariants Fö rfattare

Author Marcus Isaksson

Sammanfattning Abstract

Face detection and pose estimation are two widely studied problems − mainly because of their use as subcomponents in important applications, e.g. face recognition. In this thesis I investigate a new approach to the general problem of object detection and pose estimation and apply it to faces. Face detection can be considered a special case of this general problem, but is complicated by the fact that faces are non− rigid objects. The basis of the new approach is the use of scale and orientation invariant feature structures − feature triplets − extracted from the image, as well as a biologically inspired associative structure which maps from feature triplets to desired responses (position, pose, etc.). The feature triplets are constructed from curvature features in the image and coded in a way to represent distances between major facial features (eyes, nose and mouth). The final system has been evaluated on different sets of face images.

Nyckelord Keyword

(6)

(7)

Abstract

Face detection and pose estimation are two widely studied problems - mainly be-cause of their use as subcomponents in important applications, e.g. face recognition. In this thesis I investigate a new approach to the general problem of object detec-tion and pose estimadetec-tion and apply it to faces. Face detecdetec-tion can be considered a special case of this general problem, but is complicated by the fact that faces are non-rigid objects. The basis of the new approach is the use of scale and orientation invariant feature structures – feature triplets – extracted from the image, as well as a biologically inspired associative structure which maps from feature triplets to desired responses (position, pose, etc.). The feature triplets are constructed from curvature features in the image and coded in a way to represent distances between major facial features (eyes, nose and mouth). The final system has been evaluated on different sets of face images.

Key Words: Face Detection, Pose Estimation, Neural Networks, HiperLearn, Triplet Invariants.

(8)

(9)

Acknowledgments

First of all I would like to thank my supervisor Gösta Granlund for introducing me to the ideas on which this thesis is based and for helping me throughout the project. I would also like to thank Joakim Jaldén for valuable discussions on optimization theory, Jörgen Ahlberg for letting me use images collected at the Image Coding Group, Anders Moe for sharing related code and ideas, and of course all the people at the Computer Vision Lab for a pleasant environment to work in. Finally I would like to thank my friends and family for their constant support.

(10)

(11)

Notation

Symbols

x, p Lowercase boldface letters are used for vectors. A Uppercase boldface letters are used for matrices.

z, s Lowercase boldface letters are used for complex numbers. X Uppercase letters are used for sets.

Operators and functions

∗ Convolution operator.

z∗ Complex conjugate of z.

AT Matrix transpose of A.

x• y Scalar product of vectors x and y

x⊗ y Kronecker product of vectors x and y, i.e. the vector consisting of all possible products of an element of the first vector and an element of the second vector, e.g. [1 2]T_{⊗[7 9]}T_{= [7 9 14 18]}T_.

gσ(x, y) Gaussian probability distribution function with zero mean and

standard deviation σ, i.e. gσ(x, y) = 2πσ12e

−x2 +y2 σ2

gσ(x) Gaussian probability distribution function with zero mean and

standard deviation Σ = diag(σ). δ(x) Dirac impulse, i.e. δ(x) =

1 if x = 0 0 if x 6= 0

(12)

(13)

Introduction

Face detection and pose estimation are two widely studied problems - mainly be-cause of their use as subcomponents in important applications, e.g. face recognition. So far no satisfactory solution has been proposed. Most solutions rely on certain simplifying properties of the image, e.g. simple background, normal lighting, no occlusion, known scale and known orientation of the image.

In this thesis I will investigate a new approach to the general problem of object detection and pose estimation which might solve some of these problems. Face de-tection can be considered a special case of this general problem, but is complicated by the fact that faces are non-rigid objects. The basis of this approach is the use of triplet invariants. These are scale and orientation invariant triplets of localized low level features, which means that the resulting procedure will be independent of the scale and orientation of the input image, provided that the low level fea-tures are also invariant to these properties. The use of rotational symmetries as low level features will provide orientation invariance, and to a certain degree, scale invariance. It will also decrease the sensitivity to changes in lighting conditions. Occlusion of the face and the presence of background is expected to be handled by the redundancy imposed by selecting a large number of independent feature triplets in the image.

Further, a biologically inspired associative structure will be used to map from feature triplets to responses (position, orientation and pose of the head). This structure will also generate a confidence measure which can be used to suppress false responses.

Feature

Generation AssociativeStructure processing

Post-Image Face position etc.

Figure 1.1. System overview

The work presented in this thesis is mainly focused on the design and imple-mentation of the three main components pictured in figure 1.1. The input should

(16)

be a high resolution black and white image. If a head is present it should be within some known range of size (for instance 200 ± 40 pixels from bottom of chin to top of forehead). The first component generates feature triplets from this image, which are fed to an associative structure generating responses, which are finally passed through a post-processing component that computes a final estimate of the position, orientation and pose.

1.1 Problem Specification

I will consider two versions of the problem – one slightly easier than the other.

1.1.1 Frontal face detection

In this version of the problem, the faces are assumed to be facing the camera. Only rotation and translation in the image plane is allowed. Hence, there are three parameters to estimate – the horizontal and vertical position of the face (px and

py) and the orientation (θ) (rotation in the image plane from upright position).

See figure 1.2. 50 100 150 200 250 300 50 100 150 200 py px ψ θ

Figure 1.2. Parameters used in the two problems. Here, θ = 30◦ _{and ψ = 0}◦_{. Note,}

that in order to rotate a face from its normalized position (corresponding to θ = 0◦_and

ψ = 0◦_{) it should first be rotated ψ around the head to toe axis, and thereafter be rotated}

θ in the image plane. Thus, ψ is invariant to rotation of the image in the image plane, which will turn out to be a useful property.

(17)

1.2 Objectives 3

1.1.2 Pose estimation

In this more general setting the faces are also allowed to be rotated in 3D. For prac-tical reasons I limited this problem to rotations around a verprac-tical axis (head to toe axis) and only allowed the rotation parameter (ψ) to vary from −π

2 (corresponding

to right hand side profile view) to +π

2 (left profile view). See figure 1.2.

1.2 Objectives

The goal is that the system should be able to detect faces with low error rates and estimate parameters corresponding to position, orientation and, when rele-vant, pose, of faces with a reasonable precision. To be more specific, the systems considered generate a set of predictions consisting of the position, orientation and pose of a possible face. Each such prediction is associated with a certainty measure and only predictions with a certainty measure above a certain threshold are con-sidered. A face is considered to be detected when at least one prediction is within a reasonable precision from the true parameters. There are two interesting error rates that can be measured over a set of test images:

• Detection rate: the percentage of faces that were detected, i.e. having a prediction close to the true parameters.

• Average number of false detections: the average number of false detec-tions (i.e. predicdetec-tions far from the true parameters) per image.

The goal is to maximize the detection rate while keeping the average number of false detections low.

The system should also be totally invariant to rotations in the image plane, it should be able to handle small scale changes (like ±20%) from the current scale considered and it should be insensitive to background which have no resemblance to faces. Finally, it should be able to handle large changes in lighting conditions as well as occlusion of parts of the face.

(18)

(19)

Chapter 2

Previous Work

2.1 Window Based Approaches

Many algorithms have been presented that are based on classifying small images of fixed size (about 20×20 pixels) into faces or non-faces. Each image is first pre-processed in various ways to minimize the effects of different lighting and camera conditions, e.g. color histogram equalization and illumination gradient correction (where a best-fit brightness plane is subtracted from the image) reduces the effects of heavy shadows caused by extreme lighting angles. To search for faces in larger images, these algorithms are applied to windows at all possible locations in the larger image. Also, to detect faces of different sizes, the larger image is repeatedly subsampled (usually by a factor 1.2 each time) and searched again at all possible locations. To make the algorithms rotation invariant it would probably also be possible to try windows of all possible rotations, but this is usually not done due to the computational complexity it would imply.

Sung and Poggio suggested [9] the use of elliptical windows with an area of 283 pixels. A suitable training set in this 283 dimensional space was then clustered into six positive and six negative clusters. Novel examples were then classified by computing a distance to each cluster center and feeding these twelve distances into a multi layer perceptron.

Rowley, et al., presented [8] a pure neural network based approach using two levels of neural networks. The first level consisted of a set of specially designed neural networks that took a 20×20 pixels sized window as input and tried to classify it by outputting a number in the range -1 (non-face) to 1 (face). The networks were designed to pick up certain features in the image, e.g. some hidden units were connected to horizontal stripes in the image which hopefully would match features as the mouth or the eyes. The networks in the first level were all of the same structure but were trained on different training sets. Hence, they would not behave in the same way after training. They would all have some weaknesses, but hopefully different weaknesses. A second level neural networks was therefore used to combine the outputs of these first level networks. The authors reported a

(20)

running time of 10 minutes in order to search for a face in a 320×240 pixels image (197737 windows), which is far too slow for practical purposes.

Osuna, et al., took [7] a different, but interesting approach. They used support vector machines (SVM), a quite recently developed pattern classification algorithm, which can be viewed as a different way of training polynomial or neural network based classifiers. Instead of minimizing the training error as learning in neural networks does, it tries to minimize the generalization error. In the case of a linear classifier this means that the SVM tries to place the separating hyper plane as far away from any example as possible, i.e. maximizing the margin between the separating plane and the closest example. In [7], the decision surface was a second degree polynomial surface, operating on windows of size 19×19 pixels. The result of the SVM algorithm is a set of support vectors (a subset of the training examples) that are used later in the classification process. In this case the classifier was trained on 50000 training examples which resulted in approximately 1000 support faces. One problem with SVM:s are the long training times, but it seems to give good results. Better results than the two previously mentioned papers was reported.

2.2 SNoW

SNoW (Sparse Network of Winnows) [11] is a learning architecture that in many ways resembles the associative structure (chapter 4) which will be used in this thesis. It works basically as a (sparse) linear classifier, but the simplicity of the linear model is expected to be compensated by the use of a very large set of features. In its discrete version it tries to map from large binary feature vectors as∈ {0, 1}nf

to binary responses us∈ {0, 1}. More specific, a response node usis active (us= 1)

if wT_a _{> θ and inactive (u}

s= 0) otherwise. Here, w is a weight vector and θ is

a threshold. Training of this network is performed using an online and mistake-driven algorithm based on the Winnow update rule; randomly selected examples from the training set are presented to the network and when a mistake is made in the prediction of a response, the weights are updated. If the algorithm predicts an inactive (wT_a _{≤ θ) response when it should have been active, the current active}

weights (A = {i|a = 1}) are promoted in a multiplicative fashion: ∀i ∈ A : wi ←

α · wi, where α is a parameter slightly larger than 1. Similarly, if the algorithm

predicts an active response when it should have been inactive, the current active weights are demoted: ∀i ∈ A : wi ← β · wi, where β is a parameter slightly less

than 1. In [11] two response signals were used. One to detect face patterns and one to detect non-face patterns. The final prediction is simply the strongest of these signals. As feature vectors very large and sparse vectors were used. Each component in the feature vector corresponded to a specific position within a 20×20 pixel window and a specific discretization level of the intensity at that pixel (using 256 levels). Thus 20 · 20 · 256 = 102400 feature components were used, but only 400 of those were active (non-zero) for each window.

The results were promising – better than most other window based algorithms (including the ones mentioned in section 2.1) – but it also seems reasonable to

(21)

2.2 SNoW 7 expect the use of more high level features (e.g. orientation) to further improve the learning capabilities of such a structure.

(22)

(23)

Chapter 3

Feature Generation

The purpose of the feature generation component is to extract some kind of proper-ties (which we from here on will call features) from the image. There are two main reasons for doing this filtering before feeding the information to the associative structure:

• to reduce the amount of information by removing information which is re-dundant for the task considered (face detection)

• to transform the remaining information into a representation that is better suited for the task

Examples of useful features are: lines, edges, textures, pixels with RGB values that are usually present in faces, elliptic structures (approximating the contour of head) etc. In this thesis I will only consider corner-like features which for instance can be found at important locations in the face, like the nose, the mouth or the eyes. These corner features will be connected into convex pairs that might represent an object (for instance an eye). Connecting these pairs into triplets will then give the final feature structure that will be used for association.

3.1 Corner Features

The complete process of detecting corner features is pictured in figure 3.2. It is basically divided into two steps:

• computing a local orientation image • finding local symmetry peaks

3.1.1 Local orientation

The local orientation of an image [3] can be described by a complex field:

z(x, y) = c(x, y)ei2ϕ(x,y) (3.1)

(24)

where c(x, y) is a certainty measure and ϕ(x, y) is the angle of the direction of maximal local variation (measured clockwise from the positive x-axis).

The double angle representation used here resolves the inherent ambiguity of local orientation. For instance, the angle of maximal variation of a vertical line could be both ϕ1 = +π₂ or ϕ2 = −π₂. With the double angle representation this

does not matter, since ei2ϕ1 _{= e}i2ϕ2 _{= e}iπ_{. Figure 3.1 illustrates how the argument} of the complex field z on a simple line varies as the line is rotated.

c

Rez Imz

2ϕ z

Figure 3.1. Illustration of the double angle representation. Vertical edges or lines (or other objects where the main direction of change is horizontal) have ϕ = 0. Horizontal edges or lines (or other objects where the main direction of change is vertical) have ϕ = π.

For a more interesting example consider figure 3.2 where two orientation fields (z1 and z2) in different scales of a face (and an artificial box) are pictured. The

intensity corresponds to the certainty measure c = |z|, and the hue to the direction 2ϕ = arg z. Horizontal structures like the mouth becomes green and vertical structures like the nose becomes red. The complete color code used to map complex numbers onto colors is shown in the top right of figure 3.2.

Now let’s consider how to compute an orientation field. First a single angle orientation field can be computed using a differentiated Gaussian filter:

zs(x, y) = c(x, y)ejϕ= I(x, y) ∗ ∂ ∂x+ i ∂ ∂y gσ(x, y) (3.2) Doubling the angle gives us the double angle representation:

z(x, y) = c(x, y)ej2ϕ _(3.3)

Finally, to get a more selective (less noise sensitive) response one can combine two orientation fields computed on two different scales. The idea is only to keep orientation values that are stable over a change of scale. One way of doing that is to weight the high resolution orientation field with the certainty of the low resolution

(25)

3.1 Corner Features 11 Color representation of complex numbers Re Im Combined corner features Final Features

Fine scale Medium scale Coarse scale

Corner Features z1 z2 I zc s3 s1 s2 sc

Figure 3.2. Corner generation overview. From the original image I, we first construct two orientation images in different scales according to (3.3). The standard deviations are σ = 1.2 for z1 and σ = 2.4 for z2. These are combined into z_c according to (3.4). z_c

is then used to detect corner features by computing three first order symmetry maps on different scales (using the filter (3.7)). The standard deviations are σ = 1.6, 3.2, 6.4 for s1, s2, s3 respectively. Combining them to scusing (3.8) and localizing local peaks gives

(26)

field |z2| and a factor which is monotonically decreasing as the difference in angle

increases (see also figure 3.2):

zc= |z2|max(0, cos(arg z1− arg z2))z1 (3.4)

3.1.2 Rotational symmetries

Definition 3.1 An n:th order rotational symmetry [5], [3], is a pattern I(r, ϕ) whose local orientation z(r, ϕ) can be written on the form:

z(r, ϕ) = c(r, ϕ)ei(nϕ+α), c ∈ R, α ∈ [0, 2π) (3.5) That is, arg z is independent of the radius r and can be written as arg z = nϕ + α for some α.

The most useful rotational symmetries are the following:

• 0:th order: Describes edges with a constant double angle orientation α. • 1:st order: Describes hyperbolic patterns (like line endings and corners of

various objects). Here α indicates the direction of the corner (single angle representation).

• 2:nd order: Describes circles, stars and spiral patterns. Here α indicates which of these types of patterns are present.

In this work rotational symmetries of the first order were used for detecting corner features, hence only these will be described in detail. For a more intuitive view of these symmetries, consider figure 3.3 where a few examples of intensity images I(r, ϕ) together with the corresponding orientation fields z(r, ϕ). There are many different patterns that satisfies (3.5) but they can all be considered ”corners” and it is easy to see that they should match end points of convex parts of the face (like eyes and mouth).

One way to detect these symmetries is to convolve the orientation field z with the complex filter b(r, ϕ) = eiϕ_{. Consider a perfect first order symmetry with}

direction α: z(r, ϕ) = c(r, ϕ)ei(ϕ+α)_{. Then the result of the convolution (at the}

origin) is: s= [b ∗ z](0, 0) = Z Z R2 b∗(r, ϕ)z(r, ϕ) ∂r∂ϕ = Z Z R2 c(r, ϕ)eiα _{∂r∂ϕ =}

= c0eiα, for some c0∈ R (3.6)

(where b∗ denotes the complex conjugate of b). Thus, arg s = α corresponds directly to the direction of the corner. Since we want to find local symmetries we have to weight the filter with a localized function. One such function is r · gσ(r, ϕ).

(27)

3.1 Corner Features 13

Figure 3.3. Some examples of first order rotational symmetries. The first row shows four double angle orientation fields z(r, ϕ) represented with vector fields. From left to right we have α = 0, α = 0, α = π/2, α = 3π/4 respectively. Note that the magnitude function c(r, ϕ) also varies. The second row shows the corresponding intensity images I(r, ϕ) (which are not unique).

to use an efficient algorithm for rotationally symmetry detection based on local polynomial expansion of the image [5]. The final filter is described by the following equation (but the actual convolution is only performed approximately by means of polynomial expansion):

bσ(r, ϕ) = r · gσ(r, ϕ) · b(r, ϕ) (3.7)

For the face detection application I chose to compute three different rotational symmetry fields on different scales (see figure 3.2). The idea is that important parts of the face have corners of different sharpness at almost the same positions (for instance, at the eye you can find a sharp corner at the ends of the eyelids, and more smoother corners at the inside and outside of eye socket). By only accepting corners that appear on many scales we get a more selective response which reduces complexity. The combination is performed similarly to (3.4):

sc= |s2||s3|max(0, cos(arg s1− arg s2))max(0, cos(arg s1− arg s3))s1 (3.8)

3.1.3 Defining corner features

Finally a post-processing step is performed to find the local maxima in sc. These

maxima are the final corner features as pictured in figure 3.2. The following nota-tion will be used to represent a corner feature:

(28)

Definition 3.2 A corner feature C = (c, d) is a pair where the vector c repre-sents the position of the feature and another vector d reprerepre-sents the direction and strength of the corner. The direction used here is the opposite of the direction in sc, which means that d is directed towards the inside of the corner and not towards

the outside (as α would be). The vector d can be computed as: d= −Re sc(c) −Im sc(c) (3.9)

3.2 Convex Feature Pairs

d

₂

c

₂

c

₁

d

₁

υ

Figure 3.4. A convex feature pair P = {C1, C2} = {(c₁, d₁), (c₂, d₂)}. The two corner

features must not deviate from the interconnecting line more than the angle υ.

By a convex feature pair we mean a pair of features directed towards each other. A perfect example of this are the two end points of a straight line, which are directed precisely towards each other. Another example would be the left and right ”corners” of an eye, which could be expected be approximately directed towards each other. The convexity requirement can be formalized by the requirement that the direction of the corners should not deviate more than a certain angle υ from the straight line connecting the corners (cf. fig 3.4):

Definition 3.3 A convex feature pair P = {C1, C2} = {(c1, d1), (c2, d2)} is an

unordered pair of corner features (section 3.1)) satisfying the following requirement: ( d₁•(c₂−c₁)

|d1||(c2−c1)| > cos(υ)

d2•(c1−c2)

|d₂||(c₁−c₂)| > cos(υ)

, where υ < pi/2 (3.10)

In the face detection problem I chose υ = 60◦ _{and I also added a distance}

requirement to reduce the number of pairs generated:

m ≤ |c2− c1| ≤ M (3.11)

where m and M , the minimal and maximal allowed distances, were chosen to correspond to approximately 2 cm and 6 cm respectively in the scale at which we are currently looking for faces.

(29)

3.3 Triplet Invariants 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 100 150 200 250 300 50 100 150 200

Figure 3.5. The 80 most significant convex feature pairs found in the example image. Note the good pairs covering the eyes horizontally, the nose vertically and the pair connect-ing the nostrils. These are stable in the sense that we could expect to find correspondconnect-ing pairs in other faces.

3.3 Triplet Invariants

Definition 3.4 A triplet invariant T = (f1, f2, f3) is a three-tuple of local features

(e.g. convex pairs or corner features) associated with a description function (feature vector), D = D(T ), which satisfies the three following invariance requirements:

• Orientation invariance: D should be invariant to rotations of the object (and thereby the local features associated with the object) around arbitrary points in the image plane.

• Scale invariance: D should be invariant to scale changes of the object. • Order invariance: D should be invariant to the ordering of f1,f2, and f3,

i.e. D should be the same independent of in which order these three local features are selected (this reduces the number of valid triplets, and thus speeds up both training and testing of the system).

There are many ways of achieving these invariance requirements and exactly how they are achieved shouldn’t be critical as long as all the requirements are met. I will now describe the specific criteria I used for the face detection problem. Consider the three local features in figure 3.6. To achieve order invariance we need a set of rules that uniquely labels these features with f1,f2, and f3. The first rule

uniquely selects f2by requiring that it should be the feature point opposite to the

shortest side of the triangle:

l2< κ · l1

(30)

β f2 f3 f1 ˆ f21 l2 l3 ˆ f23 l1 ˆ y ˆ x Figure 3.6. A triplet

Triplets where all sides are of almost the same lengths has no labeling that satisfies (3.12). These triplets will be discarded. Therefore the choice of κ is a trade-off between the noise sensitivity of the ordering (a small change in position of one of the features due to noise should not change the ordering of the triplet) and how many triplets we can afford to discard. κ = 0.9 seems to be a reasonable choice. We could of course equally well have chosen the longest side of the triangle as f2.

The only difference would have been that a different set of triplets would have had to been discarded.

The selection of f1 and f2 can the be performed by requiring that the curve

f1→ f2→ f3is left-oriented (f1 denotes the position in the image of the feature

f1). This can be expressed using the following determinant:

deth ˆf21fˆ32

i

> τ, where τ ≥ 0 (3.13)

This time triplets with very small angles β have to be discarded and a reasonable choice of τ could be τ = sin(10◦_{). Thus only left-oriented triplets with β > 10}◦

are accepted.

Given an ordered triplet we can now define a local coordinate system (u, v) such that if all measurements used to compute D(T ) are performed in this coordinate system, D(T ) will automatically be both rotation and scale invariant. First we define the scale s as the average length of the sides of the triplet (we could have

(31)

3.3 Triplet Invariants 17

used only one side, but taking all sides into account should reduce noise sensitivity): s = l1+ l2+ l3

3 (3.14)

The orientation ϕ of the triplet is defined as the angle between the positive x axis and the median from f2onto the opposite side (again taking all points into account

to reduce noise sensitivity). See figure 3.7. Finally the position p of the triplet is defined as the center of mass of the three feature locations, i.e.

p=f1+ f2+ f3

3 (3.15)

Now we can define the local coordinate system (u, v) by: x y = s cos(ϕ) −sin(ϕ) sin(ϕ) cos(ϕ) u v + p (3.16)

Since we have a unique accepted ordering of any triplet of features, we only need to make sure that all measurements used in D(F ) are performed in the (u,v) coordinate system, and all our triplets will be invariant as defined in definition 3.4.

ˆ x p uˆ ˆ v p f2 f1 β f3 f1 ˆ u f3 ˆ v s= 3 β ϕ ˆ y f2

Figure 3.7. Demonstration of the local coordinate system of a triplet

3.3.1 Triplets invariants of convex feature pairs

In this application convex feature pairs (section 3.2) were used as the localized features to build up a triplet. Thus fi in definition 3.4 corresponds to a convex

(32)

feature pair P and the position fi was chosen to be the midpoint on the straight

line connecting the two corner features (C1 and C2). Figure 3.8 shows two such

triplets on the example image previously used.

50 100 150 200 250 300

50

100

150

200

Figure 3.8. Two examples of triplet invariants. Corner features that make up a convex pair are connected with a red line. The position of the pair is the center of that line and this is where the green lines representing triplets are attached.

(33)

Chapter 4

Associative Structure

Associative Structure Response vector Feature vector u ∈ Rnr a ∈ Rnt

Figure 4.1. Associative Structure

An associative structure could be viewed as a black box function that maps feature vectors to response vectors (fig. 4.1). This structure should associate novel feature vectors with feature vectors that it has previously been seen (usually during a training phase) and output a response vector according to some hypothesis about the optimal mapping from feature vectors to response vectors. There are many different ways of implementing such a structure, for instance:

• Feed Forward Neural Networks [4]

• Support Vector Machines [7, 4], where the response components are linear combinations of the scalar products between the given feature vector and a set of support vectors (see also section 2.1).

• SNoW (Sparse Network of Winnows) [11], which is also partially de-scribed in section 2.2.

In the structure I have used (described more in detail in [1]) each response component is basically a simple linear combination of the feature components. The power and speed of the structure is instead expected to come from the complexity and sparsity of the feature vectors.

(34)

4.1 Structure Definition

Let’s introduce some notation:

nr number of components in a response vector

ns number of samples in training set

nt number of components in a feature vector

as∈ Rn+t a feature vector

A= [a1a2...ans] a feature matrix where each column is a feature vector us∈ Rn+r a response vector

U = [u1u2...uns] a response matrix where each column is a response vector

cr ∈ Rnt a weight vector

C = [c1c2...cnr] a link matrix where each column is a weight vector The link matrix C should ideally be chosen to satisfy:

U= CA (4.1)

This equation usually has no solution and has to be approximated, for example, in a least square sense:

cr = argmin c

|ur− cA| (4.2)

One important requirement is that only positive coefficients are allowed in the feature vectors as and response vectors us. Another requirement is that at least

the feature vectors, and preferably also the response vectors, should be very sparse, i.e. most of their coefficients should be zero. There are three main reasons of this: • Interpretation - a coefficient should indicate the degree of presence of some feature and not the value of a certain feature. Thus a zero coefficient should be interpreted as no information and a maximum value of, say 1.0 should be interpreted as a perfect detection of that feature.

• Speed of computation - only nonzero components contribute to a matrix multiplication. Thus we can handle a large number of feature components, as long as only a small fraction of them are nonzero in each feature vector. • Memory requirements - only nonzero components need to be stored. The interpretation might require some more explanation. Consider a feature based on the intensity value of a pixel. Using the intensity directly as a feature component would not be a good idea. There is no reason to expect a response component to be a linear combination of intensity values. This is due to the behavior of the ordinary distance metric on intensity values - the statement white is twice as as much as gray simply does not make any sense for most purposes. If we instead choose features indicating the presence of white, the presence of gray and so on, then there would be no such connection between white and gray. A white pixel would use the link coefficient for the white feature component, a gray pixel would

(35)

4.2 Channel Representation 21

use the link coefficient for the gray feature component and a light gray pixel would use some mix of these coefficients. This is the idea behind the usage of channel code representation(section 4.2) in the feature vectors.

We can decrease the error by noticing that all responses are required to be positive, hence we could also map all negative values of CA onto 0 and instead of (4.2) use the following expression in the optimization:

cr = argmin c

|ur− max(0, cA)| (4.3)

I will not describe in detail how this optimization is performed, but it is basically a gradient descent based method.

4.2 Channel Representation

As was described in section 4.1 there is a need to partition scalar measurements into comparable entities before they can be used in a feature vector of the associative structure. One-dimensional channel representation [6, 2] is one way of doing that: Definition 4.1 Let x be a scalar in the interval [1,K], where K ∈ Z+. Then

the channel representation x = [x0 x1 x2. . . xK+1] of x with channel distance π m, m ∈ {2, 3, 4, ...} is given by: xk= cos2 π m(x − k) if |x − k| ≤m 2 0 otherwise (4.4)

Each channel xk can be viewed as a band pass filter (where band pass should

be interpreted in the definition domain and not in the Fourier domain) centered at the integer k and having a response that peaks at x = k and then falls off as cos2

around x = k. The choice of cos2 _{is somewhat arbitrary. We could have chosen}

any other function that decreases monotonically as |x − k| increases. However, cos2

has some nice properties, like for instance being both continuous and differentiable and yielding the property of equation (4.6).

For an example, let’s consider K = 6 and channel distance π

3, i.e. m = 3. In

this case each channel has a support of width 3

2 and each x ∈ [1, 6] is covered by

three channels. See figure 4.2.

By looking up the responses for each channel in this diagram (or applying (4.4)), we can find the channel representation for a given value of x. Some examples:

x = 1.0 ⇒ x= [0.25 1.0 0.25 0 0 0 0 0 ]T

x = 3.13 ⇒ x= [0 0 0.14 0.98 0.38 0 0 0 ]T

x = 4.0 ⇒ x= [0 0 0 0.25 1.0 0.25 0 0 ]T

x = 6.0 ⇒ x= [0 0 0 0 0 0.25 1.0 0.25 ]T

(36)

−1 0 1 2 3 4 5 6 7 8 9 10 0 0.5 1 x response

Figure 4.2. Response functions for the example channel set.

As can be seen in the examples and from the definition, each scalar is represented as a sparse vector with m = 3 nonzero elements. It can also be shown that, for x in the interval [1, K], the sum of the elements of x is constant:

X

k

xk(x) =

3

2 (4.6)

This is an important property in the setting of our associative structure, since it means that a response can be interpreted as a weighting of link coefficients.

4.3 Feature Vectors

The measured variables that are used to construct a feature vector as = D(Ts)

(with the notation of section 3.3) for a triplet are shown in table 4.1 (see also figure 4.3).

Before channel coding, all measurements are transformed by the application of a strictly increasing and smooth transformation function chosen (tanh turned out to be useful) so that the resulting transformed measurements become approximately uniformly distributed on the interval which is covered by the channels. This is done to make sure that all measurements don’t fall into only a few channels. The channel coding is done using the number of channels specified in table 4.1 and a channel distance of π

2 (a small channel distance is used to reduce the number of

nonzero components in each vector to a minimum). The resulting vector will be denoted with boldface.

I have choosen to evaluate four different descriptors Di(T ). To simplify the

description of these let’s introduce the following unifying variable names (and the corresponding bold face variables for the channel coded values):

[α1 α2 α3α4 α5 α6 α7α8] = [l1 l2 γ1γ2 γ3δ1 δ2 δ3] (4.7)

The descriptors are shown in table 4.2. D1consist of all channel coded variables

stacked above each other (first order products). D2 adds second order products,

that is all products of elements from different channel vectors. D3includes all first,

(37)

4.4 Response Vectors 23 f2 n f1 γ3 p _u_ˆ ˆ v γ1 β f3 γ2 δ1 δ2 δ3 l3 l1 F ace l2 ρ

Figure 4.3.Measurements used in feature and response vectors. n and ρ represents the position and orientation of the face in the local (u, v) coordinate system. δiis the length

of pair i and liis the length of triplet side i.

Variable Number of Channel Coded Description

Channels Vector

li 6 li relative leg length (i = 1, 2)

γi 5 γi relative pair orientation (i = 1, 2, 3)

δi 9 δi relative pair length (i = 1, 2, 3)

Table 4.1. Variables used to construct feature vectors. The reason for not using l3 is

that it is uniquely determined by l1 and l2through (3.14) - in the local coordinate system

l1+ l2+ l3= 3. Also, γi, are actually coded with a modular channel representation [6].

fifth order products however would yield too large a feature vector. Instead I also tried to use all eight order products (but no lower order products) in D5.

4.4 Response Vectors

The response vectors are also channel coded in order to separate the computation of linkage coefficients for different responses. The idea is that the ideal mapping from features to responses is locally linear, i.e. it could be well approximated by a linear mapping for small variations in the response space. Of course a fine granular division of the response space would increase the requirement of training data, so there is a trade off between granularity and available training data. In the face detection problem, we assume frontal faces of approximately known scale, and only use the position and orientation of the face as response variables. The position n is

(38)

Descriptor D1 D2 D3 Feature Vector      α1 α2 .. . α8             D1 α1⊗ α2 α1⊗ α3 .. . α7⊗ α8               D3 α1⊗ α2⊗ α3 α1⊗ α2⊗ α4 .. . α6⊗ α7⊗ α8        Vector Length 54 570 4754 Nonzero components 16 68 176 Descriptor D4 D5 Feature Vector        D2 α1⊗ α2⊗ α3⊗ α4 α1⊗ α2⊗ α4⊗ α5 .. . α5⊗ α6⊗ α7⊗ α8        α1⊗ α2⊗ . . . ⊗ α8 Vector Length 23039 3280500 Nonzero components 293 9

Table 4.2. Five different descriptors. ⊗ denotes the Kronecker product of two vectors which is the vector consisting of all possible products of an element of the first vector and an element of the other vector, e.g. [1 2]T

⊗ [7 9]T

= [7 9 14 18]T

chosen to be on the vertical symmetry line, halfway between the center of the eyes and the center of the mouth, and the orientation θ is chosen to be zero as the face is upright oriented (in the local (u, v) coordinate system). In the problem where pose is also to be estimated a variable ψ corresponding to the rotation around a vertical axis (head to toe) is also added (cf figure 1.2). See table 4.3 for a description of all variables involved. For the response variables, the channel coding was performed using a channel distance of π

3. This should hopefully increase the stability of the

system, compared to if π

2 had been used.

For both the detection and the pose problem two different response descriptors were evaluated. In the first all channel vectors were simply stacked above each other (first order products). In the second, a full Kronecker product of all channel vectors were used (third or fourth order products). See tables 4.4 and 4.5.

(39)

4.4 Response Vectors 25

Variable Range Number of Channel Description Channels Coded

Vector

nx [−2, 2] 15 nx relative horizontal position

ny [−2, 2] 15 ny relative vertical position

ρ [0, 2π] 20 ρ relative in plane rotation

ψ [−π

2,π2] 15 ψ rotation around vertical axis

Table 4.3. Variables used to construct response vectors. All variables are measured in the local coordinate system (u, v), except the pose variable ψ which, since it is an out of plane rotation, has to be measured in a global frame.

Descriptor B1 B2 Response Vector   nx ny ρ   nx⊗ ny⊗ ρ Vector Length 50 4500 Nonzero components 12 81

Table 4.4. Two different response descriptors for the detection problem.

Descriptor C1 C2 Response Vector     nx ny ρ ψ     nx⊗ ny⊗ ρ ⊗ ψ Vector Length 65 67500 Nonzero components 16 243

(40)

(41)

Chapter 5

Post-processing

Each feature vector which is fed through the associative structure generates a re-sponse vector, which has to be appropriately decoded. Since the rere-sponse vectors represent channel coded scalars (cf. section 4.2) this involves the inversion of the channel representation equation (4.4). Unless the vector is a valid channel rep-resentation (i.e. it could have been generated from a scalar using (4.4)) there is no unique way of doing this. Instead it is possible to compute (although I will not describe in detail how this is done) a set of possible scalars, each associated with a degree of certainty, that could possibly have been the cause of a channel representation vector similar to the one found in the response vector. Thus, each response vector results in a set of responses {ri}, where

ri=   nx,i ny,i ρi   (5.1)

(in the case of pose estimation, a fourth component ψ is added to the response vector). Each response is associated with a degree of certainty ci. A certainty

of ci = 1 corresponds to a valid channel representation vector and an arbitrary

certainty ci< 1 to such a vector scaled component-wise by ci. To limit the number

of responses we ignore responses with a low degree of certainty – which are likely to be nothing but noise – by requiring the certainty to be at least, say 0.1.

From an image containing mainly a face, we typically extract about 100 triplets, i.e. 100 feature vectors, which after processing typically results in about 50 to 200 responses of the form (5.1). An example of extracted responses can be seen in figure 5.1. In order to remove responses generated by noise, we wish to keep only responses that are consistent with a number of other responses. This is done using a clustering method described in the next section.

(42)

5.1 Response Clustering

Consider n responses {r1, . . . , rn} and associated degrees of certainty {c1, . . . , cn}

that we wish to search for consistent clusters. A simple way of doing that is to search for local maxima in the following total response function:

f (r) = gσ(r) ∗ n

X

i=1

ciδ(r − ri), where (5.2)

gσ(r) is the three(four) dimensional Gaussian with zero mean and standard

devi-ation Σ = diag(σ). The selection of the smoothing parameters σ depends on the resolution of the response vector coding. I chose a spatial smoothing (σxand σy)

corresponding to 1-2 cm in natural scale, and an angular smoothing (σθ and σψ)

of 15◦_{. Local maxima of the total response function (5.2) with a cluster certainty}

f (r) greater than a threshold t are selected as the final predictions of where the face is located. The choice of t is trade-off between the face detection rate and the number of false predictions made. Choosing t = 1.0 seems reasonable, since the associative structure can be expected to output responses with confidence ci= 1.0

when it has detected a face with very high probability. See also figure 5.1.

Figure 5.1. Left: 250 responses generated by a trained system. Right: Two response clusters with a certainty above the threshold (one of which is correct, and one which is a false prediction). Each response is plotted as a + extended in the head to toe direction of the predicted face.

5.2 Connection between Triplets and Responses

In order to get some insight into what the system is actually doing, figure 5.2 illustrates which triplets and pairs that were involved in the generation of two particular response clusters (using a system with descriptors D4 and B2 trained

(43)

5.2 Connection between Triplets and Responses 29

Figure 5.2. Left: The triplets (thin lines) and pairs (thick lines) that were responsible for the generation of the good responses at the nose tip in figure 5.1 (left). Right: Some triplets and pairs that were responsible for the generation of the responses at the subject’s forehead which contributed to the false prediction in figure 5.1 (right).

the eyes, nose and mouth are actually those that contribute most to the correct response cluster at the nose tip – as expected.

(44)

(45)

Chapter 6

Results

6.1 Data Sets

In this section I will describe the various data sets I have used for training and testing of the system.

6.1.1 Training set I

The training set used for the face detection problem consisted of 76 frontal faces with no background. This set was chosen to represent a wide variety of faces and thus contained both males and females, different ethnic groups, and people with a variety of facial attributes.

6.1.2 Training set II

The training set used for the pose estimation problem consisted of 200 images from approximately 50 different subjects with no background. The pose parameter ψ varied from −3π

4 to +

3π

4 .

6.1.3 Test set I - the Yale Face Database

The Yale Face Database consists of 11 images each taken of 15 different subjects. The 11 images of each subject are all frontal and correspond to different configura-tions or facial expression (different light direcconfigura-tions, with or without glasses, happy, sad, sleepy, surprised, etc.). These images do not contain any background but are still useful in order to measure the detection rate and generalization capabilities of the system. Adding background would only increase the number of false posi-tives (assuming that a constant number of triplets per area unit are considered). The example image used throughout this thesis (see for instance figure 1.2) is an example from this database. For a few other examples see figure 6.1.

(46)

Figure 6.1. Examples of faces from the Yale Face Database

6.1.4 Test set II - The MIT Face Database

Figure 6.2. Examples of faces from the MIT Face Database

The MIT Face Database [10] contains 27 images each of 16 different subjects (all males). The 27 images of each subject corresponds to different head orientation, lighting and scale. These images do contain a non-trivial background. See figure 6.2 for a few examples.

6.1.5 Test set III - Images from the Image Coding Group at

Link¨

oping University

The database from the Image Coding Group at Link¨oping University contains approximately 60 images each from 7 different subjects, with varying pose ψ and varying facial expressions. It does contain some background, but this background is the same for all images in the set. However, since I do not use these images when training the system, we know that the background will not be used by the system to infer the pose of the subject. See figure 6.3 for a few examples.

(47)

6.2 Evaluation of the Face Detection System 33

Figure 6.3. Examples of faces from Test set III

6.2 Evaluation of the Face Detection System

6.2.1 Different feature vectors and response vectors

In this section I will evaluate the systems using the different feature vectors and response vectors described in sections 4.3 and 4.4. A prediction made by a system is considered to be correct if the deviation between the spatial prediction and the true value is no more than 1₄ of the distance between the center of the eyes and the center of the mouth, and the orientation prediction is within ±30◦ _{from the true}

value.

Table 6.1 shows the results after evaluating the different face detection systems on the Yale Face Database. The results clearly improves as higher order products are added to the feature vectors, but when the threshold is adjusted to yield a comparable average number of false detections per image, which could be consid-ered a more fair comparison, the gain of using higher order products is drastically reduced. We can also see a clear advantage of using third order products in the response vectors, but we should remember that this comes at a quite high increase of training time.

The results on the MIT test set (table 6.2) are less satisfying. The number of false predictions increases, which should be expected since these images contain some non-simple background. However, there is also an unexpected decrease in the detection rate, which is probably just due to the different kinds of faces appearing in the two sets.

6.2.2 Size of training set

All systems used above was trained on 76 random face images. In order to test how the size of the training set influenced the results I retrained the system with descriptors D3and B1on training sets of different sizes. As figure 6.4 shows, there

is no obvious benefit of using more than about ten images for training. This is surprisingly low and might indicate that the feature vectors have been generated

(48)

Descriptors B1 B2 B1∗ B2∗ D1 6.1 % (0.1) 46.1 % (3.6) 46.7 % (2.5) 40.6 % (2.0) D2 44.2 % (1.0) 68.5 % (6.5) 53.9 % (2.2) 52.7 % (2.0) D3 58.8 % (2.0) 78.8 % (10.0) 58.8 % (2.0) 62.4 % (2.0) D4 65.5 % (3.0) 81.8 % (12.8) 60.6 % (2.0) 65.5 % (2.1) D5 64.2 % (4.4) 73.9 % (9.6) 55.2 % (2.0) 63.6 % (2.2)

Table 6.1. Face detection rates on the Yale test set. Numbers inside parenthesis indicate the average number of false detections per image.

∗_{The first two columns show the result when a fixed certainty threshold of 1.0 was used}

for all descriptors. In last two columns the threshold was adjusted for each descriptor combination to yield an average number of false detections of approximately 2.0 to make it easier to compare the results.

Descriptors B1 B2 B1∗ B2∗ D1 0.0 % (0.1) 37.5 % (1.2) 6.2 % (2.2) 43.8 % (2.3) D2 37.5 % (0.9) 56.2 % (9.5) 43.8 % (2.4) 43.8 % (3.1) D3 50.0 % (1.4) 50.0 % (16.1) 50.0 % (1.4) 43.8 % (2.4) D4 50.0 % (4.1) 56.2 % (21.4) 50.0 % (2.6) 43.8 % (2.0) D5 68.8 % (6.6) 62.5 % (13.6) 56.2 % (3.6) 31.2 % (0.7)

Table 6.2. Face detection rates on the MIT test set. The unexpected behavior (values that drop with increased size of feature vector) is probably due to the limited size of this test set.

∗ _{The two last columns correspond to the same thresholds as the two last columns of}

table 6.1.

in a way that removes most of the subject specific properties.

6.2.3 Required resolution

The systems were trained and evaluated on images with a rather high resolution (approximately 180 pixels from bottom of chin to top of forehead). In order to estimate the minimum resolution required for these systems to be useful I first down-sampled the Yale test set to a lower resolution and thereafter performed an up-sampling back to the original resolution again. Thus I avoided the need to make the feature generation module completely scale invariant in order to do this comparison. The same results should however be achievable without the last up-sampling. Figure 6.5 shows to that we could achieve almost the same detection rates with a head size as low as 60 pixels.

(49)

6.2 Evaluation of the Face Detection System 35 Detection Rate 0 10 20 30 40 50 60 70 20 40 60

Size of training set

%

0 10 20 30 40 50 60 70

20 40 60

Size of training set

Detection Rate (%)

Detection Rate

0 10 20 30 40 50 60 700

5 10

Average number of false detections

Figure 6.4. Detection rates and average number of false detections versus the size of the training set.

0 20 40 60 80 100 120 140 160 180

0 50 100

Head size (pixels)

Detection Rate (%)

Detection Rate

0 20 40 60 80 100 120 140 160 18010

15 20

Figure 6.5. Detection rates and average number of false detections versus the resolution of the head (using the system with descriptors D4 and B2).

6.2.4 Robustness to scale changes

We have so far assumed the size (in pixels) of the face to be in a known range. In the test sets this has been achieved by scaling all faces to a predetermined size, but since the original size was measured by hand we have to expect some variation in the scaled sizes as well (maybe a standard deviation of something like 10 − 20 %). To evaluate how the system behaves when the faces deviate more from the desired size I rescaled the Yale data set by different factors and applied the system to it. The results (figure 6.6) show that we can handle size deviations of up to ±20 % (plus the varying deviation caused by measurements by hand as mentioned above) rather well.

(50)

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 20 40 60 80 100 Scale Detection Rate (%) Detection Rate 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20 10 20 30 40

Figure 6.6. Detection rates and average number of false detections versus the size of the head (using the system with descriptors D4 and B2). The steady increase in false

detections is caused by the number of triplets selected being proportional to the area of the image considered.

6.2.5 Handling of occlusion

To get some insight into how the systems handles occlusion I tested the system with descriptors D4 and B2on the Yale test set again, but with different parts of

the faces occluded (see figure 6.7).

Figure 6.7. Types of occlusion tested. Left: Eyes occluded; Middle: Nose occluded; Right: Mouth occluded

The results (table 6.3) clearly indicates that the eyes are much more important than the mouth or the nose. This result could be expected since the most useful triplets connect the four main facial parts (two eyes, nose and mouth) and occluding the eyes removes two of these parts, while occlusion of nose or mouth only removes one such part. For some reason the result with no occlusion is actually worse than the result when only the mouth or the nose is occluded. I have no good explanation for this, but it might have something to do with the triplet selection; more triplets will be selected from the non-occluded area when some part is occluded.

(51)

6.3 Evaluation of the Pose Estimation System 37 Type of Detection Average number of

occlusion Rate false detections

1 14.5 % 11.0

2 93.3 % 12.6

3 92.1 % 10.4

None 81.8 % 12.8

Table 6.3. Results when evaluation the system with descriptors D4 and B2on the Yale

Data Set with the three different kinds of occlusions pictured in figure 6.7.

6.3 Evaluation of the Pose Estimation System

The pose estimation system was trained on training set II, containing 200 images, and tested on test set III. Here, a prediction was considered to be correct if the spatial deviation was less than1₄ of the distance between the center of the eyes and the center of the mouth, and the orientation θ prediction was within ±30◦ _from

the true value, and the rotation ψ was within ±20◦ _{from the true value. Table 6.4}

shows the results. This problem is clearly a lot more difficult and seems to require a much higher complexity of both feature vectors and response vectors.

Descriptors C1 C2 C1∗ C2∗ D1 0.0 % (0.2) 0.0 % (0.0) 12.3 % (2.1) 0.0 % (0.0) D2 21.5 % (3.7) 21.5 % (1.0) 18.5 % (2.0) 23.1 % (1.9) D3 21.5 % (5.5) 33.8 % (3.8) 18.5 % (2.1) 26.2 % (2.0) D4 36.9 % (8.6) 50.8 % (10.2) 20.0 % (2.1) 27.7 % (2.0) D5 30.8 % (20.0) 40.0 % (22.8) 10.8 % (1.9) 9.2 % (2.0)

Table 6.4. Face pose estimation results on test set III. The percentages correspond to the fraction of images where a correct prediction was made. Numbers inside parenthesis indicate the average number of false detections per image.

∗_{The first two rows show the result when a fixed certainty threshold of 1.0 was used for all}

descriptors. In last two rows the threshold was adjusted for each descriptor combination to yield an average number of false detections of approximately 2.0 to make it easier to compare the results.

(52)

(53)

Chapter 7

Summary and Conclusions

In this thesis I have described a new approach to the general problem of object detection and pose estimation and evaluated it on the special case of detection and pose estimation of faces. The basis of the approach was the use of rotation and scale invariant triplets of convex feature pairs which were combined into large, but sparse, feature vectors. These feature vectors were then processed by an associa-tive structure to generate responses which might indicate that a face with certain parameters had been detected.

The results in chapter 6 are however not satisfactory. It is possible reach a quite high detection rate on simple images containing only a face and no background, but this comes at the cost of an unacceptably high number of false detections. Reducing the number of false detections by raising the cluster threshold hurts the detection rate too much, and hence does not solve the problem.

The system is truly rotation invariant – by construction – which has also been verified experimentally. It handles scale changes within ±15% without any major performance degradation. The presence of background clearly hurts the perfor-mance – either because the feature vectors are too simple to capture the difference between a face and a non-face, or because essentially no non-faces are actually used in the training process (although it can be argued that triplets corresponding to different parts of the response space are negative examples for each other). Small changes in lighting conditions are not a problem, but the evaluation on the Yale Face Database clearly showed that the system had big problems handling strong light coming from the side (mainly since the shadows displaces the corner features too far from their positions under normal lighting).

The results on the pose estimation problem are worse – probably because of the increased size of the active response space. A larger training set might improve these results, but would also increase the training time of the system.

The total time required to train the system ranges from an hour to a day depending on the complexity of the feature vectors and the response vectors, as well as the size of the training set. Considering the size of the feature and response vectors, this is quite fast, especially when comparing to what would have been

(54)

expected from an ordinary neural network trained with back-propagation. Testing on an image of a size similar to the ones in the test sets requires approximately 5-10 seconds, which is too slow for most applications. It should however be noted that the current code is non-optimized Matlab code, so at least a ten-fold improvement should be expected in an optimized version.

7.1 Future Work

The measurements used to construct the feature vectors are probably not complex enough to differentiate between faces and non-faces with the desired error rate. Thus it might be worthwhile to investigate the effect of incorporating more, and different kinds of, measurements (color, texture etc.) into the feature vectors.

Another way to proceed would be to use a complementary system to verify the predictions made by our primary system. This complementary system need not be rotation and scale invariant, since it is only supposed to verify the correctness of a given set of those parameters. Hence it could be constructed and trained in an orthogonal way to the primary system in order to minimize the probability of a false prediction being accepted by both systems.

It might also be interesting to combine this fast associative structure with a more conventional foveal retina comprised of different kinds of orientation and color receptors. Using only second order products of the receptors in the feature vector would allow for up to a few thousand receptors to be handled in an efficient way. Using such a retina usually implies that the image has to be examined at all positions on a regular grid of locations and scales. It seems however not too unreasonable to expect that this search pattern could be optimized by learning methods such as reinforcement learning.

(55)

Bibliography

[1] G. Granlund, P.-E. Forss´en, and B. Johansson. Hiperlearn: A high perfor-mance learning architecture. Technical report, Department of Electrical En-gineering, Link¨oping University, January 2002.

[2] G. H. Granlund. An associative perception-action structure using a localized space variant information representation. In AFPAC, 2000.

[3] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, 1995.

[4] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition, 1999.

[5] B. Johansson. Multiscale Curvature Detection in Computer Vision. Li-centiate thesis LIU-TEK-LIC-2001:14, Department of Electrical Engineering, Link¨oping University, Link¨oping, Sweden, March 2001.

[6] K. Nordberg, G. Granlund, and H. Knutsson. Representation and learning of invariance. In ICIP. IEEE, November 1994.

[7] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. In Proc. Fifth Int. Conf. on Computer Vision, pages 130–136, 1997.

[8] H. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158R, Carnegie Mellon University, November 1995.

[9] K. Sung and T. Poggio. Example-based learning for view-based human face detection. Technical Report A.I. Memo 1521, MIT A.I. Lab, December 1994. [10] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive

Neuroscience, 3(1), 1991.

[11] M.-H. Yang, D. Roth, and N. Ahuja. A SNoW-based face detector. In Advances in Neural Information Processing Systems, volume 12, pages 855–861. MIT Press, 2000.

(56)

(57)

På svenska

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ick-ekommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den

omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna

sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i

sådant sammanhang som är kränkande för upphovsmannens litterära eller

konst-närliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

för-lagets hemsida

http://www.ep.liu.se/

In English

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

excep-tional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Sub-sequent transfers of copyright cannot revoke this permission. All other uses of

the document are conditional on the consent of the copyright owner. The

pub-lisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

men-tioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity, please

refer to its WWW home page:

http://www.ep.liu.se/

Face Detection and Pose Estimation using Triplet Invariants

Face Detection and Pose Estimation

using Triplet Invariants

Face Detection and Pose Estimation

using Triplet Invariants

Institutionen fö r Systemteknik

581 83 LINKÖ PING

Abstract

Acknowledgments

Notation

Symbols

Operators and functions

Contents

Chapter 1

Introduction

1.1

Problem Specification

1.1.1

Frontal face detection

1.1.2

Pose estimation

1.2

Objectives

Chapter 2

Previous Work

2.1

Window Based Approaches

2.2

SNoW

Chapter 3

Feature Generation

3.1

Corner Features

3.1.1

Local orientation

3.1.2

Rotational symmetries

3.1.3

Defining corner features

3.2

Convex Feature Pairs

d

c

c

d

υ

3.3

Triplet Invariants

3.3.1

Triplets invariants of convex feature pairs

Chapter 4

Associative Structure

4.1

Structure Definition

4.2

Channel Representation

4.3

Feature Vectors

4.4

Response Vectors

Chapter 5

Post-processing

5.1

Response Clustering

5.2

Connection between Triplets and Responses

Chapter 6

Results

6.1

Data Sets

6.1.1

Training set I

6.1.2

Training set II

6.1.3

Test set I - the Yale Face Database

6.1.4

Test set II - The MIT Face Database

6.1.5

Test set III - Images from the Image Coding Group at