Face Recognition for Mobile Phone Applications

Full text

(1)LiU-ITN-TEK-A--08/064--SE. Face recognition for mobile phone applications Erik Olausson 2008-05-13. Department of Science and Technology Linköping University SE-601 74 Norrköping, Sweden. Institutionen för teknik och naturvetenskap Linköpings Universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--08/064--SE. Face recognition for mobile phone applications Examensarbete utfört i medieteknik vid Tekniska Högskolan vid Linköpings universitet. Erik Olausson Handledare Rikard Berthilsson Examinator Björn Kruse Norrköping 2008-05-13.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Erik Olausson.

(4) .. Face Recognition for Mobile Phone Applications. Erik Olausson. Department of Science and Technology, Linköping University, Sweden. In collaboration with: Cognimatics AB, Lund, Sweden. Supervisor: Rickard Berthilsson, Cognimatics. Examiner: Björn Kruse, Linköping University. May 19, 2008.

(5) Contents 1 Introduction 2 Related work 2.1 Eigenfaces . . . . . . . . 2.2 Local DCT . . . . . . . 2.3 SVM . . . . . . . . . . . 2.4 Warping . . . . . . . . . 2.4.1 2D . . . . . . . . 2.4.2 3D . . . . . . . . 2.5 Neural networks . . . . . 2.6 Face Recognition Vendor. 1. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 3 3 4 4 5 5 6 7 8. 3 Method 3.1 Preprocessing . . . . . . . . . . . . . . . . 3.2 Feature extraction . . . . . . . . . . . . . 3.3 Training phase . . . . . . . . . . . . . . . 3.4 Building the gallery . . . . . . . . . . . . . 3.5 Handling pose variation . . . . . . . . . . 3.5.1 Using neural networks to find pose 3.5.2 Warping . . . . . . . . . . . . . . . 3.5.3 Local feature extraction . . . . . . 3.6 Recognition phase . . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 9 9 11 12 14 15 15 17 25 25. . . . . . .. 28 28 29 29 32 32 32. . . . . . . . . . . . . . . . . . . . . . Test. . . . . . . . .. . . . . . . . .. 4 Experimental results 4.1 Measure of performance . . . . . 4.2 Recognitions results . . . . . . . 4.2.1 Base case . . . . . . . . . 4.2.2 Warping . . . . . . . . . . 4.2.3 Adding reference images . 4.2.4 Implementation on mobile. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . . phone. 1. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . ..

(6) 5 Discussion 5.1 System setup . . . . . . . . . . . . . . . 5.2 Recognition rate . . . . . . . . . . . . . 5.3 Data structure and resource usage . . . 5.4 Future work and possible improvements. 2. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 35 35 37 38 39.

(7) Abstract Applying face recognition directly on a mobile phone is a challenging proposal due to the unrestrained nature of input images and limitations in memory and processor capabilities. A robust, fully automatic recognition system for this environment is still a far way off. However, results show that using local feature extraction and a warping scheme to reduce pose variation problems, it is possible to capitalize on high error tolerance and reach reasonable recognition rates, especially for a semi-automatic classification system where the user has the final say. With a gallery of 85 individuals and only one gallery image per individual available the system is able to recognize close to 60 % of the faces in a very challenging test set, while the correct individual is in the top four guesses 73% of the time. Adding extra reference images boosts performance to nearly 75% correct recognition and 83.5% in the top four guesses. This suggests a strategy where extra reference images are added one by one after correct classification, mimicking an online learning strategy. Sammanfattning Att applicera ansiktsigenkänning direkt på en mobiltelefon är en utmanande uppgift, inte minst med tanke på den begränsade minnes- och processorkapaciteten samt den stora variationen med avseende på ansiktsuttryck, hållning och ljusförhållande i inmatade bilder. Det är fortfarande långt kvar till ett färdigutvecklat, robust och helautomatiskt ansiktsigenkänningssystem för den här miljön. Men resultaten i det här arbetet visar att genom att plocka ut feature-värden från lokala regioner samt applicera en välgjord warpstrategi för att minska problemen med variationer i position och rotation av huvudet, är det möjligt att uppnå rimliga och användbara igenkänningsnivåer. Speciellt för ett halvautomatiskt system där användaren har sista ordet om vem personen på bilden faktiskt är. Med ett galleri bestående av 85 personer och endast en referensbild per person nådde systemet en igenkänningsgrad på 60% på en svårklassificerad serie testbilder. Totalt 73% av gångerna var den rätta individen inom de fyra främsta gissningarna. Att lägga till extra referensbilder till galleriet höjer igenkänningsgraden rejält, till nästan 75% för helt korrekta gissningar och till 83,5% för topp fyra. Detta visar att en strategi där inmatade bilder läggs till som referensbilder i galleriet efterhand som de identifieras skulle löna sig ordentligt och göra systemet bättre efter hand likt en inlärningsprocess..

(8) Chapter 1. Introduction During the last few decades, face recognition has become an increasingly active research area within image analysis and computer vision. A growing demand for effective and reliable security and surveillance systems has put the spotlight on face recognition along with other biometric identification schemes. Even though other methods such as fingerprint or iris scans may yield better results, face recognition is attractive because of its unintrusive nature. It is also ’natural’ in the sense that it is what humans use for identification in a social environment. The human ability to quickly and accurately spot a known face regardless of lighting conditions, pose or facial expression on photos or video sequences also suggests that facial recognition could be a promising identification method if properly developed. However, up until now, most researchers have focused on facial recognition in very specialized environments or situations. Such situations include authorities who want to spot a fugitive on the subway or a pharmaceutical company who wants to restrict access to a laboratory. The consequences of a mismatch, be it true rejection or false acceptance, are typically very severe in these circumstances and the tolerance for errors is therefore very low. The price for this low error tolerance is restrictions on input and/or reference images in the gallery that the system uses for comparison. For example, it’s no coincidence that criminals are photographed a number of times from different angles, increasing the reference images of that individual in the gallery (in this case, the database with criminal records) to ease future identification. In this thesis a new possible area of application for facial recognition is explored: mobile phone cameras. Here the intended user and operator of the system is a normal mobile phone owner. The environment is practically unrestricted with respect to pose, facial expression and lighting conditions, and in the worst case scenario, we cannot expect more than one photo per person in the gallery or as input image. On the other hand, the consequence of a mismatch is much less severe and the tolerance for errors therefore greater. Although this area of application may seem frivolous compared to e.g. stopping a plane hijacker at a security checkpoint or identifying a murderer on a surveillance video, the research is motivated by the fast hardware development in mobile phones. New 1.

(9) phones are equipped with digital cameras of increasingly better quality and resolution as well as memory and processor capabilities. This prompts the owners to use their phone rather than a separate digital or analogue camera to document their life and activities. Software development in image processing and analysis for mobile phone cameras have to keep up. Using face recognition, friends and family can be automatically tagged and the information about who’s in a photo can be saved along with the photo. This will facilitate organization and access to the photos stored on the mobile phone. It may not be worth the effort if you have a few dozens of photos stored on your phone and add a few more each month, but it’s a different story if you have a few thousand photos stored and add a few more each day.. 2.

(10) Chapter 2. Related work The following is a quick summary of previous research related to this work. For a more comprehensive survey of the methods and strategies applied to face recognition historically and today, readers are referred to Abate et al.[1].. 2.1. Eigenfaces. A classic approach to face recognition is the Eigenface method, also referred to as Karlhunen-Loève expansion or Principle Component Analysis. It was first developed by Matthew Turk och Alex Pentland in 1991[14]. In this method, singular value decomposition is used to represent a gallery of face images. The first image, or component, represents a ’mean’ or average face while subsequent components, the eigenvectors of the covariance matrix of the set of face images, add more and more detail. The faces can then be synthesized as linear combinations of these components, called "eigenfaces". This is a very efficient strategy for storing a large set of faces since approximately 100 components are enough to represent a very large face database, thus reducing the dimensionality. Recognition is performed by finding an optimal linear combination of eigenfaces to represent the input face. This linear combination is then compared to the ones used to represent the faces in the database. The more similar the linear combinations are, the greater the chances that the two faces come from the same individual. Over the years, this method has been used in series of experiments, by itself or in combination with other methods, to reach very high recognition rates. For controlled pose and lighting conditions, recognition rates as high as 95 percent on a database with over 200 individuals have been reported (see for example [8]). However, the eigenface method is holistic, i.e. it always considers the whole face during recognition and relies heavily on faces being accurately normalized in placement and scale. This makes it very sensitive to pose variation. For instance, in an experiment by Sanderson et al.[12] a PCA-based system was trained on frontal images of 90 individuals. While the system was able to recognize these 90 individuals to almost 100 percent accuracy on frontal faces using 30 eigenfaces to represent the gallery, accuracy declined dramatically for faces with different poses. For face images taken at a 60 degree angle, error rates rose to as much 3.

(11) as 45 percent if more than a few eigenfaces were used. As Sanderson et al. notes: "There is somewhat of a trade-off between the error rates on frontal and non-frontal faces, controlled by dimensionality"[12]. Furthermore, matching a linear combination of eigenfaces to the pixel intensities of the input image also makes it sensitive to lighting conditions[16].. 2.2. Local DCT. In contrast to holistic methods such as the Eigenface method, Local DCT, or Discreet Cosine Transform, divides the image into small local regions that are handled separately. The idea is to make the recognition system less sensitive to pose variation since the overall geometry of the face is ignored and only local geometry considered. This means that eyes, nose etc. can "move" independently between images without affecting the recognition process to any great extent. Also, in a local appearance model, the values of the local regions can be weighted differently to emphasize the regions which are found to be most discriminant between faces of different individuals. Typically, face images are divided into blocks that can be overlapping or non-overlapping. DCT is performed on each block independently. The coefficients resulting from the DCT are then used as features, possibly weighted according to some regional weighting model and/or reduced in dimensionality to reduce memory usage and speed up processing. Ekenel et al.[5] used this approach on a door monitoring system. Here, the input is surveillance photos displaying subjects entering a lecture hall. Lighting conditions are somewhat similar from image to image while pose varies greatly. Enkel et al. report over 80 % recognition rate on still images and over 90 % on video streams on a database of 39 individuals, using varying extensions to improve performance. However, it’s important to note that this performance was reached by using the same individuals for training and testing, i.e. the system was trained on a short video sequence of these 39 individuals. On individuals that are not known apriori, the system perfomance would probably drop. Local DCT feature extraction is also studied by Sanderson et al.[12]. They split up a 56 x 64 pixel face image in 8 x 8 pixel overlapping patches. Each patch is in turn transformed using DCT, resulting in an 18-dimensional feature vector for each patch. These feature values are fed into Bayesian Classifier system based on Gaussian Mixture Models. For a system trained solely on one frontal image per person to recognize, Sanderson et al. reached their best result on non-frontal test images when the overlap between patches were as large as possible (i.e. 7 pixels). They note: "The larger the overlap, (and hence the smaller the spatial area for each feature vector), the more the system is robust to view changes". Sanderson et al. attribute this effect to the decreased dependancy on global face geometry [12].. 2.3. SVM. SVM, or Support Vector Machines, are a set of training and classifcation methods applicable to many areas in computer vision as well as artificial intelligence in general. Input 4.

(12) data is projected into a higher dimensionality space. During training, three parallel hyperplanes are constructed. The two outlying hyperplanes are fitted so that all positive training examples fall on one side of one of the planes, all negative training examples on the opposite side of the other hyperplane and no training examples between them. The optimal central hyperplane is assumed to be the one that maximizes the distance between the outlying hyperplanes, thus reducing the risk of future input examples being misclassified. Face recognition is usually considered a k-class classification problem where k > 0 is the number of individuals in the database we are checking against. In order to use the SVM strategy, Phillips [9] redefines face recognition to a 2-class problem using an image pair as input rather than a single image. Hence, two images of the same individual are considered to be a positive example while two images of different individuals are a negative example. A principal problem when using SVM is that positive and negative examples may not be linearly separable in the given hyperspace. If the variation within the class is greater in some aspect than the variation between classes, it may prove impossible to find a hyperplane that correctly classifies all positive and negative training examples. This is usually solved using a non-linear kernel function K() to make the decision surface non-linear. This is the approach taken by Phillips[9]. His training set is comprised of 100 frontal images where the faces have been manually normalized with respect to eye positions. Histogram equalization is also performed to reduce the impact of illumination change. The resulting hyperplane was then used to classify 100 images of 100 other individuals not present in the training set. Despite the fact that the individuals in the training and test set were not the same Phillips achieved an identification rate of approximately 78% compared to 54% for the Eigenface method tested on the same dataset. However, once again the pose of the input faces is greatly restrained. Phillips only tests his algorithm on frontal face images.. 2.4. Warping. One of the hardest obstacles to overcome in face recognition is pose variation, see for example Philips et al.[10]. Common normalization using for example eye positions can easily compensate for scaling and in-plane rotation of faces. But it has no effect on out-of-plane rotation. Despite this normalization step, an image of a face looking to the right appears dramatically different from the same face looking straight and even more so from the face looking up, down, or to the left. Many different strategies have been tried to compensate for this effect, among them 2D and 3D warping.. 2.4.1. 2D. The idea behind 2D warping of input and/or gallery images is fairly simple. Assuming that we know which way the face is looking, and thus which rotation we should compen-. 5.

(13) sate for, the pixels of the image are moved differently for different parts of the image, creating the illusion that the face is rotating. The crucial issue is how to determine what pixels move and how much. Cheng et al.[4] addressed this problem by simply assuming that the head is a cylinder. Images of faces are assumed to be orthogonal projections of a cylinder with face texture onto the image-plane. Hence, rotating the face is just a matter of virtually rotating the cylinder and performing a new orthogonal projection of the cylinder in this new location. This approximation is of course only valid for head rotation in the horizontal plane (i.e. looking to the right or to the left), but the idea could be extended by using a sphere rather than a cylinder to approximate the head. In many ways, this is a very crude approximation of head rotation. Protrusions of faceparts such as nose, chin and ears are ignored. Moreover, facial images are usually normalized with respect to eye position. Fictively rotating the head as if it were a cylinder will move the eyes as well, making comparison with other normalized images difficult. Even so, Cheng et al. reports an increase in recognition rate with a number of standard recognition methods. For instance, recognition rates using the Line Edge Map rose from 48% to 70% on their testset when rotated faces were warped into front-looking position in this way[4]. A more elaborate warping scheme is proposed by Beymer and Poggio[2]. They try several techniques, some manual and some automatic, to calculate the actual optical flow between images where the same prototype head is rotated. Assuming there are no mismatches between pixel correspondences, this will in fact give the actual movement of pixels in the image during rotation of the prototype head. For each image we want to warp, interperson correspondances between the input face and the prototype are computed using manual or automatic methods. Finally, movement found using the prototype is mapped to the input image according to the interperson correspondances, creating a novel pose for the input image. Beymer and Poggio refer to this procedure as parallel deformation. This scheme is not restricted to head rotation in the horizontal plane. Beymer and Poggio[2] use it to create 9 virtual views, starting with one where the face is looking slightly to the right. Assuming the left and right side of the face is symmetrical, mirroring the images will create a total of 16 different poses. Ten different poses were tested for each person on a 62 person database. The results show that generating virtual views does in fact increase the recognition rate, in this case from 70% to 85%, compared to just using a single reference image of the face looking slightly to the right and mirroring this.. 2.4.2. 3D. A more complicated, but also more accurate approach is doing the warping in 3D. Depending on the accuracy of the 3D-model used for approximating the head, this strategy can produce very realistic results for any desired head rotation. But it is not only more demanding computationally, it suffers from the principal geometric problem that a 3D model can never be totally determined from a singular view. One way around this problem is using prior knowledge of the general shape of the 6.

(14) head. Assuming there are no anatomic anomalies in the set of faces used for training and/or testing, all head shapes will be approximately the same. Of course, the sizes and positions of anatomic traits will be different from person to person, so the 3D model of a general head shape will have to be modified or fitted to input data in some way. This is the approached taken by Blanz et al[3]. Using a database of 200 laser-scanned faces they build a so called morphable model with vectors representing the shape and texture of the faces. Given an input image, an analysis-by-synthesis-loop is applied to fit the morphable model to the image. In each iteration, a 3D face is rendered from the morphable model using the current parameters. An error measure is calculated between the input image and the rendered image, and the rendering parameters (including head orientation, lighting conditions etc.) are modified accordingly. Once the correct 3D shape has been approximated in this way, Blanz et al. renders a new image where the person in the input image is looking straight at the camera. On a database of 87 individuals not part of the set used for laser scans, Blanz et al. reported a significant increase in recognition rates on images of rotated faces. For instance, on the set of images where the person is looking 45 degrees to the left, recognition rose from 16 to 70%[3]. However, the iterating fitting-algorithm has got significant drawbacks. In order to work properly it has to be manually initialized with seven feature points, such as the tip of the nose. Also, this type of strategy is very computationally expensive. Blans et al. reports that the optimization loop takes 4.5 minutes on a 2GHz Pentium 4 processor to converge, far from the kind of processing power we can expect in a mobile phone.. 2.5. Neural networks. Neural networks have been used for many years in artificial intelligence and machine learning research. For situations when input data are extensive and the relationship between input and output is complicated and hard to overlook, neural networks is an efficient model. Since the system is automatically constructed during the learning phase, the operator doesn’t have to have know how certain parts of the input will affect the output. Another advantage is that although the backpropagation algorithm used for training may take some time, the run-time classification is very fast. An example of a neural network used for image analysis is the ALVINN which uses a neural network to steer an autonomus car on a highway[7]. Input in this case is a 30 x 32 pixel image of the road ahead. The pixel intensities are fed into the neural network’s 960 input nodes. Via the four hidden nodes, output is obtained to turn the steering wheel. In another project, Rowley et al.[11] used a neural network to automatically detect faces in an image. In this case the input to the network was a series of 20 x 20 pixel histogram-equalized cutouts of the input image, sampled at different resolutions. The output here is either 1 if a face is found in the cutout window in question, or -1 if no face was found.. 7.

(15) 2.6. Face Recognition Vendor Test. One of the most thorough tests of commercial face recognitions systems conducted so far is the Face Recognition Vendor Test 2002, performed by Philips et al.[10]. Here, ten commercial systems were tested on an extensive data set containing 121,589 images of 37,473 individuals. The images were taken from the Visa Services Directorate of the U.S. Department of State. They were therefore of good quality with uniform lighting, pose and background. Another, smaller data set was also used, were lighting changed from indoor to outdoor and poses and facial expressions varied. Results show that the best commercial software today will reach around 90 % verification rate at a 1% false acceptance rate in the verification test. In the identification test, the top systems were able to spot the correct individual around 70 % of the time, while the correct individual will be among the top ten guesses around 76 % of the time. All this is on a database of astounding 37,473 individuals. If the database was reduced to 800 individuals, top recognition rate rose to about 85 %. However, these results apply to controlled, front-facing images. When tried with the smaller dataset, the huge effect of pose variation could be seen. Even when reducing the size of the database from 37,473 individuals to 87, the rate of correct identification dropped to about 50 % on faces looking up or down, and to about 30 % on faces looking left and right.. 8.

(16) Chapter 3. Method 3.1. Preprocessing. Although the training is done off-line on a Linux PC and the recognition is done on the mobile phone, the preprocessing for images used in training and to build the gallery was the same as for the input images used to test the recognition system. Five different sets of images were used: Caltech image set: 143 images, 896 x 592 pixels of 26 individuals looking straight at the camera. Lighting conditions vary while pose and facial expressions are practically constant. Courtesy of California Institute of Technology. Cognimatics image set: 177 images, ranging from 640 x 480 to 1280 x 960 pixels. Lighting conditions and facial expressions vary while pose is practically constant. Courtesy of Cognimatics. Georgia Tech image set: 390 images, 640 x 480 pixels of 50 individuals. Lighting conditions vary somewhat while facial expressions and pose vary greatly. Courtesy of Georgia Institute of Technology. Massachusetts image set: 154 images, 250 x 250 pixels of 20 individuals. These images are collected from the web using the Viola Jones face detector[15]. Apart from this, they are totally unrestrained and vary greatly in pose, illumination and image quality. Since the photos in this image set were very small to start with, the poorest quality ones had to be discarded. Courtsey of The University of Massachusetts. Facebook image set: 155 images, ranging from 178 x 434 to 604 x 543 pixels of 18 individuals. These images are "live" photos taken from the internet community Facebook. Hence, there are no restrictions at all and lighting conditions, facial expressions, pose and image quality all vary greatly. This is the most challenging image set from recognition point of view, but also the most realistic with respect to future use of the application.. 9.

(17) Figure 3.1: Example of original (here shown downscaled) and normalized 112 x 112 pixel face image, in this case from the Caltech image set. All photos were colorl jpeg images containing a maximum of one face per image. In each image, the face and eye positions were found automatically with functionality previously developed by the Cognimatics crew. Using the eye positions the images were scaled, cropped and rotated to a 176 x 176 pixel image with fixed eye positions. This image is then cropped further to a 112 x 112 pixel image during feature extraction to avoid differences between images due to background changes, hairstyles etc, see figure 3.1. Note that when the faces are looking to the left or right, the distance between the eyes is reduced as a geometric side effect. This causes unwanted scaling in the normalization process that can become quite severe when the angle of rotation is large. See figure 3.2. Figure 3.2: Faces looking left or right are inadvertently upscaled when normalizing with respect to distance between the eyes.. 10.

(18) Figure 3.3: The three different face patches used for feature extraction. Whole face (112x112 pixels), right eye (80x80 pixels) and left eye (80x80 pixels).. 3.2. Feature extraction. As with the preprocessing step, the feature extraction is done in the same manner in the training, when building the gallery and when testing probe images against the gallery. Each 112 x 112 face image is divided up into three overlapping images. The first covers the whole 112 x 112 image, the second an 80 x 80 pixel image centered around the right eye and the third a 80 x 80 pixel image centered around the left eye, see fig. 3.3. From now on, these three image patches are treated independently. Each image patch is divided into 599 overlapping rectangular feature patches of different sizes and proportions. In each feature patch Ω, three different gradient measures from the image Φ(x, y) are calculated: . . Φx2 Φx + Φy 2 Φ y Φx + Φ y Φ Φ x y Φx + Φ y . . (3.1). (3.2). (3.3). . For all (x, y) ∈ Ω, where Φx and Φy are found using a sobel convolution kernel on each pixel. For each patch, this will give us three measures of the "activity" in each patch in three directions, i.e. a measure on the amount of structure along the x-axis, the y-axis and the xy-direction in each feature patch. To speed up computation, three-channel integral images were used to store the values of Φx2 , Φy2 and Φx Φy respectively. An integral image I(u, v) stores the cumulative values along the rows and columns of gradient image G(x, y) and is defined as: I(u, v) =. x≤u,y≤v. 11. G(x, y). (3.4).

(19) Calculating the total sum of gradients in the rectangle delimited by corner pixels (xlef t , ytop ), (xright , ytop ), (xlef t , ybottom ), (xright , ybottom ) is then a simple summation: G(u, v) = I(xlef t , ytop ) − I(xright , ytop ) − I(xlef t , ybottom ) + I(xright , ybottom ) u,v∈rectangle. (3.5) Note that the integral image can be constructed efficiently using an iterative algorithm: I(u, v) = G(u, v) + I(u − 1, v) + I(u, v − 1) − I(u − 1, v − 1). (3.6). In total, 599 rectangles in each gradient image are summed in this way, resulting in a total of 599∗3 = 1797 feature values for each image patch (whole face, right eye, left eye). These values are henceforth treated as components of a vector in the 1797-dimensional space where training as well as testing occurs.. 3.3. Training phase. In contrast to, for example, a restricted access system, we have no prior knowledge of the individuals we want to recognize. So for realistic testing of the algorithm, the images in the training set and gallery/probe set have to be of different individuals. In this case, the Cognimatics image set was used for training along with parts of the Georgia Tech, Massachusetts and Facebook image sets, while the Caltech image set was entirely reserved for testing. The reason for this is that while the images in Cognimatics and Caltech image sets are quite similar, the Georgia Tech, Massachusetts and Facebook images differ greatly in lighting, facial expressions and pose. While the training and testing set should not contain the same individuals, they should be similar in their composition in order to reach maximum recognition rates. That is, the closer the training set resembles the images we want to classify in the finished application, the better are our chances of making a successful classification. In total 383 images of 65 individuals were used for training while 598 images of 85 individuals were reserved for testing purposes. Out of these, 252 images were used to build a gallery with three reference images per person, while 346 images were used as input or probe images. While all the code used in run-time is written in C, the training procedure is carried out using Octave scripts, which is similar to Matlab scripts. The reduced computational efficiency is no problem here since all training is done offline. The first step of the training procedure is to extract feature values for each image in the training set, as described before. All input photos are paired up in every possible combination. The absolute value of the differences between the 1797 feature values in each feature space is calculated: dmn = |fm − fn |. (3.7). where fm and fn are the feature values for the m:th and n:th image in the training set, m = n and fm , fn , dmn ∈ 1797 12.

(20) This yields a new position vector dmn per feature space for each image pair. Note that this operation is necessary to make the k -class problem of classifying an input photo showing one of k individuals into a two-class problem: positive match or negative match. As mentioned before, the training set contained a total of 383 images of 65 individuals, resulting in a total of 1,439 positive matches (i.e. image pairs of the same individual) and 71,322 negative matches (i.e. image pairs of different individuals). Once the difference values have been calculated for all combination of image pairs, the positive and negative matches were loaded into an octave script. Here, the dimensionality is set to a maximum of 400 using singular value decomposition. That is, during training only the 400 dimensions with largest amount of variance, and therefore the biggest inpact on the recognition, will be used to construct the hyperplanes. Later, when the final hyperplanes are saved in a c-file,the same projection matrix that was created with singular value decomposition is used in reverse to project the hyperplanes back to 1797 dimensions. The goal of the training procedure is to find 1797-dimensional hyperplanes, wj so that all (or as many as possible) of the points representing positive matches are on one side, and all representing negative matches are on the other. To simplify calculations, we negate all negative matches, thus creating one large point-group in the same "part" of the 1797-dimensional space. We then try to define each plane in turn so that as few points as possible (both positive and negated negative) are outliers, i.e. fall on one wrong side of the plane. The remaining training procedure is in essence a modified version of the GaussNewton algorithm for finding a minimum sum of squared values, and has many similarities with the support vector machine described in section 2.3. Given a set of points xj we want to find a (hyper)plane w passing through origin so that the error Q(w) is minimized. Q(w) is set to: (f (xj , w))2 ≈ {f (xj , w0 ) + f (xj , w0 )(w − w0 )}2 (3.8) Q(w) = j. j. Where the error term is expanded using Taylor expansion and w0 refers to the approximated plane from the previous iteration (or random values if it is the first iteration). In our case, f (xj , w) is a function referring to the side of the plane that point xj is on: xj · w − 1 if xj · w < 1 (3.9) f (xj , w) = 0 else This is equivalent to only considering the outliers: positive matches and negated negative matches on the wrong side of the plane. Since the derivative f (xj , w) with respect to w is xj , the Taylor expansion in eq. 3.8 simplifies to: {(xj · w0 − 1) + xj · Δ}2 (3.10) Q(w) = j. where Δ = (w − wo ). Since we cannot control xj and w0 , minimizing Q(w) is equivalent to minimizing Δ. So, we set: ˆ Q(Δ) = {(xj · w0 − 1) + xj · Δ}2 (3.11) j. 13.

(21) expand:. ˆ Q(Δ) =. . {(xj · w0 − 1)2 + ΔT xj xTj Δ + 2(xj · w0 − 1)xj Δ}. (3.12). j. And take the derivative with respect to Δ. Note that the first term is not dependent on Δ and therefore vanishes: ˆ δ Q(Δ) (xj · w0 − 1)xj xj xTj Δ + 2 =2 δΔ j. Setting. ˆ δ Q(Δ) δΔ. (3.13). j. = 0 and solving gives us the final value of Δ: xj xTj )−1 · ( (xj · w0 − 1)xj ) Δ = −(. (3.14). j. j. As Δ = w − w0 , the new approximation of w is found: w = w0 + Δ. Following the Gauss-Newton method, the algorithm will iterate, each iteration producing a new approximation of w, eventually reducing the mean absolute value of Δ and the number of outliers. In our case, the training procedure was set to iterate until μ(|Δ|) < 10−18 or at most 30 iterations. Each set of iterations produces one plane, reducing the number of outliers. The number of planes produced depends on the amount and disparity of input training data and the stopping criteria. Depending on the size of training data and number of planes desired, the algorithm was either asked to stop when the total number of outliers fell below a preset threshold, or when a maximum number of planes were reached, regardless of the number of outliers left. This training procedure was performed independently in the three different feature spaces, resulting in 4-38 planes per feature space, depending on stopping criteria.. 3.4. Building the gallery. Time restrictions made it impossible to fully implement this step on the mobile phone. Instead, it was performed on the PC and the result then fed into the phone to test the final matching algorithm. In the finished application, this step is intended to be performed directly on the mobile phone, using photos taken with the mobile phone camera as gallery images. As during the training phase, the first step in building the gallery is to crop and normalize the images to 176 x 176 pixels with fixed eye positions. This image is further cropped to 112 x 112 pixels before feature extraction proceeds as described above (see section 3.1). A crucial assumption was made with respect to gallery images. While probe photos are unrestricted with respect to pose and lighting conditions, gallery images are assumed to be front-facing images with the subject is looking more or less straight at the camera. The reason for this is that gallery images are taken during much more controlled 14.

(22) Figure 3.4: Example of extreme pose variation. Here an individual from the Facebook image set. circumstances, i.e. the user takes the photo with the explicit purpose of adding it to his or her gallery. Gallery images are already used in today’s mobile phones to connect a number and/or name with a photo of that person to see who you are calling or who’s calling you. Extending the use of existing phone book gallery to include recognition of known contacts on arbitrary input photos is therefore a natural step.. 3.5. Handling pose variation. As the basis for this recognitions scheme is using image gradients rather than intensity values directly, changing lighting conditions did not pose a great problem. With the exception of extreme cases and changing shadows, gradients are practically insensitive to changes in lighting from day to night, indoor to outdoor etc. However, one of the most difficult hurdles to overcome remains: pose variation. While gallery photos may be assumed to be taken under somewhat controlled circumstances, photos from vacations, outings, parties etc. tend to be varying to the extremes with respect to pose. See fig 3.4 for an example. The face- and eye-extraction algorithm developed by the Cognimatics team and used in this thesis work will fail in the most extreme cases, e.g. when faces are greatly rotated with respect to the horizontal. But even among the faces that are correctly found and normalized, varying head rotations are quite common, especially images of faces looking left or right.. 3.5.1. Using neural networks to find pose. Early on in the project, building a simple neural network was tried as a strategy for automatically detecting the pose of probe images. The idea was that once the correct pose could be identified or approximated, it could be compensated for in some way, rectifying or warping the input image of a rotated face to a front-looking face. Since in-plane rotation is compensated in the normalization step, two types of head rotation remain: tilting the head to look up or down and rotating it to look left or right. Theoretically speaking, an optimal neural network output given a face image as input would therefore be two continuous angle values describing the head rotation in these two. 15.

(23) Figure 3.5: Example of the 28x28 pixel gradient images used as input to the neural network. Here the same individual looking right, straight, and left (shown upscaled). directions. However, this kind of outcome would require a very large set of images where head rotation was minutely controlled to train the net. Even then, it’s hard to predict the accuracy of the outcome. A quick survey of the different image sets revealed that while faces looking up or down are present in the sets, faces looking left or right are much more common. This is only natural in social situations were we tend to keep our gaze more or less horizontally, were the faces of people around us can be found. Hence, for testing purposes a simplified neural network was built where the outcome is discreet rather than continuous and the network only analyzes head rotation in the horizontal direction. A neural network was built, using an open source framework provided by Tom Mitchell. See [18] and [7]. A 28x28 pixel gradient image of the 112x112 cutout images of faces looking right, straight and left were extracted and fed into the net as input, see fig. 3.5. This gave a total of 28 ∗ 28 = 784 input nodes. The net was constructed with 6 hidden nodes and 3 output nodes corresponding to "left", "straight" or "right". The net was trained on 183 images of 63 individuals with different head poses. Another 83 images of the same individuals were used to test the performance of the net. The result can be seen in table 3.1 While the preliminary tests of the neural network for finding head pose were promising, making it a useful part of the face recognition scheme was still a far way off. Neural networks are generally speaking fast during run-time and not overly greedy with respect to memory usage. However, operating on a mobile phone with its limited memory resources and processing power, it could prove burdensome, especially considering running the neural network is only the first step in the recognition process. Even if the correct pose is found, the input image has to be rectified or warped into a front-facing image before the usual recognitions procedure takes over. A far from unthinkable worst-case scenario would be where four of five faces are found in an image, all of them looking one direction or the other. Once the faces are found, cut out and normalized, they have to run through the neural network, be warped, their features extracted and then matched against, say, 100 images in the gallery for recognition. And all this has to be done in a few seconds since the user supposedly wants to take more photos with his camera, not wait around five minutes for the recognition algorithm to finish up. Hence, the neural network strategy was abandoned. Instead, the assumption that gallery images are front-facing yields a possibility to greatly reduce the execution time during recognition. By warping the front-facing gallery images to left- and right-looking images instead of warping input images looking right or left to front-facing images, the 16.

(24) Using neural network to classify pose Training set (183 images) 98.4 % Test set (83 images). 86.7 %. Table 3.1: Percentage of correctly classified images in training and test set using the neural network. This result was reached after 12 epochs of training. whole pose-finding step that the neural network was supposed to handle can be omitted. Also, the warping will be done during gallery building and not during the recognition procedure, saving processing power during run-time (at the price of increased memory usage, as will be discussed in section 5.3).. 3.5.2. Warping. Assuming gallery images are front-facing gives us the opportunity to warp them preemtively when building the gallery. With this approach, three images per individual will automatically be stored in the gallery. One of these being the original (assumed to be) front-facing, and two being synthetic images where the original photo has been warped so that the individual appears to look to the left and the right. This accommodates the most frequent types of pose variation present in both the the training and test set, as previously discussed (see section 3.5.1). Striving after the "appearance" of synthesized pose variation (i.e. making a front-facin person look right or left) might appear frivolous. But since the recognition algorithm is based on geometric properties such as the shape and placement of nose, mouth, cheek lines etc, a head that appears rotated to the human eye will also "appear" rotated to the algorithm. On the other hand, warped faces may or may not look natural to the human eye. The skewing of parts of the face is an inherent side effect to warping when trying to emulate 3D rotation by simple 2D pixel movement. But the performance of the recognition algorithm is not based on aesthetic criteria, as long as prominent facial treats are as uniform as possible in shape, size, orientation and position, it matters little if the face is natural-looking or not. While 3D warping, as described in [3] for example, would probably yield the best results for recognition, this strategy was considered to computationally costly and therefore abandoned. The Blanz et al. strategy required not only manually set input points, but also 4.5 minutes of execution time on a 2GHz Pentium 4 processor. While simplifications and optimization may cut the computational cost somewhat, a 3D approach is inherently more expensive than a 2D approach. Cylindrical assumption The first warping strategy that was tried was inspired by the work of Cheng et al [4]. As in their work, the input photos were assumed to be orthogonal projections of a cylinder with the face as texture. That is, the shape of the head was approximated by a cylinder with the facial features such as the eyes, nose, mouth etc. "painted" on it. This assumption leads to very efficient warping in the horizontal direction. Since the cylinder is uniform in the vertical direction, only the columns of the image have to be warped, and the rows 17.

(25) Figure 3.6: Result of warping and then re-normalizing using a cylindrical assumption and α = Π8 . Original image in the middle. Note how the the re-normalization step needed to relocate the eyes to the original positions will cause an increase in unwanted scaling of facial features. can be left untouched. Simple geometric deduction gives the equation: xnew = r ∗ cos(cos−1 (. xold ) + α) r. (3.15). where xnew and xold define the vertical distance to the center of the cylinder for each column of the image before and after warping, r is the radius of the cylinder and α is the desired angle of rotation used in warping. Assuming the virtual cylinder covers the whole input image, which is equivalent to the radius being half the image width, the input value can be scaled so that 0 < xold < 1, which gives us r = 1 and 0 < xnew < 1. Eq. 3.15 then reduces to: (3.16) xnew = cos(cos−1 (xold ) + α) As input for the warping, the 176 x176 pixel images were used. Note that these images are normalized with respect to eye positions. However, this normalization is lost in the warping process. Eyes are treated just as any other part of the face and rotated along with them. To remedy this, a re-normalization step had to be added, were the 176 x 176 pixel images are rescaled with respect to the new positions of the eyes, so that the eye positions of the warped images match the eye positions of the original images. Figure 3.6 shows the final result of this warp strategy when re-normalization has been performed, with α = Π8 . Note that the re-normallization step will cause even more unwanted scaling. Manual optical flow approximation As can be seen in table 4.2, the cylindrical assumption was unable to increase the recognition rate. Instead, a manual approximation of optical flow between two images of the same individual with different head rotation was tried. One motivation for this approach was that it could, in theory, handle a problem the cylindrical assumption did not: unwanted scaling. As discussed in section 3.1, facial features on images of rotated heads are often inadvertently upscaled, see fig. 3.2. This causes movement of facial features in. 18.

(26) Figure 3.7: The eleven points in a 112x112 pixel image that were manually positioned in front-facing, left-looking and right-looking images. This example is from the Gerogia Tech image set. vertical direction as well as horizontal direction between images of different head orientation. For example, in fig. 3.2, the mouth and nose appear closer to the bottom on the left-facing image than on the front-facing image. Furthermore, an adequate optical flow approximation could also address the problem of protruding parts of the face , especially the nose, moving further in the horizontal direction than e.g. the cheek, forehead or other parts closer to the center of rotation. Under the cylindrical assumption, this difference in movement was ignored. Lastly, using an optical flow-based warp would make the re-normalization step redundant. The eyes, which are fixed at the same locations in both the rotated and the front-facing images, do not move between the images. Hence, if extracted correctly, the optical flow will be zero around the eyes and this part of the images will in fact not be warped at all. Finding optical flow between images where rotation (and sometimes facial expressions) vary is no trivial task. The risk of mismatches is severe, especially on areas of even intensity and little gradient, such as the cheek. Shadows and occlusion are other sources of error. To overcome this, eleven points on the face that were easily identified and located, regardless of variation in pose or expression, were selected. These were: the inside and outside corners of the eyes (four points), the forehead right between the eyebrows and above the nose, the tip of the nose, the two points where the sides of the nose meet the cheek, the two corners of the mouth and the middle of the upper lip. See fig. 3.7. These eleven points were located in four image triplets of four individuals looking straight at the camera, facing left and facing right. The mean positions and optical flow of these four image triplets were then used to represent general pixel movement between any normalized faces looking straight and left and straight and right respectively. Of course, this only gives us the mean optical flow over the four sets of images for these specified eleven points. Subsequently the optical flow of all other pixels in the images was calculated automatically using a weighting function:. 19.

(27) (a). (b). Figure 3.8: Sparse vector field showing the flow calculated by manually locating eleven points on prominent features (here overlayed on a straight-looking image). Red dots signify zero flow. (a) shows flow from straight to left-looking, while (b) shows flow from straight to right-looking. F low(x, y) =. . G(n) ∗ (1 −. n. cn 2 ) c. (3.17). For all points n with known flow G(n) where cn , the checkerboard distance from point n to point (x, y), is below a pre-determined distance cutoff c. Thanks to the second term, the impact of the known flow-points will decrease with the square of the distance to the point in question. This will give us the known flow in the known points (were cn = 0 for some n) and a mixture of the known flows between these points, weighted according to this distance measure. Experiments showed that squaring the checkerboard distance gave a smoother and therefore better flow map. At all points further than c pixels away from any known point, the flow is set to zero. An illustration of the flow calculated this way can be seen in fig. 3.8(a), while the result of warping two images of straight-looking faces can be seen in fig. 3.9(b). Lucas-Kanade optical flow approximation Despite numerous test with varying parameters, weighting functions etc, the manual optical flow approximation failed to produce any improvement in recognition rates (see table 4.3). While it gave a decent approximation of the optical flow for some individuals, ugly defects appeared on others. Inspired by the work of Beymer and Poggio[2], a more sophisticated approach was tried, using the well-known Lucas-Kanade optical flow algorithm[6]. As this algorithm is generally best suited for following objects in video sequences, where pixel movements between two images are very small, a short video sequence was recorded of a subject moving his head from left to right.. 20.

(28) (a). (b). Figure 3.9: The result of applying the manually based optical flow approximation on two straight-looking images, here from the Caltech and Massachussets image sets. As expected, the largest pixel movements occur in the nose region. Since eye positions were fixed on all images used for flow approximation, calculated flow should be close to zero around the eyes.. 21.

(29) (a). (b). Figure 3.10: Lucas Kanade-based optical flow. (a) shows three images from the video sequence. Tracked feature points can be seen as small white sqares. (b) shows the final calculated optical flow of points in the front-facing images. Note that a few mismatches remain despite a threshold filtering where extreme flow values were thrown away. The Shi-Tomasi algorithm[13] was used to find 187 good feature points to track throughout the video sequence. Intel’s open source library OpenCV [17] provided a platform for easy implementation of this as well as the Lucas-Kanade algorithm. Six images of varying head rotation were extracted from the video sequence and fed into the algorithm. After throwing away feature points that were lost, occluded, moved outside the image or deemed to be mismatches, 160 feature points remained. Hence the actual optical flow between the the images in the video sequence was known in 160 places of the 176x176 pixel image, see fig. 3.12. Once the actual flow for these 160 points was found, the flow for the remaining pixels was calculated in a manner similar to the one used in the manual optical flow approximation. The difference is that the sheer number of known feature points mandates a better weighting function to handle them. A parameter c is set as checkerboard distance cutoff. Assuming for a pixel (x, y) we have n points with known optical flow within cutoff distance c, we set the flow for pixel. 22.

(30) (a). (b). Figure 3.11: Resulting flow calculated using the Lucas-Kanade optical flow algorithm and a weighting function for remaining pixels. (a) shows horizontal flow, while (b) shows vertical flow. Dark values represent negative flow, light values represent positive. (x, y) to: F low(x, y) =. . G(j) ·. j∈n. 1 acj. (3.18). where cj ≤ c is the checkerboard distance to known flow-point j and a is deduced below. Since the weighting function should only determine the relative weights of the nearby know-flow points, we want it to sum to 1: 1 1 1 1 + + + ... + =1 ac1 ac2 ac3 acn Solving for a yields: a=. 1 1 1 1 + + + ... + c1 c2 c3 cn. (3.19). (3.20). However, this weighting function will have a very sharp limit between pixels just in and just out of affecting range from the nearest point with known optical flow. To remedy this we also use a general distance factor, giving the final flow in pixel (x, y): F low(x, y) =. . G(j) ·. j∈n. cj 1 · (1 − ) acj c. (3.21). For all points 1...n with known optical flow G(j) and within range. Figure 3.11 shows an illustration of the flow calculated using this approach, with cutoff value set at 70 pixels, while fig. 3.12 shows the result of applying this warp to some front-facing images.. 23.

(31) (a). (b). Figure 3.12: Result of applying the flow in fig. 3.11 to two 176x176 pixel images in the Georgia Tech and Facebook image sets and then cropping them to 112x112 pixels. In both cases, the middle image is the original while the images to the left and right are the result of doing the warp. Notice how different images respond differently to the warp. In (a), the result is visually pleasing while in (b) several artifacts can be seen around the edges.. 24.

(32) (a). (b). Figure 3.13: An example were eye-centered cutouts are very similar despite comparatively large differences in facial expression around the mouth. Example from Facebook image set.. 3.5.3. Local feature extraction. As described in section 3.2, we operate on the input, gallery and probe images alike, in three separate spaces, each 1797-dimensional. As previous work with Local DCT showed, extracting features from local patches reduces the effect of pose variation. (See for example [12]). In our case, the eyes are the only facial features that we know the exact location of. Other parts of the face, such as the nose, will be located at very different positions in the image if the face is looking straight, to the left or to the right. While the brows and sockets of the eyes will have a different angle towards the camera, the actual location of the eyes will be the same. Hence, using a small cutout centered around the eye should make the algorithm more robust to view changes. Additionally, a quick survey of the image sets shows that the areas around the eyes are the least affected by changes in facial expression. The geometry around the mouth changes dramatically if a person is smiling, moping, laughing or shouting. However, the eyes (and nose) change very little in position and appearance, see fig. 3.13.. 3.6. Recognition phase. The preprocessing and feature extraction steps are identical to the training phase and building the gallery. Once the 1797 feature values in all three feature spaces have been extracted, the actual recognition is performed by matching the probe image to each gallery image. That is, the absolute differences between the feature values is calculated between the probe image and each gallery image, in the same way as in the training phase (see section 3.3 and figure 3.14). The difference values are treated as the components of a position vector in 1797-dimensional space. The summed geometric distance D to all the planes constructed in the training is calculated using Eq. 3.22 D=. an x0 + an x1 + ... + an x1797 1 1797 1 n n 2 2 (a0 ) + (a1 ) + ... + (an1797 )2 n 25. (3.22).

(33)

(34) algorithm in extreme cases might use the distance value for the whole face patch from one reference image, the distance value for the right eye patch from another and the distance value for the left eye patch from a third. Eq. 3.23 is then modified to: Dtot = w · max (Dwf ) + nRef. 1−w 1−w · max (Dre ) + · max (Dle ) nRef nRef 2 2. (3.24). Over all nRef reference images in gallery of this individual. The best value of w turned out to be different depending on the number of reference images per individual used, as will be covered in chapter 4.. 27.

(35) Chapter 4. Experimental results 4.1. Measure of performance. The measure of performance for a face recognition system is somewhat dependent on the area of application. In verification systems two types of errors can occur: false acceptance, when the system accepts an unknown face, and false rejection, when the system rejects a known face. The trade-off between these two types of errors can be adjusted with thresholds and parameters. In many cases (see for example [9]) they are plotted against each other. Sanderson et al. instead uses the mean of these two errors as overall error measure [12]. For restricted access-systems, false acceptance rates have to be set very low, since the consequences of letting in unauthorized personnel can be severe. One way to measure performance of a verification system is therefore to fix the false acceptance rate to a low percentage and then try to maximize the true acceptance rate. This is done by Blanz et al.[3]. A recognition system is in principle an extension of the standard verification system, as input images are matches against each and every gallery image. It is still possible to have a verification threshold built into the system, rejecting faces which are deemed to differ from all gallery images. However, in most cases the system is only tested on individuals known to be present in the gallery and performance is simply measured as the percentage of input image the system is able to correctly identify. See for example [2], [5] and [16]. In our case, the recognition is only semi-automatic, the user always confirms or corrects the system’s classification guess. All faces in input images are assumed to be present in the database and the system simply supplies a shortlist of its four best guesses that the user can choose from. False acceptances are of no importance here. If the system is unable to put the individual in an input image among the top four guesses, or if the individual isn’t present in the gallery, the user is simply asked to enter the name of the individual manually. Instead of measuring the percentage of false acceptances, two other performance measures are used: C1 for the number of times the correct individual is the top guess and C4 for the number of times he or she is among the top four guesses. Note 28.

(36) Base case recognition results Small plane set Large plane Totally correct: 57.5 % Totally correct: Second correct 5.9 % Second correct Third correct: 4.5 % Third correct: Fourth correct: 2.6 % Fourth correct: Totally incorrect: 29.6% Totally incorrect: Summary C1 : 57.5 % C1 : C4 : 70.4 % C4 :. set 58.4 % 7.0 % 2.4 % 3.8 % 28.4% 58.4 % 71.6 %. Table 4.1: Base case recognition results. Only one image per individual in the gallery. w = 0.45 for the small plane set and w = 0.42 for the large plane set.. that 0% ≤ C1 ≤ C4 ≤ 100%.. 4.2. Recognitions results. For testing purposes a database of 85 individuals was constructed. 346 images of these individuals, ranging from 2 to 6 images per individual, were fed independently into the recognition algorithm. A simple optimization loop was used to determine the approximate optimal value of weighting parameter w in each case (see eq. 3.24). Although the optimal value for w was dependent on the system setup with the number of planes, composition of the gallery etc, best result was always obtained when 0.33 ≤ w ≤ 0.46. That is, best result was generally achieved when the values of the whole face feature space were given larger weight than the right eye and left eye feature spaces. Two different sets of planes were trained and tried on the test data: the first set, referred to as Small plane set, used 12, 8 and 8 planes for the whole face, right eye and left eye feature spaces respectively. Given the input training data, when the training procedure was stopped 12,500, 16,160 and 19,890 outliers (positive matches and negated negative matches ) remained on the wrong side of all planes in whole face, right eye and left eye feature spaces. The second set, referred to as Large plane set used 38, 30 and 30 planes for whole face, right eye and left eye feature spaces. When the training procedure was terminated, 580, 2500 and 2500 outliers remained in each feature space respectively.. 4.2.1. Base case. Here we have only one, unwarped gallery image per individual. For a w-value of 0.45 and 0.42, which turned out to be optimal in this case, the system achieved a C1 recognition rate of 57.5% for the small plane set and 58.4% for the large plane set. See table 4.1.. 29.

(37) Cylindrical assumption warp Small plane set Large plane Totally correct: 56.9 % Totally correct: Second correct 5.9 % Second correct Third correct: 4.4 % Third correct: Fourth correct: 2.6 % Fourth correct: Totally incorrect: 30.2 % Totally incorrect: Summary C1 : 56.9 % C1 : C4 : 69.8 % C4 :. set 58.0 % 6.6 % 5.1% 3.3 % 28.6% 58.0 % 71.4 %. Table 4.2: Recognition results on gallery with images warped using the cylindrical assumption.w=0.42 for the small plane set and 0.34 for the large plane set. Manual approximation warp Small plane set Large plane Totally correct: 56.0 % Totally correct: Second correct 6.7 % Second correct Third correct: 2.6 % Third correct: Fourth correct: 4.4 % Fourth correct: Totally incorrect: 30.2 % Totally incorrect: Summary C1 : 56.0 % C1 : C4 : 69.8 % C4 :. set 58.6 % 6.9 % 2.8% 3.9 % 29.8% 58.6 % 70.2 %. Table 4.3: Recognition results on gallery with images warped using the manual flow approximation. w=0.37 for the small plane set and 0.35 for the large plane set. Lucas-Kanade based warp Small plane set Large plane Totally correct: 58.1 % Totally correct: Second correct 7.3 % Second correct Third correct: 3.2 % Third correct: Fourth correct: 3.5 % Fourth correct: Totally incorrect: 27.9% Totally incorrect: Summary C1 : 58.1 % C1 : C4 : 72.1 % C4 :. set 59.5 % 7.9% 3.5 % 2.1 % 27.0% 59.5 % 73.0 %. Table 4.4: Recognition results on gallery with images warped using the Lucas-Kanade based flow. w=0.45 for the small plane set and 0.37 for the large plane set.. 30.

(38) Adding reference images nRef=2 nRef=3 Small plane set Large plane set Small plane set Large plane set w = 0.42 w = 0.33 w = 0.37 w = 0.33 1:st correct: 63.1 % 1:st correct: 66.0 % 1:st correct: 67.2 % 1:st correct: 71.2 % 2:nd correct 7.0 % 2:nd correct 7.0 % 2:nd correct: 7.6 % 2:nd correct: 5.5 % 3:rd correct: 5.3 % 3:rd correct: 5.3 % 3:rd correct: 2.9 % 3:rd correct: 2.9 % 4:th correct: 3.0 % 4:th correct: 2.4 % 4:th correct: 4.1 % 4:th correct: 1.8 % Incorrect: 21.7 % Incorrect: 19.4% Incorrect: 18.2 % Incorrect: 18.5 % Summary C1 : 63.1 % C1 : 66.0 % C1 : 67.2 % C1 : 71.2 % C4 : 78.3 % C4 : 80.6 % C4 : 81.8 % C4 : 81.5 % Table 4.5: Recognition results with one and two added reference image per individual. Note that no warped images were used in this gallery.. Warping and adding reference images nRef=4 nRef=5 Small plane set Large plane set Small plane set Large plane set w = 0.46 w = 0.38 w = 0.38 w = 0.39 1:st correct: 64.5 % 1:st correct: 69.2 % 1:st correct: 69.2 % 1:st correct: 74.8 % 2:nd correct 8.2 % 2:nd correct 7.0 % 2:nd correct: 6.7 % 2:nd correct: 4.7 % 3:rd correct: 4.5 % 3:rd correct: 2.7 % 3:rd correct: 3.8 % 3:rd correct: 2.7 % 4:th correct: 2.4 % 4:th correct: 2.7 % 4:th correct: 1.8 % 4:th correct: 1.5 % Incorrect: 20.5 % Incorrect: 18.5% Incorrect: 18.5 % Incorrect: 16.5 % Summary C1 : 64.5 % C1 : 69.2 % C1 : 69.2 % C1 : 74.8 % C4 : 79.5 % C4 : 81.5 % C4 : 81.5 % C4 : 83.5 % Table 4.6: Result of combining an added reference image per individual with the best warping strategy found. For each individual, the three first nRef images represent the original image and the two warped images. After that, one and two extra images were added to the gallery, resulting in a total of 4 and 5 nRef images per individual.. 31.

(39) One face:. 2.76 s. Processing time Two faces: 3.75 s Three faces:. 4.64 s. Table 4.7: Mean processing time on a Nokia N73 after ten trials each with one, two and three faces present in input image. Note that processing time does not increase linearly with the number of faces in an image. While the recognition process is repeated for every face, the face detection is only performed once per image.. 4.2.2. Warping. The results of warping gallery images according to the the three different strategies described in section 3.5.2 can be seen in tables 4.2 to 4.4. Note that only the Lucas-Kanade based optical flow approximation was able to increase the recognition rates compared to the base case.. 4.2.3. Adding reference images. Extra, randomly chosen reference images were added to the gallery, and the recognition rate measured, see table 4.5 This operation simulates the situation when the gallery on the mobile phone is extended as the number of correctly classified images (with or without input from the user) increases. Finally, the Lucas-Kanade-based warping scheme was combined with extra reference images in table 4.6. Not surprisingly, this yielded the best recognition rate overall.. 4.2.4. Implementation on mobile phone. Once all the computer-based tests were run, the recognition system was implemented on a Nokia N73 mobile phone with Symbian OS, 220 MHz processor and 64 MB of SDRAM. Table 4.7 shows the processing time on a gallery containing 72 individuals with 2 (unwarped) reference images per person, making it a total of 114 images used in the matching. For demonstration purposes, the number of planes were restricted to 6 for the whole face feature space, and 4 each for the right eye and left eye feature spaces. While it is hard to measure any recognition rate statistics on the mobile phone differences in how the phone is held when acquiring the photo, facial expressions etc. would make any such measurements subjective and volatile - the system seems to be operating reasonably well albeit somewhat erratically. Fig. 4.3 shows two photos of the author taken with the N73 camera that are an extreme example of this erratic behavior. Although one seems substantially harder to identify then the other, the opposite is true: the difficult image was correctly classified while the easy one was not.. 32.

(40) Figure 4.1: Impact on C1 recognition rate of adding images warped using the LucasKanade based optical flow to the gallery. Although adding extra reference images has the largest positive impact, adding warped images does improve recognition rates, especially in the cases were extra reference images are added as well. As can be seen in tables 4.1 - 4.6, C4 performance follows a similar pattern: adding Lucas-Kanade warped images reduces number of totally incorrect matches by about 1.5-2 percentage points.. Figure 4.2: Best case performance. Added reference images combined with the LucasKanade based warping strategy.. 33.

(41) (a). (b). Figure 4.3: Example of photos taken with the N73 camera (shown downscaled). Found faces and eyes are marked with red squares. In most cases, the system performs as expected on the mobile phone, but there are striking exceptions. In this case (a) should be a great deal harder for the system to identify than (b). Even so, (a) was correctly classified while (b) didn’t even reach top four guesses.. 34.

(42) Chapter 5. Discussion 5.1. System setup. From the results shown in tables 4.1 to 4.6, a number of conclusions can be drawn. While some are to be expected, others are harder to explain: • The system performed better with the large plane set than the small. This makes sense since the number of remaining outliers in the training set was substantially smaller when forming the large plane set. Assuming the training data are sufficiently similar to the test data, the large plane set should have fewer outliers when confronted with the test set as well, translating to fewer mismatches and lower error rates. It’s more difficult to explain why the difference between the large and small plane seems to increase with the number of added reference images. In the base case, there is about one percentage point of difference in C1 as well as C4 performance between the small and large plane sets. With two added reference images, the difference is only 0.3 percentage points for C4 performance but 4.0 percentage points for C1 . This effect is even more apparent in the case where warped images are added to the gallery. With only original and warped images in the gallery, difference in C1 performance is 1.4 percentage points between small and large plane sets. With two added reference images, the difference has increased to a staggering 5.6 percentage points. In summary, with a larger plane set the system seems to be able to capitalize better on adding reference images and using the optimal combination of whole face, right eye and left eye images as match value. • Only Lucas-Kanade-based warp improved recognition rates. It might look like a paradox that adding warped images to the gallery can decrease the recognition rate: the original images used in the base case are still there. But adding warped images will increase the size of the database by a factor three, increasing the risk of mismatches. A bad warping strategy will therefore increase the number of mismatches more than it increases the number of correct matches.. 35.

No results found