Face Recognition for Annotation in Media Asset Management Systems

(1)

Face Recognition for Annotation in Media Asset Management

Systems

Sebastian Gröhn

Master's Thesis in Computing Science 30 ects credits

June 1, 2013

Supervisor Examiner

Frank Drewes Fredrik Georgsson

umeå university Dept of Computing Science

SE-901 87 umeå, sweden

(2)

(3)

Abstract

The goal of this thesis was to evaluate alternatives to the Wawo face recog

nition (fr) library, used by the company CodeMill AB in an application for video-based fr, implemented as a plugin to the media asset management sys

tem Vidispine. The aim was to improve the fr performance of the application, and the report tried to compare the performance of recent versions of Open Source Biometrics Recognition (OpenBR) and Open Source Computer Vision (OpenCV ) to Wawo.

For comparison of the fr systems, roc curves and area under roc curves (auc) metrics were used. Two dierent test videos were used: one simpler shot with webcam and one excerpt from a tv music show. The results are somewhat inconclusive; the Wawo system had technical diculties with the biggest test case. However, it performed better than OpenBR in the two other cases (comparing auc values), which leads to the conclusion that Wawo would have outperformed the other systems for all test cases if it had worked. Finally, the comparison shows that OpenBR is better than OpenCV for two of the three test cases.

(4)

(5)

Figures

2.1 Local Binary Patterns Illustration . . . 12

2.2 hmm Illustration . . . 13

3.1 Test Video Sample Frames . . . 18

3.2 Table of Confusion Illustration . . . 20

3.3 roc Plot Example . . . 21

3.4 Confusion Matrix Illustration . . . 21

4.1 Sample Frames after fr . . . 27

4.2 roc Results for codemill . . . 27

4.3 roc Results for SA_SKA_DET_LATA-long . . . 28

4.4 roc Results for SA_SKA_DET_LATA-short . . . 28

A.1 Detailed roc Results for codemill . . . 40

A.2 Detailed roc Results for SA_SKA_DET_LATA-long (part 1) . . . 41

A.3 Detailed roc Results for SA_SKA_DET_LATA-long (part 2) . . . 42

A.4 Detailed roc Results for SA_SKA_DET_LATA-short (part 1) . . 43

A.5 Detailed roc Results for SA_SKA_DET_LATA-short (part 2) . . 44

(8)

(9)

Tables

2.1 Face Recognition Algorithms . . . 6

2.2 Evaluated Face Recognition Systems . . . 13

3.1 Test Videos Used in the Comparison . . . 18

4.1 auc Results . . . 26

A.1 Test Video Details . . . 38

A.2 Detailed auc Results . . . 39

(10)

(11)

Chapter 1 Introduction

In the media industry, media asset management systems (mam systems) are in

creasingly important, as companies start publishing video and audio material on multiple platforms in multiple formats. The responsibility of mam systems is to catalog, store, and distribute media assets (digital les and their accom

panying usage rights).

There are many attributes that are interesting to store about a media le, and emerging among them is more hard-to-extract time-coded information, such as

Who is present at dierent times in a video? or What do people say in a video and when do they say it?

Only recently has functionality for (semi)-automatic extraction of this been introduced into mam systems, making this a eld with potential for big im

provements.

Of interest for this thesis project is the area of automated face recognition (fr) in video material. One tries to answer what people are present in a video, at which times they are present, and how they move in the picture. Below are two reasons for mam system users showing interest in fr:

1. An organization handles recordings of board meetings with a mam system and needs to know which meetings a given board member has attended.

2. The editorial sta of a public service broadcasting corporation wants to compose a montage of a newly deceased celebrity using archive footage.

The video archive is managed by a mam system.

This chapter gives an introduction to the project, its aim and goal, and how and where it was conducted.

(12)

Chapter 1 Introduction

1.1 Background

The thesis project was conducted at CodeMill AB¹, a Umeå-based information and communications technology (ict) consulting and application development company: It focuses on medical-resource and media-production applications and currently has an application for face recognition (fr).

CodeMill works closely with Vidispine AB², another Swedish company that has been developing a mam framework since 2009. The Vidispine product is composed of several modules, whereas the transcoder is the most interesting for this work. It processes video streams in a multitude of ways: scaling resolution, changing frame rate, extracting parts of streams, merging several streams, etc.

It also has support for extracting time-coded metadata via a plugin architec

ture.

CodeMill's fr application hence works as a plugin to the Vidispine transcoder, leveraging its broad support of dierent video decoders and demuxers.

It uses the Wawo library (presented in more detail in Chapter 2) for doing the actual face recognition.

1.2 Project Aim and Goal

The aim of the thesis project was to improve the performance of CodeMill's fr application, as it had room for improvement. Earlier thesis projects at the company had looked at the aspects of face detection and video object tracking;

this project was to focus on face recognition. More specically, the goal was to evaluate alternatives to Wawo for doing fr.

The original problem statement was to evaluate alternatives both propri

etary and free/open source software (foss) systems comparing them with the currently used system for face recognition (Wawo). The evaluation was to be based on the reliability of the dierent systems in matching faces, amount of training required, and amount of user interaction needed.

As a next step, the implementation of an existing open-source system for face recognition would be analyzed and improved with available algorithms. It would then be compared to the original implementation and the other systems to improve the impact of the improvements. However, this step was removed from the project plan because of time constraints.

1http://www.codemill.se/

2http://vidispine.com/

(13)

1.3 Working Method

The thesis project was divided into three broad phases:

1. Preparation: Apart from concretizing the project idea into a thesis pro

posal and plan the work, a theoretical study of the scientic area of face detection and recognition was conducted, followed by a eld study of available fr systems. Comparison criteria was also investigated.

2. Evaluation: A test framework for evaluating the chosen fr systems was developed, as well as plugins to use them with the Vidispine transcoder.

Then, test material to use for the evaluation was prepared, semi-manually marking up faces in video streams to use as reference when comparing.

Finally, the comparison was made.

3. Documentation: The work was documented in this thesis.

1.4 Related Work

This thesis is in some way a continuation of previous thesis projects conducted at CodeMill, all with the aim at improving the face recognition application.

Among those are the work of Thomas Rohndal [9], comparing two systems for face detection, and Linus Nilsson [7], adding support for face tracking. Another project, related to the overall aim of extraction of time-coded metadata, is Tobias Nilsson's [8], looking into speech-to-text in audiovideo-streams.

1.5 Thesis Structure

This thesis is divided into the following chapters:

Chapter 2: A theoretical background to face recognition and common meth

ods and algorithms is given, as well as an introduction to the compared fr systems.

Chapter 3: How the comparison was conducted is explained: First, com

parison criteria and the test material is presented. Then, an introduction to statistical tools of interest are given.

Chapter 4: The results of the comparison are presented and explained.

(14)

Chapter 1 Introduction

Chapter 5: The conclusions drawn from the comparison result are given.

Furthermore, limitations of the presented work are discussed and future work is proposed.

Appendix A: For interested readers, details of the test video and detailed comparison results are included here.

(15)

Chapter 2 Face Recognition in Video

This chapter describes the process of face recognition from a theoretical per

spective. First, the whole process workow of face detection, object tracking, and face recognition is outlined. Then several techniques for doing actual face recognition are described. Finally, the evaluated fr systems are presented.

2.1 Face Recognition Workow

Face recognition in video streams can be divided into three steps:

1. Face detection (fd) is the act of discovering potential faces in an image frame from the stream. The image is scanned for areas that contain features of a human face: skin-toned color, eyes, nose, mouth, etc.

2. After that, video object tracking is used to follow the identied face objects frame by frame in the video stream. This is done during the whole scene, until the video is cut. Then, the detection process must be restarted from the beginning of the next scene.

3. Finally, the actual face recognition (fr) step is performed. The images of the identied faces are matched against a database of known ones.

Depending on the size of the database and the number of faces in the scene, this process can range from simple (single person in the scene, where every person occurring in the video is known) to very complex (hundreds of people, choosing among everyone guring somewhere in a video archive of a company).

In the Vidispine transcoder, Step 1 and 3 are performed by one of the fr system plugins, and Step 2 is done by an object tracker via another plugin structure.

As explained in Chapter 1, this thesis focuses on the face recognition step.

(16)

Chapter 2 Face Recognition in Video

Algorithm Description Sect.

Eigenfaces Holistic approach, using pca for dimension reduction. 2.2.1 Fisherfaces Holistic approach, using (pca and) lda for data analysis

and dimension reduction. 2.2.2

lbp Local-feature based approach, developed from the eld of

texture analysis. . . 2.2.3

jm-hmm 2.2.4

Table 2.1: Face recognition algorithms that are used by the evaluated fr sys

tems. Each algorithm is described in more detail in its respective subsection.

2.2 Techniques for Face Recognition

The compared fr systems use algorithms developed specically for or adapted to face recognition, some common and some specic to only one product. This section describes the ones used in the evaluated systems, namely

1. the Eigenfaces approach, using Principal Component Analysis (pca);

2. the Fisherfaces approach, using Linear Discriminant Analysis (lda);

3. Local Binary Patterns (Histograms) (lbp), and 4. Joint Multiple Hidden Markov Models (jm-hmm).

Each algorithm is described in its respective subsection. Table 2.1 contains a summary. (Below, (· · · ) denotes lists or sequences, and [· · · ] denotes vec

tors or matrices. Also, boldface variables, such as X and x, represent lists, sequences, vectors, or matrices. This in contrast to scalar variables such as X and x.)

2.2.1 Eigenfaces and Principal Component Analysis

The Eigenfaces algorithm, proposed by mit researchers Matthew Turk and Alex Pentland in 1991 [10], takes a holistic approach to face recognition. Holistic in the sense that each face image is looked at as a whole, rather than trying to extract its local features.

Each face image (converted to gray-scale) of resolution p × q pixels corre

sponds to a vector in a pq-dimensional vector space, the space of all potential faces. If all vector components are uncorrelated, i.e., contain no overlapping

(17)

2.2 Techniques for Face Recognition

information, it is a huge task to store and compare these vectors; e.g., a set of 100 × 100-pixel training images yields a 10 000-dimensional vector space.

The goal of Principal Component Analysis (pca) is to nd the components with greatest variance, called the principal components, so the original space can be reduced to one of lower dimensionality by converting correlated vector components into uncorrelated ones. This is done with the following steps:

1. Treat the sample images a set of representative face images, or the face images of people to train for as a list of column vectors

X = (x₁, x₂, . . . , x_N) ,

each vector xi of length K = pq corresponding to the pixel values of image i.

2. Compute the sample mean vector x of the images:

x = [x_j] = 1 N

X

i≤N

x_i. (2.1)

Each component xj (j ≤ K) is the mean of the jth components of every xi∈ X.

3. Compute the K×K (sample) covariance matrix Q from each image vector xi, rst subtracting the mean x:

Q = [qjk] = 1 N − 1

X

i≤N

(xi− x) (xi− x)^>, (2.2)

where each entry qjk is (an estimate of) the covariance between the jth and kth component of every xi∈ X.

4. Now the task is to nd a projection V^∗ that maximizes the variance of the data:

V^∗= arg max

V

V^>QV .

This is equivalent with solving the eigenvalue problem for Q:

Qv_j= λ_jv_j, j ≤ K ,

with eigenvectors V = (v1, v2, . . . , vK) called eigenfaces as they have the same dimension as the sample images and their respective eigenval

ues Λ = (λ1, λ2, . . . , λK). V^∗, then, is the matrix composed of the L ≤ K

(18)

eigenfaces with largest eigenvalue ordered descending. I.e., V^∗ = v_j^∗ (j ≤ L), where v^∗1∈ Vhas the highest corresponding eigenvalue, v^∗2∈ V has the next highest corresponding eigenvalue, etc.

5. Solving this equation can, however, be infeasible if not using very low-res

olution face images. Suppose, again, that we have a set of N = 300 training images, each of size 100 × 100 pixels. This corresponds to vec

tors in X of length K = 100 · 100 = 10 000, in turn meaning that the covariance matrix Q is of size K × K = 10 000 × 10 000.

If, as in this case, N < K, we can use a trick to compute the eigenvectors.

First, note that Q can be computed in a dierent way:

Q = 1

N − 1YY^>,

where Y is the K × N matrix composed of column vectors [yi], each yi= xi− x(i ≤ N). The idea is to compute the eigenvalue problem for R = _{N −1}¹ Y^>Yinstead of for Q:

Ruj= λjuj, 1

N − 1Y^>Yuj= λjuj, j ≤ K ,

a much easier task as R is only N × N = 300 × 300 in size. By left-mul

tiplying both sides with Y, 1

N − 1YY^>Yuj= λjYuj, QYuj= λjYuj,

we note that if ujis an eigenvector of R then vj = Yuj is an eigenvector of Q.

6. Finally, the L principal components x^∗ (L ≤ K) of an image vector x, then, are given by

x^∗= V^∗>(x − x) , (2.3)

where x is dened as in Equation (2.1). As expected, the x^∗ vector is a reduced version of x, with only L components instead of the original K.

(19)

Now V^∗ combined with x acts as a basis for the pca subspace where the comparisons will take place.¹ To initiate a face gallery, all training samples are projected into the subspace using Equation (2.3) (if not done so when computing V^∗). For face recognition, the query image is also projected into the same space and compared to all samples in the gallery.

Depending on the application, V^∗ can be computed in advance on a repre

sentative set of face images, without knowledge of whom to recognize, or be computed on the actual training set (or a superset of it) when training the fr system.

2.2.2 Fisherfaces and Linear Discriminant Analysis

Even though the Eigenfaces approach seems reasonable, it has decits. As it does not take into account what class (i.e., what person) a given face belongs to, some information is lost.

The Fisherfaces algorithm was proposed by Peter Belhumeur, João Hes

panha, and David Kriegman in 1996 [2]. Instead of using pca for dimension reduction it uses the related Linear Discriminant Analysis (lda), which tries to nd a projection of the face data that both maximizes the between-class variance and minimizes the within-class variance.

The algorithm outline is as follows:

1. As in Section 2.2.1, let X be a list of p × q-pixel sample images, each image represented as a column vector of length K = pq. However, this time X is partitioned in C classes, each class representing a person:

Xj= xj1, xj2, . . . , xjN_j , Nj= |Xj| , j ≤ C X = X1kX2k· · ·kXC

Here, ·||· denotes list concatenation, e.g., (a1, a2)k(b1) = (a1, a2, b1). 2. Compute the per-class sample mean vectors (x1, x2, . . . , xC)and the total

sample mean vector x:

xj= 1 Nj

X

i≤N_j

xji, j ≤ C

x = 1 N

X

j≤C i≤N_j

xji, N = |X|

1For completeness sake, the formula for reconstructing a projected face image x^∗from the pca subspace into its original form x is as follows: x = V^∗x^∗+ x. Here, x is dened as in Equation (2.1). Note the symmetry with Equation (2.3).

(20)

3. Compute the between-class scatter matrix SB and within-class scatter matrix SW (both K × K):

S_B=X

j≤C

N_j(x_j− x) (xj− x)^>,

SW=X

j≤C i≤N_j

(xji− xj) (xji− xj)^>,

measuring (estimates of) the non-normalized covariance between and within classes, respectively. I.e., they are not, as in Equation (2.2), nor

malized by the number of classes C or samples N, respectively.

4. The nal task is to nd a projection W^∗that separates the classes as much as possible while keeping the within-class variance as low as possible, formally

W^∗= arg max

W

W^>SBW

|W^>SWW|.

The solution is found by solving the following eigenvalue problem for eigenvectors (v1, v₂, . . . , v_K)with respective eigenvalues (λ1, λ₂, . . . , λ_K):

S_Bv_k = λ_kS_Wv_k, k ≤ K S⁻¹_WSBvk = λkvk

2.2.3 Local Binary Patterns

Instead of the approach of Eigenfaces and Fisherfaces, more recent research with fr has moved away from such holistic or global reasoning, instead focusing on the local features in face images. One such algorithm, Local Binary Patterns (Histograms) (lbp), was proposed by Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen of University of Oulu in 2004 [1].

It has its roots in 2D texture analysis, a subeld of computer vision. We only encode how each pixel dierentiates itself from its neighbors, instead of trying to make sense of the whole image at once.

The algorithm follows these steps:

1. Each p × q-pixel face image is represented as a matrix x of the same size, and for each center pixel position c = (xc, yc)with neighboring positions (p1, p2, . . . , pP), its local binary pattern is computed as

LBP(c) =X

0≤i<P

2ⁱ· sgn(I(pi) − I(c)) ,

(21)

where P is the number of neighbors to c, I(p) = I(x, y) the intensity at image position p = (x, y), and

sgn(x) =

(1 if x ≥ 0 , 0 if x < 0 .

2. An lbp operator is used to compute what positions constitute the neigh

borhood of a given center pixel. The original (and simplest) one is exem

plied in Figure 2.1, but there are newer, more rened, operators. One can use, e.g., the extended lbp operator:

pi= (xc, yc) + R ·

cos2πi

P , − sin2πi P

, 0 ≤ i < P , R ≥ 1 , (2.4) where each position pi is a neighbor to the center position c = (xc, yc), R is the radius of the neighborhood circle, and P the number of sample points on the circle. The idea, then, is to vary the value of R to encode features of varying scale. As pi in general will not be an integer pixel position, its intensity must be interpolated from surrounding pixels using, e.g., bilinear or nearest-neighbor interpolation.²

3. The nal feature vector for the image is computed by

a) dividing the image matrix into xed-size cells, e.g., 16 × 16 pixels;

b) computing a local histogram over each cell; and c) concatenating the local histogram data.

Feature vectors can then be compared, matching a query image to a gallery of faces.

2.2.4 Joint Multiple Hidden Markov Models

Hidden Markov Models (hmms) have been used in pattern recognition for, e.g., speech and gesture recognition. A variant for face recognition was proposed

2 Using matrix notation, bilinear interpolation is dened as I(x, y) ≈x₂− x x − x1

I(x₁, y1) I(x1, y2) I(x2, y1) I(x2, y2)

y₂− y y − y1

, or more compactly as

I(x^∗, y^∗) ≈1 − x^∗ x^∗I(0, 0) I(0, 1) I(1, 0) I(1, 1)

1 − y^∗ y^∗

if we use the unit square coordinate system where every point is translated by (−x1, −y1).

(22)

x

yc

x_c

1 2 2 9 5 6 5 3 1

0 0 0 1 c 1 1 0 0

LBP(c) = 0011 0001₂ threshold

I(pi) ≥ I(c) = 5?

Figure 2.1: Illustration of the original lbp operator. x is a face image, repre

sented as a two-dimensional matrix. The operator can be expressed formally as pi = (xc, yc) + cos^2πi₈ , − sin^2πi₈

, 0 ≤ i < 8, where each position pi is a neighboring pixel to the center pixel c = (x_c, y_c)and b·e denotes rounding to nearest integer. Adapted from Equation (2.4).

by Professor Hung-Son Le [5, 6], called Joint Multiple Hidden Markov Models (jm-hmm).

A hmm models the observable outcomes of a hidden stochastic process. I.e., a random process whose direct state is not observable is indirectly observed through another set of processes conditioned by the rst. As illustrated in Figure 2.2, a hmm Λ = (Π, A, B) is dened by

1. the initial state probabilities Π = [πi], each element πithe probability of of starting in state si,

2. its states S = {s1, s2, . . . , sN} with transition probability matrix A = [aij], each entry aij the probability of transitioning to state sj when in state si, and

3. the set of observation symbols V = {v1, v₂, . . . , v_M}, with emission prob

ability matrix B = [bjk], each entry bjk the probability of emitting the symbol vk when in state sj.

A corresponding hmm process, then, is a sequence of emitted observation sym

bols O = (o1, o2, . . . , oT) and corresponding sequence of hidden states Q = (q1, q2, . . . , qT), each observation otemitted in state qt.

In the jm-hmm algorithm, two problems are to be solved: a) computing the parameters Π, A, and B of a model Λ that maximizes the probability of emitting a set of training observations (O1, O2, . . . , ON), and b) evaluating

(23)

2.3 Evaluated Face Recognition Systems

−−−−→ q1 −−−−→ q2 −−−−→ · · · −−−−→ qT



 y



 y



 y

o1 o2 · · · oT

q_t∈ S = {s₁, s₂, . . . , s_N} ot∈ V = {v1, v2, . . . , vM}

Figure 2.2: Illustration of a hmm process, where the hidden state sequence Q = (q₁, q₂, . . . , q_T)is observed as a sequence of emitted symbols O = (o₁, o₂, . . . , o_T), each transition qt → q_t+1 and observation q_t→ o_t with probabilities dened by the underlying model.

Identier Name Comments Sect.

wawo Wawo A research spin-o company,

based on research from Umeå University.

2.3.1

openbr OpenBR Collaborative research project

supported by the mitre Cor

poration.

2.3.2

opencv{1,2,3} OpenCV's facerec Experimental contribution to OpenCV's contrib package. 2.3.3 Table 2.2: Face recognition systems that are part of the comparison. The iden

tier is used when presenting the results. Each system is described in more detail in its respective subsection.

the probability that a given observation sequence O was produced by a given model Λ.

2.3 Evaluated Face Recognition Systems

This thesis aims to compare the three fr systems summarized in Table 2.2.

Their theoretical foundation, design, and public api is described here.

(24)

2.3.1 Wawo

The Wawo fr library and application is developed by the Umeå-based company Wawo Technology AB³. It was co-founded by Prof Le in 2008 as a way to make his hmm-based approach [6] to fr into a product.

2.3.2 OpenBR

The OpenBR project⁴ (Open Source Biometrics Recognition) is an eort to develop a library of open algorithms in the area of biometrics, as well as a framework to utilize and evaluate those. The evaluated version is 0.2, released February 23, 2013.

OpenBR's api is based on the pipes-and-lters architectural pattern, where each algorithm, called transform, is composed of a pipeline of sub-transforms

or an actual implementation. The base objects that are manipulated in a pipeline are (lists of) matrices, called templates; each transform step takes a matrix or list of matrices and modies it in some way, giving its output as input to the next step. This template also stores a map with parameter values that can be used for one-o data related to, but not part of, the matrix or matrices.

For example, the FaceRecognition transform is expanded into

FRRegistration → FRExtraction → FREmbedding → FRQuantization , which further expands into

[ASEFEyes → Affine → DFFS] −→

−→

Mask →

DenseSIFT DenseLBP

→ ProjectPCA → Normalize

−→

−→ [RndSubspace → ProjectLDA → ProjectPCA] −→

−→ [Normalize → Quantize] .

(A → B means that the result from A is piped to B as input; {^AB} that A and B are performed independently on the same input, their respective results concatenated as output. Some details removed for clarity.) Worth noting is that the transforms are not static; while manipulating templates, they at the same time learn internal parameters.

As hinted by the transform names, OpenBR uses a combination of three of the described fr techniques:

3http://www.wawo.com/

4http://openbiometrics.org/

(25)

2.3 Evaluated Face Recognition Systems

1. Eigenfaces, with the use of pca;

2. Fisherfaces, with the use of lda; and 3. lbp.

These are then weighted together when doing the face recognition.

2.3.3 OpenCV's Face Recognition Library

OpenCV⁵ (Open Source Computer Vision) is a software library for computer vision and machine learning, as well as a community surrounding this devel

opment. Recent versions of OpenCV contain an experimental library for face recognition in its contrib module. The version evaluated is 2.4.3, released November 2, 2012.

The facerec library contains three separate fr implementations:

1. EigenFaceRecognizer, implementing Eigenfaces and pca;

2. FisherFaceRecognizer, implementing Fisherfaces and lda; and 3. LBPHFaceRecognizer, implementing lbp Histograms.

The user selects which one to use though a factory pattern. All recognizers need gray-scale face images for training and classication, and the rst two recognizers additionally require that images are normalized to a xed pixel resolution.

In the comparison, opencv1 identies the classier FisherFaceRecognizer,

opencv2 identies LBPHFaceRecognizer, and opencv3 identies EigenFaceRecognizer.

5http://opencv.org/

(26)

(27)

Chapter 3 Comparison

This chapter describes how the comparison of the fr systems was done, more specically what aspects were evaluated, what test data was used, and how they were compared.

3.1 Criteria for Evaluation

When evaluating the dierent fr systems, comparison criteria must be chosen.

To make it fair, all systems where given the same set of detected faces in the frames of each test video. Instead of each system using its own face detection (if at all available), they were fed the actual recorded faces from the truth data of the test videos. Likewise, all tracking of face objects was disabled, so that it would not interfere with the results.

The evaluated aspect of the fr systems were the amount of correct vs. incor

rect classications of identied faces, summing for all video frames. In discus

sion with CodeMill, aspects such as training and classication time was deemed less important. The statistical methods for weighing correct and incorrect clas

sications against each other are presented in Section 3.3.

3.2 Test Material

The comparison used several dierent test videos, listed in Table 3.1 and ex

emplied in Figure 3.1. They were prepared by extracting face rectangles from each video frame automatically with OpenCV's face detector, then manually assigning the correct identity (person) to each detected face. False positives (incorrect faces) were removed and false negatives (missing faces) inserted, so that the test material would correspond with reality.

The music quiz show Så ska det låta (b), a long-running entertainment se

ries in Swedish television, was chosen for its complexity: there are many scene changes and camera rolls, several people occurring in dierent poses and con

stellations that makes fr dicult. On the other hand, the locally recorded

(28)

Chapter 3 Comparison

Identier Description Length #P

codemill Video recorded at CodeMill us

ing webcam. 13 people in pic

ture at the time; camera still.

0:38 @ 7.5 3

SA_SKA_DET_LATA First part of an episode from this year's season of the Swedish-tv music quiz Så ska det låta. Multiple people in picture; a lot of camera movements.

8:02 / 0:30 @ 25 7 / 5

Table 3.1: The test videos used in the comparison. The Length column is the length and frame rate of the video in the format m:s @ fps; #P

is the number of identiable people appearing in it. Table A.1(a) in the appendix contains more details.

c

CodeMill AB, used with permission.

(a) codemill frames.

Så ska det låta, c Sveriges Television, used with perm.

(b) SA_SKA_DET_LATA frames.

Figure 3.1: Sample frames from the test videos.

(29)

3.3 Introduction to ROC

webcam video (a) constitutes a simpler case, while not being trivial: the cam

era is still but people are constantly moving in and out of picture, 13 at the time.

Due to technical diculties with one of the fr systems, a shorter version of the Så ska det låta video was also prepared.

Table A.1(b), (c) in the appendix contains detailed information of how the identities are distributed, i.e., the class skew.

3.3 Introduction to ROC

This section gives an introduction to tables and matrices of confusion, true and false positive rates, and Receiver Operating Characteristic plots. It can be skipped if the reader already has an understanding of these concepts. It is based on the excellent introductory article by Fawcett [3].

3.3.1 Tables of Confusion

Tables of confusion as illustrated in Figure 3.2 are used as a statistical tool when visualizing data from binary classication experiments. E.g., in testing whether a patient has a given disease, the person is tested positive if he or she has the disease, negative otherwise. This may or may not correspond with the truth: positive is the patient actually has the disease, negative if not.

A table of confusion summarizes the the test result into four categories, depending on actuality and classication:

TP: the number of true-positive samples, correctly classied as positive;

FP: the number of false-positive ones, incorrectly classied as positive;

FN: the number of false-negative ones, incorrectly classied as negative;

TN: the number of true-negative ones, correctly classied as negative.

These values are in turn summed up row- and column-wise:

P = TP + FN , P⁰= TP + FP , N = FP + TN , N⁰= FN + TN .

Hence, P is the total number of positive samples and N the total number of negative ones, whereas P⁰and N⁰is the total number of positively and negatively classied samples, respectively.

(30)

classied total actual TP FN P

FP TN N

total P⁰ N⁰

Figure 3.2: Illustration of a table of confusion, summarizing the result of a binary classication experiment into four categories, depending on actuality and classication (rows vs. columns).

3.3.2 ROC Plots

Tables of confusion can be visualized in Receiver Operating Characteristic plots (roc plots) as seen in Figure 3.3(a) using the metrics true-positive rate TPRand false-positive rate FPR.¹ They are dened as

TPR = TP

TP + FN = TP

P , FPR = FP

FP + TN = FP

N , (3.1) and the (FPR, TPR) pair of each table corresponds to a point in the plot.

When interpreting a roc plot, points closer to the corner point (0, 1) and farther away from the line of no discrimination y = x (the dashed diagonal line) are better, as that indicate higher TPR (more correct classications) and/or lower FPR (fewer incorrect classications).

By varying the threshold parameter of a binary classier we get multiple tables (one for each threshold value); plotting these as a curve show the trade-o

between high TPR and low FPR.

3.3.3 Confusion Matrices

In fr, the interest lies in classifying a given face into one of multiple identities, or classes. Therefore, the table of confusion concept must be generalized to more than the two classes positive and negative.

A confusion matrix contains the same information for N classes as a table of confusion does for the binary case. As shown in Figure 3.4(a), each entry cjk

(j, k ≤ N) is the number of samples of actual class j that are classied as be

longing to class k. Hence, the N diagonal entries indicate correct classications while all other N²− N entries indicate errors.

1TPRis equivalent with sensitivity and FPR is equivalent with (1 − specicity).

(31)

3.3 Introduction to ROC

0 0.2 0.4 0.6 0.8 1

TPR

FPR (a) roc curve.

0 0.2 0.4 0.6 0.8 1

TPR

FPR

(b) Area under (roc) curve in (a).

Figure 3.3: Example of roc plots. The horizontal axis shows the false-positive rate FPR and the vertical the true-positive rate TPR. The curve is the roc curve of an example classier, each point corresponding to a dierent threshold value. The lled area is the auc metric of the given example curve.

classied total

actual

c₁₁ c₁₂ · · · c_1N C₁ c₂₁ c₂₂ · · · c_2N C₂ ... ... ... ... ...

cN 1 cN 2 · · · cN N CN

total C⁰₁ C⁰₂ · · · C⁰_N

(a) Confusion matrix for an N-class classica

tion experiment.

classied total actual TP_i FN_i P_i

FPi TNi Ni

total P⁰_i N⁰_i

(b) Class reference table of confu

sion corresponding to (a).

Figure 3.4: Illustration of a confusion matrix, summarizing the result of a N-class classication experiment into a square matrix, and cor

responding class reference table of confusion (pairwise comparison between each class i and all other).

(32)

The row sums C1, . . . , C_N, then, are the actual number of samples of each class, and the column sums C⁰1, . . . , C⁰_N are the number of samples classied as each class. Formally:

C_i=X

k≤N

c_ik, C⁰_i=X

j≤N

c_ji, i ≤ N .

The confusion matrix in itself is hard to visualize because of its high dimen

sionality. One method to handle this is called class reference formulation, where each class i is compared to all other classes (i ≤ N) [3, s. 9]. As illustrated in Figure 3.4(b), for each class, the matrix is reduced to a table of confusion where the original class i corresponds to the positive class and all other classes form the negative class, formally

TPi= cii, FNi=X

k≤N k6=i

cik,

FPi=X

j≤N j6=i

cji, TNi=X

j,k≤N j,k6=i

cjk, (3.2)

and

Pi= Ci, P⁰_i= C⁰_i, N_i=X

j≤N j6=i

C_j=X

j,k≤N j6=i

c_jk, N⁰_i=X

k≤N k6=i

C⁰_k =X

j,k≤N k6=i

c_jk. (3.3)

3.3.4 Area under ROC Curves

When comparing roc curves, it is advantageous to reduce the two-dimensional plot into a single scalar value. One common method is to compute the area under the curve (auc), illustrated in Figure 3.3(b) [3, s. 7]. The auc is always between 0 and 1, as the roc curve is contained in the unit square. What is more, a usable classier should have an area greater than 0.5, as anything less would indicate worse performance than pure luck.² The auc value is computed from the (FPR, TPR) pairs of the roc curve, using the trapezoidal rule or any other method for numerical integration.

2Such a bad-performing classier could simply be inverted to produce better-than-random results.

(33)

3.4 Method

Reasonably, one would like to have a single metric like the auc also for the multiclass case. A straightforward way is to compute the auc for each class independently (using class reference formulation). Then combining these using a weighted mean gives a simple yet reasonable global auc metric:

AUC =X

i≤N

Pi· AUCi, (3.4)

where AUCiis the area under the roc curve with reference class i and Pi∈ [0, 1]

is the distribution of class i (Pi≤NPi = 1).

3.4 Method

This section describes the comparison method for one test case. It is then re-iterated for each.

Each test case one video le contains a set of identied faces F , each with an actual identity from the set of identities P , dened as

P = {ID₁, ID₂, . . . , ID_N} ∪ {ID_∅} ,

where N is the number of (identiable) people occurring in the video. The identity ID∅ is special in that it has two meanings: either the face does not belong to any of the people 1, ..., N or the system cannot tell whether it does or not.

We also dene the following subsets of P for convenience later on:

Pi= P \ {i} , P_∅= P \ {ID_∅} .

For each fr system and test case, the method is as follows:

1. Each detected face in F is classied by the fr system as belonging to an identity in P . The classication data is summarized in a confusion matrix A = [γjk], as dened in Section 3.3.3: each γjk (j, k ∈ P ) is the number of times the fr system classies a face of actual identity j as having identity k.

2. For each identity i ∈ P , we then construct a table of confusion Ai using the class reference method described in the same section: the positive

(34)

class corresponds to identity i while the negative class corresponds to identities Pi. By Equation (3.2) we have

Ai=





TPi FNi

FPi TNi



=





γii P

k∈Piγik

P

j∈P_iγji P

j,k∈P_iγjk



, and, by Equation (3.3), the corresponding row sums are

Pi=X

k∈P

γik, Ni =X

j∈Pi

k∈P

γjk.

Thus, by Equation (3.1), the true- and false-positive rates for each identity iis

TPRi= γii

P

k∈Pγ_ik, FPRi =

P

j∈P_iγji

Pj∈P_i k∈P

γ_jk.

3. The fr system has a classication threshold value, and varying it gives multiple (FPRi, TPR_i)pairs. Plotting each TPR value (y-axis) against its FPRvalue (x-axis), produces the roc curve for reference class i.

4. Finally, for each identity i, the auc value AUCiof the roc curve is com

puted, as described in Section 3.3.4, as well as the mean auc value AUC following Equation (3.4).

(35)

Chapter 4 Results

This chapter shows the results from the comparison of the fr systems. For each test video and fr system, two items are included:

1. roc plots for one of the identities vs. all other,

2. the mean auc value, computed from the individual auc values.

Figure 4.2 and Table 4.1(a) show the results for test video codemill, while

gures 4.3, 4.4, and Table 4.1(b) show the results for SA_SKA_DET_LATA. Also, Figure 4.1 shows examples of the rendering of classied faces in the original video.

More detailed results can be found in Appendix A: see each gure or table caption for specic references.

4.1 Problems

A big setback was the fact that the wawo system kept crashing when running the long version of the SA_SKA_DET_LATA video. To remedy this, a shorter version of the same video was also used in the comparison; this way, at least some results are available for wawo in this setting.

Also, the openbr system refused to accept multiple training images, rendering the computed metrics a bit misleading. The auc values marked with * are computed by classifying those identities without any accepted training images as None/Unknown. I.e., from the perspective of the fr system, these identities do not exist and should not be counted when computing auc values (and roc curves).

4.2 Interpretation

For the codemill test case, surprisingly OpenCV's Eigenfaces implementation (opencv3) performed best. Compared with the results from the other test video, this seems like a uke.

(36)

Chapter 4 Results

fr system Mean auc

wawo 0.65

openbr 0.62

opencv1 0.61

opencv2 0.49

opencv3 0.67

(a) codemill

Mean auc

fr system long short

wawo 0.85

openbr 0.64/ 0.88^* 0.82/ 0.94^*

opencv1 0.54 0.70

opencv2 0.52 0.57

opencv3 0.53 0.71

(b) SA_SKA_DET_LATA

Table 4.1: auc results. A detailed version is found in Table A.2 in the ap

pendix, containing the auc results also for each identity compared to all other.

*Disregarding identities with all face images failed.

For the long version of SA_SKA_DET_LATA, openbr performed best, even though the missing result from wawo makes the conclusion uncertain. How

ever, looking at the results for the shorter version, wawo has the highest mean auc value. As the results look quite similar to the longer version, a reasonable assumption could be that wawo would have performed best for both test cases.

(37)

4.2 Interpretation

c

CodeMill AB, used with permission.

(a) Successful (correct id.) classication. (b) Failed (None/Unknown id.) classi

cation.

Figure 4.1: The codemill sample frames from Figure 3.1(a), after doing fr.

0 0.2 0.4 0.6 0.8 1

TPR

FPR openbrwawo opencv1 opencv2 opencv3

(a) Identity 0 vs. all other

0 0.2 0.4 0.6 0.8 1

TPR

(b) Identity 1 vs. all other Figure 4.2: A selection of roc plots for codemill. The rest of the plots are

found in Figure A.1 in the appendix.

(38)

Chapter 4 Results

0 0.2 0.4 0.6 0.8 1

TPR

FPR openbr opencv1 opencv2 opencv3

0 0.2 0.4 0.6 0.8 1

TPR

(b) Identity 1 vs. all other Figure 4.3: A selection of roc plots for SA_SKA_DET_LATA-long. The rest of

the plots are found in gures A.2 and A.3 in the appendix.

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

(b) Identity 1 vs. all other Figure 4.4: A selection of roc plots for SA_SKA_DET_LATA-short. The rest of

the plots are found in gures A.4 and A.5 in the appendix.

(39)

Chapter 5 Conclusions

The project work was not without problems. One third into the project, I dis

covered that I had not received all relevant code for CodeMill's fr application, making some of work redundant or mist. Also, when doing the test runs to generate comparison data, I noticed that the original fr system (wawo) kept crashing one the longer test video. Thus, I had to rethink my test cases, also using a shorter version of that video.

Because of this, my initial time plan was revised for the later half of the project. Among the changes, the task of improving one of the evaluated fr systems was removed from the plan. Apart from that, my initial goal was met:

I evaluated and compared alternatives to wawo. However, the performance of CodeMill's application was not improved, as the evaluated systems did not necessarily do a better job.

5.1 Limitations

The work with and the results of the comparison have some limitations:

As can be seen in some of the roc plots (e.g. in Figure A.4(d) in the appendix), the curves are not always reaching in an arc from (0, 0) to (1, 1)as an ideal roc curve should. Or they are simply missing.

As a consequence, an assumption has been made when computing the auc values: implicit endpoints (0, 0) and (1, 1) have been inserted when missing. The eect of this is that, e.g., completely empty curves will receive an auc value of 0.5.

One probable explanation for these incomplete curves is related to the eect of the fr system threshold parameters: varying them do not change the ranking among the individual identities, only whether the highest-ranked one should be output as classication or not. This means that for some of the roc curves we might not receive a pair (FPRi, TPRi) = (1, 1), as

(40)

Chapter 5 Conclusions

at least some face of identity i is not ever classied correctly, regardless of threshold value.

This approach to calculating roc curves and auc values is not the most ecient one. E.g., Fawcett [3] presents algorithms for roc and auc com

putations. However, they require that the classier is able to output some kind of probability for the classications, rather than just a binary deci

sion. The fr systems openbr and opencv had this possibility but not wawo, making this approach impossible.

Given more accessible source code, one could modify most classiers to report the probability in addition to the classication. However, the available code for the wawosystem was not behaving correctly so it was not used (or modied).

Solving the rst point would require a more in-depth study of multi-class Re

ceiver Operating Characteristics. Maybe more parameters than a single thresh

old must be introduced.

5.2 Future work

The following points are areas of improvement for fr in video in general and CodeMill's fr application specically:

In combination with face tracking, use a prole face detector in combina

tion with the frontal detector, and aggregate the result. That way, the end fr result would hopefully improve, as the object tracker could follow faces even when people turn their head sideways and back.

Also related to face tracking, do face recognition on the tracked face ob

jects continuously, not only on the statically detected faces. This should yield more information to integrate into the end result.

Evaluate true video-based face recognition, such as the work of Gorod

nichy [4]. In this method, the classier directly uses the temporal informa

tion inherent in video streams, both for training and classication. This instead of looking at each video frame independently.

Explore continuous learning, where a face classier receives ongoing cor

rections to its classications by a user. As soon as a classied face has been cleared as correct, it can be used for training. Also a face that is mis-classied could be used as counter-example if the fralgorithm sup

ports it.

(41)

5.2 Future work

These rst two points are just minor adjustments to the core work ow de

scribed in Section 2.1. The third one, however, constitutes a dierent approach to fr in video and would require major changes to the Vidispine architecture and CodeMill's application. The fourth point would require another approach to storing and handling training data (face galleries).

(42)

(43)

Acknowledgements

It has been a great opportunity to write my Master's thesis at CodeMill this semester. Thanks goes to Petter Edblom for giving me this project, and Tomas Härdin, my supervisor there, for answering my questions about the gory system details and being a tech wizard in general.

Also, I would like to thank Frank Drewes, my supervisor at the Department of Computing Science, for being supportive and positive when I have have hit diculties, as well as Linus Nilsson, previous student doing his Master's thesis project before me at CodeMill, for taking inspiration from his scripts for comparison.

Last but not least, a heartfelt thank you to Linnéa Carlberg for her love and encouragement, and for believing in me.

(44)

(45)

References

[1] Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary patterns. In Computer Vision ECCV 2004, pages 469481. Springer, 2004. URL http:

//link.springer.com/chapter/10.1007/978-3-540-24670-1_36.

[2] Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman.

Eigenfaces vs. sherfaces: Recognition using class specic linear projection. In Computer Vision ECCV 1996, pages 4358. Springer, 1996. URL

http://link.springer.com/chapter/10.1007/BFb0015522.

[3] Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861874, 2006. URL

http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf.

[4] Dimitry Gorodnichy. Video-based framework for face recognition in video. In Second Workshop on Face Processing in Video (FPiV 2005), Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV 2005), pages 330338. 2005. URL http://nparc.

cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=5764009.

[5] Hung-Son Le. Face Recognition Using Hidden Markov Models. Lic. thesis, Umeå University, 2005. URL http://libris.kb.se/bib/9941301.

[6] Hung-Son Le. Face Recognition: A Single View Based HMM Approach.

PhD thesis, Umeå University, Applied Physics and Electronics, 2008.

URLhttp://umu.diva-portal.org/smash/record.jsf?pid=diva2:141208.

[7] Linus Nilsson. Object Tracking and Face Recognition in Video Streams.

Bachelor's thesis, Umeå University, Department of Computing Science, 2012. URL http://umu.diva-portal.org/smash/record.jsf?

searchId=4&pid=diva2:546732.

[8] Tobias Nilsson. Speech Recognition Software and Vidispine. Master's thesis, Umeå University, Department of Computing Science, 2013.

(46)

References

[9] Thomas Rondahl. Face Detection in Digital Imagery Using Computer Vision and Image Processing. Bachelor's thesis, Umeå University, Department of Computing Science, 2011. URL http://umu.

diva-portal.org/smash/record.jsf?searchId=4&pid=diva2:480900.

[10] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):7186, 1991. URL http:

//www.mitpressjournals.org/doi/abs/10.1162/jocn.1991.3.1.71.

(47)

Appendix A

Detailed Results

This appendix contains more information about the test videos (Table A.1) and the full details of the results of the comparison (gures A.1A.5 and Table A.2).

(48)

Appendix A Detailed Results

Identier Length fps Res. #P #F #D ^#D_#F

codemill 0:38 7.5 640 × 480 3 283 352 1.24

SA_SKA_DET_LATA 25.0 512 × 288

long 8:02 7 12 062 9 435 0.78

short 0:30 5 750 602 0.80

(a) Details of the test videos used in the comparison.

Identity Gender #D %

0 Male 81 23

1 Male 52 15

2 Female 71 20

None/Unkn. 148 42

Total 352 100

(b) Class skew for codemill.

long short

Identity Gender #D % #D %

0 Male 1 316 14 84 14

1 Male 2 703 29 388 65

2 Female 578 6 17 3

3 Female 782 8 110 18

4 Female 558 6

5 Female 1 223 13

6 Male 1 838 19 1 0

None/Unkn. 437 5 2 0

Total 9 435 100 602 100

(c) Class skew for SA_SKA_DET_LATA.

Table A.1: Details (a) and class skew (b), (c) for the test videos:

In (a), for each video the columns show (left to right): length, frame rate (frames per second), frame resolution (width times height in pixels), the number of identiable people, the total number of frames, the number of detected faces, and the number of detected faces per video frame.

In (b) and (c), the #D columns shows the number of face de

tections belonging to the given identity (i.e., class) and the %

columns the percentage of total detections. Note that there is no correlation of identities between test videos.

(49)

auc

Identity wawo openbr opencv1 opencv2 opencv3

0 0.58 0.68 0.48 0.50 0.59

1 0.65 0.52 0.66 0.46 0.67

2 0.80 0.68 0.76 0.53 0.81

None/Unknown 0.60 0.59 0.60 0.47 0.65

Mean 0.65 0.62 0.61 0.49 0.67

(a) codemill

auc

0 0.71/ 0.76^* 0.50 0.54 0.50

1 0.77/ 0.87^* 0.62 0.49 0.48

2 0.82/ 0.85^* 0.50 0.51 0.50

3 0.50 0.57 0.50 0.67

4 0.50 0.68 0.50 0.59

5 0.50 0.50 0.50 0.50

6 0.50 0.50 0.50 0.50

None/Unknown 0.75/ 0.91^* 0.36 0.91 0.81

Mean 0.64/ 0.88^* 0.54 0.52 0.53

(b) SA_SKA_DET_LATA-long

auc

0 0.95 0.95/ 0.95^* 0.50 0.50 0.50

1 0.83 0.88/ 0.94^* 0.80 0.62 0.82

2 0.86 0.90/ 0.91^* 0.87 0.38 0.50

3 0.88 0.50 0.50 0.50 0.50

6 0.50 0.50 0.50 0.50 0.50

None/Unknown 0.91 0.88/ 0.92^* 0.85 0.89 0.42

Mean 0.85 0.82/ 0.94^* 0.70 0.57 0.71

(c) SA_SKA_DET_LATA-short

(50)

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

(b) Identity 1 vs. all other

0 0.2 0.4 0.6 0.8 1

TPR

(c) Identity 2 vs. all other

0 0.2 0.4 0.6 0.8 1

TPR

(d) None/Unknown identity vs. all other Figure A.1: roc plots for codemill, comparing each identity to all others.

(51)

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

(d) Identity 3 vs. all other Figure A.2: roc plots for SA_SKA_DET_LATA-long (part 1), comparing each

identity to all others.

(52)

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

(d) None/Unknown identity vs. all other Figure A.3: roc plots for SA_SKA_DET_LATA-long (part 2), comparing each

(53)

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

(d) Identity 3 vs. all other Figure A.4: roc plots for SA_SKA_DET_LATA-short (part 1), comparing each

(54)

0 0.2 0.4 0.6 0.8 1

TPR

0 0.2 0.4 0.6 0.8 1

TPR

(b) None/Unknown identity vs. all other Figure A.5: roc plots for SA_SKA_DET_LATA-short (part 2), comparing each

Face Recognition for Annotation in Media Asset Management Systems