• No results found

Real-time Head Pose Estimation inLow-resolution Football Footage UsingRandom Forests

N/A
N/A
Protected

Academic year: 2022

Share "Real-time Head Pose Estimation inLow-resolution Football Footage UsingRandom Forests"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)

 

Degree project in Computer Science Second cycle

Stockholm, Sweden 2013

Real-time Head Pose Estimation in Low-resolution Football Footage Using Random Forests

André Dübbel

(2)

Real-time Head Pose Estimation in Low-resolution Football Footage Using

Random Forests

DD221X - Degree Project in Computer Science, Second Level

ANDRÉ DÜBBEL DUBBEL@KTH.SE

Master’s Thesis at CSC CVAP Subject: Computer Science Supervisor at Tracab: Dan Song Supervisor at CVAP: Josephine Sullivan

Examiner: Stefan Carlsson

(3)

Abstract

This report presents a method for real-time head pose estimation in low res-

olution football footage. The presented method uses a random forest trained on

synthetically generated head images. The use of synthetic training images is shown

to be a good substitute for the use of manually labelled images. The presented

method compares favourably to support vector machines trained for the same task,

both in terms of accuracy and speed. It is noted that the method relies on a good

head detection to perform well. The report also examines ways of combining the

image based head pose estimation with contextual features such as ball position

and player position. It is shown that the relative direction to the ball can improve

the accuracy of the pose estimate in certain situations. Furthermore, it is found

that the random forest method can easily be extended to incorporate images from

multiple cameras to improve accuracy.

(4)

Realtidsestimering av huvudets vridning i lågupplöst fotbollsvideo

Sammanfattning

Denna rapport presenterar en metod för realtidsestimering av huvudets vridning

i lågupplöst fotbollsvideo. Den presenterade metoden använder sig av en random fo-

rest tränad på syntetiskt skapade huvudbilder. Användandet av syntetisk data visar

sig vara ett gott substitut för användandet av handmärkta bilder. Den presenterade

metoden jämförs positivt mot stödvektormaskiner tränade för samma ändamål, bå-

de i avseende på precision och snabbhet. Det visas att metoden förlitar sig på god

huvuddetektering för att prestera väl. Rapporten undersöker också sätt att kombi-

nera den bildbaserade estimeringen med kontextuella faktorer så som bollposition

och spelarposition. Det visas att den relative riktiningen till bollen kan förbättra

precisionen i vissa lägen. Vidare visas att random forest-metoden enkelt kan utökas

till att använda sig av bilder från flera kameror för att förbättra precisionen.

(5)

Acknowledgements

Many thanks to all the helpful people at the computer vision department at

Tracab, thank you for always being approachable and for making me feel as part of

the team. Thanks to Josephine Sullivan for her supervision and useful comments

on the report. This project was conducted as part of the EU research project FINE

(Grant agreement 248020, FP7-ICT-2009-4) whose support is gratefully acknowl-

edged. Finally, a special thank you to Dan Song for her invaluable support and

guidance throughout this project.

(6)

Contents

1 Introduction 1

1.1 Resources and Requirements . . . . 2

1.1.1 Synthetic Dataset . . . . 2

1.2 Thesis Outline . . . . 3

2 Theory and Related Work 4 2.1 Random Forests . . . . 4

2.1.1 Bagging . . . . 4

2.1.2 Split Nodes . . . . 5

2.1.3 Leaf Nodes . . . . 7

2.1.4 Previous Work . . . . 8

2.2 Head Pose Estimation . . . . 9

3 Method 12 3.1 Random Forest Design . . . . 12

3.1.1 Bagging . . . . 12

3.1.2 Split Nodes . . . . 12

3.1.3 Leaf Nodes . . . . 15

3.1.4 Forest Evaluation . . . . 16

3.2 Comparison To Support Vector Machines . . . . 17

3.3 Integration with Player Tracking . . . . 17

3.3.1 Multiple Cameras . . . . 17

3.3.2 Ball Position . . . . 20

4 Experiments and Results 21 4.1 Datasets . . . . 21

4.2 Performance Evaluation . . . . 24

4.2.1 Forest Size . . . . 24

4.2.2 Stopping Criteria . . . . 27

4.2.3 Mean Shift Clustering . . . . 28

4.2.4 Target Function . . . . 29

(7)

4.3 Comparison With SVM . . . . 30

4.3.1 Results On Different Datasets . . . . 32

4.4 Effects Of Translation Errors . . . . 34

4.5 Integration With Tracking System . . . . 35

4.5.1 Multiple Cameras . . . . 35

4.5.2 Ball Position . . . . 35

5 Discussion 37 5.1 General System Performance . . . . 37

5.2 Training And Test Data . . . . 37

5.3 Integration With Tracking System . . . . 39

5.3.1 Multiple Cameras . . . . 39

5.3.2 Ball Position . . . . 39

6 Conclusions And Future Work 40 Bibliography 41 A Other Algorithms 44 A.1 Mean Shift Clustering . . . . 44

B Auxiliary Results 45

B.1 Support Vector Machines . . . . 45

(8)

Chapter 1 Introduction

A simple thing like the movements of a persons head can potentially be used to garner a lot of information. In surveillance videos, areas of interest can be found by following the direction of a pedestrian’s head. Rapid head movements may indicate alarm or fear and a system that automatically identifies such a situation could help a security guard to prevent crimes from happening.

One could also make product and advertisement placements more effective by studying the gaze statistics of customers in shops and malls.

Babies as young as six months display the ability to follow another person’s gaze towards an object of interest [1]. However, tasks that we humans find exceedingly simple can for a computer be very hard to replicate, and head pose estimation remains an active area of research in computer vision. The many variations in human appearance, changing lighting conditions and cluttered images are factors that makes this such a challenging problem. Add to this that the only information available is previously unseen low resolution footage, and you get a daunting task indeed. This is precisely what this thesis aims to do: to design a method that estimates the head orientation of football players from in game footage, in real-time.

The work of this thesis has been sponsored by Tracab and most of the work has been done on location at their office in Stockholm. Tracab is a world leader in sports tracking. Using image processing technology their system is able to identify the position of all moving objects in a sports arena in real-time.

With this technology they can both gather a multitude of game statistics and add graphics that follow players around in live broadcasts.

Tracab currently has an active collaboration with the Computer Vision and Active Perception lab (CVAP) at KTH on a 3-year EU project FINE (Free-viewpoint Immersive Networked Experience) [2]. Currently, the project is in its final year with the focus for Tracab and CVAP on image-based 3D human pose extraction. The aim is to gather richer information from sports

1

(9)

in addition to the player tracking. Such information could be used to generate 3D computer-game-style animations for football games which can be used in virtual arena applications. The thesis is in the context of this project, with the focus on estimating players’ head pose. The idea is that an accurate head pose will make game play animations look more realistic. In addition, the head pose can perhaps be used to reflect team strategy, to classify different game situations or to sketch players’ attention maps, all of which could be useful information for post game analysis or as a coaching aid.

We propose to solve the problem using random forests, a method recog- nized for its ability to handle a diverse variety of tasks and its fast evaluation speed. One of the novelties of our approach is the use of synthetically created images to train our system. We show that these can be made to generalise quite well to real images. In addition to our appearance based method we also explores ways of improving performance combining our results with data from Tracab’s tracking system.

1.1 Resources and Requirements

The goal of this thesis is to design a system that can estimate the orientation, or pose, of a football player’s head using a single image, in real-time. We propose to do this using random forests, see section 2.1. A version of ran- dom forests mostly based on the model introduced by Breiman [3] has been implemented in C++ specifically for our purposes. We have created several datasets, containing manually labelled head images from 3 different football games, that are used to evaluate the systems performance. In the final system, head images are supplied by a head detection and tracking system designed by Dan Song. Depending on camera position and image resolution the extracted head images will vary in size, but in general images will be square images in the range of 5×5 to 20×20 pixels. The heads are fairly centred and the images include a small border of background around the head, see figure 1.1b. Due to the low resolution of the head images we limit the estimation of the head’s pose to the rotation around the vertical axis.

1.1.1 Synthetic Dataset

A common challenge in machine learning is the gathering of sufficient amounts

of labelled training data. This becomes especially hard when the labelling of

head orientation needs to be done accurately on low resolution images. To

address this problem we propose the use of automatically labelled synthetic

head images. The synthetic images have been generated in a software called

(10)

(a) Synthetic head im- ages

(b) Real head images

yaw

pitch

roll

(c) Head rotation

Figure 1.1

Poser [4]. Starting from 10 different head models we create a total of 27 different head models using different permutations of hair style, hair colour and skin tone. These are further varied using 5 different lighting conditions. The models have two different base poses, either looking straight forward or slightly downwards. To model other poses the base pose is rotated with random yaw and pitch in the ranges of 0-359 and 0-40 degrees respectively, see figure 1.1c.

In this way approximately 36,000 head images with transparent background have been generated. The final dataset is created by superimposing those head images on randomly selected image patches from football footage. The images are scaled to 18×18 pixels to be in approximately the same resolution as the real images, while still not loosing too much information. To model possible translation errors in the real images the head position are shifted both horizontally and vertically in the image according to a normal distribution with mean 0 and standard deviation 1. Some sample images from the dataset can be seen in figure 1.1a.

1.2 Thesis Outline

The rest of the report is organized as follows: in chapter 2 we go through

some theory relevant to our work, and make a short review of previous work

done in the area; chapter 3 gives a detailed description of how our system

has been designed; in chapter 4 we present the experiments we have done to

evaluate the system and present the results they have yielded; in chapter 5 we

discuss the results presented in the previous chapter; in the final chapter we

summarize our work and give some suggestions for possible future work.

(11)

Chapter 2

Theory and Related Work

Here we summarize prior work related to ours in the field of head pose esti- mation. To properly do so, we first need to introduce some new terminology, and we begin this chapter with an introduction to random forests and some of its variations.

2.1 Random Forests

Random forests are a family of ensemble classifiers introduced by Breiman in 2001 [3]. They have been gaining a lot in popularity in recent years, much thanks to the impressive results achieved by Shotton et. al. In connection with the development of Microsoft Kinect they have demonstrated the random forests’ ability to handle large training datasets efficiently and their ability to solve complex tasks [5–8].

A random forest consists of an ensemble of decision trees, see figure 2.1, and can be used either for multi-class classification [5, 9], regression [7, 10, 11], or even both at the same time [12]. As the name implies the method incorporates certain elements of randomness; each tree is trained using a random subset of the training data, and the test at each split node is constructed from a random selection of descriptive features. See figure 2.2 for a step by step summary of the random forest training algorithm.

2.1.1 Bagging

The concept of using a subset of the data for training each tree was one of the novelties with Breiman’s method and is referred to as bootstrap aggregating, or bagging for short [13]. Let S be the full set of training data {(y

n

, x

n

), n = 1, ...., N }, where y

n

is the label or numerical response of the data vector x

n

. Let

4

(12)

Split Node Leaf Node

(a) Random Forest

S S

l

(b) Tree Training

Figure 2.1: (a) A random forest is made up of a set of individual trees. Each tree consists of split nodes and leaf nodes. The split nodes contain tests that send data either to the left or right child and the leaf nodes contain some kind of prediction. (b) At training time a set of data is presented to the root node, each node then selects the test that best separates the data into classes. When the data is pure enough growing stops and the node is turned into a leaf.

S

t

denote a subset of S containing N samples drawn from S with replacement.

By then training tree t using the corresponding subset S

t

, bagging aims at reducing the risk of overfitting to the training data by decorrelating the trees.

In addition, bagging also has the benefit of giving us a convenient measure of generalization error: the out-of-bag (OOB) error is estimated by testing each tree on the unseen samples S\S

t

and can be done continuously during training [3].

2.1.2 Split Nodes

Every split node in the tree contains a test meant to separate the data as well as possible with regards to some objective. In this work we limit the discussion to binary tests, which leaves us with functions of the general form:

h(x

n

, θ) = {0, 1}, (2.1)

where θ is the vector of descriptive features used in this particular node.

The tests used in practice are generally quite simple: comparison of a single feature to a threshold or a comparison between two different features are two commonly used tests.

The tests evaluates the samples in S = {(y

n

, x

n

)} using the descriptors in

the feature vector θ and passes the samples to the left or right child of the

(13)

Starting at the root node with a set S = {(y

n

, x

n

), n = 1, ...., N } of training data a tree is grown recursively as:

1. Generate a random set of feature vectors Θ = {θ

k

}.

2. Split the dataset S into a left and right subset for each θ ∈ Θ according to some test function h(x

n

, θ) = {0, 1}:

S

l

(θ) = {(y

n

, x

n

) ∈ S|h(x

n

, θ) = 0}

S

r

(θ) = {(y

n

, x

n

) ∈ S|h(x

n

, θ) = 1}

3. Select the feature vector θ

that maximizes the informa- tion gain G(θ):

θ

= arg max

θ

G(θ), where

G(θ) = H(S) − 1

|S| (|S

l

(θ)|H(S

l

(θ)) + |S

r

(θ)|H(S

r

(θ))) 4. Continue growing the child nodes with the subsets S

l

and S

r

until some stopping criteria is met in which case the split node is turned into a leaf node and stores the distribution of the samples that reached the node.

Figure 2.2: A general summary of the random forest training algorithm.

(14)

node:

S

l

(θ) = {(y

n

, x

n

) ∈ S|h(x

n

, θ) = 0} (2.2) S

r

(θ) = {(y

n

, x

n

) ∈ S|h(x

n

, θ) = 1} (2.3) When constructing a tree each new node is presented with a set of randomly selected features Θ = {θ

k

}. The node picks the feature vector that best partitions the data according to:

θ

= arg max

θk∈Θ

G(θ

k

), (2.4)

where G(θ) is the information gain defined as:

G(θ) = H(S) − 1

|S| (|S

l

(θ)|H(S

l

(θ)) + |S

r

(θ)|H(S

r

(θ))) (2.5) The target function H(S) is chosen differently depending on the application, but the objective is to minimize the error while balancing the sizes of the left and right distributions. A commonly used target function is the Shannon entropy [5, 7, 8, 14] defined for classification as:

H(S) = X

c

p(c|S) log(p(c|S)), (2.6)

where c are the class labels. In case of regression one can instead try to minimize the sum of squared errors [7]:

(S) = X

y∈S

|y − ¯ y|

2

(2.7)

where ¯ y represents the mean value of the labels in the distribution S.

2.1.3 Leaf Nodes

During training, each leaf node tries to find a good representation of the data that reached it using some kind of prediction model. Depending on the application and the nature of the data this can be done in a number of ways.

In case of classification the most straightforward way is to let the probability distribution of tree t p

t

(c|x

n

) be proportional to the number of samples from each class c. The estimated distribution of the entire forest is then simply the average of the individual distributions:

p(c|x

n

) = 1 T

T

X

t

p

t

(c|x

n

) (2.8)

In a regression forest, the leaf node outputs could be either point estimates

or probability distributions. In the latter case the outputs can be weighed

together as the average of all the trees in the same way as with a classification

forest.

(15)

2.1.4 Previous Work

As mentioned above much of the recent attention in computer vision given to random forests are due to Microsoft’s research team at Cambridge. In their first paper [5] Shotton et. al. use random forests for human pose recognition by classifying individual pixels as belonging to different body parts. The forest is trained using hundreds of thousands of depth images synthetically generated from 100,000 human poses collected using a motion capture device.

Their results give insights on the effects of several training parameters. They indicate that depth has the most significant effect on performance, and show that overfitting can be avoided by using a larger dataset even when training very deep trees. Also, in their case, the number of trees in the forests is shown to be of less importance, and their final classifier achieves state of the art performance even when consisting of no more than 3 trees.

In their next paper [7] the human pose estimation problem is tackled using a regression forest. The leaf distributions are modelled using mean shift clus- tering. They show that there is no benefit in storing more than the 2 biggest clusters, thus reducing both memory requirements and evaluation speed. An- other interesting discovery is that the use of a classification-style target func- tion in the split nodes gives better results than a regression-style function.

They also investigate different sub-sampling methods with the purpose of making the forest more efficient, with some promising results.

Dantone et. al. [11] uses a conditional regression forest to improve facial feature detection. The idea is to use prior knowledge of some global variable to constrain output. In this case the global variable is the orientation of the head, divided into 5 states. By training a full forest for each possible state they get forests that are specialized to identify facial features on faces with different orientations. However, if the number of states of the global variable grows larger this method quickly becomes infeasible since the training time increases linearly with the number of states. Instead of conditioning the entire tree structures on the global variable Sun et. al. [6] suggest that only the regression models in the leaf nodes should be modified. They store a regression model for each state of the global variable and show that this not only speeds up training but also gives the best results when used with their human pose estimator.

Schulter et. al. [15] suggest an on-line training scheme for decision forests

where training data arrives sequentially: a tree starts out as a single leaf node

and as data arrives the node can chose to either remain a leaf node, or if

enough data has arrived to turn into a split node and pass on the data to its

child nodes. With this way of training, split nodes will base their decisions

on considerably smaller subsets of data than in the traditional off-line case.

(16)

Intuitively this may seem like a bad thing, but the results presented by Schulter et. al. suggests that this in fact helps with generalisation. The reason being that it decorrelates the trees in the same way that bagging does. The on-line training method could also be used to further grow an already built tree, thus possibly adapting a general tree to a more specific situation. Inspired by the good performance of the on-line forests they also introduced a modified off-line training method where each split node base their choice of split function on a small subset of the data available, showing that this will reduce training time while improving performance.

2.2 Head Pose Estimation

There has been a fair amount of work done in the area of head pose estimation, some of which uses appearance based methods and others which rely on the identification of facial features. The latter, requiring that the resolution of the images be sufficiently high for individual features to be distinguishable, are not applicable to our situation and will hence not be discussed in any greater extent here. The same applies to work done using depth images [12].

However, the interested reader can find an extensive and relatively recent survey of existing methods for head pose estimation in [1].

Benfold and Reid have done extensive work in the area of head pose esti- mation in low resolution surveillance videos [16,17]. They classify head images as small as 10 pixels square into 8 classes spanning a full 360

rotation using a forest of random ferns

1

.

In [16] the split nodes base their decisions on a segmentation of the head images into hair, skin and background regions from k-means clustering, requir- ing that the segmented training images are manually assigned correct region labels. During evaluation of unseen data, the most likely labelling hypothesis needs to be found. This is done by testing all possible combinations of labels, thus removing the random ferns’ inherent advantage in evaluation speed.

In [17] Colour Triplet Comparison (CTC) is introduced as an alternative image descriptor. CTC compares the colour values of three pixels and the split node makes a decision depending on whether the first and second pixels are more similar than the second and third, where similarity is measured as the L1 norm in RGB-space, see figure 2.3a. The intuition being that the comparison will be able to find colour differences in the image while still being robust towards different lighting conditions, skin colour etc. They also explore the

1

Random ferns are a special kind of decision trees where every split node at a level uses

the same test. This makes ferns easier to implement and more efficient to train, but at the

cost of some flexibility.

(17)

1 3

2

(a) Colour Triplet Comparison (b) Histogram of Oriented Gradi- ents

Figure 2.3: (a) Colour Triplet Comparison (CTC): The test compares the colour values of three pixels and makes its decision depending on whether the first and second pixels are more similar than the second and third. (b) Histogram of Ori- ented Gradients (HOG): Divides the image into cells and counts the occurrences of gradient orientations in each cell.

usefulness of HOG-descriptors, where the tests work by comparing the count in two bins of the histograms, see figure 2.3b. They find that HOG-descriptors perform better than CTC but that a combination of the two outperforms both of them separately.

Benfold et. al. also consider ways of improving accuracy when there are translation errors in the estimate of the head location. They show that injection of similar errors in the training data can slightly improve results [17].

Siriteerakul et. al. use image descriptors similar to the ones used by Ben- fold to train a Support Vector Machine (SVM) to estimate head pose: Pairwise Non-Local Intensity Differences (iDF) and Pairwise Non-Local Colour Differ- ences (cDF). iDF simply compares the intensity of two pixels, while cDF checks whether two pixels of an image segmented with k-means clustering belong to the same segment [18].

Launila did work very similar to ours in his thesis work which was per-

formed jointly at Tracab and KTH [19]. He replicates the random ferns classi-

fier used in [16] and compares its performance on football footage to an SVM

trained on grayscale images. The SVM is found to be both more efficient

and more accurate. To improve performance he explores ways of combining

the appearance based classifier with contextual features, such as player and

football position, with some success.

(18)

Dantone et. al. uses a random forest to estimate head pose of faces viewed from the front. The head pose estimation is not the focus of their paper, but they indicate that better results are found by first estimating a continuous value for the pose and then converting it into a discrete label [11].

A random forest based on Gabor features is used for head pose estimation in [9]. In an attempt to strengthen the individual trees, Linear Discriminant Analysis (LDA) is used in the split nodes.

In summary we can say that Benfold’s, and Launila’s test data is the most similar to ours. The image representations used in Benfold’s later work [17]

(CTC and HOG) seem promising, and it would be interesting to see how

they fare when used in combination with a random forest utilizing some of

the newer ideas discussed in the previous section. Launila’s successful use of

contextual features should also be explored further. These ideas and more are

investigated in the following chapters.

(19)

Chapter 3 Method

Based on the discussion in the previous chapter we decided to try to solve the head pose estimation problem using a random regression forest. In this chapter we give a detailed description of the random forest that we use and how we combine it with Tracab’s tracking data into a complete system.

3.1 Random Forest Design

The basis of our random forest is very similar to the original design suggested by Breiman [3]. The framework can be used for a multitude of tasks without too many modifications.

3.1.1 Bagging

Our bagging process is completely analogous to the original one: out of N training samples, T subsets of N

t

samples are drawn randomly with replace- ment to be used for training each tree t ∈ {1, . . . , T }. In practice this means that, in the standard case where N

t

= N , each tree will be trained on roughly two thirds of all the samples [13].

3.1.2 Split Nodes

One thing to consider, when adapting a forest to a specific task, is the choice of a suitable set of tests to be used in the split nodes. In our case we have decided to work with the same tests that Benfold et. al. used in their work, since they have been shown to work well on data very similar to ours [17]. As mentioned in the previous chapter they introduced Colour Triplet Comparison (CTC) and used them in combination with HOG-descriptors. However, our

12

(20)

ϕ

0

o

90

o

Figure 3.1: Discrete division of head poses into 16 bins. By starting the binning at different angles φ for all the trees we avoid biasing the forest towards a certain division.

implementation will have one minor modification: instead of RGB-space our colour triplets will be represented in CIELUV-space. CIELUV is a color-space created with the aim of being perceptually uniform, which means that the dis- tance between two colours should be proportional to the perceptual difference between the colours [20]. Since the CTC works by comparing distances in color space this should be a useful property.

Since we have access to synthetic training data, perfectly labelled with continuous angle values, it is natural that we should use this to our advan- tage and try to estimate the head rotation as accurately as possible with a regression output. Intuitively the target function, used to choose the best test function at each split node, should be made to match the desired output of the forest. In other words, a regression target should be used to a regression out- put and a classification target to a classification output. However, as shown by Girschick et. al in [7], a classification target can sometimes outperform a regression target even when striving to do regression. We therefore test both methods to see which performs best in our case.

For the classification target, the head poses are divided into 16 evenly

(21)

0

o

90

o

r

x y

θ

Figure 3.2: Mean of Circular Quantities. The angles are projected onto the unit circle and the mean is calculated as a point in 2D, (¯ x, ¯ y), before being projected back onto the unit circle and converted to an angle, ¯ θ.

spaced bins around the full rotation and the trees aim to separate these bins using the Shannon entropy described in equation 2.6. However, some initial testing showed a problem with this discretization. We noticed that trees rarely voted for angles close to edges of the bins, making the forest perform consid- erably worse on test samples in those regions. We solve this by introducing a variable starting angle φ for the binning as shown in figure 3.1. Each tree is assigned a random angle to use as the starting point of the first bin, meaning that individual trees will still be weaker in certain regions, but the output from the entire forest will no longer be biased towards a certain division.

As the regression target we would like to use the sum of squared errors described in equation 2.7. However, since we are working with a circular quan- tity, i.e. head rotation, we need our mean to be able to model the proximity of 0 degrees and 359 degrees. The linear mean does not have this property, so instead we use the following formula:

θ = atan2 ¯ 1 n

n

X

i=1

sin θ

i

, 1 n

n

X

i=1

cos θ

i

!

, (3.1)

where atan2 is a version of arctangent that takes two arguments to be able to

(22)

discern from the sign of the arguments in which quadrant the resulting angle should be. The workings of equation 3.1 is illustrated in figure 3.2. From the the mean angle we can proceed to calculate the error sum. However, because of the circular property of the angle we need to modify the error function as well. We propose the following:

(S) =

 P

θ∈S

|θ − ¯ θ| if |θ − ¯ θ| ≤ 180

P

θ∈S

360 − |θ − ¯ θ| if |θ − ¯ θ| > 180 , (3.2) where ¯ θ is the mean from equation 3.1. The best split is then simply the one that minimizes the value of:

E(S) = (S

L

) + (S

R

) (3.3)

During training each split node randomly generates a set of both CTC and HOG tests and picks the one that best splits the data according to the chosen target function. The number of tests that are evaluated at each split node also needs to be taken into consideration. A large number of tests will result in individually stronger trees, making it possible to have a forest with fewer trees and thus faster evaluation speed. At the same time more tests will increase correlation between trees, thus possibly reducing the generalization properties of the forest. A very large number of tests will also result in infeasibly long training times. On the other hand too few tests will result in weak trees with poor performance. Therefore it is important to find a suitable number of tests that is neither too small or too large.

3.1.3 Leaf Nodes

The splitting process is repeated for both child nodes, and the tree keeps growing until at least one out of the following stopping criteria is met:

1. The node is at a depth greater than or equal to d

max

2. The node meets a purity criteria:

a) Classification target: at least a fraction ρ

min

of the training samples belongs to the same class.

b) Regression target: the average angular error is smaller than 

min

.

3. The node contains a number of training samples less than |S|

min

.

(23)

0o 90o

0o 90o

θ2 θ1 θ3

w = ⅙

3

w = ½

1

w = ⅓

2

Figure 3.3: Mean Shift Clustering run on a set of angles.

Once a stopping criterion is met the node is turned into a leaf node. The last of the stopping criteria, |S|

min

, is merely meant to stop branches with inconclusive results from growing more than necessary and should not affect performance very much. It is set to |S|

min

= 10 and we do not consider any other values for it here.

Once turned into a leaf the node needs to find a good representation of the data that it contains. The most straightforward approach would be to repre- sent the data by its mean angle. However, we notice that leaf distributions are sometimes multi modal, in which case the mean would be a rather poor representation of the true distribution. To handle this problem we decide to cluster the data using the mean shift algorithm

1

. To tackle the circularity of the data the clustering is run on the data points projected on the unit circle, see figure 3.3. The main advantage with mean shift is that it makes no as- sumptions on the number of clusters in the dataset, making it very useful in our case. We use an Epanechnikov kernel and the only parameter that needs to be specified is the cluster bandwidth. Each leaf stores the 2 biggest clusters in the form of their respective means ¯ θ

m

and weights w

m

, where the weights are simply the portion of data that belongs to the respective clusters.

3.1.4 Forest Evaluation

At test time, the forest is presented with an image of an unknown head pose.

The image is passed down each tree until it reaches a leaf node. Each tree then votes on up to 2 angles corresponding to the centres of the biggest data clusters in the leaf node. Each vote carries a weight w

m

and the votes accumulated

1

The reader unfamiliar with the mean shift algorithm can find a more thorough descrip-

tion of it in appendix A.1

(24)

from the entire forest are clustered once again using a weighted mean shift.

The estimated head orientation is then simply the mean of the biggest vote cluster.

3.2 Comparison To Support Vector Machines

A popular and versatile machine learning method is the support vector machine (SVM). It can be used for binary and multi-class classification as well as regression [21]. As a comparison to our random forest based method we decide to train a set of multi-class SVMs on the same synthetic dataset that we used to train the random forest. There is no readily available implementation of an SVM that does regression on a circular quantity, and since optimizing the SVM is not the focus of our thesis we decide to compare the performance of the random forest with a SVM classifiers. We use the open-source implementation of SVM, libSVM [22], and design classifiers similar to the ones used by Launila in [19]. The 360 degrees are divided into 45 degrees wide bins, resulting in an 8-class classification. We choose to use a radial basis function (RBF) kernel and C-style soft margins. As for image descriptors we choose to investigate grayscale pixel intensities, HOG descriptors and a combination of the two.

We also train an SVM using a combination of color pixel values and HOG descriptors, which is the same information that the random forest has to work with. The resulting feature vector is quite high dimensional and we thus train this SVM using a linear kernel.

3.3 Integration with Player Tracking

Our access to Tracab’s tracking system affords us with several unique opportu- nities to improve our head pose estimation. We have access to player position, ball position and multiple camera angles from several football games. Since our synthetic training data does not contain these extra features, we are not able to train our random forest on them directly, as has been done by Lau- nila [19]. However, we suggest some simple heuristics that could possibly be used to improve the pose estimate.

3.3.1 Multiple Cameras

In a football match the players spend most of the time separated from each

other by at least a few meters, and occlusion and background clutter is not

that much of a problem, see figure 3.4. But in certain situations such as free

kicks and corner kicks players will crowd together and it can be hard to get a

(25)

Figure 3.4: Typical game footage. Players are well separated and there is very little occlusion and background clutter.

clean picture of all players. As can be seen from figure 3.5, multiple camera views can offer an advantage in these crowded situations.

There could also be situations where one camera fails in some way, giving a blurred or unclear image or no image at all. In these situations having a system that handles input from several cameras of varying quality could be very useful.

The tracking system supplies all head detections with a player ID so it is a simple thing to match corresponding head detections from different cameras.

To get an estimate of a single player’s head orientation we run all available detections of this player through the forest. The votes from each tree are then transformed from angles relative to the respective images into angles relative to the football pitch. The votes from all views are then clustered using mean shift in the same way as when evaluating a single image and the output is the centre of the largest cluster. Our hypothesis is that a bad image will give rise to scattered votes, while a clear image will give a distinct cluster of votes. If so the mean shift algorithm should still be able find the clear image’s distinct cluster within the noisy votes from the bad image.

Figure 3.6 shows how the pitch relative pose, ϕ, can be derived from the image relative pose, θ, player position and camera positions. Note that the geometry will vary a bit for different player positions but the general idea remains the same. The transformation can be summarized in the following equation:

ϕ = (1 + n)π + α + θ, where n ∈ Z, so that 0 ≤ ϕ < 2π (3.4)

and α is the relative angle between the two positions as defined in figure 3.6.

(26)

(a) Left Camera (b) Right Camera

Figure 3.5: The figure shows two different views of the same free kick situation.

We see that a different viewing angle can make a huge difference in player separation.

x y

(x , y )

c c

θ

(x , y )

p p

α φ

α

Figure 3.6: Diagram explaining how the image relative head pose, θ is related

to the pitch relative pose, ϕ, where α is the relative angle between player and

camera, positioned in (x

p

, y

p

) and (x

c

, y

c

) respectively.

(27)

3.3.2 Ball Position

Naturally, players are often looking at the ball, and it has been shown by Launila that this can be used to improve the accuracy of the head pose esti- mation [19] [23]. We know both the positions of the players and the ball so it is a simple matter to calculate the relative direction from the player to the ball.

The output from the forest often contains more than one big cluster of votes, and we have noted that in the case of erroneous estimates that the second biggest cluster is often much closer to the correct angle. Our hypothesis is that this second cluster may often correspond to the angle towards the ball.

We therefore suggest that the estimate should be subject to change in the case where the second biggest cluster coincide with ball direction.

We consider the angles to coincide whenever the cluster center is within

45 degrees of the angle towards the ball. We do not want to be confused by

small clusters from noisy votes, and therefore only consider clusters that are

at least half as big as the largest cluster.

(28)

Chapter 4

Experiments and Results

In this chapter we describe the experiments performed to evaluate the perfor- mance of the system. We start off by introducing the dataset used to evaluate the performance. We then present the results of the experiments and offer observations regarding those results.

4.1 Datasets

Our main objective is to investigate how well our random forest trained with synthetic data performs on real footage. We therefore create a few sets of head images with manually assigned angles:

1. Dataset 1 consists of head images from a video shot in 1920×1080 pixels, see figure 4.1, and head sizes range from approximately 5 to 10 pixels

Figure 4.1: Dataset 1 (original resolution 1920×1080)

21

(29)

square, see figure 4.2a. Video frames are randomly sampled from 2 cam- eras, both covering the center of the pitch but placed 11 meters apart.

Not all players are present in every frame, but in total 20 different play- ers are included in the dataset. Head images are manually extracted so that it is possible to study the performance of the head pose estimation without it being affected by possible translation errors in the automatic detections. Because of the low resolution it is sometimes impossible even for a human to tell the head pose and in those cases the head images are excluded from the set. In total 35 frames are sampled from both cameras giving a total of 820 labelled head images. When matching players over frames we get 292 matching image pairs that can be used to evaluate the effect of using input from multiple cameras.

2. Dataset 2 consists of head images from two different football games, both shot in 1920×1080 pixels, with head sizes ranging from 9 to 16 pixels square, see figure 4.2b. Head images are manually extracted for 31 different players and the set contains a total of 520 labelled head images.

3. Dataset 3 covers the same video frames as dataset 1 except that the head images have been extracted using the automatic head detector. After some false detections and detections where a large part of the head is outside the image have been discarded we are left with 602 labelled head images. Some examples of images that are in the dataset as well as some images that were discarded can be seen in figure 4.3.

4. Dataset 4 is simply all the head images from dataset 1 and 2 combined.

Giving a total of 1340 images of 51 different players. This is the most varied dataset and will thus be used for most of our testing.

5. Dataset 5 is extracted from the free kick situation depicted in figure 3.5,

using the same two cameras as in the figure. The cameras are placed 30

meters apart and capture the situation from two very different viewing

angles. The situation is rather messy and contain many occluded heads

and it should give a good indication of if we can benefit from multiple

camera views. In total we label 363 matching images from each camera,

containing images of 18 players.

(30)

(a) Dataset 1 (b) Dataset 2

Figure 4.2: Head images from dataset 1 and 2. As can be seen, the images from dataset 1 has considerably lower resolution.

(a) Kept detections (b) Discarded detections

Figure 4.3: Automatically detected head images from dataset 3. The left image

shows the nature of the kept images and the right shows some of the images

that were discarded because of too large errors.

(31)

4.2 Performance Evaluation

Performance of the random forest is measured as the mean absolute error (MAE) of the estimated head angle. The forest is trained using our full 36’000 image synthetic dataset and bagging is used with each tree being trained on N

t

= N samples, where N is the total number of available training samples.

The performance of the forest is affected by a number of different param- eters. Parameters that need to be considered are:

1. The number of trees, T , in the forest.

2. The number of tests, τ , to be considered in each split node.

3. The stopping criteria described in section 3.1.3 a) The maximum tree depth d

max

.

b) The purity criteria: ρ

min

for a classification target or 

min

for a regression target.

4. The mean shift bandwidth, β

leaf

, to be used when clustering the data in a leaf node.

5. The mean shift bandwidth, β

f orest

, to be used when clustering the output from the entire forest.

We tune these parameters using the synthetic data, by means of the out-of-bag- error (OOB-error). We also study the effects that changing these parameters has on the performance on the real data in dataset 4.

4.2.1 Forest Size

There is likely to be more than one optimal configuration of the training pa- rameters and an extensive grid search spanning all the parameters is infeasible.

Therefore we begin by finding suitable values for T and τ and then study the

effects of the other parameters given that choice. The number of tests, τ , is

chosen because it is the parameter that has the biggest impact on training

time and it would be useful to settle for a value early on. Also, since the trees

are trained completely independently from one another, the effects of the num-

ber of trees in the forest should not be affected by the other parameters. We

explore the effects of these two parameters on a random forest trained with

the classification target described in section 3.1.2 and other parameters, set

using intuition, as shown in table 4.1.

(32)

0 20 40 60 80 100 120 140 160 180 200 6

8 10 12 14 16 18 20 22 24 26

Mean absolute error (degrees)

Number of trees

50 Tests 100 Tests 250 Tests 500 Tests 1000 Tests 5000 Tests 10000 Tests

Figure 4.4: MAE on the synthetic OOB samples as a function of number of trees.

d

max

ρ

min

β

leaf

β

f orest

20 0.85 0.5 0.5

Table 4.1: Initial choice of training parameters

(33)

0 20 40 60 80 100 120 140 160 180 200 22

24 26 28 30 32 34 36 38 40 42

Mean absolute error (degrees)

Number of trees

50 Tests 100 Tests 250 Tests 500 Tests 1000 Tests 5000 Tests 10000 Tests

Figure 4.5: MAE on dataset 4 as a function of number of trees.

Figure 4.4 shows how performance is affected by the number of trees for different values on τ . The decrease in error saturates at around 5000 tests, and increasing the number of tests further will likely just result in increased training time. It can also be noted that after a point very little is to be gained by increasing the number of trees, and since evaluation time and memory requirements increases linearly with the number of trees, and our goal is to design a real time system, we decide to limit ourselves to forests of 100 trees.

Thus, if nothing is said to the contrary, for the rest of this report it will be understood that τ = 5000 and T = 100.

Figure 4.5 shows the results on the real data. We see the same qualitative

behaviour as on the synthetic data which should be a positive indication of

the synthetic dataset’s ability to model real data.

(34)

0 2 4 6 8 10 12 14 16 18 20 22 10

15 20 25 30 35

Mean absolute error (degrees)

Maximum tree depth

(a) OOB-samples

0 2 4 6 8 10 12 14 16 18 20 22

24 26 28 30 32 34 36 38 40

Mean absolute error (degrees)

Maximum tree depth

(b) Dataset 4

Figure 4.6: Performance for varying tree depths on the synthetic OOB-samples and on the real dataset 4. On the real data we see signs of overfitting when trees grow deeper. Note that the range of the y-axis is not the same in the two plots.

4.2.2 Stopping Criteria

The stopping criteria are meant to stop the tree from growing more than necessary and to prevent overfitting. First we study the effects of varying tree depth. In figure 4.6a we see how the performance varies with depth. Minimum error is achieved with a maximum depth of 14, and this depth will be used for the remainder of the report.

When comparing to the performance on the real data in figure 4.6b we see that while the performance on the OOB-samples saturates, the performance on the real dataset starts to deteriorate with deeper trees. This is a clear sign of overfitting, and the reason that this does not show up when tuning on the synthetic data is most likely that the OOB-samples share too many similarities with the other training samples.

It should be noted that the trees trained are generally not balanced. I.e.

trees will not necessarily be uniform in depth and some branches may go much deeper than others. For example, a tree trained to a maximum depth of 20 will have an average depth of around 12. Thus only a small portion of the branches are affected by the maximum depth criteria, but as we have seen pruning these branches can sometimes give an improvement in performance.

In figure 4.7 we see how the performance is affected by changing the min-

(35)

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 8

8.5 9 9.5 10 10.5

Mean absolute error (degrees)

Minimum leaf purity, ρ min

(a) OOB-samples

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

23.5 24 24.5 25 25.5

Mean absolute error (degrees)

Minimum leaf purity, ρ min

(b) Dataset 4

Figure 4.7: Performance for different values of ρ

min

on the synthetic OOB- samples and on the real dataset 4. Note that the range of the y-axis is not the same in the two plots.

imum purity criteria in the leaf nodes. Higher values is obviously preferable, but one might expect there to be some overfitting if ρ

min

is set too high. The reason that we see no such tendencies here is likely due to the depth limitation set by d

max

. Minimum error is achieved with ρ

min

= 0.85 and it is the value that will be used for the rest of the report.

4.2.3 Mean Shift Clustering

The last two parameters to investigate are the bandwidths used in the two mean shift steps. The bandwidths can be changed after training so it is not too expensive to perform a grid search over these two parameters. In figure 4.8 we illustrate the results of such a search using a contour plot of the performance for a set of different values on the bandwidths. We see that value of the bandwidth in the first step, β

leaf

, has very little impact. This is likely due to less multi-modality in the leaf distributions than we first thought, and the addition of the mean shift clustering may have been unnecessary in the leaf nodes. However, in the uni-modal case mean shift should give more or less the same result as the circular mean and we are no worse off for using mean shift.

The value of the second bandwidth, β

f orest

, used when clustering the votes

(36)

8.5 8.6

8.6 8.7

8.7 8.8

8.8 8.9 9

9.1 9.2

βleaf βforest

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) OOB-samples

23.8

24 24

24

24 24.2

24.2

24.2

24.2

24.2

24.2

24.4 24.2

24.4 24.4

24.4

24.4 24.4

24.4

24.4

24.6

24.6

24.8 25.2 25

25.2 25.2

25.4 25.4

βleaf βforest

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) Dataset 4

Figure 4.8: Performance for different values of the two bandwidths on the syn- thetic OOB-samples and on the real dataset 4. The values on the contours rep- resent the MAE in degrees. The minimum errors are found at: (a) {β

leaf

= 0.3, β

f orest

= 0.6}, and (b) {β

leaf

= 0.7, β

f orest

= 0.7}.

T τ d

max

ρ

min

β

leaf

β

f orest

100 5000 14 0.85 0.3 0.6

Table 4.2: Final choice of training parameters

from all the trees, does however seem to matter. In particular, smaller band- widths perform worse. We also note that the synthetic dataset seems to prefer slightly smaller bandwidths than the real dataset. This could be an indication of some minor overfitting to synthetic data when using tighter clusters.

4.2.4 Target Function

In the previous chapter we introduced two different target functions. As Gir-

shick et. al. showed in [7] a classification target can outperform a regression

target on a regression task. Here we explore if this is true in our case as

well. We train a random forest using the regression target function 3.3 with

the other parameters set as in 4.2, replacing ρ

min

with 

min

= 25 degrees. In

table 4.3 we compare the random forests trained using the two different target

functions on dataset 4.

(37)

MAE (degrees) Regression Target Function 25.6

Classification Target Function 24.6

Table 4.3: MAE on dataset 4 for the random forests trained using different target functions.

Accuracy # bins off # bins off when wrong Random Forest 57.9% 0.534 1.27

SVM: Grayscale 51.4% 0.630 1.30

SVM: HOG 49.9% 0.766 1.53

SVM: Grayscale+HOG 53.6% 0.729 1.57

Table 4.4: Classification accuracy on dataset 4. The third column represents the average number of bins between predicted and true bin for all samples, and the last column represents the same measure for all the misclassified samples.

We see that the forest trained with the classification target function per- forms better and we decide to continue using this target in the remainder of the report. It could be argued that the we should explore more values on 

min

, and perhaps even a different configuration of the other parameters, for this to be a fair comparison. But, given that the range of values on the MAE in figure 4.7b, where ρ

min

, the parameter in the classification target corresponding to



min

, is varied, is in the same range as the difference in performance we judge it unlikely that such experiments would yield any drastically different results.

4.3 Comparison With SVM

The SVMs are trained on the same synthetic dataset that we use to train the random forest. We perform grid searches to find optimal values for the parameters C and γ for each of the three image descriptors

1

. The performance measure of the grid search is the mean classification accuracy on a 4-fold cross validation.

To compare our regression forest with the SVM classifiers we let the forest vote for the class corresponding to the one of the 8 bins at which the inferred angle points. The forest used for the comparison is trained using the parame- ters in table 4.2. In table 4.4 we compare classification accuracy on dataset 4.

Accuracy is measured as the percentage of samples that is classified correctly.

We also calculate the average distance between predicted bin and true bin.

1

The interested reader can find the results illustrated as contour plots in appendix B.1.

(38)

Predicted class

True class

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(a) Random Forest

Predicted class

True class

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(b) SVM: Grayscale

Predicted class

True class

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(c) SVM: HOG

Predicted class

True class

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(d) SVM: Grayscale + HOG

Figure 4.9: Confusion matrices for the random forest and the SVMs. Correct

predictions are plotted on the diagonal, meaning that a clear diagonal indicates

an accurate classifier. Note that because of the circularity bin 1 and 8 are also

adjacent, meaning that the lower left and upper right corners are only off by

one bin.

(39)

Time per sample (ms) Evaluations per second (Hz)

Random Forest 6.75 148

SVM: Grayscale 20.7 48.4

SVM: HOG 26.0 38.4

SVM: Grayscale+HOG 46.8 21.3

Table 4.5: Evaluation times for all the compared methods.

The distance is measured as the minimum number of bins between the pre- dicted and true bin, where distance 1 would mean that the bins are adjacent.

We calculate this distance based on the results on the entire dataset and also on only the misclassified samples.

We see that the random forest outperforms the SVMs with regard to both measures. The SVM combining HOG and grayscale features achieves the high- est classification accuracy of the three SVMs. However, the distance measure indicates that the SVM using only grayscale features is slightly more stable, making smaller mistakes. Still, all in all each method performs fairly well. The distance measure also suggests that no method makes many predictions that are wrong by a large amount. In figure 4.9 we display the results as confusion matrices. Here we see even more clearly what was indicated by the distance measure. When misclassifying the prediction is more often than not in an ad- jacent class. Since a significant portion of the test samples will very likely lie close to class borders some classifications of this nature is to be expected even for a rather accurate classifier. We can also note that the HOG+grayscale SVM predicts class 1 and 8 more often that the other classifiers, which is likely the cause of the instability that we mentioned before.

We also compare the random forest and the SVMs on account of evaluation speed. The evaluations are run using a single thread on an Intel Core i7- 3720QM processor with a clock speed of 2.6 GHz. The evaluation time per sample is calculated by timing the evaluation of dataset 4 and then dividing the elapsed time with the number of samples. The results can be seen in table 4.5. We see that that the random forest is considerably faster than all types of SVMs.

4.3.1 Results On Different Datasets

Based on the above results the final version of the random forest is trained

using the parameters shown in table 4.2. We test this forest on dataset 1 and

2. The results are shown in table 4.6. We see that the random forest performs

considerably better on dataset 2, which is not unexpected considering that

(40)

MAE (degrees) Dataset 1 26.6

Dataset 2 21.4

Table 4.6: Random forest performance on the different datasets.

0 20 40 60 80 100 120 140 160 180

0 10 20 30 40 50 60 70 80 90 100

Number of samples

Prediction error (degrees)

Figure 4.10: Histogram of the prediction errors when testing the random forest on dataset 4.

the samples in that dataset contain about twice as many pixels as those in dataset 1. Also, the size of the images in the training data (18×18 pixels) is more similar to size of the images in dataset 2 which might make the training data better suited for that dataset.

To test the importance of training image size we scale our training data to

10×10 pixels and train a second random forest. In table 4.7 we see the results

from testing this forest on dataset 1 and 2. We see that the MAE is decreased

on the smaller images of dataset 1, but is increased on the larger images of

dataset 2.

References

Related documents

During the past 18 years, on the other hand, I have worked as a Special Education teacher in the English section of a Swedish international school where my interest in

“It’s positive,” she said crisply. When she looked up, she did a double take. “Are you all right? You’ve turned white.” I did feel strangely cold. “Eva, I thought you

(2002) beskriver att förtroendearbetstid ger mer tid för fritid och familj, jämfört med reglerad arbetstid, talar intervjupersonerna om att de har möjlighet att anpassa

168 Sport Development Peace International Working Group, 2008. 169 This again raises the question why women are not looked at in greater depth and detail in other literature. There

In order to contribute to existing knowledge, we have developed a model that examines the factors affecting the perception of the corporate brand identity, and thus the

This study builds on the work carried out by Almond &amp; Verba 4 as well as Putnam 5 in so far as to argue for the importance of civil society and its influence on

Strong commitment to core values, to family business legacy and to relationships with important internal and external stakeholders is another important feature of

What is interesting, however, is what surfaced during one of the interviews with an originator who argued that one of the primary goals in the sales process is to sell of as much