Degree project in Computer Science Second cycle
Stockholm, Sweden 2013
Real-time Head Pose Estimation in Low-resolution Football Footage Using Random Forests
André Dübbel
Real-time Head Pose Estimation in Low-resolution Football Footage Using
Random Forests
DD221X - Degree Project in Computer Science, Second Level
ANDRÉ DÜBBEL DUBBEL@KTH.SE
Master’s Thesis at CSC CVAP Subject: Computer Science Supervisor at Tracab: Dan Song Supervisor at CVAP: Josephine Sullivan
Examiner: Stefan Carlsson
Abstract
This report presents a method for real-time head pose estimation in low res-
olution football footage. The presented method uses a random forest trained on
synthetically generated head images. The use of synthetic training images is shown
to be a good substitute for the use of manually labelled images. The presented
method compares favourably to support vector machines trained for the same task,
both in terms of accuracy and speed. It is noted that the method relies on a good
head detection to perform well. The report also examines ways of combining the
image based head pose estimation with contextual features such as ball position
and player position. It is shown that the relative direction to the ball can improve
the accuracy of the pose estimate in certain situations. Furthermore, it is found
that the random forest method can easily be extended to incorporate images from
multiple cameras to improve accuracy.
Realtidsestimering av huvudets vridning i lågupplöst fotbollsvideo
Sammanfattning
Denna rapport presenterar en metod för realtidsestimering av huvudets vridning
i lågupplöst fotbollsvideo. Den presenterade metoden använder sig av en random fo-
rest tränad på syntetiskt skapade huvudbilder. Användandet av syntetisk data visar
sig vara ett gott substitut för användandet av handmärkta bilder. Den presenterade
metoden jämförs positivt mot stödvektormaskiner tränade för samma ändamål, bå-
de i avseende på precision och snabbhet. Det visas att metoden förlitar sig på god
huvuddetektering för att prestera väl. Rapporten undersöker också sätt att kombi-
nera den bildbaserade estimeringen med kontextuella faktorer så som bollposition
och spelarposition. Det visas att den relative riktiningen till bollen kan förbättra
precisionen i vissa lägen. Vidare visas att random forest-metoden enkelt kan utökas
till att använda sig av bilder från flera kameror för att förbättra precisionen.
Acknowledgements
Many thanks to all the helpful people at the computer vision department at
Tracab, thank you for always being approachable and for making me feel as part of
the team. Thanks to Josephine Sullivan for her supervision and useful comments
on the report. This project was conducted as part of the EU research project FINE
(Grant agreement 248020, FP7-ICT-2009-4) whose support is gratefully acknowl-
edged. Finally, a special thank you to Dan Song for her invaluable support and
guidance throughout this project.
Contents
1 Introduction 1
1.1 Resources and Requirements . . . . 2
1.1.1 Synthetic Dataset . . . . 2
1.2 Thesis Outline . . . . 3
2 Theory and Related Work 4 2.1 Random Forests . . . . 4
2.1.1 Bagging . . . . 4
2.1.2 Split Nodes . . . . 5
2.1.3 Leaf Nodes . . . . 7
2.1.4 Previous Work . . . . 8
2.2 Head Pose Estimation . . . . 9
3 Method 12 3.1 Random Forest Design . . . . 12
3.1.1 Bagging . . . . 12
3.1.2 Split Nodes . . . . 12
3.1.3 Leaf Nodes . . . . 15
3.1.4 Forest Evaluation . . . . 16
3.2 Comparison To Support Vector Machines . . . . 17
3.3 Integration with Player Tracking . . . . 17
3.3.1 Multiple Cameras . . . . 17
3.3.2 Ball Position . . . . 20
4 Experiments and Results 21 4.1 Datasets . . . . 21
4.2 Performance Evaluation . . . . 24
4.2.1 Forest Size . . . . 24
4.2.2 Stopping Criteria . . . . 27
4.2.3 Mean Shift Clustering . . . . 28
4.2.4 Target Function . . . . 29
4.3 Comparison With SVM . . . . 30
4.3.1 Results On Different Datasets . . . . 32
4.4 Effects Of Translation Errors . . . . 34
4.5 Integration With Tracking System . . . . 35
4.5.1 Multiple Cameras . . . . 35
4.5.2 Ball Position . . . . 35
5 Discussion 37 5.1 General System Performance . . . . 37
5.2 Training And Test Data . . . . 37
5.3 Integration With Tracking System . . . . 39
5.3.1 Multiple Cameras . . . . 39
5.3.2 Ball Position . . . . 39
6 Conclusions And Future Work 40 Bibliography 41 A Other Algorithms 44 A.1 Mean Shift Clustering . . . . 44
B Auxiliary Results 45
B.1 Support Vector Machines . . . . 45
Chapter 1 Introduction
A simple thing like the movements of a persons head can potentially be used to garner a lot of information. In surveillance videos, areas of interest can be found by following the direction of a pedestrian’s head. Rapid head movements may indicate alarm or fear and a system that automatically identifies such a situation could help a security guard to prevent crimes from happening.
One could also make product and advertisement placements more effective by studying the gaze statistics of customers in shops and malls.
Babies as young as six months display the ability to follow another person’s gaze towards an object of interest [1]. However, tasks that we humans find exceedingly simple can for a computer be very hard to replicate, and head pose estimation remains an active area of research in computer vision. The many variations in human appearance, changing lighting conditions and cluttered images are factors that makes this such a challenging problem. Add to this that the only information available is previously unseen low resolution footage, and you get a daunting task indeed. This is precisely what this thesis aims to do: to design a method that estimates the head orientation of football players from in game footage, in real-time.
The work of this thesis has been sponsored by Tracab and most of the work has been done on location at their office in Stockholm. Tracab is a world leader in sports tracking. Using image processing technology their system is able to identify the position of all moving objects in a sports arena in real-time.
With this technology they can both gather a multitude of game statistics and add graphics that follow players around in live broadcasts.
Tracab currently has an active collaboration with the Computer Vision and Active Perception lab (CVAP) at KTH on a 3-year EU project FINE (Free-viewpoint Immersive Networked Experience) [2]. Currently, the project is in its final year with the focus for Tracab and CVAP on image-based 3D human pose extraction. The aim is to gather richer information from sports
1
in addition to the player tracking. Such information could be used to generate 3D computer-game-style animations for football games which can be used in virtual arena applications. The thesis is in the context of this project, with the focus on estimating players’ head pose. The idea is that an accurate head pose will make game play animations look more realistic. In addition, the head pose can perhaps be used to reflect team strategy, to classify different game situations or to sketch players’ attention maps, all of which could be useful information for post game analysis or as a coaching aid.
We propose to solve the problem using random forests, a method recog- nized for its ability to handle a diverse variety of tasks and its fast evaluation speed. One of the novelties of our approach is the use of synthetically created images to train our system. We show that these can be made to generalise quite well to real images. In addition to our appearance based method we also explores ways of improving performance combining our results with data from Tracab’s tracking system.
1.1 Resources and Requirements
The goal of this thesis is to design a system that can estimate the orientation, or pose, of a football player’s head using a single image, in real-time. We propose to do this using random forests, see section 2.1. A version of ran- dom forests mostly based on the model introduced by Breiman [3] has been implemented in C++ specifically for our purposes. We have created several datasets, containing manually labelled head images from 3 different football games, that are used to evaluate the systems performance. In the final system, head images are supplied by a head detection and tracking system designed by Dan Song. Depending on camera position and image resolution the extracted head images will vary in size, but in general images will be square images in the range of 5×5 to 20×20 pixels. The heads are fairly centred and the images include a small border of background around the head, see figure 1.1b. Due to the low resolution of the head images we limit the estimation of the head’s pose to the rotation around the vertical axis.
1.1.1 Synthetic Dataset
A common challenge in machine learning is the gathering of sufficient amounts
of labelled training data. This becomes especially hard when the labelling of
head orientation needs to be done accurately on low resolution images. To
address this problem we propose the use of automatically labelled synthetic
head images. The synthetic images have been generated in a software called
(a) Synthetic head im- ages
(b) Real head images
yaw
pitch
roll
(c) Head rotation
Figure 1.1
Poser [4]. Starting from 10 different head models we create a total of 27 different head models using different permutations of hair style, hair colour and skin tone. These are further varied using 5 different lighting conditions. The models have two different base poses, either looking straight forward or slightly downwards. To model other poses the base pose is rotated with random yaw and pitch in the ranges of 0-359 and 0-40 degrees respectively, see figure 1.1c.
In this way approximately 36,000 head images with transparent background have been generated. The final dataset is created by superimposing those head images on randomly selected image patches from football footage. The images are scaled to 18×18 pixels to be in approximately the same resolution as the real images, while still not loosing too much information. To model possible translation errors in the real images the head position are shifted both horizontally and vertically in the image according to a normal distribution with mean 0 and standard deviation 1. Some sample images from the dataset can be seen in figure 1.1a.
1.2 Thesis Outline
The rest of the report is organized as follows: in chapter 2 we go through
some theory relevant to our work, and make a short review of previous work
done in the area; chapter 3 gives a detailed description of how our system
has been designed; in chapter 4 we present the experiments we have done to
evaluate the system and present the results they have yielded; in chapter 5 we
discuss the results presented in the previous chapter; in the final chapter we
summarize our work and give some suggestions for possible future work.
Chapter 2
Theory and Related Work
Here we summarize prior work related to ours in the field of head pose esti- mation. To properly do so, we first need to introduce some new terminology, and we begin this chapter with an introduction to random forests and some of its variations.
2.1 Random Forests
Random forests are a family of ensemble classifiers introduced by Breiman in 2001 [3]. They have been gaining a lot in popularity in recent years, much thanks to the impressive results achieved by Shotton et. al. In connection with the development of Microsoft Kinect they have demonstrated the random forests’ ability to handle large training datasets efficiently and their ability to solve complex tasks [5–8].
A random forest consists of an ensemble of decision trees, see figure 2.1, and can be used either for multi-class classification [5, 9], regression [7, 10, 11], or even both at the same time [12]. As the name implies the method incorporates certain elements of randomness; each tree is trained using a random subset of the training data, and the test at each split node is constructed from a random selection of descriptive features. See figure 2.2 for a step by step summary of the random forest training algorithm.
2.1.1 Bagging
The concept of using a subset of the data for training each tree was one of the novelties with Breiman’s method and is referred to as bootstrap aggregating, or bagging for short [13]. Let S be the full set of training data {(y
n, x
n), n = 1, ...., N }, where y
nis the label or numerical response of the data vector x
n. Let
4
Split Node Leaf Node
(a) Random Forest
S S
l(b) Tree Training
Figure 2.1: (a) A random forest is made up of a set of individual trees. Each tree consists of split nodes and leaf nodes. The split nodes contain tests that send data either to the left or right child and the leaf nodes contain some kind of prediction. (b) At training time a set of data is presented to the root node, each node then selects the test that best separates the data into classes. When the data is pure enough growing stops and the node is turned into a leaf.
S
tdenote a subset of S containing N samples drawn from S with replacement.
By then training tree t using the corresponding subset S
t, bagging aims at reducing the risk of overfitting to the training data by decorrelating the trees.
In addition, bagging also has the benefit of giving us a convenient measure of generalization error: the out-of-bag (OOB) error is estimated by testing each tree on the unseen samples S\S
tand can be done continuously during training [3].
2.1.2 Split Nodes
Every split node in the tree contains a test meant to separate the data as well as possible with regards to some objective. In this work we limit the discussion to binary tests, which leaves us with functions of the general form:
h(x
n, θ) = {0, 1}, (2.1)
where θ is the vector of descriptive features used in this particular node.
The tests used in practice are generally quite simple: comparison of a single feature to a threshold or a comparison between two different features are two commonly used tests.
The tests evaluates the samples in S = {(y
n, x
n)} using the descriptors in
the feature vector θ and passes the samples to the left or right child of the
Starting at the root node with a set S = {(y
n, x
n), n = 1, ...., N } of training data a tree is grown recursively as:
1. Generate a random set of feature vectors Θ = {θ
k}.
2. Split the dataset S into a left and right subset for each θ ∈ Θ according to some test function h(x
n, θ) = {0, 1}:
S
l(θ) = {(y
n, x
n) ∈ S|h(x
n, θ) = 0}
S
r(θ) = {(y
n, x
n) ∈ S|h(x
n, θ) = 1}
3. Select the feature vector θ
∗that maximizes the informa- tion gain G(θ):
θ
∗= arg max
θ
G(θ), where
G(θ) = H(S) − 1
|S| (|S
l(θ)|H(S
l(θ)) + |S
r(θ)|H(S
r(θ))) 4. Continue growing the child nodes with the subsets S
land S
runtil some stopping criteria is met in which case the split node is turned into a leaf node and stores the distribution of the samples that reached the node.
Figure 2.2: A general summary of the random forest training algorithm.
node:
S
l(θ) = {(y
n, x
n) ∈ S|h(x
n, θ) = 0} (2.2) S
r(θ) = {(y
n, x
n) ∈ S|h(x
n, θ) = 1} (2.3) When constructing a tree each new node is presented with a set of randomly selected features Θ = {θ
k}. The node picks the feature vector that best partitions the data according to:
θ
∗= arg max
θk∈Θ
G(θ
k), (2.4)
where G(θ) is the information gain defined as:
G(θ) = H(S) − 1
|S| (|S
l(θ)|H(S
l(θ)) + |S
r(θ)|H(S
r(θ))) (2.5) The target function H(S) is chosen differently depending on the application, but the objective is to minimize the error while balancing the sizes of the left and right distributions. A commonly used target function is the Shannon entropy [5, 7, 8, 14] defined for classification as:
H(S) = X
c
p(c|S) log(p(c|S)), (2.6)
where c are the class labels. In case of regression one can instead try to minimize the sum of squared errors [7]:
(S) = X
y∈S
|y − ¯ y|
2(2.7)
where ¯ y represents the mean value of the labels in the distribution S.
2.1.3 Leaf Nodes
During training, each leaf node tries to find a good representation of the data that reached it using some kind of prediction model. Depending on the application and the nature of the data this can be done in a number of ways.
In case of classification the most straightforward way is to let the probability distribution of tree t p
t(c|x
n) be proportional to the number of samples from each class c. The estimated distribution of the entire forest is then simply the average of the individual distributions:
p(c|x
n) = 1 T
T
X
t
p
t(c|x
n) (2.8)
In a regression forest, the leaf node outputs could be either point estimates
or probability distributions. In the latter case the outputs can be weighed
together as the average of all the trees in the same way as with a classification
forest.
2.1.4 Previous Work
As mentioned above much of the recent attention in computer vision given to random forests are due to Microsoft’s research team at Cambridge. In their first paper [5] Shotton et. al. use random forests for human pose recognition by classifying individual pixels as belonging to different body parts. The forest is trained using hundreds of thousands of depth images synthetically generated from 100,000 human poses collected using a motion capture device.
Their results give insights on the effects of several training parameters. They indicate that depth has the most significant effect on performance, and show that overfitting can be avoided by using a larger dataset even when training very deep trees. Also, in their case, the number of trees in the forests is shown to be of less importance, and their final classifier achieves state of the art performance even when consisting of no more than 3 trees.
In their next paper [7] the human pose estimation problem is tackled using a regression forest. The leaf distributions are modelled using mean shift clus- tering. They show that there is no benefit in storing more than the 2 biggest clusters, thus reducing both memory requirements and evaluation speed. An- other interesting discovery is that the use of a classification-style target func- tion in the split nodes gives better results than a regression-style function.
They also investigate different sub-sampling methods with the purpose of making the forest more efficient, with some promising results.
Dantone et. al. [11] uses a conditional regression forest to improve facial feature detection. The idea is to use prior knowledge of some global variable to constrain output. In this case the global variable is the orientation of the head, divided into 5 states. By training a full forest for each possible state they get forests that are specialized to identify facial features on faces with different orientations. However, if the number of states of the global variable grows larger this method quickly becomes infeasible since the training time increases linearly with the number of states. Instead of conditioning the entire tree structures on the global variable Sun et. al. [6] suggest that only the regression models in the leaf nodes should be modified. They store a regression model for each state of the global variable and show that this not only speeds up training but also gives the best results when used with their human pose estimator.
Schulter et. al. [15] suggest an on-line training scheme for decision forests
where training data arrives sequentially: a tree starts out as a single leaf node
and as data arrives the node can chose to either remain a leaf node, or if
enough data has arrived to turn into a split node and pass on the data to its
child nodes. With this way of training, split nodes will base their decisions
on considerably smaller subsets of data than in the traditional off-line case.
Intuitively this may seem like a bad thing, but the results presented by Schulter et. al. suggests that this in fact helps with generalisation. The reason being that it decorrelates the trees in the same way that bagging does. The on-line training method could also be used to further grow an already built tree, thus possibly adapting a general tree to a more specific situation. Inspired by the good performance of the on-line forests they also introduced a modified off-line training method where each split node base their choice of split function on a small subset of the data available, showing that this will reduce training time while improving performance.
2.2 Head Pose Estimation
There has been a fair amount of work done in the area of head pose estimation, some of which uses appearance based methods and others which rely on the identification of facial features. The latter, requiring that the resolution of the images be sufficiently high for individual features to be distinguishable, are not applicable to our situation and will hence not be discussed in any greater extent here. The same applies to work done using depth images [12].
However, the interested reader can find an extensive and relatively recent survey of existing methods for head pose estimation in [1].
Benfold and Reid have done extensive work in the area of head pose esti- mation in low resolution surveillance videos [16,17]. They classify head images as small as 10 pixels square into 8 classes spanning a full 360
◦rotation using a forest of random ferns
1.
In [16] the split nodes base their decisions on a segmentation of the head images into hair, skin and background regions from k-means clustering, requir- ing that the segmented training images are manually assigned correct region labels. During evaluation of unseen data, the most likely labelling hypothesis needs to be found. This is done by testing all possible combinations of labels, thus removing the random ferns’ inherent advantage in evaluation speed.
In [17] Colour Triplet Comparison (CTC) is introduced as an alternative image descriptor. CTC compares the colour values of three pixels and the split node makes a decision depending on whether the first and second pixels are more similar than the second and third, where similarity is measured as the L1 norm in RGB-space, see figure 2.3a. The intuition being that the comparison will be able to find colour differences in the image while still being robust towards different lighting conditions, skin colour etc. They also explore the
1
Random ferns are a special kind of decision trees where every split node at a level uses
the same test. This makes ferns easier to implement and more efficient to train, but at the
cost of some flexibility.
1 3
2
(a) Colour Triplet Comparison (b) Histogram of Oriented Gradi- ents
Figure 2.3: (a) Colour Triplet Comparison (CTC): The test compares the colour values of three pixels and makes its decision depending on whether the first and second pixels are more similar than the second and third. (b) Histogram of Ori- ented Gradients (HOG): Divides the image into cells and counts the occurrences of gradient orientations in each cell.
usefulness of HOG-descriptors, where the tests work by comparing the count in two bins of the histograms, see figure 2.3b. They find that HOG-descriptors perform better than CTC but that a combination of the two outperforms both of them separately.
Benfold et. al. also consider ways of improving accuracy when there are translation errors in the estimate of the head location. They show that injection of similar errors in the training data can slightly improve results [17].
Siriteerakul et. al. use image descriptors similar to the ones used by Ben- fold to train a Support Vector Machine (SVM) to estimate head pose: Pairwise Non-Local Intensity Differences (iDF) and Pairwise Non-Local Colour Differ- ences (cDF). iDF simply compares the intensity of two pixels, while cDF checks whether two pixels of an image segmented with k-means clustering belong to the same segment [18].
Launila did work very similar to ours in his thesis work which was per-
formed jointly at Tracab and KTH [19]. He replicates the random ferns classi-
fier used in [16] and compares its performance on football footage to an SVM
trained on grayscale images. The SVM is found to be both more efficient
and more accurate. To improve performance he explores ways of combining
the appearance based classifier with contextual features, such as player and
football position, with some success.
Dantone et. al. uses a random forest to estimate head pose of faces viewed from the front. The head pose estimation is not the focus of their paper, but they indicate that better results are found by first estimating a continuous value for the pose and then converting it into a discrete label [11].
A random forest based on Gabor features is used for head pose estimation in [9]. In an attempt to strengthen the individual trees, Linear Discriminant Analysis (LDA) is used in the split nodes.
In summary we can say that Benfold’s, and Launila’s test data is the most similar to ours. The image representations used in Benfold’s later work [17]
(CTC and HOG) seem promising, and it would be interesting to see how
they fare when used in combination with a random forest utilizing some of
the newer ideas discussed in the previous section. Launila’s successful use of
contextual features should also be explored further. These ideas and more are
investigated in the following chapters.
Chapter 3 Method
Based on the discussion in the previous chapter we decided to try to solve the head pose estimation problem using a random regression forest. In this chapter we give a detailed description of the random forest that we use and how we combine it with Tracab’s tracking data into a complete system.
3.1 Random Forest Design
The basis of our random forest is very similar to the original design suggested by Breiman [3]. The framework can be used for a multitude of tasks without too many modifications.
3.1.1 Bagging
Our bagging process is completely analogous to the original one: out of N training samples, T subsets of N
tsamples are drawn randomly with replace- ment to be used for training each tree t ∈ {1, . . . , T }. In practice this means that, in the standard case where N
t= N , each tree will be trained on roughly two thirds of all the samples [13].
3.1.2 Split Nodes
One thing to consider, when adapting a forest to a specific task, is the choice of a suitable set of tests to be used in the split nodes. In our case we have decided to work with the same tests that Benfold et. al. used in their work, since they have been shown to work well on data very similar to ours [17]. As mentioned in the previous chapter they introduced Colour Triplet Comparison (CTC) and used them in combination with HOG-descriptors. However, our
12
ϕ
0
o90
oFigure 3.1: Discrete division of head poses into 16 bins. By starting the binning at different angles φ for all the trees we avoid biasing the forest towards a certain division.
implementation will have one minor modification: instead of RGB-space our colour triplets will be represented in CIELUV-space. CIELUV is a color-space created with the aim of being perceptually uniform, which means that the dis- tance between two colours should be proportional to the perceptual difference between the colours [20]. Since the CTC works by comparing distances in color space this should be a useful property.
Since we have access to synthetic training data, perfectly labelled with continuous angle values, it is natural that we should use this to our advan- tage and try to estimate the head rotation as accurately as possible with a regression output. Intuitively the target function, used to choose the best test function at each split node, should be made to match the desired output of the forest. In other words, a regression target should be used to a regression out- put and a classification target to a classification output. However, as shown by Girschick et. al in [7], a classification target can sometimes outperform a regression target even when striving to do regression. We therefore test both methods to see which performs best in our case.
For the classification target, the head poses are divided into 16 evenly
0
o90
or
x y
θ
Figure 3.2: Mean of Circular Quantities. The angles are projected onto the unit circle and the mean is calculated as a point in 2D, (¯ x, ¯ y), before being projected back onto the unit circle and converted to an angle, ¯ θ.
spaced bins around the full rotation and the trees aim to separate these bins using the Shannon entropy described in equation 2.6. However, some initial testing showed a problem with this discretization. We noticed that trees rarely voted for angles close to edges of the bins, making the forest perform consid- erably worse on test samples in those regions. We solve this by introducing a variable starting angle φ for the binning as shown in figure 3.1. Each tree is assigned a random angle to use as the starting point of the first bin, meaning that individual trees will still be weaker in certain regions, but the output from the entire forest will no longer be biased towards a certain division.
As the regression target we would like to use the sum of squared errors described in equation 2.7. However, since we are working with a circular quan- tity, i.e. head rotation, we need our mean to be able to model the proximity of 0 degrees and 359 degrees. The linear mean does not have this property, so instead we use the following formula:
θ = atan2 ¯ 1 n
n
X
i=1
sin θ
i, 1 n
n
X
i=1
cos θ
i!
, (3.1)
where atan2 is a version of arctangent that takes two arguments to be able to
discern from the sign of the arguments in which quadrant the resulting angle should be. The workings of equation 3.1 is illustrated in figure 3.2. From the the mean angle we can proceed to calculate the error sum. However, because of the circular property of the angle we need to modify the error function as well. We propose the following:
(S) =
P
θ∈S
|θ − ¯ θ| if |θ − ¯ θ| ≤ 180
P
θ∈S
360 − |θ − ¯ θ| if |θ − ¯ θ| > 180 , (3.2) where ¯ θ is the mean from equation 3.1. The best split is then simply the one that minimizes the value of:
E(S) = (S
L) + (S
R) (3.3)
During training each split node randomly generates a set of both CTC and HOG tests and picks the one that best splits the data according to the chosen target function. The number of tests that are evaluated at each split node also needs to be taken into consideration. A large number of tests will result in individually stronger trees, making it possible to have a forest with fewer trees and thus faster evaluation speed. At the same time more tests will increase correlation between trees, thus possibly reducing the generalization properties of the forest. A very large number of tests will also result in infeasibly long training times. On the other hand too few tests will result in weak trees with poor performance. Therefore it is important to find a suitable number of tests that is neither too small or too large.
3.1.3 Leaf Nodes
The splitting process is repeated for both child nodes, and the tree keeps growing until at least one out of the following stopping criteria is met:
1. The node is at a depth greater than or equal to d
max2. The node meets a purity criteria:
a) Classification target: at least a fraction ρ
minof the training samples belongs to the same class.
b) Regression target: the average angular error is smaller than
min.
3. The node contains a number of training samples less than |S|
min.
0o 90o
0o 90o
θ2 θ1 θ3
w = ⅙
3w = ½
1w = ⅓
2Figure 3.3: Mean Shift Clustering run on a set of angles.
Once a stopping criterion is met the node is turned into a leaf node. The last of the stopping criteria, |S|
min, is merely meant to stop branches with inconclusive results from growing more than necessary and should not affect performance very much. It is set to |S|
min= 10 and we do not consider any other values for it here.
Once turned into a leaf the node needs to find a good representation of the data that it contains. The most straightforward approach would be to repre- sent the data by its mean angle. However, we notice that leaf distributions are sometimes multi modal, in which case the mean would be a rather poor representation of the true distribution. To handle this problem we decide to cluster the data using the mean shift algorithm
1. To tackle the circularity of the data the clustering is run on the data points projected on the unit circle, see figure 3.3. The main advantage with mean shift is that it makes no as- sumptions on the number of clusters in the dataset, making it very useful in our case. We use an Epanechnikov kernel and the only parameter that needs to be specified is the cluster bandwidth. Each leaf stores the 2 biggest clusters in the form of their respective means ¯ θ
mand weights w
m, where the weights are simply the portion of data that belongs to the respective clusters.
3.1.4 Forest Evaluation
At test time, the forest is presented with an image of an unknown head pose.
The image is passed down each tree until it reaches a leaf node. Each tree then votes on up to 2 angles corresponding to the centres of the biggest data clusters in the leaf node. Each vote carries a weight w
mand the votes accumulated
1
The reader unfamiliar with the mean shift algorithm can find a more thorough descrip-
tion of it in appendix A.1
from the entire forest are clustered once again using a weighted mean shift.
The estimated head orientation is then simply the mean of the biggest vote cluster.
3.2 Comparison To Support Vector Machines
A popular and versatile machine learning method is the support vector machine (SVM). It can be used for binary and multi-class classification as well as regression [21]. As a comparison to our random forest based method we decide to train a set of multi-class SVMs on the same synthetic dataset that we used to train the random forest. There is no readily available implementation of an SVM that does regression on a circular quantity, and since optimizing the SVM is not the focus of our thesis we decide to compare the performance of the random forest with a SVM classifiers. We use the open-source implementation of SVM, libSVM [22], and design classifiers similar to the ones used by Launila in [19]. The 360 degrees are divided into 45 degrees wide bins, resulting in an 8-class classification. We choose to use a radial basis function (RBF) kernel and C-style soft margins. As for image descriptors we choose to investigate grayscale pixel intensities, HOG descriptors and a combination of the two.
We also train an SVM using a combination of color pixel values and HOG descriptors, which is the same information that the random forest has to work with. The resulting feature vector is quite high dimensional and we thus train this SVM using a linear kernel.
3.3 Integration with Player Tracking
Our access to Tracab’s tracking system affords us with several unique opportu- nities to improve our head pose estimation. We have access to player position, ball position and multiple camera angles from several football games. Since our synthetic training data does not contain these extra features, we are not able to train our random forest on them directly, as has been done by Lau- nila [19]. However, we suggest some simple heuristics that could possibly be used to improve the pose estimate.
3.3.1 Multiple Cameras
In a football match the players spend most of the time separated from each
other by at least a few meters, and occlusion and background clutter is not
that much of a problem, see figure 3.4. But in certain situations such as free
kicks and corner kicks players will crowd together and it can be hard to get a
Figure 3.4: Typical game footage. Players are well separated and there is very little occlusion and background clutter.
clean picture of all players. As can be seen from figure 3.5, multiple camera views can offer an advantage in these crowded situations.
There could also be situations where one camera fails in some way, giving a blurred or unclear image or no image at all. In these situations having a system that handles input from several cameras of varying quality could be very useful.
The tracking system supplies all head detections with a player ID so it is a simple thing to match corresponding head detections from different cameras.
To get an estimate of a single player’s head orientation we run all available detections of this player through the forest. The votes from each tree are then transformed from angles relative to the respective images into angles relative to the football pitch. The votes from all views are then clustered using mean shift in the same way as when evaluating a single image and the output is the centre of the largest cluster. Our hypothesis is that a bad image will give rise to scattered votes, while a clear image will give a distinct cluster of votes. If so the mean shift algorithm should still be able find the clear image’s distinct cluster within the noisy votes from the bad image.
Figure 3.6 shows how the pitch relative pose, ϕ, can be derived from the image relative pose, θ, player position and camera positions. Note that the geometry will vary a bit for different player positions but the general idea remains the same. The transformation can be summarized in the following equation:
ϕ = (1 + n)π + α + θ, where n ∈ Z, so that 0 ≤ ϕ < 2π (3.4)
and α is the relative angle between the two positions as defined in figure 3.6.
(a) Left Camera (b) Right Camera
Figure 3.5: The figure shows two different views of the same free kick situation.
We see that a different viewing angle can make a huge difference in player separation.
x y
(x , y )
c cθ
(x , y )
p pα φ
α
Figure 3.6: Diagram explaining how the image relative head pose, θ is related
to the pitch relative pose, ϕ, where α is the relative angle between player and
camera, positioned in (x
p, y
p) and (x
c, y
c) respectively.
3.3.2 Ball Position
Naturally, players are often looking at the ball, and it has been shown by Launila that this can be used to improve the accuracy of the head pose esti- mation [19] [23]. We know both the positions of the players and the ball so it is a simple matter to calculate the relative direction from the player to the ball.
The output from the forest often contains more than one big cluster of votes, and we have noted that in the case of erroneous estimates that the second biggest cluster is often much closer to the correct angle. Our hypothesis is that this second cluster may often correspond to the angle towards the ball.
We therefore suggest that the estimate should be subject to change in the case where the second biggest cluster coincide with ball direction.
We consider the angles to coincide whenever the cluster center is within
45 degrees of the angle towards the ball. We do not want to be confused by
small clusters from noisy votes, and therefore only consider clusters that are
at least half as big as the largest cluster.
Chapter 4
Experiments and Results
In this chapter we describe the experiments performed to evaluate the perfor- mance of the system. We start off by introducing the dataset used to evaluate the performance. We then present the results of the experiments and offer observations regarding those results.
4.1 Datasets
Our main objective is to investigate how well our random forest trained with synthetic data performs on real footage. We therefore create a few sets of head images with manually assigned angles:
1. Dataset 1 consists of head images from a video shot in 1920×1080 pixels, see figure 4.1, and head sizes range from approximately 5 to 10 pixels
Figure 4.1: Dataset 1 (original resolution 1920×1080)
21
square, see figure 4.2a. Video frames are randomly sampled from 2 cam- eras, both covering the center of the pitch but placed 11 meters apart.
Not all players are present in every frame, but in total 20 different play- ers are included in the dataset. Head images are manually extracted so that it is possible to study the performance of the head pose estimation without it being affected by possible translation errors in the automatic detections. Because of the low resolution it is sometimes impossible even for a human to tell the head pose and in those cases the head images are excluded from the set. In total 35 frames are sampled from both cameras giving a total of 820 labelled head images. When matching players over frames we get 292 matching image pairs that can be used to evaluate the effect of using input from multiple cameras.
2. Dataset 2 consists of head images from two different football games, both shot in 1920×1080 pixels, with head sizes ranging from 9 to 16 pixels square, see figure 4.2b. Head images are manually extracted for 31 different players and the set contains a total of 520 labelled head images.
3. Dataset 3 covers the same video frames as dataset 1 except that the head images have been extracted using the automatic head detector. After some false detections and detections where a large part of the head is outside the image have been discarded we are left with 602 labelled head images. Some examples of images that are in the dataset as well as some images that were discarded can be seen in figure 4.3.
4. Dataset 4 is simply all the head images from dataset 1 and 2 combined.
Giving a total of 1340 images of 51 different players. This is the most varied dataset and will thus be used for most of our testing.
5. Dataset 5 is extracted from the free kick situation depicted in figure 3.5,
using the same two cameras as in the figure. The cameras are placed 30
meters apart and capture the situation from two very different viewing
angles. The situation is rather messy and contain many occluded heads
and it should give a good indication of if we can benefit from multiple
camera views. In total we label 363 matching images from each camera,
containing images of 18 players.
(a) Dataset 1 (b) Dataset 2
Figure 4.2: Head images from dataset 1 and 2. As can be seen, the images from dataset 1 has considerably lower resolution.
(a) Kept detections (b) Discarded detections
Figure 4.3: Automatically detected head images from dataset 3. The left image
shows the nature of the kept images and the right shows some of the images
that were discarded because of too large errors.
4.2 Performance Evaluation
Performance of the random forest is measured as the mean absolute error (MAE) of the estimated head angle. The forest is trained using our full 36’000 image synthetic dataset and bagging is used with each tree being trained on N
t= N samples, where N is the total number of available training samples.
The performance of the forest is affected by a number of different param- eters. Parameters that need to be considered are:
1. The number of trees, T , in the forest.
2. The number of tests, τ , to be considered in each split node.
3. The stopping criteria described in section 3.1.3 a) The maximum tree depth d
max.
b) The purity criteria: ρ
minfor a classification target or
minfor a regression target.
4. The mean shift bandwidth, β
leaf, to be used when clustering the data in a leaf node.
5. The mean shift bandwidth, β
f orest, to be used when clustering the output from the entire forest.
We tune these parameters using the synthetic data, by means of the out-of-bag- error (OOB-error). We also study the effects that changing these parameters has on the performance on the real data in dataset 4.
4.2.1 Forest Size
There is likely to be more than one optimal configuration of the training pa- rameters and an extensive grid search spanning all the parameters is infeasible.
Therefore we begin by finding suitable values for T and τ and then study the
effects of the other parameters given that choice. The number of tests, τ , is
chosen because it is the parameter that has the biggest impact on training
time and it would be useful to settle for a value early on. Also, since the trees
are trained completely independently from one another, the effects of the num-
ber of trees in the forest should not be affected by the other parameters. We
explore the effects of these two parameters on a random forest trained with
the classification target described in section 3.1.2 and other parameters, set
using intuition, as shown in table 4.1.
0 20 40 60 80 100 120 140 160 180 200 6
8 10 12 14 16 18 20 22 24 26
Mean absolute error (degrees)
Number of trees
50 Tests 100 Tests 250 Tests 500 Tests 1000 Tests 5000 Tests 10000 Tests
Figure 4.4: MAE on the synthetic OOB samples as a function of number of trees.
d
maxρ
minβ
leafβ
f orest20 0.85 0.5 0.5
Table 4.1: Initial choice of training parameters
0 20 40 60 80 100 120 140 160 180 200 22
24 26 28 30 32 34 36 38 40 42
Mean absolute error (degrees)
Number of trees
50 Tests 100 Tests 250 Tests 500 Tests 1000 Tests 5000 Tests 10000 Tests
Figure 4.5: MAE on dataset 4 as a function of number of trees.
Figure 4.4 shows how performance is affected by the number of trees for different values on τ . The decrease in error saturates at around 5000 tests, and increasing the number of tests further will likely just result in increased training time. It can also be noted that after a point very little is to be gained by increasing the number of trees, and since evaluation time and memory requirements increases linearly with the number of trees, and our goal is to design a real time system, we decide to limit ourselves to forests of 100 trees.
Thus, if nothing is said to the contrary, for the rest of this report it will be understood that τ = 5000 and T = 100.
Figure 4.5 shows the results on the real data. We see the same qualitative
behaviour as on the synthetic data which should be a positive indication of
the synthetic dataset’s ability to model real data.
0 2 4 6 8 10 12 14 16 18 20 22 10
15 20 25 30 35
Mean absolute error (degrees)
Maximum tree depth
(a) OOB-samples
0 2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40
Mean absolute error (degrees)
Maximum tree depth
(b) Dataset 4
Figure 4.6: Performance for varying tree depths on the synthetic OOB-samples and on the real dataset 4. On the real data we see signs of overfitting when trees grow deeper. Note that the range of the y-axis is not the same in the two plots.
4.2.2 Stopping Criteria
The stopping criteria are meant to stop the tree from growing more than necessary and to prevent overfitting. First we study the effects of varying tree depth. In figure 4.6a we see how the performance varies with depth. Minimum error is achieved with a maximum depth of 14, and this depth will be used for the remainder of the report.
When comparing to the performance on the real data in figure 4.6b we see that while the performance on the OOB-samples saturates, the performance on the real dataset starts to deteriorate with deeper trees. This is a clear sign of overfitting, and the reason that this does not show up when tuning on the synthetic data is most likely that the OOB-samples share too many similarities with the other training samples.
It should be noted that the trees trained are generally not balanced. I.e.
trees will not necessarily be uniform in depth and some branches may go much deeper than others. For example, a tree trained to a maximum depth of 20 will have an average depth of around 12. Thus only a small portion of the branches are affected by the maximum depth criteria, but as we have seen pruning these branches can sometimes give an improvement in performance.
In figure 4.7 we see how the performance is affected by changing the min-
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 8
8.5 9 9.5 10 10.5
Mean absolute error (degrees)
Minimum leaf purity, ρ min
(a) OOB-samples
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
23.5 24 24.5 25 25.5
Mean absolute error (degrees)
Minimum leaf purity, ρ min
(b) Dataset 4
Figure 4.7: Performance for different values of ρ
minon the synthetic OOB- samples and on the real dataset 4. Note that the range of the y-axis is not the same in the two plots.
imum purity criteria in the leaf nodes. Higher values is obviously preferable, but one might expect there to be some overfitting if ρ
minis set too high. The reason that we see no such tendencies here is likely due to the depth limitation set by d
max. Minimum error is achieved with ρ
min= 0.85 and it is the value that will be used for the rest of the report.
4.2.3 Mean Shift Clustering
The last two parameters to investigate are the bandwidths used in the two mean shift steps. The bandwidths can be changed after training so it is not too expensive to perform a grid search over these two parameters. In figure 4.8 we illustrate the results of such a search using a contour plot of the performance for a set of different values on the bandwidths. We see that value of the bandwidth in the first step, β
leaf, has very little impact. This is likely due to less multi-modality in the leaf distributions than we first thought, and the addition of the mean shift clustering may have been unnecessary in the leaf nodes. However, in the uni-modal case mean shift should give more or less the same result as the circular mean and we are no worse off for using mean shift.
The value of the second bandwidth, β
f orest, used when clustering the votes
8.5 8.6
8.6 8.7
8.7 8.8
8.8 8.9 9
9.1 9.2
βleaf βforest
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(a) OOB-samples
23.8
24 24
24
24 24.2
24.2
24.2
24.2
24.2
24.2
24.4 24.2
24.4 24.4
24.4
24.4 24.4
24.4
24.4
24.6
24.6
24.8 25.2 25
25.2 25.2
25.4 25.4
βleaf βforest
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(b) Dataset 4
Figure 4.8: Performance for different values of the two bandwidths on the syn- thetic OOB-samples and on the real dataset 4. The values on the contours rep- resent the MAE in degrees. The minimum errors are found at: (a) {β
leaf= 0.3, β
f orest= 0.6}, and (b) {β
leaf= 0.7, β
f orest= 0.7}.
T τ d
maxρ
minβ
leafβ
f orest100 5000 14 0.85 0.3 0.6
Table 4.2: Final choice of training parameters
from all the trees, does however seem to matter. In particular, smaller band- widths perform worse. We also note that the synthetic dataset seems to prefer slightly smaller bandwidths than the real dataset. This could be an indication of some minor overfitting to synthetic data when using tighter clusters.
4.2.4 Target Function
In the previous chapter we introduced two different target functions. As Gir-
shick et. al. showed in [7] a classification target can outperform a regression
target on a regression task. Here we explore if this is true in our case as
well. We train a random forest using the regression target function 3.3 with
the other parameters set as in 4.2, replacing ρ
minwith
min= 25 degrees. In
table 4.3 we compare the random forests trained using the two different target
functions on dataset 4.
MAE (degrees) Regression Target Function 25.6
Classification Target Function 24.6
Table 4.3: MAE on dataset 4 for the random forests trained using different target functions.
Accuracy # bins off # bins off when wrong Random Forest 57.9% 0.534 1.27
SVM: Grayscale 51.4% 0.630 1.30
SVM: HOG 49.9% 0.766 1.53
SVM: Grayscale+HOG 53.6% 0.729 1.57
Table 4.4: Classification accuracy on dataset 4. The third column represents the average number of bins between predicted and true bin for all samples, and the last column represents the same measure for all the misclassified samples.
We see that the forest trained with the classification target function per- forms better and we decide to continue using this target in the remainder of the report. It could be argued that the we should explore more values on
min, and perhaps even a different configuration of the other parameters, for this to be a fair comparison. But, given that the range of values on the MAE in figure 4.7b, where ρ
min, the parameter in the classification target corresponding to
min
, is varied, is in the same range as the difference in performance we judge it unlikely that such experiments would yield any drastically different results.
4.3 Comparison With SVM
The SVMs are trained on the same synthetic dataset that we use to train the random forest. We perform grid searches to find optimal values for the parameters C and γ for each of the three image descriptors
1. The performance measure of the grid search is the mean classification accuracy on a 4-fold cross validation.
To compare our regression forest with the SVM classifiers we let the forest vote for the class corresponding to the one of the 8 bins at which the inferred angle points. The forest used for the comparison is trained using the parame- ters in table 4.2. In table 4.4 we compare classification accuracy on dataset 4.
Accuracy is measured as the percentage of samples that is classified correctly.
We also calculate the average distance between predicted bin and true bin.
1
The interested reader can find the results illustrated as contour plots in appendix B.1.
Predicted class
True class
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(a) Random Forest
Predicted class
True class
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(b) SVM: Grayscale
Predicted class
True class
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(c) SVM: HOG
Predicted class
True class
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(d) SVM: Grayscale + HOG
Figure 4.9: Confusion matrices for the random forest and the SVMs. Correct
predictions are plotted on the diagonal, meaning that a clear diagonal indicates
an accurate classifier. Note that because of the circularity bin 1 and 8 are also
adjacent, meaning that the lower left and upper right corners are only off by
one bin.
Time per sample (ms) Evaluations per second (Hz)
Random Forest 6.75 148
SVM: Grayscale 20.7 48.4
SVM: HOG 26.0 38.4
SVM: Grayscale+HOG 46.8 21.3
Table 4.5: Evaluation times for all the compared methods.
The distance is measured as the minimum number of bins between the pre- dicted and true bin, where distance 1 would mean that the bins are adjacent.
We calculate this distance based on the results on the entire dataset and also on only the misclassified samples.
We see that the random forest outperforms the SVMs with regard to both measures. The SVM combining HOG and grayscale features achieves the high- est classification accuracy of the three SVMs. However, the distance measure indicates that the SVM using only grayscale features is slightly more stable, making smaller mistakes. Still, all in all each method performs fairly well. The distance measure also suggests that no method makes many predictions that are wrong by a large amount. In figure 4.9 we display the results as confusion matrices. Here we see even more clearly what was indicated by the distance measure. When misclassifying the prediction is more often than not in an ad- jacent class. Since a significant portion of the test samples will very likely lie close to class borders some classifications of this nature is to be expected even for a rather accurate classifier. We can also note that the HOG+grayscale SVM predicts class 1 and 8 more often that the other classifiers, which is likely the cause of the instability that we mentioned before.
We also compare the random forest and the SVMs on account of evaluation speed. The evaluations are run using a single thread on an Intel Core i7- 3720QM processor with a clock speed of 2.6 GHz. The evaluation time per sample is calculated by timing the evaluation of dataset 4 and then dividing the elapsed time with the number of samples. The results can be seen in table 4.5. We see that that the random forest is considerably faster than all types of SVMs.
4.3.1 Results On Different Datasets
Based on the above results the final version of the random forest is trained
using the parameters shown in table 4.2. We test this forest on dataset 1 and
2. The results are shown in table 4.6. We see that the random forest performs
considerably better on dataset 2, which is not unexpected considering that
MAE (degrees) Dataset 1 26.6
Dataset 2 21.4
Table 4.6: Random forest performance on the different datasets.
0 20 40 60 80 100 120 140 160 180
0 10 20 30 40 50 60 70 80 90 100