Degree project in Computer Science Second cycle
Mihai Damaschin
A real-time hand pose recognition system
MIHAI DAMASCHIN
Master’s Thesis at NADA Supervisor: Hedvig Kjellström
Examiner: Danica Kragic
TRITA xxx yyyy-nn
Abstract
This thesis work aimed to reimplement and improve an ex- isting system for hand pose recognition from monocular video data. The resulting system is light, multi-platform and easily extensible because of its modularity. It relies on treating the problem of hand pose estimation as a near- est neighbour look-up in a database of synthetically gener- ated hand images. Its main characteristics are the use of HOGs (Histogram of Oriented Gradients) as features and employing temporal consistency for greater reliability and robustness.
The paper also makes a review of the current hand pose recognition research and gives arguments for our choices of implementation both in terms of design and actual technol- ogy used.
Arbetet med den här uppsatsen ämnade till att bygga om och förbättra ett befintligt system för handposeestimer- ing. Det framtagna systemet är lättviktigt och plattform- soberoende samt lätt att utöka tack vare dess modular- itet. Problemet med att estimera handposer behandlas som ett närmaste-grannsökning i en databas av syntetiskt fram- tagna bilder på händer. Systemets huvudsakliga egenskaper är användandet av HOGs (Histogram of Oriented Gradient) samt temporal konsistens för ökad pålitlighet och stabilitet.
Uppsatsen innehåller också en studie av nuvarande forskn- ing inom området och presenterar argument för vår imple- mentation avseende både vilken design och vilken teknik som använts.
Contents
1 Introduction 1
2 Hand tracking and pose estimation 3
2.1 Related work . . . . 3
2.2 The hand tracking module . . . . 4
2.3 Image descriptors for hand pose estimation . . . . 5
2.3.1 Image gradients . . . . 5
2.3.2 Histogram of Oriented Gradients . . . . 6
2.3.3 Hu moments . . . . 7
2.3.4 SIFT features . . . . 8
2.3.5 Shape contexts . . . . 9
2.3.6 Evaluating image descriptors . . . . 9
2.4 Inferring hand pose . . . . 10
2.4.1 k Nearest Neighbours (kNN) . . . . 10
2.4.2 Support vector regression . . . . 11
2.4.3 Particle swarm optimization (PSO) . . . . 11
3 Overview of the system 13 3.1 Kinematic hand model . . . . 13
3.2 Probabilistic framework . . . . 13
3.3 The application . . . . 16
3.4 Performance . . . . 17
3.4.1 Resource usage . . . . 17
3.4.2 Detection performance . . . . 17
3.5 The code . . . . 19
4 Generating the image database 23 4.1 Rendering the images . . . . 23
4.2 Random poses versus grasps . . . . 24
5 Conclusions 27 5.1 Summary . . . . 27
5.2 Future work . . . . 27
5.2.1 Extending the database . . . . 27
Appendices 28
Bibliography 31
Chapter 1
Introduction
In this paper we are going to discuss the problem of hand pose estimation in the context of an existing application, developed in [19] and more recently in [18] for close to real-time hand tracking when hands are grasping objects. The method and code developed in the original paper are analysed and extended. Throughout the paper I will be referring to the implementation of the method described in these two papers as "our system" and "our method" interchangeably. Hand pose recognition is an important focus of current research in computer vision with many papers being written on the topic [5]. Estimating hand pose can be seen as a particular type of pose estimation that makes use of the hand’s unique structure. In full body pose recognition one faces the problem of clothing and thus extra occlusions [5]. In the case of hand pose estimation we face a similarly tough problem - that of the high dimensionality of the space of movements (expressed as angles) a hands joints can make [18],[12] or, more simply, the hand is a deformable object with many degrees of freedom [10].
Estimating hand pose has a wide applicability. Fields where it is potentially useful range from human computer interaction [5] to automatic sign-language recog- nition [3] to virtual object manipulation [10] and robot learning from demonstration [25]. In recent years, especially, due to the increased demand for human computer interactions, hand gesture recognition has been explored as an intuitive way of in- teracting with computers [10].
Methods that attempt to determine hand pose make use of either gloves, mark- ers, or rely just on the video frames and perform an analysis on the image [5].
Among the techniques that do not make the subject wear extraneous elements there are those that rely on depth data either from a Kinect sensor or multiple cameras [14], [12], [8], [21], [16], and those that use just a monocular image [24], [1], [10].
Other data that has been suggested can be useful is heat data from an infrared camera [5].
We are particularly interested in estimating hand poses from video so it is impor-
tant to note that while our method makes use of the estimation from the previous
frame as does [14], there are several systems that do not employ it. Some, however,
do admit that their methods could be extended by employing temporal data [21].
In a sense systems try to replace the information from markers or gloves with that from depth data and/or temporal data.
Another way of classifying existing hand pose recognition systems is whether they use a model of the 3D hand or not. Thus they can be divided into: 1) model based tracking and 2) single frame pose detection. Model based tracking means that the system has to keep track of a simulated model of the human hand which they try to track [5]. The second method can be thought of being only appearance based [16]. The former mostly suffer from the high dimensionality of the state space while the latter are more sensitive to noise [19]. Our system can be considered to be of the second type.
According to [15] a priori knowledge of human motion capture can be broken down into: kinematic structure, 3D shape, color appearance, pose and motion type.
In our system we indirectly make use of the kinematic structure and directly of the color appearance - in the tracker, the pose and the motion type.
One way in which our method stands out is the fact that objects in the scene that occlude the hand do not impede the estimation, but rather aid in it. The majority of applications deal with the free hand [8]. The systems that consider the hand in isolation have compromised results when the hand is grasping an object [16].[8] solves the problem of occlusion using depth data and by having a separate occlusion model. [16] solves the problem with data from multiple cameras.
The topic of this paper is a real-time, modular, cross-platform implementation of the system proposed in [19] and [18]. Apart from giving details on the actual implementation we will also try to go somewhat in depth regarding the theoretical motivation for the design chosen. The main contributions are:
• Implementation. We have modified the existing system by replacing out of date components and bringing it to full working state.
• Analysis. We have described the current functioning of the system and dis- cussed the possibility of its expansion and its limitations.
The main output of our work is a hand pose estimator application. The exe- cutable file, compiled from code written in C++, makes use of a database of approx- imately 100000 synthetic hand grasping images. It reads in approximately 20 frames per second either from a camera or from a stored video. For each frame it extracts the hand (if any) and finds the approximate nearest neighbours in the database. A temporal model, created in the previous steps, is combined with the current guesses to weigh the nearest neighbours and thus employ temporal consistency.
The paper is organized into 4 main chapters. Chapter 2 presents the general
problem of hand tracking and hand pose estimation describing image features and
inference methods that are used in the literature. Chapter 3 gives an overview of
the system after the changes we made. Chapter 4 goes into more details about the
image database - a fundamental part of the system. Finally in Chapter 5 we make
a summary of the previous discussion and consider future work.
Chapter 2
Hand tracking and pose estimation
One of the fundamental problems in computer vision is tracking an object in motion [24]. The main application of tracking has been surveillance [15], other possible applications being traffic control and video editing. In our case, the tracker is a separate module. It is important to note that by tracking we understand getting a bounding box around the hand in the image. Then this image is processed and the pose is determined. The focus of our system was determining the kinematic structure of the hand having as input the grayscale hand segmented from the image and not the tracking itself. For that reason we use a somewhat simple hand tracker.
However there are systems that put the focus on the tracking itself [28], [24].
We define pose estimation in our case as the process of estimating the configu- ration of the underlying kinematic or skeletal articulation structure of the hand.
2.1 Related work
Real time object detection became possible after Viola and Jones Robust Real-time object detection paper [26]. Their method of using a cascade of classifiers based on AdaBoost allows background regions to be quickly discarded and the features they use are extremely quickly calculated. Unfortunately applying cascade detection that was so efficient in other applications [15] is very difficult because of the many different poses that the human hand can take and its subsequent appearance [24].
[15] defines tracking as the process of segmentation of the object of interest from the background. This is the way we use the term as well. They define different types of segmentation:
• Motion based segmentation - relies on the fact that what is moving in the image is the object of interest.
• Appearance based segmentation - one example of this is training a classifier
to recognize the object of interest. This is what happens in the case of face
detection. This approach can be extended to use temporal data.
• Shape based segmentation - this is useful if the shape of the object is very different from other objects in the background. Shape can be deduced from a variety of descriptors. One example would be extracting edges as templates and then using Chamfer matching.
• Depth based segmentation - this is being more and more used since the in- troduction of the Kinect sensor. This method makes background subtraction much easier [21].
In [24] they use a tree of templates of the hand ranging from extremely coarse to detailed ones to do the detection. They do not determine the actual angles of the joints but they manage to segment the hand almost perfectly from the background.
[16] uses a model of the hand with 26 degrees of freedom along with a model of the object being grasped. The visual cues used are the edge image obtained with a Canny Edge detector and a skin color map. They then pose the problem of detection as an minimization problem which they solve through particle swarm optimization. The authors faced a similar problem to our system - skin coloured pixels on the object surface affects results.
[8] also uses depth data in conjunction with color data for tracking, but uses a model of the hand formed out of 6 parts - the palm and the 5 fingers. 6 individual trackers are used and then the entire hand is rebuilt.
A 17-bone skeletal model of the hand is fitted to a point cloud in [14]. The reason they cite for using point clouds instead of blobs is to make it less susceptible to errors from holes. Even in their case, where they use depth data, occlusions, pose ambiguity and camera limitations make it necessary to use temporal coherence.
An interesting approach, this time using markers is used in [28]. Their system is quite similar to ours - the hand is segmented out and then looked up in a database.
The difference is that a glove with a unique pattern is used to simplify the nearest neighbour search. Methods using markers focus on accuracy as opposed to ease of deployment. An advantage of using the glove is it reduces the ambiguity of the bare hand. For example the front and the back of the hand will now look very different because they have different patterns on them.
2.2 The hand tracking module
As mentioned earlier on we use a comparatively simple hand tracker based on skin
color segmentation. However, given that the hand tracker is a separate model one
could presumably replace it with something more sophisticated. Skin color seg-
mentation is a process generally used as a pre-processing step for more complicated
algorithms given that it is computationally efficient - linear in the number of image
pixels. Generally skin pixels occupy a certain area of a color space and the aim is to
use such a color space where all skin colors are in a compact region. While RGB and
CMY color models are ideally suited for hardware implementation, humans think
of colors in terms of hue, saturation and brightness. The HSI (hue, saturation and
2.3. IMAGE DESCRIPTORS FOR HAND POSE ESTIMATION
intensity) model decouples the intensity component from the color information [7].
After having converted the input image from RGB to HSI/HSV we create a binary image mask that is 1 where the color value was in a pre-specified cube and 0 where it was not. Afterwards we apply morphological closing to this mask. This ensures that any blobs that might have gotten separated will be in the same connected component. Finally the mask would look like figure 2.2. Notice that there are a large amount of false positives in the image. We solve this problem by eliminating all connected components that have a small number of pixels.
Figure 2.1. Result of HSV thresholding and morphological closing.
This system performs well enough for our needs though it made testing the system with a weak laptop integrated web-cam quite difficult. We had false positives with skin coloured objects such as wood or leather or with shiny objects or surfaces.
2.3 Image descriptors for hand pose estimation
To recap, the high dimensionality of the hand representation and the self occlusion it is subject to are the main problems hand pose estimation faces. To counter this a number of shape features that extract the relevant information from the image have been developed. At the same time these features discard the variations that are not correlated with the hand pose [25]. We first take a look at image gradients and then describe Histograms of gradients, Hu moments, SIFT features and shape context. Finally we describe the reasons for choosing HOG features. Note that our system is modular so the features being used can be changed relatively easily.
2.3.1 Image gradients
Most image features are based on the image gradient. The gray scale image is
a function f (x, y), where x and y are the pixel coordinates and the value of the
function is the intensity at that pixel. Common values for f are either discrete values in the interval [0, 255] or continuous - limited by computer precision - in the interval [0, 1].
Mathematically the gradient of a 2D function is a 2D array:
δf /δxdf /dy. In the continuous case
δf (x,y)
δx
= lim
∆x→0f (x+∆x,y)−f (x,y)∆x
and similarly
δf (x,y)
δy
= lim
∆y→0 f (x,y+∆y)−f (x,y)∆y
.
However since the image is a discrete sampling we can only take differences at a one pixel distance. In most cases a symmetrical version is used by taking one pixel before and one pixel after the one we are calculating the gradient. Thus the 2 components of the discrete image gradient are
δf (x,y)δx=
f (x+1,y)−f (x−1,y)2
and
δf (x,y)
δx
=
f (x,y+1)−f (x,y−1)2
. The significance of the image gradient at a particular point is the direction of the greatest change in the image. At the same time the magnitude of the gradient represents the intensity of the change. Since the gradient is calculated using differences it eliminates the problem of illumination. One of the main uses of gradients is edge detection. Image gradients are the basis for the Canny edge detector - one of the state of the art algorithms.
Figure 2.2. From left to right: Orignal image; Gradient in horizontal direction ∆y;
Gradient in vertical direction ∆x; Sum of the 2; Gradient image using OpenCV Sobel operator.
Usually the vertical and horizontal gradients are combined into an operator that is then convolved with the original image. Typical operators are the Sobel operator or the Laplacian operator. Usually such an operator is applied after the image has been blurred using a Gaussian operator to remove noise and insignificant edges.
2.3.2 Histogram of Oriented Gradients
HOGs were first introduced in [4] where they were used for human detection. To
compute the HOG for an image you must divide the image into cells. For each of
the cells the gradient direction in each of the pixels is calculated. These values are
then binned and a histogram of gradient directions is obtained. One can use either
the gradient direction [0
o− 360
o] or the absolute direction [0
o− 180
o]. In their
original paper the authors mention that best performances were obtained using
2.3. IMAGE DESCRIPTORS FOR HAND POSE ESTIMATION
simple gradient calculation ([1, 0, −1] and [1, 0, −1]
T) with no previous smoothing.
They also claim that this feature performs one order of magnitude better than other features in terms of false positives, but is slower than Viola’s integral image [26].
In our case the best performance was obtained for a division of the image into an 8x8 grid and 8 bins for each of the individual cells. We used the normalized gradient angle [0 − 180] so each of the bins corresponds to a breadth of 22.5
o. Thus our final feature vector is 512 dimensional.
Figure 2.3. Image of the hand and its HOG. The HOG is calculated for the rectangle containing the palm.
Figure 2.3.2 gives a visual representation of the HOG for the image of a hand.
Notice there are 64 cells. In each of the cells the dominant bin is represented by a line in the corresponding direction. The length of the line is proportional to the number of pixels in the bin. Notice the visual similarity between the original image and the HOG image. Particularly notice the 45
oline in the area corresponding to the thumb and the vertical lines corresponding to the extended fingers. In general the size of the cells and the histogram granularity affect the generalization capabilities of the feature. The larger the granularity the more descriptive the feature is, but the less it is able to generalize [19].
The authors of [4] use HOGs in conjunction with SVMs for human detection.
They find that HOGs significantly outperformed other features they had tried.
2.3.3 Hu moments
In [9] Hu introduces 7 invariant descriptors for an image based on image moments.
These are independent of position, size and orientation and parallel projection [11].
This image descriptor was used for hand silhouette representation before 2005 [25].
As mentioned before we can consider the image as a 2-dimensional function. For a continuous 2D function f the moments m
p,qare defined as
m
pq=
R−∞∞ R−∞∞x
py
qf (x, y)dxdy.
The moment sequence m
pq completely define the function f . These moments are not invariant to translation, rotation and scaling. The central moments µ
pqare invariant to translation.
µ
pq=
R−∞∞ R−∞∞(x − x)
p(y − y)
qf (x, y)dxdy where x =
mm1000
and y =
mm0100
. Invariance to scaling is obtained through normaliza- tion:
η
pq=
µpqmuγ00
; where γ =
p+q+22Finally the 7 Hu moments are obtained as algebraic expressions of the normalized central moments n
pqwith 2 <= p + q <= 3. The exact expressions can be found in [9], [11] or through any search engine. As mentioned before these moments are calculated for continuous functions. Since images are discrete functions a discrete approximation of the moments needs to be calculated. Thus while rotation, scaling and translational invariance are guaranteed for the continuous case, it is not for the discrete case. However using a high enough resolution image ensures these invariant properties [11].
Coming back to our problem; this representation only describes the contour of the hand and not its internal shape. According to [25] it might be susceptible to errors in segmenting the hand from the background.
2.3.4 SIFT features
Scale invariant feature transform (SIFT) is an image feature descriptor developed by David Lowe in 1999 [13]. Broadly speaking it is based on finding keypoints, or points of interest, in training images. Each of the keypoints has a corresponding feature vector associated to it. The feature vectors for these points are then stored in a database. For a test image the keypoints and feature vectors are similarly calculated and the nearest neighbours, based on Euclidean distance, from the database are retrieved. A Hough transform is used such that different matching features from the database can vote for object pose. Usually an object is described by dozens of keypoints but a matching of as little as 3 such points is a strong signal of correct detection.
Figure 2.3.4 shows what the points of interest would be for one of the images in our database.
The points of interest are defined as the extreme points of the difference of Gaussians at multiple scales - the image is blurred with a Gaussian kernel at different scales and the difference between the original image and the blurred one is taken.
The keypoint set obtained after this first step is further pruned such that only the stable points remain. Each keypoint location is assigned a gradient orientation.
After this step a 128-dimensional feature vector is computed. The values are based
on gradient orientation and magnitude in a 16x16 neighbourhood of the point of
interest. It is important to note that for the database lookup a version of K-D trees
is usually used instead of exact nearest neighbours, similar to what we are using.
2.3. IMAGE DESCRIPTORS FOR HAND POSE ESTIMATION
Figure 2.4. Gray scale version of image in database and the over-imposed SIFT keypoints.
[27] succesfully uses SIFT features combined with AdaBoost to correctly recog- nize hand pose class, albeit with a small number of classes.
2.3.5 Shape contexts
Shape contexts were first suggested for detecting silhouette similarity in [2]. Their matching algorithm relies on taking a sample of pixel locations (in the original paper they give a rule of thumb of choosing approximately 100 points) chosen from the output of an edge detector. For each of these chosen points a shape context is then computed. A distance measure is constructed between two shape contexts.
Using this distance measure finding point correspondences between two silhouettes is reduced to finding an optimal bipartite graph matching.
Once the points have been chosen the shape context is calculated for each of the points. For one point all the vectors that unite this point and the other n - 1 are considered. These vectors are then binned (60 bins in the original paper). The bins are uniform in log-polar space. The resulting histogram is the shape context for that particular point. A slightly modified version of this representation has successfully been used and found to be discriminative of articulated pose [25].
2.3.6 Evaluating image descriptors
[25] offers a framework for evaluating hand pose image descriptors. The 3 char-
acteristics they look for in image descriptors are smoothness, discriminability and
generativity. By smoothness they understand the fact that small changes in the
pose space lead to small shifts in the feature space. Similarly, small changes in the
feature space should correspond to a small movement in the pose space. They use
discriminability in the sense that a particular feature should map to an easily
discriminable uni-modal distribution of poses. In the same way they use genera-
tivity in the sense that a particular pose should match to an easily discriminable
uni-modal distribution of features. They find HOGs to be the best in terms of smoothness and generativity, but is surpassed by shape context features in terms of discriminability.
[18] suggests that desirable qualities of an image feature would be robustness to segmentation errors, sensitivity to non-textured areas and speed of computation.
Tests are run on different types of HOG features with different number of cells and bins and the 8x8x8 version proves to be best for our database.
2.4 Inferring hand pose
The feature vectors we now have are then input into statistical models from which the pose is inferred. In this section we will (very) shortly present some of the models that can be used to infer hand pose. A more detailed presentation of these can be found in most machine learning books. We are going to describe k nearest neighbour - a type of non-parametric regression, SVM regression, and particle swarm optimization algorithms since most of the authors dealing with hand pose estimation are using one of these methods.
Aside from extracting shape information, developing the image descriptors from the previous section had the desirable side-effect of reducing the very high-dimensional image (a 400x400 pixel image is a point 160000-dimensional space) to a more manage-able feature vector. The vectors from the previous section had in the order of hundreds of elements and most of the papers cited in the related work section used feature vectors of at most tens of thousands of elements. This reduc- tion in size makes statistical/machine learning algorithms easily applicable using little computer power.
2.4.1 k Nearest Neighbours (kNN)
kNN is the simplest instance based learning method. It first finds the k closest known points to the test instance. Euclidean or Mahalanobis distance are custom- arily used. It then classifies the test instance by having the nearest neighbours vote on the class of the test instance. Thus, if among the top 5 nearest neighbours 1 is a negative example and 4 are positive then the new instance is classified as be- ing positive. There are, of course, cases when you do not need to classify the new instance but the nearest neighbours are interesting in themselves.
Care must be taken when applying kNN. First of all the dimensions need to be
normalized. If one of the dimensions spans the interval [0,1] while another spans
the dimension [0,100] then this second dimension dominate the distance measure,
particularly when Euclidean distance is used. Another problem is the fact that
this measure suffers from the curse of dimensionality. One way of expressing this
is that in a high dimensional space almost all points are outliers. Another way of
thinking about this is that out of 100 dimensions only 2 may be relevant, but all are
used in the computation of the distance. One approach to alleviate this problem
2.4. INFERRING HAND POSE
is to pre-process the data and reduce dimensionality via e.g. principal component analysis.
To obtain a continuous output value from this algorithm one usually obtains the result as an average of the top nearest neighbours. The top neighbours are generally weighted with their inverse distance to the new data point. The number of nearest neighbours to choose is an important parameter. It is usually chosen via cross-validation.
Finding the nearest neighbours has a complexity of O(N ∗ D) where N is the number of the labelled instances and D is the number of dimensions. This makes the algorithm hard to apply on very large datasets with high dimensionality. In this case one has to do an approximate nearest neighbour search - that is find a neighbour that is a good guess instead of the actual nearest neighbour. Through a relatively lengthy pre-processing step query times can be significantly reduced.
One of the methods that are used for NN search is locality sensitive hashing. This is the same method that we are using. The decrease in accuracy is usually not too big and the improvement in speed is significant.
kNN has successfully been used for hand pose estimation in [1] and [28].
2.4.2 Support vector regression
Support vector machines are binary classifiers. If 2 classes are linearly separable the SVM finds the decision boundary as the hyperplane that maximizes the distance to each of the 2 classes. Expressed differently SVMs are maximum margin classifiers, where the margin is defined as the distance to the closest positive and negative examples. The key to doing this are the support vectors. These are the ones corresponding to the non-zero Lagrange multipliers when solving the maximization problem of the margin. In general there is a small number of support vectors. This proves advantageous because these are used to label new instances so that, after training, SVMs are very fast.
Several methods have been used for modifying the support vector algorithm to perform regression. One such method is described in [22] and follows the same steps as described above. Support vector regression represents a good baseline for comparing other algorithms. [19] uses it in this manner. When only a finite subset of the pose space is to be found the SVM classifier can be used. This is the case in [4] and [10].
2.4.3 Particle swarm optimization (PSO)
If the problem at hand can be expressed as a function optimization problem then
PSO is a good choice of heuristic optimization method. Unlike gradient based
optimization methods PSO can be applied to non-differentiable, non-continuous
functions. The method is inspired by the so-called swarm intelligence. It uses a set
of particles in the parameter space called a population. A particle is described by
a current position and a velocity along with the location of an individual minimal
value and the location of a population minimal value. At each time step the new velocity is calculated as based on the previous velocity and the 2 locations mentioned before along with random weightings. Through this method PSO manages to strike a balance between exploration and exploitation.
If x
itis the i
thparticles position at time t, p
itis the position of its individual minimum point up till time t and p
gis the population minimum up to time t and v
itis the particles current speed then
v
it+1= K(v
it+ c
1r
1(p
it− x
it) + c
2r
2(p
g− x
it)) x
it+1= x
it+ v
it+1c
1is called the cognitive component, c
2is called the social component and they are both parameters that are chosen at the start. K is a decaying rate called a constriction factor. r
1and r
2are random uniformly distributed variables in the interval [0,1].
[16] manage to express the problem of hand pose recognition as a optimization
problem. They try to minimize the difference between what the hand model predicts
and what the input image is. Thus they can apply PSO in a straight forward
manner.
Chapter 3
Overview of the system
In this chapter we explain how the hand pose estimation is implemented. We start off with presenting the kinematic model, then we describe the probabilistic frame- work for hand pose estimation proposed in [19]. This is followed by a description of the implementation, a modular and extendible one which is the main contribution of this thesis, a quick overview of the performance and finally the organization of the code.
3.1 Kinematic hand model
27 bones make up the human hand. Of these 19 are in the actual palm and fingers, whereas the other 8 are located in the wrist. Figure 3.1 shows the bones that make up the human hand and the actual degrees of freedom for each of the joints. We can see that using this model of the hand would lead to a total of 27 degrees of freedom.
We use a simplified version of this kinematic model where we ignore the wrist and treat each of the fingers as having 5 degrees of freedom - 3 for the joint closest to the palm and 1 for each of the other 2 joints. This leads to a total of 25 degrees of freedom. As mentioned in the introduction this is a very high dimensional space to search through. In [24] they mention that by using principal component analysis they were able to reduce the dimensionality of the joint space to 7 dimensions with only a 5% loss of information. This is because of the existence of a large dependency between the joint angles in the positions that a human hand usually takes.
3.2 Probabilistic framework
This section describes method presented in [19] and [18]. Throughout the section
we will be talking about the representation of a hand in two spaces. The first
space is the joint space. This is a 34 dimensional space, the first 9 dimensions
represent the rotation matrix of the hand relative to the camera. Each of the other
25 dimensions corresponding to the sine of the angle of each of the joints. Thus each
of the dimensions is in the domain [−1, 1] so that no scaling is needed. The second
Figure 3.1. The human hand skeleton and its kinematic model. Taken from [5]
space we are interested in is the HOG space. The image of a hand viewed from a certain angle gives rise to a 512-dimensional feature vector. What we wish to find is a mapping between the two spaces. We should mention early on that there is no one-to-one mapping from one space to another. The mapping is in fact many-to- many. That is, a joint space representation corresponds to multiple points in the HOG space - this issue mostly arises from the background noise. Conversely a HOG can correspond to multiple points in the joint space. One reason is the projections that are going on when going from 3D to 2D and the other is the ambiguous nature of the hand - for example the back of the hand looks very much like the front when the fingers are extended, or when occlusions occur it is hard to figure out which finger is which. The way we will try to overcome this problem is by employing temporal consistency. That is we will use the results from previous time steps as a basis for the current time step.
To formalize the previous paragraph at a time step t let x
tbe the point in the joint space corresponding to the ground truth for the current image. Similarly let y
tbe the point in the HOG space for the current image. We will assume that p(x
t) is uniform and that the process is Markovian, meaning p(x
t) depends only on p(x
t−1). What we want to determine is p(x
t|y
t, x
t−1) - the probability that we are in state x
tgiven the observation y
tand the previous state x
t−1. We, of course, do not know x
t−1so we use our estimate of it instead. p(x
t|y
t, x
t−1) can be expressed as p(x
t|y
t) ∗ p(x
t|x
t−1).
Figure 3.2 is a summary of the internal structure of the application. We can
3.2. PROBABILISTIC FRAMEWORK
Figure 3.2. Pose estimation flow. Taken from [19].
see how the image of the hand is segmented out using skin color then converted to gray scale. The HOG of this image is then calculated. The nearest neighbours in the HOG space are then retrieved. The inverse distance from these neighbours to our point form the basis of the probability measure p(x
t|y
t). For each y
tireturned we have a corresponding x
tiin the database. Similarly we have a weight w
ti= p(x
ti|y
ti). In the final step these weights get updated w
ti∗= w
ti∗ p(x
t|x
t−1) and we choose our "winning" pose as that with the highest weight. We do not discard the candidates from this step though. They are used to recompute p(x
t|x
t−1) for the next step.
Two versions of the temporal filter have been tried. The first, simpler, version only considers the best candidate from the previous step. This corresponds to having a uni-modal Gaussian distribution centred at x
t−1as the prior p(x
t|x
t−1).
The second version, which is depicted in the lower center part of figure 3.2, considers all the guesses from the previous time step. Thus we use kernel density estimation on the weighted pairs (x
it−1, w
it−1∗). This method allows the application to recover in the case when x
t−1was an erroneous guess [19]. Even though using all previous guesses performs better than using just the best guess both methods are susceptible to rapid hand movement because the implicit assumption is that the current hand pose is similar to the one from the previous time step.
Different versions of the nearest neighbours algorithm were also used. Using exact nearest neighbours would have been too computationally expensive so locality sensitive hashing was employed. LSH divides the high dimensional feature vector into bands. Then for each band a hash value is calculated. The entire feature vector and bucketizes each of the items using a hash function for each of the bands.
Locality sensitive hash functions are used for the hashing. This means that if two
corresponding bands from two different feature vectors are close to each other they
will have a higher probability of being put into the same bin. Thus when querying
the nearest neighbour with LSH not all values are checked, instead only the distance
to the instances in the bins that the query item would be inserted into are computed
[17].
3.3 The application
Our application takes as input a video stream either from a camera or a file and for each frame outputs the top 4 best candidates from the database that match the detected hand. Although only the top 4 weighted nearest neighbours are shown more (or even less, but this will affect performance) can be used for the computation of the temporal filter. The default value is retrieving the top 16 nearest neighbours from the database. Figure 3.3 represents a snapshot of the system. Snapshots can easily be taken because we have stop/pause functionality.
Figure 3.3. Snapshot of the application.
The upper left portion of the application is the current frame being analysed.
Notice that there is a yellow rectangle around the hand. That is the output of
the hand tracker. That sub-rectangle is what is the input to our actual detection
code. The upper right portion of the application is a visual representation of the
HOG (histogram of gradients) of the sub-rectangle. We will go into more detail
about HOGs in a later chapter. For the moment it suffices to know that it is a
512-dimensional feature vector. It is important to know that this vector is what
we use as a lookup in the database. Finally the bottom row represents the result
- the 4 images from the database that correspond to the input image. Notice that
all 4 are visually similar to the input. Also notice that some contain objects. The
objects being grasped can, in fact, change in consecutive frames because they are
not part of the image, only the hand is. The objects are actually "cut-outs" from
the image.
3.4. PERFORMANCE
Aside from the number of nearest neighbours being retrieved there are a few other parameters that can be tuned.
• The nearest neighbour algorithm being used. The options are either lsh, flann or exact. As you might expect they perform least senstive hashing, approx- imate nearest neighbours using the Fast Library for Approximate Nearest Neighbours and exact nearest neighbours respectively.
• The number of elements in the database. This option is useful if not all 33 grasps (see next chapter) are of interest. One can then focus on less of the grasps.
• The image files being used and the corresponding HOGs. This is useful when testing different versions of the database.
3.4 Performance
3.4.1 Resource usage
On a standard Intel i5 2.5 GHz CPU our application manages to run at approx- imately 20fps for a 640 x 480 image. If exact nearest neighbours are used then performance drops to 15fps. Retrieving the nearest neighbours with LSH is actu- ally 5 times faster than with exact nearest neighbours. However these times are dominated by loading the images from file system and retrieving the ground truth from the SQLite database. Loading 106920 HOGs into memory eats up about 218MB (512 dimensions x 4 bytes x 106920 HOGs). When using LSH the index takes up another 8MB. The 106920 images take up 50.5 GB on disk. This is because the PNG format is used. If a more space friendly format would be used the size could significantly drop. Archiving the images leads to a comparatively meagre 2.8 GB file. The SQLITE database takes up 159MB.
[18] reported a frame rate of 10 fps while using a more powerful processor, but we believe the reason for this is our current test machine had an SSD. Since the largest amount of time was spent retrieving the images from the file system improving the hardware had a much larger impact than any algorithm could have had.
3.4.2 Detection performance
In the initial paper [19] the system is shown to have good detection and general- ization properties for different types of hand and for different objects whether they are in the database or not. Similarly good detection results can be seen in figures 3.4.2, 3.4.2, 3.4.2 and 3.4.2.
The application manages to extract the hand pose and generalizes the hand pose
over different objects being grasped. Visually the application manages to correctly
detect the pose even of an empty hand. Note that the videos from which the
Figure 3.4. Hand holding scissors. Best 3 weighted nearest neighbours are visually similar. 4th is not.
Figure 3.5. Hand holding mobile phone. No small rectangle is present in the database. The objects being grasped differ in the result because they are not part of the HOG calculation.
snapshots were taken were filmed with a laptop integrated camera. Better results could probably be obtained when using a more professional filming device.
Much of the errors in detection stem from the fact that the distance between two poses is not an exact measure of visual similarity. The temporal filter favours poses that are close in pose space. However, a pose that differs in every dimension by a small amount is further away from a default pose than one that differs by a large amount in only one dimension.
As mentioned before the hand tracker is pretty simple so we sometimes have
problems detecting the hand. What usually happens is that with reflective or skin-
coloured objects the tracker outputs a larger area than it should, thus making the
HOGs meaningless. This has the added negative effect of throwing off the temporal
consistency model. An example of this is presented in figure 3.4.2.
3.5. THE CODE
Figure 3.6. Hand holding a bottle of water.
Figure 3.7. Emtpy hand is mapped to 4 visually similar grasping poses. There are no empty hands in the database, but even if there were we could expect objects to appear in the result unless the absence of the object made the HOG more similar.
3.5 The code
The code is organized into 4 main folders: scene, src, include and apps. The
scene folder contains the code for generating the image database - discussed next
chapter. It also contains binary files containing the HOGs of all the images in the
database: hog.bin, f lann.bin and lsh.bin. Depending on the nearest neighbour
algorithm being chosen, one of these gets loaded into memory when the program
starts. If exact nearest neighbour is not used then the next step would be to index
these files for faster lookup. This loading into memory and indexing is what makes
our system real time! The folder also contains a hands.db SQLite database - this
contains all information except the HOG for each of the images: ground truth joints,
rotation, type of grasp and next poses. Arguably this could have also been loaded
Figure 3.8. Because the metallic part of the scissors is shiny the hand tracker module outputs a larger bounding box. Thus the returned images are totally different from the actual hand pose.
into memory on start up, but we decided on leaving it separate for future tests with a larger database that could not fit into memory. Finally the folder also contains an images sub-folder which contains the actual images. These are used only for display purposes when showing the results. The include and src folders contain the .h and .cpp library files respectively. The apps folder contains several executables.
The main executable is handT racker. Check out the code for a list of parameters that can be sent to it. The most important one is of f line which determines if the program should take its input from the camera or from a video. The default video if no other one is specified is scene/video1.
Below, in figure 3.9 is part of the main function of the application.
...
CPoselistMulti pList(
lDBPath,lTemporalSmoothing,lDimMetric,lMetricPath);
nn<float> *nnptr;
...
ProcessFeat<CPoselistMulti,float> procFeat(nnptr,&pList);
Hog<float> hog;
Feature<float> *feat=&hog;
CPoseEstimator<CPoselistMulti,float> tracker(
feat,&procFeat,lOffline,lVideo, lGUI);
tracker.Run();
Figure 3.9. Excerpt from main function.
3.5. THE CODE
pList contains the current model of the multi-modal distribution from the pre- vious time step. It needs the database path to retrieve the ground truth about hte poses. If the second parameter is set to false then temporal smoothing is not used at all. The 3rd and 4th arguments are for debug purposes.
nnptr is a pointer to an object that retrieves the nearest neighbours it is initial- ized based on user input parameters.
procF eat coordinates the exchange of information between the nearest neigh- bour algorithm and the current temporal model.
hog is an object that calculates the hog for an image. Notice that f eat - a general feature calculating object - is what is passed to the actual pose estimator.
This means that any class implementing the Feature interface could easily substitute the histogram of oriented gradients.
Finally the actual entry point to the application is through a CP oseEstimator object. This object takes in the feature object, the feature processor, a parameter that tells it whether to use the attached camera or a video and whether to use a GUI or not (not using the GUI is mostly used for debugging).
You can pick up a fresh copy of the code at https://github.com/damamihai/cvap- thesis and you can download the files described in the previous paragraph from https://drive.google.com/#folders/0BxFGx9ZPHuPlUlhkLTlXZ0Jlczg.
The dependencies needed to set up the code are:
• Boost library. We currently use boost 1.49, but the code should be compatible with future versions. It is mostly used for the simplicity of dealing with the file system.
• Ogre rendering engine. This item is actually optional if all you want to do is use the system without generating your own database. Used version is 1.85.
• GNU Scientific library - tested implementations of lots of mathematical oper- ations.
• OpenCV. Used for image capturing as well as operations on images such as thresholding and segmentation. The version used was 2.3.1.
• SQLite. We use this single file based transactional database for storing ground truth about hand poses.
The needed components have readily available versions for Mac OS and Ubuntu.
Installing on Windows is a little less straight forward as the GNU Scientific library needs to be compiled from code. Installing the code and all its dependencies on a fresh installation of Ubuntu took under 10 minutes.
The new code relies significantly on the initial implementation of Javier Romero
and would best be described as a re-factored and up-to-date version of his code.
Chapter 4
Generating the image database
The first question that comes to mind when thinking about what images our database should contain is whether we want real images of hands or synthetic im- ages. Synthetic images are much easier to generate with software such as Poser [23]
than it is to collect real images. Another important reason is that generating the images in this way gives us the ground truth in terms of the angles of the joints.
However we do not use the joints directly and we do not enforce the angles in our detection algorithm. [15] also confirms that motion capture databases are being used to provide a mapping between image sequences and the 3D poses to be used in pose reconstruction.
Since the set of possible outputs is governed by the composition of the database, the performance of the application can be no better than the poses represented in it.
In this sense we can consider the image database to be the most important part of the system. We will go more in depth into the matter in section 4.2. Finally another important parameter is the size of the database. When the database is large we have a large representational power, but we are more prone to errors. Conversely, if the database is small we can’t represent all hand grasps, but we have a smaller error when we recognize a pose that is in the database. A compromise has to be made between robustness and accuracy.
4.1 Rendering the images
In the introduction we said that the method described in [19] does not employ a
model of the hand for pose estimation. This is true when we refer to the actual
estimation code. However, since we decided on synthetic training data we need a
3D model to generate it. The actual generation went through a few changes, but we
finally decided to use a slightly modified version of LibHand [20]. According to the
home page’s description LibHand is an open-source permissively licensed portable
library for rendering and recognizing articulations of human hand. We made two
minor changes to the library. The first was removing the static dependency to
OGRE, which made it necessary to have multiple versions of OGRE installed on
one computer if other rendering was also done. The second change was introducing the ability to add meshes of objects from files. The need for this will become apparent in the next section.
The rendering of the images is quite a lengthy process. On our test computer it took approximately 18 hours. To avoid doing this when installing on other com- puters we also offer an archive of the pre-rendered images.
4.2 Random poses versus grasps
As mentioned before, though the hand pose space would normally have 26 DOF most combinations of angles rarely, if ever, take place in the day to day use of the hand. Rather than deal with the problem of hand pose recognition of a free hand we decided to solve a more well-constrained problem, that of estimating the pose of a grasping hand. This is the argument for our database containing a large sample of hand grasps. It is important to note that although we made use of a different way of rendering the hand images they are the same as the ones in [19], apart from a different hand mesh being used.
According to [6] a grasp is every static hand posture with which an object can be held securely in one hand. This definition thus excludes intrinsic movements, gravity dependent and two-handed grasps. The final taxonomy that they decide on is the same one we are using now and can be seen in figure 4.2.
Figure 4.1. Comprehensive grasp taxonomy with 33 grasps. Taken from [6].
There are a total of 33 grasps. On the one hand they are divided into Precision,
Intermediate and Power tasks. In an orthogonal dimension they are divided into
4.2. RANDOM POSES VERSUS GRASPS
Adduction and Abduction. Adduction means that the thumb is not touching the object, while abduction means it is. Finally Palm, Pad and Side are the different types of oppositions - what is on the opposite side of the thumb.
Despite the fact that in the use-cases we mentioned earlier, apart from automatic sign language recognition, the hand is interacting with an object, many of the studies until now have focused on the free hand [24], [28]. Our approach is to use the contextual information that an object provides to aid in recognition. The hand and the object are modelled together encoding the correlations between object shape and hand pose in a non-parametric way [18].
As mentioned in the introduction one way to classify current hand estimation systems was model based (generative) versus appearance based (discriminative).
While the generative model has the potential to be more accurate it is also far more computationally expensive. On top of that when using a model of the hand one must also keep a model of the objects the hand is interacting with thus adding further complexity. Figure 4.2 summarizes the argument for a database of images of grasps.
The top part of the figure suggests that a model of both the hand can be maintained. The middle and bottom parts suggest that instead of this explicit model we keep a mapping between the angle space and the image space. We will see when we talk about the actual image descriptor that this mapping is non-linear and non-unique. The middle part shows that when the database does not contain objects and we try to estimate the pose from the image we get a wrong result corresponding to the image in the database that is most visually similar to the input image. Finally the bottom example shows that if a similar grasp is present in the database then the discriminative approach should pick that up.
For each of the 33 grasps we generate 5 steps of the grasp. The first one is a hand in a neutral position with an object next to it, while the last one is the image of the actual grasp. The 3 intermediary steps are obtained as a linear combination between the first and final pose (linear combination for each of the angles in the joints). Finally each of these poses is rendered from 648 angles. Thus the total number of images in the database is 33 * 5 * 648 = 106920.
Figure 4.2 shows the 5 steps we generate for one of the 33 grasps from one
particular angle. Each of the 5 3D poses of the hand is viewed from another 647
angles chosen randomly on a sphere around the hand. The same is true for all the
33 grasps.
Figure 4.2. Examples of recognition performance in the case of interaction with objects. Top:Generative approach. Middle: Discriminative approach with no objects in database. Bottom: Discriminative approach with objects in database. Taken from [18].
Figure 4.3. The 5 steps of a grasp captured from one particular angle.
Chapter 5
Conclusions
5.1 Summary
In this work we analysed and re-implemented the hand pose estimation application mentioned in [19] and [18]. The problem was formulated as a retrieval problem from a database of synthetically generated images while using HOGs as image descriptors and by employing multi-modal temporal model.
The resulting application achieves real time performance on a standard com- puter and generalizes over a range of hand-poses. We make use of non-exotic easily available libraries which makes the system easily installable. The modular nature of the application makes the application extendible.
5.2 Future work
While the performance of the current system is reasonable both in terms of detection and run time a few things could be done to improve it.
5.2.1 Extending the database
One thing that could be done is to use a larger database. This might not have a significant effect on the number of grasps that can be detected since the taxonomy is already comprehensive. It could instead produce more precise results.
Each of the poses is already in the database from 648 angles, which we consider to be a sufficient amount. Extending the database would benefit more from a more granular evolution of the grasp - recall that we only use 5 steps now - and from more diverse angles in the grasps - currently we have default values for each of the 33 fully contracted grasps.
To expand the second idea above recall that the current poses in the database are a linear combination of the neutral pose and the fully contracted grasp. For grasp g and x
fthe pose of the neutral hand:
x
gi= (4 − i) ∗ x
f+ i ∗ x
g; i = 0..4
Adding noise of up to 10
oto each of the angles in the joints of either x
f, x
gor both would make a more diverse database. This way of increasing the database maintains the focus on detecting grasps. One can also consider adding random poses to expand the applicability of the system.
A 2x increase in the size of the database would be pretty straight forward and the frame rate should only drop slightly. However a 10x increase in databases size would pose some challenges. First of all the images would take 10x more time to generate which is approximately one week. However this is a one-time process so one can say that this isn’t such a big impediment. Another effect of the 10x increase in database size would be the total size of the images. Keeping the PNG format would lead to a 500GB size of the images folder. However 10:1 compression ratios with JPEG or GIF are not uncommon so we could expect this problem to be solved.
A more pressing problem would be the memory usage of the HOGs of the images.
These would take up approximately 2.2 GB of RAM. This is likely to be a problem on computers that have a small amount of RAM. If the memory is not enough and swap space is used instead this will severely affect the speed of the retrieval of nearest neighbours whose quick computation is the basis for the real time frame rate of our application.
2 things could be done to reduce the gravity of this problem. First of all, there are images in the database that can never be detected in real life. The clearest example are those taken as though the camera was at the elbow pointing towards the hand like in figure 5.2.1. However removing these will probably not lead to a significant reduction in database size. Another idea would be to load only one type of grasp in its entirety into memory at a time and only some representative images of the other grasps. When the grasp changes most of the images representing the previous grasps are removed from memory and the ones corresponding to the current grasp are loaded.
Increasing the database a factor of 100 and up would be infeasible with current hardware.
5.2.2 Other possible improvements
Currently the returned estimated hand poses are images from the database. Since we already use a rendering engine to generate these images one could imagine making use of it for the result. Thus the returned image could be a weighted average of the top best weighted nearest neighbours. The problem with this idea is that there are cases when one of the top results returned is not visually similar to the input pose.
In such a case the resulting average pose might be far off from the actual pose.
As mentioned before, we use a relatively simple hand tracker. One way of im- proving the system could be replacing this module with something more advanced.
Finally, as mentioned in the related work section others have made use of depth or
infra-red data to improve segmentation.
5.2. FUTURE WORK
Figure 5.1. An angle of viewing the hand that is unlikely to occur in real life.