Color-based Human Hand Segmentation Based on Smart Classification of Dynamic Environments

(1)

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

Color-based Human Hand

Segmentation Based on Smart

Classification of Dynamic

(2)

(3)

Color-based Human Hand Segmentation

Based on Smart Classification of Dynamic

Environments

Qihui Wang

qihui@kth.se

Supervisor: Shahrouz Yousefi Examiner: Markus Flierl

(4)

Acronyms and Abbreviations

AR Augment Reality C-H Calinski-Harabasz D-B Davis-Bouldin FN False Negative FP False Positive

GMM Gaussian Mixture Model

HSV Hue, Saturation, Value color space LAB CIE 1976 L*a*b* color space RGB Red, Green, Blue color space SDK Software Development Kit

SKN A hybrid color space for skin detection SVM Support Vector Machine

TN True Negative TP True Positive VR Virtual Reality

(5)

Abstract

Color is an eﬀective and widely used feature for hand detection. In order to deal with the problematic situations such as hand color diversity and the variation in background and lighting conditions, a multi-classifier supervised learning ap-proach is proposed using the color information of each pixel. Training images are first clustered into dozens of groups based on their global color histograms, and a linear SVM classifier is independently trained within each group. In this way each group can represent a specific chrominance or luminance situation, and each classifier is optimized for a specific segmentation task.

(6)

Sammanfattning

Färginformation effektiv metod för handdetektering. För att handskas med oli-ka färger p˚a handen eller olioli-ka bakgroundsvariationer och och ljusförh˚allanden övervakad inlärning algoritm används för att klassificera händer med hjälp av färginformation för varje pixel. Träningsbilder klustras i dussintals grupper ba-seras p˚a sina globala färghistogrammen. Sedan tränas ett SVM varje kluster för sig. P˚a s˚a sätt kan varje grupp ha specifika krominans eller luminans represen-tation och varje klassificerare optimeras p˚a ett specifikt segmenteringstask.

(7)

Acknowledgment

I would like to thank my supervisor Shahrouz Yousefi and co-supervisor Bo Li at ManoMotion, who oﬀered great help and support with patience from beginning to the end. It would be diﬃcult for me to enter into this new field of gesture recognition without their guidance and a clear plan. I also want to thank Manu, Jean-Paul and Thang for the discussion and their valuable suggestions, and thank all the other ManoMotion colleagues for the great time I have spent there.

Besides, I want to thank my examiner Markus Flierl for his kind help and patience.

(8)

Introduction

1.1 Motivation

This thesis project is carried out in a hand gesture recognition company ManoMo-tion1_{. Hand gesture recognition have nowadays been applied in a wide range of}

fields in AR, VR for natural interaction, especially in gaming and entertainment. It can replace the traditional physical medium like mouse, keyboard, controller, touchscreen, etc., and oﬀer an innovative way of human machine interaction. This improvement of contactless control can bring up a revolutionary update in a broad range of scenarios from large industrial equipments to small personal electronic products.

Although accurate hand segmentation is not mandatory for every recognition task, good segmentation result is an important step towards successful gesture recognition in many existing methods. Currently there are a variety of algo-rithm based methods for gesture recognition, but either they are not accurate enough in segmentation and recognition, or they are too complicated for real-time applications. In this thesis project, a multi-classifier approach based on color information is explored for real-time pixel-level hand segmentation.

1.2 Problem statement

Color is the most widely used feature for skin detection because of its eﬀec-tiveness and eﬃciency. The simple implementation make it potential and com-petitive for real-time detection. What’s more, it is robust to rotation, partial occlusion and pose change. However, there are mainly 3 challenging situations for color based skin segmentation:

1. Large variation among hand skin color.

2. Very similar background color with the hand skin color. 3. Diverse lighting conditions or shadows.

(11)

Generally color information is used for segmentation through a pre-trained color distribution model of hand and background. Test data is evaluated by the model and the resulting probability provides a clue for classification. Hence, the more variety there is in the sample data, the more difficult it is to build an accurate model that works for all. However, in practice, such variation is inevitable. Not only the skin color is diverse for different people, even the color of the same hand can be very different in images under different lighting condition and background surroundings. Besides, regardless of the variation, when the background color is very close to the hand color, it can be very challenging for the model to classify.

1.3 Research Questions

In order to deal with the problems above, an innovative method based on multi-classifiers is proposed in this project to realize hand segmentation at pixel level accuracy. This thesis mainly addresses two questions:

1. Given the dataset, how to find an optimal clustering solution so that all the images within the same cluster are similar to each other?

2. Within each clustered group, what features can be used to train a linear model so that hand and background pixels can be clearly separated? The training images are first clustered into several groups based on their global histogram similarity in a certain color space. In this way, each group can represent one type of lighting condition or color distribution. Assuming feature distributions are similar for similar images clustered together, hand and background can be easily separated by a simple model within each group. Thus a linear SVM model is individually trained in each group using the color features of each pixel. When testing, the test image is first indexed to its closest group based on the same clustering histogram. Then the pre-trained SVM model of this group is used for segmentation. Since each group only represents one specific type of color characteristic or lighting condition, the model can be more accurate for segmentation.

1.4 Contribution

Compared to existing work, the contribution of this project is as following:

1.4.1 Multiple classifiers

(12)

In this project, different linear SVM classifiers are independently trained not only for different illuminations, but also for images with different hand and background color. Each customized classifier is directly used for the classifica-tion only on a specific type of images. Hence, the multiple-classifier approach can provide more accurate models than a uniform model. Besides, the opti-mal clustering solution for the dataset is explored in different color spaces with different parameter settings and clustering algorithms.

1.4.2 Accurate hand segmentation dataset

Since no public available dataset is found to be suitable for the purpose of this thesis, 200 hand images of varying lighting conditions and diverse types of background are manually labeled using Photoshop to test the proposed method. The hand and background masks are at pixel-level accuracy. This dataset can be used for further research on pixel-level hand segmentation in diverse settings.

1.4.3 Multiple color spaces

Most existing clustering or classifying methods are based on a single color space, or a hybrid color space, which is made up by several specific elements or their combinations from diﬀerent color spaces. Few methods have used multiple color spaces to construct features. In this project, diﬀerent combinations of the four popular color spaces RGB, HSV, YCbCr and LAB are experimented to construct features, and most of them achieve better performance than using single color space. Test results show that by constructing features using all the four color spaces RGB, HSV, YCbCr and LAB together can lead to the best segmentation results among all the tested combinations.

1.4.4 Pixel-level SVM

Most previous work use SVM for gesture recognition rather than segmentation [5]. Hence each feature vector for SVM is usually made up by the information of the whole image. In this project, each feature vector is constructed by the color information of each single pixel. Such construction is very eﬀective for pixel-level hand segmentation. The computational cost is greatly reduced by using an eﬃcient linear SVM Liblinear. Besides, a group of similar training images also makes it easier and faster to train the classifier.

1.5 Organization

The rest of the thesis is organized as following.

Chapter 2 provides an overview in gesture recognition techniques and reviews the related work on algorithm based hand segmentation methods.

Chapter 3 describes the methodology of the proposed method in detail. It also explains the consideration in clustering solution and rationalizes the choice of the supervised learning method.

(13)

perfor-mance of diﬀerent clustering and classifying solutions and shows the experiment results on the synthetic dataset.

Chapter 5 evaluates the classification results of the proposed methods on the natural image dataset using F score color map.

(14)

Chapter 2

Background

2.1 Hand gesture recognition techniques

There are mainly two approaches for hand gesture recognition: contact based and vision based methods [18]. Contact based methods usually recognize gesture with the help of some wearable devices like gloves or wrist bands embedded with sensors to detect and capture hand gestures, so it relies on the physical interaction of users and the devices. Although the contact based methods are usually more accurate and stable for recognition tasks at the moment, due to their costly devices and potential health issues brought up by those wearing devices [23], the user-friendly vision based methods is more potential in the near future.

Based on the diﬀerence in feature extraction, vision based approach can be divided into appearance based methods and 3D model based methods [10]. The former usually models the 2D visual appearance of hands according to features extracted from training images using computer vision algorithms and machine learning techniques, and compares the trained model parameters with test image feature to classify whether an input test sample is hand and what type of gesture it is. While 3D model based methods usually do the classification based on 3D kinematic models of hands. This is the most popular approach now. See Fig. 2.1 as the relationship of the methods. In practice, companies using 3D model based methods for hand gesture recognition generally provide a hardware based solution, using a small gadget equipped with a binocular camera or depth camera to capture the 3D model of hands, like LeapMotion1_{, Microsoft}

HoloLens 2_.

However, the computationally expensive 3D models make it quite limited in real-time applications [21], especially on mobile devices with restricted calculat-ing capacity. Besides, requircalculat-ing extra device largely reduces the user experience and brings inconvenience. Hence, in contrast to hardware based solution like using wearable devices or building 3D models, the relatively simple and low-cost algorithm based solutions are gradually getting more attention and becoming an important approach for hand gesture recognition.

1_{https://www.leapmotion.com}

(15)

Figure 2.1: Hand gesture recognition methods overview

2.2 Algorithm based methods

There is a large variety of scenarios in hand gesture recognition applications, thus there is a diversity in the recognition methods as well.

2.2.1 Global appearance based recognition

Since the ultimate goal is to do gesture recognition, some methods usually take into account the global information and leave out the hand segmentation step or roughly segment hands based on simple thresholding, then directly build the models for gesture recognition, especially when recognizing dynamic gestures, which are made up by a series of image frames. A recent work has shown re-markable results in gesture recognition against complex background [22]. Hand is first detected using color features in HSI and YCbCr color spaces, then gesture is recognized using shape and texture features classified by SVM.

Two other major approaches are using statistical models like HMM (Hidden Markov Model), or some deep learning based methods like CNN (Convolution Neural Networks) [21]. Since these methods do the recognition by modeling hand with 2D or 3D templates and comparing the templates with test samples, they are usually limited to a small number of gestures due to variation of gestures and limit of templates.

2.2.2 Motion based detection

(16)

suitable when the camera is fixed such as equipped in an automobile to detect driver’s gestures. Some other adaptive segmentation methods exploit the corre-lation among frames and use the decision from previous frames as cues for the segmentation in the current frame [28].

2.2.3 Local appearance based segmentation

In general, good hand segmentation is essential to the recognition task, espe-cially when recognizing static gestures in a single frame.

Color information is frequently used for feature construction in local appear-ance based methods [13]. It is proved to be a simple but eﬀective cue for skin detection and segmentation [26]. It is robust to rotation and shape variation. Color spaces other than RGB such as HSV, YCbCr, or LAB has proved to be helpful in diﬀerent situations for skin detection.

Sometimes new color spaces are constructed especially for a specific recog-nition or segmentation task. For example, Cheddad et al [4] proposed a new color space for skin detection based on a nonlinear transform from RGB color space, which outperforms both YCbCr or nRGB alone, especially under vary-ing lightvary-ing conditions. In a recent work about recognizvary-ing an eye disease from medical images, Holly et al [32] got better results using a hybrid color space by taking only the best performing channel consecutively from LAB, RGB and I1I2I3 color spaces.

In practice, thresholding is a classical technique for segmentation based on color information, especially when the system prefers simple implementation and fast computation [1]. But generally they can be over fitting on the specific training dataset and do not apply to general use. For example, A thresholding model for skin segmentation is built based on 200 faces with diﬀerent color and lighting conditions in a hybrid color space YUV & RGB [1]. However the good performance is only limited to their test dataset.

Some traditional approaches are based on statistical appearance model such as GMM [13] or Bayesian network [25][17]. However Gaussian mixture model generally requires consistent hand color distribution in the same image, so it does not work well when the hand color changes due to shadow or varying lighting condition in one image [34]. Besides too many Gaussian models will increase the calculating complexity thus not optimal for real-time applications, especially on mobile devices.

Non-parametric models such as histogram based modeling is also popular among scientists [33]. Although 1D histogram has been widely used for feature extraction, Tan et al [29] experimented with 2D color histogram in diﬀerent color spaces and got better results compared to previous methods using 1D histogram. And Li et al [16] used a 3D HSV histogram together with HOG gradient information to form their virtual probe feature and achieved the best performance on the majority of their test datasets.

2.3 Color Spaces

(17)

analysis is one of the most easy and popular methods to detect hand from im-ages, because for a given color space, the distribution of hand pixels generally fall into certain regions. If the cluster is compact and if a clear separation be-tween hand and non-hand pixels clusters exists in the given color space, the hand can be easily segmented from the whole image [14]. The following is a brief introduction of 4 most popular color spaces used for skin detection.

2.3.1 RGB

RGB color space is one of the most widely used color spaces for displaying and processing digital images. Using RGB color space for segmentation can save the time for format conversion. Despite all its convenience and advantages, it has the disadvantages like mixing color chrominance and intensity information, device dependence, perceptually non-uniformity [31].

2.3.2 YCbCr

YCbCr color space is originally designed to handle video information in color television transmission systems. YCbCr is a luminance based color space, and has a clear separation of chrominance and luminance components. The trans-formation from RGB to YCbCr is linear and eﬃcient.

2.3.3 HSV

HSV is hue based color space. It describes colors with intuitive values based on human vision. There is a clear separation of chrominance and luminance information, which are represented by H and V separately. The transformation between RGB and HSV is nonlinear.

2.3.4 LAB

LAB also has a separation in luminance and chrominance information. L is for lightness and A and B represent color components. It includes all the perceivable colors. In contrast to RGB, LAB is a perceptually uniform color space, which means that a certain amount of change at any direction in the color channel can always produce the same amount of visually perceptible change. It is also device independent, which means that the colors are defined independent of the device they displayed on. So it is often used as an interchange color space among diﬀerent devices. The transformation from RGB to LAB is nonlinear and a little computational expensive.

2.4 Hybrid color space

2.4.1 SKN

(18)

S = 0.088G R + G + B 58.89G 30.014R 11.952B 7.24 K = 2.122B 14.859G + 6.921R + 0.62G R + G + B + 1.744 N = 0.342G R + G + B 3.698G 2.25R 0.103B 0.464

It outperforms 7 existing popular color spaces on skin detection when trained with 3 diﬀerent classifiers, including a polynomial kernel SVM. SKN color space is used as comparison to construct the clustering histogram or training feature with the proposed solution in this report.

2.5 Summary

(19)

Chapter 3

Methodology

3.1 Methodology overview

One of the main challenges for hand segmentation is the large variety in back-ground color and lighting condition, so it is diﬃcult to find a uniform criteria to separate hand and background for diverse images. Hence, an innovative approach is to cluster the images into several groups first, and then train an optimized model separately within each group. In this way, a more accurate segmentation model within each group can be built than without clustering.

The training process consists of two parts: clustering and supervised learn-ing, as shown in Fig. 3.1. For clusterlearn-ing, images are clustered into k groups based on their global color histograms. Global histogram is the histogram of the whole image, rather than part of the image. In this way, images with similar color information are clustered into the same group. Then linear SVM (Support Vector Machine) supervised learning is applied individually within each cluster to train a linear model for hand pixel classification in the specific chrominance or luminance situation. The histogram center of each cluster and its corresponding linear SVM model is saved for the testing part.

Figure 3.1: Training process of the proposed method

(20)

shown in Fig. 3.2. The global histogram of the test image is extracted in the same color space as that in training. The histogram is compared with the center of each cluster in cosine distance to find the closest cluster it belongs to. Then the image is classified by the pre-trained linear SVM model of that cluster. Since the classifier is trained by images which are very similar to the test image, it is likely to perform well for the specific type of images. Thus indexing is a crucial step of the proposed method, because bad indexing will lead the test image to a wrong model, which is unable to classify the test image from another group in most cases.

Figure 3.2: Testing process of the proposed method

3.2 Clustering

3.2.1 Background classification

The purpose of clustering is to put images with similar background into the same group, so that within each group, an optimal model can be trained to separate the hand and background pixels as much as possible. This takes the assumption that images in the same group with similar background tend to have similar hand or background distribution in the proposed feature space for segmentation as well. The histogram used for clustering and the feature used for the supervised learning are not necessarily in the same color spaces.

A global histogram which includes the information of all the pixels from the image is used for clustering and indexing, because the background or hand part of the image is unknown beforehand when testing. Since hand usually only takes up 10%-20% of the whole image, so the clustering result is mainly based on the background information.

To get a satisfying clustering result, there are mainly 3 tasks: 1. Find a suitable clustering algorithm

2. Find the optimal parameters for the model

(21)

Previous studies show that there is no such cluster algorithm that can per-form good in all situations, so several algorithms should be selected and tested to find the one that best fits the data [2]. Besides, clustering algorithms like k-means or GMM (Gaussian Mixture Model) is not able to determine the op-timal number of clusters nor the dimension of the feature representation. So these parameters should be also tested to find the optimal combination with the selected algorithm. Considering using the global histogram as feature, the clustering result is determined by the choice of feature dimension, color space, the number of clusters k and the number of bins b for the histogram. Moreover, when clustering lots of images with diverse background into diﬀerent groups, there are no known class labels for the images to evaluate the clustering ac-curacy, so an internal validity measure representing the intrinsic property of the images should be used to evaluate the compactness and separation of the clusters.

3.2.2 Clustering algorithm

3.2.2.1 k-means

k-means is a popular clustering method because of its efficient implementation and good clustering results. The clustering starts from k randomly selected centroids of the data group and local optimal solutions are achieved by iteration. As a result, each data point is clustered to the group of the closest centroid. k-means can be applied with several different distance measures, like squared Euclidean distance, absolute difference, cosine distance, etc.

The performance of k-means relies largely on its starting points. It is possi-ble that a local minimum is reached when a better clustering solution does exist. In order to avoid the bad classification results caused by the random initializa-tion, the initial clusters are randomly selected for multiple times to find the optimal clustering solution when applying k-means. The number of replicates is proportional to the number of clustering groups. Hence due to the random ini-tialization of k-means, the optimal solution and the evaluation measure values may not be exactly the same every time applying k-means, even on the same dataset. However, when the number of replicates is large enough, the changes of the optimal solution can be limited to a certain range so that they are small enough to be neglected.

3.2.2.2 GMM

GMM is also a popular method for clustering. It assumes that the overall data distribution satisfies a Gaussian mixture model and each cluster can be modeled by a Gaussian distribution. The model parameters are estimated using the sample data. The clustering of each image is based on the highest probability of the image fitting into the Gaussian distribution model of each cluster.

(22)

Distance measure

Euclidean distance

Squared Euclidean distance is a common measure to evaluate the distance be-tween two feature points. The squared Euclidean distance bebe-tween a feature point (row vector) x and its clustering centroid (row vector) c is

d(x, c) = (x c)(x c)T

Cosine distance

The cosine distance between a feature point x and its clustering centroid c is

d(x, c) = 1 p xcT (xxT_)(ccT₎

It measures the cosine of the angle between the two feature vectors. The maximum cosine distance is 1 when the two vectors are in the same orien-tation. Cosine distance is most commonly used in high-dimensional positive spaces where the outcome ranges in [0,1]. It is very eﬃcient to calculate espe-cially for sparse vectors. Thus cosine distance can be a good choice to evaluate the similarity between the clustering histograms.

3.2.3 Model parameters

3.2.3.1 Histogram

1D histogram is commonly used for clustering. It is made up by concatenating the histogram counts in each color channel. For example, a 1D 4-bin histogram of an image in RGB is a 1⇥12 vector. The number of bins of each color channel is an important parameter which has an influence on both the clustering results and processing speed. In practice, all the color values are first normalized to [0,1] before calculating the histogram.

2D or 3D histogram are also used for clustering. Since the size of the feature vector increases exponentially, which is a 1 ⇥ 16 vector for a 2D 4-bin histogram and 1⇥64 vector for a 3D 4-bin histogram, the vector can be very long but sparse as the number of bins or histogram dimension increases. Multiple dimension histograms can be an alternative option when it is diﬃcult for 1D histogram to cluster.

3.2.3.2 Color space

The choice of color space might have an impact on the clustering results, depend-ing on the dataset. Because the change in some color or luminance components can be sensitive to one color space while not sensitive to the others. This thesis focuses on experimenting with the color spaces discussed in Chapter 2.

3.2.3.3 Number of clusters k

(23)

within each group. Over clustering can lead to over fitting in the following training, but a rough clustering is not good enough for building an eﬀective segmentation model within each group. k largely depends on the characteristics of the clustering dataset.

3.2.4 Evaluation measure

An intuitive way to evaluate the clustering result is to use the within cluster sum distance, which is the sum distance of every data point to its center within each cluster. However, this value alone can not eﬀectively measure the clustering result because it can not describe the separation among the clusters and the optimal situation is achieved when each data point is assigned as a cluster, which is not the purpose of clustering [30].

There are many cluster validity indices available [2], and as with clustering algorithms, there is no best evaluation index that could perform well in all situations either. Hence for this project the best way to find a suitable way of clustering is to choose several algorithms, parameters and evaluation methods, and try diﬀerent combination of them and visually inspect the clustering results. Following three most popular measures are tested.

3.2.4.1 Silhouette value

Silhouette value [24] is a measure of the similarity of each data point to the points in its own group compared to the points in other groups, The silhouette value of a feature point p is

s = b a max(a, b)

where a is the average distance between point p and other points in its cluster, and b is the minimum average distance between point p and points in other clusters, minimized over clusters. Hence the silhouette value ranges in [-1,1]. 1 represents an optimal clustering solution and -1 the opposite. Hence an overall high silhouette value, which is the average silhouette value of all the data points, generally suggests a good clustering solution.

3.2.4.2 Calinski-Harabasz value

The C-H (Calinski-Harabasz) criterion [3] reveals the ratio of the overall between-cluster variance and the overall within-between-cluster variance. For k between-clusters and N observations, the C-H value is defined as

CH = K P i=1 nikmi mk2 K P i=1 P x2ci kx mik2 N k k 1

where mi is the centroid of cluster i, m is the overall mean, ni is the number of

(24)

3.2.4.3 Davis-Bouldin value

The D-B (Davis-Bouldin) criterion [6] also represents the ratio of within-cluster and between-cluster, but it is measured in distance rather than variance, as the C-H criterion does. For k clusters, the acrlongDavis-Bouldin criterion is defined as DB = 1 k k X i=1 max j6=i{ (di+ dj) di,j }

where di is the centroid of cluster i and di,j is the Euclidean distance between

cluster i, j. Hence a smaller DB criterion represents a better clustering result.

3.3 Supervised learning

3.3.1 LIBLINEAR

After training images are clustered into groups, an SVM model is trained within each group using the color information of each pixel. Since the ideal outcome of the proposed methods assumes a clear separation of hand and background within each group, a simple classifier like binary linear SVM is a suitable choice for the task.

SVM is a popular supervised learning method for classification by finding a linear or a hyperplane or a nonlinear kernel with the largest margin between the classes. Since the ultimate goal is to realize binary classification at pixel level, the input test data to the pre-trained SVM model is therefore the feature of each pixel. However, taking the color values of each pixel from all the training images can lead to a huge training set even for a small number of images. For example, there are 76,800 pixels in each training image with resolution 240 ⇥ 320. For 10 training images the training matrix would be a 768, 000 ⇥ 3 array. This is extremely time consuming for non-linear kernel SVM. Since the distribution of hand and background pixels within each group may not be diﬃcult to separate, a linear SVM is used for training in this project.

LIBLINEAR [7] is an open source linear SVM library, which is optimized for large-scale linear classification. LIBLINEAR is very eﬃcient for large-scale problems. It only takes several seconds to finish the training which can take a few hours for MABTLAB SVM or other linear SVM library.

3.3.1.1 Dual L2-loss solver

The default solver in LIBLINEAR is the dual L2-loss support vector classi-fication solver [12]. It has a simpler implementation but comparable perfor-mance compared with many other linear solvers. For a sample of labeled data (xi, yi), i = 1, ..., l, xi2 Rn, yi 2 { 1, +1}, the dual L2-loss SVM optimizes the

following dual problem:

min ↵ 1 2↵ T_Q↵_¯ _eT_↵ subject to 0_{ ↵}i, i = 1, ..., l

where xi is column vector, e is an all-one vector, ¯Q = Q + D, and Qij =

(25)

3.3.1.2 Primal L2-loss solver

A classical linear SVM solver is the primal L2-loss support vector classifica-tion based on Newton-type methods. It optimizes the following unconstrained problem: min w 1 2w T_{w + C} l X i=1 L(w; xi, yi)

where w is a weight column vector, L(w; xi, yi) = max(1 yiwTxi, 0)2 is the

L2 loss function of the optimization. 3.3.1.3 Parameter setting

For linear SVM, there is only one parameter C to tune, which represents the trade oﬀ cost between margin and misclassification error. A larger C imposes a stricter separation between the classes so it requires longer training time, while a smaller C towards 0 does not give much penalty on misclassification. Since the performance of the solvers in LIBLINEAR is not very sensitive to the value of C, the default value 1 is used for C in the experiments. The tolerance of iteration stopping criterion is set as 0.0001, compared to the default value 0.1 to obtain a more accurate optimization.

The dual solver can be extremely fast in some situations, and it is especially optimal for data with much more number of features than the number of in-stances. However, this is the opposite case with the training data in this project, which has much more number of instances than the number of features. And during testing, the dual solver takes several times longer than the primal solver with similar performance. Hence, the primal L2-loss support vector solver is used for this project.

3.3.2 Training feature

3.3.2.1 Quantization

After normalizing the pixel color values into range [0,1], the normalized values can be directly used to construct the features for training. Alternatively, the normalized values can be further quantized to reduce the variety in sample data when constructing the features, so that a clearer pattern can be observed in the model.

3.3.2.2 Multiple color spaces

(26)

in the corresponding bin. The same for cyan part representing the background pixels.

The first row shows the simple cases while the second row shows the diﬃcult cases. The slight interaction of the cyan and magenta part in the first row mainly come from the noises in the image and only takes a small number of pixels, which are negligible. It is clear that the concentrating parts (red and blue) of the two distributions are clearly separable simply in LAB color space. However, the two distributions are mostly overlapping and mixing together in the second row, and the concentration parts are very close to each other, which makes it diﬃcult to separate using linear model in LAB color space.

Figure 3.3: Example images of easy classification (first two) and diﬃcult classi-fication (latter two)

Figure 3.4: 3D histogram of hand (magenta) and background (cyan) distribution in LAB color space of the example images

(27)

while diﬃcult for the others. An intuitive solution is to add more color spaces in the classifying feature so that the bad performance of one color space can be complemented by the good performance of another. Thus diﬀerent combinations of RGB, YCbCr, HSV and LAB color spaces are used to construct the training feature in the following experiments.

3.3.2.3 nRnG HS AB CbCr

In order to see the eﬀect of the luminance component on the clustering and clas-sification results, clustering histograms and training features can be constructed by normalized R, normalized G, HS from HSV, CbCr from YCbCr, and AB from LAB to test if the removal of the luminance component can improve the cluster-ing performance. RGB is replaced by normalized R and normalized G, which is said to be able to remove the variation produced by diﬀerent orientations under the same light source [11].

3.3.3 Evaluation measure

The performance of each SVM model is evaluated by four values: error rate, precision, recall and F score. Error rate is the percentage of the pixels that are misclassified, either as hand or background, so it is an overall evaluation of the percentage of misclassification.

error rate = F P + F N T P + T N + F P + F N

Precision is the percentage of the correctly classified hand pixels out of all the pixels that are classified as hand. Precision reveals the how many classification results are relevant.

precision = T P T P + F P

Recall is the percentage of the pixels which are correctly classified as hand out of all the hand pixels. Recall reveals how many relevant classification results are returned.

recall = T P T P + F N

F score is the harmonic mean of precision and recall, defined as

(28)

Chapter 4

Experiments and results

4.1 Dataset

The quality of the dataset is important to the training model and testing results. More than 25 publicly available hand gesture datasets [21] are examined but none of them can be a suitable dataset for this project, which requires images of diﬀerent hands in a variety of background, with hand and background labels at pixel-level accuracy. Most of the datasets are collected for gesture recognition. Either the provided masks are not accurate enough, or there is a lack of diversity in background. Hence it is necessary to collect a dataset for this project.

4.1.1 Dataset collection

The initial idea is to obtain the mask of hand and background with the help of a depth camera. Images are collected using Lenovo Phab 2 Pro with Tango, as shown in Fig. 4.1. From top to down, there are a 16 megapixel RGB camera, a depth camera and a motion tracking camera.

Figure 4.1: Lenovo Phab 2 Pro With Tango

Source: Lenovo. http://www3.lenovo.com/gb/en/tango/

(29)

and its depth image in Fig. 4.2b. The binary image generated by the depth camera is used for labeling hand and background pixels. White pixels represent hand and black pixels represent background. All the RGB images are down-scaled to size 240 ⇥ 320 and aligned to their corresponding depth images due to the resolution limit of the depth camera1_.

(a) RGB image from RGB

camera (b) Binary image fromdepth camera (c) Comparison of the twoimages

Figure 4.2: An example of the collected images

4.1.2 Mask accuracy

4.1.2.1 Mismatching

Since the RGB camera and the depth camera have diﬀerent resolution, frame rate and optical centers (due to the distance between the two cameras as shown in Fig. 4.1), some mismatches are inevitable when aligning the binary and the RGB images, as shown in Fig. 4.2c. Besides, the cut between wrist (which is supposed to be classified as hand) and sleeve (which is supposed to be classified as background) in the binary image is randomly generated using Particle Swarm Optimization, because they are always at the same depth and it is impossible to distinguish only use depth information. This means that in most cases there is a mismatch between the cut in a binary image and the cut in its RGB image, like it shows in Fig. 4.2b and 4.2a.

4.1.2.2 Preprocessing

In practice, some noise can be produced when generating the binary images using the depth sensor. In order to reduce the influence of the noise, preprocessing the binary images is necessary before extracting features from hand or background region. The hand part is eroded when used for taking hand pixels to avoid the noise near the hand border. And for the same reason the hand part is dilated when used for taking background pixels (See Fig. 4.3).

(30)

(a) Eroded hand for

ex-tracting hand pixels (b) Dilated hand for ex-tracting background pixels

Figure 4.3: Preprocessing the binary images to improve segmentation accuracy

In order to avoid the error introduced by mismatching near the wrist, when taking samples of background, the random cut between wrist and sleeve is ex-tended throughout the dilated binary image and only the pixels above the line are taken as background. The rest are discarded even if there might be back-ground pixels, as shown in Fig. 4.4a. Likewise, when taking samples of hand, a line parallel to the cut is drawn crossing the center point of the palm region in the eroded binary image, and only the pixels above the line are taken as hand, as shown in Fig. 4.4b.

(a) Mask for selecting

back-ground (b) Mask for selecting hand

Figure 4.4: Mask (white part) for selecting hand and background

4.1.2.3 Outcome

(31)

(a) Selected hand (b) Selected background

Figure 4.5: Selected hand and background region

However this is not a single case. In some worse cases in Fig. 4.6, some part of the hand is missing due to bad alignment between the RGB and its binary image as shown in Fig. 4.6a. Sometimes the sleeve is taken as hand pixels due to a wrong cut, as shown in Fig. 4.6b. The huge mismatch in Fig. 4.6c is due to diﬀerent frame rates of the two cameras. Such eﬀect is obvious when the hand does not remain static for long enough while taking the images.

(a) Bad selection of hand

region (b) Bad selection of handregion (with sleeve) (c) Bad selection of back-ground

Figure 4.6: Examples of bad hand or background selection

Unfortunately, only a small number of images have satisfying error rates in both hand and background pixel selection. Even for those few cases without serious error in fig .4.7, pixels at the boundary of the hand are always excluded due to erosion, but these pixels are also important for training the segmentation model. In conclusion, a more accurately labeled database of hand with diverse background should be built for this project.

(a) Original RGB image (b) Selected hand region (c) Selected background

(32)

4.1.3 Manual labeling

In order to get more accurate masks, the ground truth is manually labeled using software Pixelmator and Photoshop 2_{. 200 images of diverse background and}

lighting conditions are selected out of the 1600 images and the dataset includes hands of 4 people. Due to the limit of the brush accuracy in the software, the original RGB images are upsampled first using nearest neighbor method to 5 times of its original size before drawing the contour. In this way, it is easier to draw and the error is less. The process is as following.

1. Upsample the RGB image to 5 times using nearest neighbor algorithm 2. Draw the boundary along the hand

3. Fill the hand part with red color

4. Downsample the Red channel of the image to original scale based on mode pixel value

5. Generate a binary mask from the downsampled image with 1 representing hand and 0 representing background

An example labeled image and extraction results are shown in Fig. 4.8.

(a) Original RGB image (b) Binary mask

(c) Selected hand (d) Selected background

Figure 4.8: An example of manually labeled image

4.1.4 Synthetic image dataset

Since it approximately takes 10 minutes per image to to manually label the dataset, it is not practical to collect a large dataset in this way for this thesis project. An alternative solution is to generate synthetic images to get a larger dataset with accurate masks, which means to take hands and background from separate images and synthesize them into new images. In order get more diver-sity in hands, 72 more hand images of 9 people (8 images per person) are taken

(33)

under diﬀerent lighting condition against a green screen, as shown in Fig.4.9. With green screen, it is easy to separate hand and background and synthesize the hands with other background images.

Figure 4.9: Examples of diﬀerent hands against a green screen under diﬀerent lighting condition

To start with, simple-background images are used to explore an optimal solution for the proposed method because it is relatively easy to visually evaluate the clustering results with simple background images. Hence, the extracted hands are synthesized with images of pure color, simple pattern, and varying lighting conditions. Some example simple background images are in Fig. 4.10.

Figure 4.10: Examples of diﬀerent pure background for synthesizing

In order to get a larger dataset, 40 hand images with pure or simple color background are taken from the 200 manually labeled images to generate syn-thetic test images with 21 simple background images. Besides, these 40 images are also used as a natural simple-background image test set for comparison to evaluate the performance of the SVM models trained only using synthetic images.

60 hand images from 12 diﬀerent people (3-7 images per person) are taken from the rest of the 272 natural images to generate the synthetic images for training. Another 56 simple background images are used for synthesizing, and for each background, 20 hand images are randomly selected from the 60 images for synthesizing.

(34)

the experiment dataset to explore an optimal choice of clustering solution and training features for the proposed multi-classifier approach.

Three example synthetic images are shown in Fig. 4.11, which are the syn-thetic results of Fig. 4.9 and Fig. 4.10.

Figure 4.11: Examples of synthetic images of simple background

4.1.5 Natural image dataset

In contrast to the synthetic images, there are 272 hand images in the natural image dataset from 12 people, including the 200 manually labeled images and 72 green-screen images. The natural image dataset is used for final verification and evaluation of the proposed solution.

To sum up, the generation of both the experiment dataset and the evaluation dataset is in Fig. 4.12.

Figure 4.12: Experiment dataset and evaluation dataset generation process

4.2 Clustering

(35)

All possible combinations of 11 b values [4 5 6 7 10 15 20 49 100 199 256] and 21 k values from 5 to 25 are experimented on the synthetic training dataset with 1120 images using 1D histogram in RGB, HSV, YCbCr or LAB color space. In this way, both small and large b values as well as odd and even number can be tested. Both k-means and GMM are tested for clustering, and all the three evaluation measures in both Euclidean and cosine distance are used to evaluate the clustering results.

Figure 4.13: Silhouette value color map using 1D histogram in LAB clustered by k-means (Euclidean distance)

The performance of different clustering solutions in each color space are visualized by a color map, from blue to yellow representing the evaluation crite-rion values from small to large, clustered using different k and b. For example, Fig. 4.13 shows the Silhouette criterion values clustered using k-means with Eu-clidean distance in LAB color space for different k and b. The highest score is when b = 4, k = 14.

Different clustering algorithm like k-means and GMM can produce obviously different patterns in silhouette value color maps, as shown in Fig. 4.14, which means that the optimal number of k can be very different when clustered by different algorithms, even using the same clustering histogram.

(36)

To sum up, evaluated by silhouette values, the optimal clustering solution for the synthetic training dataset in LAB color space is to use k-means and 1D 4-bin LAB histogram in cosine distance with 14 clusters.

k-means (Euc.) k-means (cos) GMM (Euc.) GMM (cos) Sil. values 0.8680 0.8720 0.6027 0.6587

k 14 14 14 11

b 4 4 4 4

Table 4.1: Highest Silhouette values clustered using LAB

(a) Silhouette value color map using 1D histogram in LAB clustered by

k-means (cosine distance)

(b) Silhouette value color map using 1D histogram in LAB clustered by

GMM (cosine distance)

Figure 4.14: Silhouette value color map using 1D histogram in LAB clustered by k-means and GMM (cosine distance)

The clustering performance in diﬀerent color spaces can be compared using the same criterion measure like Silhouette.

The highest Silhouette values and their corresponding b, k clustered by each color space in cosine distance and Euclidean distance are shown in Tab. 4.2 and Tab. 4.3. The 1D 32-bin histogram in HSV for clustering proposed in [17] is also evaluated as comparison. LAB and YCbCr have equivalent good performance in terms of silhouette value. The optimal number of cluster is 14 with 4-bin histogram in cosine distance.

Color space RGB HSV LAB YCbCr HSV (Li’s) Silhouette values 0.7650 0.7832 0.8720 0.8973 0.5870

k 21 25 14 14 23

b 4 4 4 4 32

Table 4.2: Highest Silhouette values (cosine distance) of each color space

Color space RGB HSV LAB YCbCr HSV (Li’s) Silhouette values 0.7691 0.7800 0.8680 0.8626 0.6201

k 25 25 14 12 25

b 4 4 4 4 32

(37)

In the same way, the highest C-H values and their corresponding b, k clus-tered by each color space in cosine distance and Euclidean distance are shown in Tab. 4.4 and Tab. 4.5. In this case, YCbCr has obvious advantages over other color spaces when the optimal number of clusters is 24.

Color space RGB HSV LAB YCbCr HSV (Li’s) CH values 761.8 767.2 1860 3297 145.5

k 25 25 24 24 25

b 4 4 4 4 32

Table 4.4: Highest C-H values (cosine distance) of each color space

Color space RGB HSV LAB YCbCr HSV (Li’s) CH values 802.5 764.5 1917 3271 152.5

k 24 25 24 24 25

b 4 4 4 4 32

Table 4.5: Highest C-H values (Euclidean distance) of each color space

The clustering results for D-B values are in Tab. 4.6 and Tab. 4.7. YCbCr again gets the best score among all the other color spaces with similar clustering solution evaluated by silhouette value.

Color space RGB HSV LAB YCbCr HSV (Li’s) DB values 0.6183 0.5828 0.4372 0.3752 1.0074

k 22 22 15 14 24

b 4 4 4 4 32

Table 4.6: Highest D-B values (cosine distance) of each color space

Color space RGB HSV LAB YCbCr HSV (Li’s) DB values 0.6445 0.5954 0.4372 0.3806 0.8828

k 25 24 15 13 23

b 4 5 4 4 32

Table 4.7: Highest D-B values (Euclidean distance) of each color space

As a result, simply changing the distance measure of Euclidean or cosine distance does not make much difference on the pattern of the color map, as shown from the tables above and by comparing Fig. 4.13 and Fig. 4.14a, but different criterion measures can lead to different optimal solutions.

(38)

4.3 Training feature quantization

In order to examine whether the quantization can help with the classification, the normalized values are quantized with a step of 0.1 and 0.01 on [0,1] to construct the training features, and both linear and Gaussian SVM model are tested for evaluation.

9 images similar to that in Fig. 4.15 are used for training, and the image in Fig. 4.16 is used for testing. The evaluation measures are listed in Tab. 4.8 and Tab. 4.9.

Figure 4.15: Example training images for quantization

(a) Test image (b) Mask

Figure 4.16: Test image for quantization

(a) Result using normalized

pixel values (b) Result using 0.01 stepquantization (c) Result using 0.1 stepquantization

Figure 4.17: Classification results using linear SVM

Evaluations Error rate Precision Recall F score Normalized pixel values 4.08 74.12 98.64 84.64 Quantized with 0.01 step 4.11 74.0 98.62 84.55 Quantized with 0.1 step 4.68 71.0 99.66 82.92

(39)

(a) Result using normalized

values (b) Result using 0.01 stepquantization (c) Result using 0.1 stepquantization

Figure 4.18: Classification results using Gaussian SVM

Evaluations (%) Error rate Precision Recall F score Normalized pixel values 0.7 95.54 98.43 96.96 Quantized with 0.01 step 0.72 95.91 97.85 96.87 Quantized with 0.1 step 1.48 88.93 99.38 93.87

Table 4.9: Classification evaluation of diﬀerent SVM models for group 1 using Gaussian kernel

As a result, there is no improvement in the classification result using quan-tized features with either linear or Gaussian SVM model. Instead, the perfor-mance is getting a little worse when using a larger step for quantization. Hence, the normalized color values are directly used to construct the training features in the experiments of project. In this way, the training feature of each pixel is a multi-dimensional vector of the pixel values in diﬀerent color spaces, with each value normalized within range [0,1].

4.4 Experiments on the synthetic dataset

To start with, simple-background images are used to explore a feasible clustering solution and training features, because it is easier to visually evaluate the clus-tering results with simple background images compared to complex background images. Training feature is made up by diﬀerent color values of each pixel. Su-per pixel or texture features are not used because they usually take much more computational capacity. Since all the training images within a group is similar to each other after clustering, there is no need to complicate the feature if clear separation can be found between hand and background simply using single pixel color information.

The training dataset is the 1120 synthetic images with simple background and the testing dataset is the 40 natural images and 840 synthetic images, as mentioned in Chapter 3. Some of the test results are shown in Tab. 4.10. All the experiments use 4-bin histogram in this Chapter unless stated otherwise.

(40)

can be unnatural to the synthetic groups. Hence the following classification performance is mainly discussed based on the results from the synthetic test set, and the results from the natural test set are presented for reference.

test clustering histogram training feature k 840 synthetic images (%)_{precision recall F score precision recall F score}40 natural images (%)

1 LAB LAB 14 89.01 68.46 77.39 56.87 65.27 60.78 2 LAB YCbCr 14 93.18 77.84 84.82 82.10 83.16 82.63 3 LAB RGB 14 88.44 78.61 83.24 83.02 81.53 82.27 4 LAB HSV 14 86.16 56.21 68.03 58.36 50.22 53.98 5 LAB RGB YCbCr 14 93.01 83.72 88.12 85.55 87.68 86.60 6 LAB RGB HSV YCbCr 14 67.11 75.04 70.85 55.51 95.29 70.15

7 LAB RGB HSV LAB YCbCr 14 88.07 88.61 88.34 50.16 93.29 65.24

8 YCbCr YCbCr 14 92.38 71.98 80.91 56.77 82.00 67.09

9 YCbCr HSV 14 51.19 45.29 48.06 44.57 63.16 52.26

10 YCbCr RGB 14 68.79 54.53 60.84 39.74 73.69 51.63

11 YCbCr LAB 14 93.83 61.27 74.13 65.30 77.22 70.76

12 YCbCr LAB 24 80.33 67.87 73.58 49.79 77.14 60.52

13 YCbCr RGB HSV LAB YCbCr 14 91.00 76.46 83.10 52.33 86.21 65.13

14 YCbCr RGB HSV LAB YCbCr 24 88.18 82.94 85.48 46.89 87.34 61.02

15 RGB LAB YCbCr RGB HSV LAB YCbCr 24 92.89 91.01 91.94 57.96 98.77 73.06 16 RGB LAB YCbCr RGB HSV LAB YCbCr 35 88.88 92.45 90.63 59.86 98.82 74.56 17 RGB HSV LAB YCbCr RGB HSV LAB YCbCr 24 91.43 90.38 90.90 56.66 96.06 71.27 18 RGB LAB YCbCr RGB LAB YCbCr 24 93.08 85.89 89.34 66.67 97.14 79.07 19 RGB LAB YCbCr RGB LAB YCbCr 14 95.32 83.28 88.89 68.70 98.73 81.02 20 nRnG HS AB CbCr RGB HSV LAB YCbCr 25 63.28 97.44 76.73 28.76 99.40 44.61 21 nRnG HS AB CbCr nRnG HS AB CbCr 25 71.73 92.36 80.75 46.25 95.91 62.41 22 RGB LAB YCbCr nRnG HS AB CbCr 24 88.44 84.74 86.55 60.06 92.24 72.75 23 RGB LAB YCbCr nRnG HS AB CbCr 35 83.01 87.44 85.17 58.47 93.16 71.84

Table 4.10: Overall classification results of diﬀerent solutions

4.4.1 One color space for clustering and training

In the simplest case, the training feature for classifying is made up by the color values of each pixel in either RGB, HSV, LAB or YCbCr, which are most commonly used for skin detection. The optimal clustering solutions of YCbCr and LAB obtained in Chapter 4 are used for clustering, so all the histograms have 4 bins for each color component with k = 14 or k = 24 measured in cosine distance.

Test 1-4 in Tab. 4.10 show the result of clustering using LAB and classifying using other color spaces. These four tests are implemented on the same cluster-ing results uscluster-ing 1D 4-bin histogram in LAB with 14 groups, which means that the classification results only depend on the SVM model trained using a certain color space. In the same way, the results clustered by 1D 4-bin histogram in YCbCr with 14 groups are shown in test 8-12.

(41)

The performance of classifying using HSV is obviously bad in either case. There is a problematic group of dark brownish background in which none of the hand pixel can been classified. This means that HSV color space is not helpful for separating hand and background in such case. This might due to the characteristic of the synthetic images.

Four more tests are implemented to further explore the classification per-formance clustered by LAB, as shown in Tab. 4.11. Comparing test 1 and 2 there is no obvious improvement in simply adding LAB in training. However the performance is improved a bit after adding both HSV and LAB, as shown in test 7 in Tab. 4.10. There is also a little improvement after using a larger number of clusters 22, as in test 3. By comparison, a larger bin number is used in test 4 but the performance is worse.

test clustering histogram training feature k 840 synthetic images (%)_{precision recall F score precision recall F score}40 natural images (%)

1 LAB RGB YCbCr 14 93.01 83.72 88.12 85.55 87.68 86.60

2 LAB RGB YCbCr LAB 14 89.51 85.87 87.65 79.71 95.66 86.96

3 LAB RGB YCbCr 22 93.47 84.94 89.00 83.86 81.66 82.75

4 LAB (6-bin) RGB YCbCr 22 83.48 87.95 85.66 74.13 83.96 78.74

5 LAB nRnG CbCr 14 92.99 81.66 86.96 90.94 88.00 89.45

Table 4.11: Some classification results clustered by LAB

4.4.2 Multiple color spaces for training

Generally there is some improvement in adding more color spaces to the training feature, comparing the results of test 5-7 with test 1-4, or test 13, 14 with test 8-12. Interestingly, when introducing HSV without LAB, the performance is significantly decreased, as comparing the results in test 5-7.

4.4.3 Clustering number k

From the experiments it is observed that the clustering number k does not make too much diﬀerence on the classification results as long as it is within an optimal range. Small k (such as smaller than 10) can lead to coarse clustering so the classification performance can be obviously decreased, while large k is not helpful for substantial improvement, as comparing the results in test 11 & 12, test 13 & 14, or test 18 & 19.

4.4.4 Group classification results

The classification result of each group of test 5 in Tab. 4.10 is shown in Tab. 4.12. Images are clustered by 4-bin LAB histogram into 14 groups and trained by RGB and YCbCr color features. The last column indicates the number of test images that are indexed to a certain group when testing.

(42)

Group error rate (%) precision (%) recall (%) F score (%) nr of images 1 2.8 100 84.01 91.31 83 2 0.69 99.06 96.69 97.86 77 3 0 100 100 100 40 4 1.44 91.65 99.67 95.49 80 5 7.57 89.69 59.6 71.61 200 6 0 0 0 0 0 7 4.36 99.91 69.06 81.67 12 8 32.94 67.06 100 80.28 65 9 1.29 92.3 99.78 95.9 120 10 0.57 100 95.66 97.78 40 11 0 0 0 0 0 12 6.53 100 59.23 74.4 83 13 0 0 0 0 0 14 0.06 100 99.28 99.64 40

Table 4.12: Classification results of each group with the best solution

Figure 4.19: Example training images and classification results of group 8

The biggest issue comes from group 8, where 25 images with orange back-ground are wrongly indexed into the group trained only by images with blue background, as shown in Fig. 4.20. The first two are example training images, and the following four images are test images and their results. As a result all the pixels in the image is classified as hand (all white). But the classification result of the blue test image is almost perfect. The wrong indexing might be due to the abnormal light distribution in the synthetic training images.

Figure 4.20: Example training images and classification results of group 8

(43)

of 72.53%, comparable with that in Tab. 4.12. Test results on the same images are shown in Fig. 4.23. In some cases the refinement is obvious, but in some other cases the performance is worse, depending on the hand color as well.

Figure 4.21: Example training images of group 5

Figure 4.22: Example test results of group 5

Figure 4.23: Test results of group 5 trained by nRnG CbCr

Some training images and test result of group 12 with dark red and brownish background are shown in Fig. 4.24 and Fig. 4.25. In this case the test results depend largely on the skin color, as shown in the third test image in Fig. 4.25, when the skin color is very close to the background color, it is almost impossible to classify.

Figure 4.24: Example training images of group 12

(44)

Another observation is that there are similar images indexed to diﬀerent groups with diﬀerent classification results, comparing the results in Fig. 4.26c and Fig. 4.26d. This is probably caused by using synthetic images for training, which ignores the color balance within a natural image and lead to problematic clustering and indexing.

Figure 4.25: Example test results of group 12

(a) Some example good results

(b) Some example bad results

(c) Example test images indexed to the right group

(d) Similar test images indexed to the wrong group

(45)

4.4.5 Multiple color spaces for clustering

Sometimes the poor classification results come from the wrong indexing, which means that the test image could not be indexed to the right model for classifying even using the optimal clustering solution. In order to improve the clustering and indexing performance, multiple color spaces are used for constructing the clustering histogram.

4.4.5.1 Optimal clustering solution

As described in Chapter 4, silhouette value is used to measure the clustering performance of the 1 ⇥ 36 4-bin histogram in RGB, LAB, and YCbCr color spaces. k-means is applied with 100 replicates. The highest score is when k = 24and b = 4 for all the k from 5 to 25.

Considering the variety in background color, hand color and lighting con-ditions, 24 might still be a small number of clusters on the training dataset. Hence, k is further explored from 23 to 40 with three b values 4, 5, 6 to find the best clustering solution. In this case, k-means is applied using 2000 replicates to get a more accurate result, because the more groups there are, the more repli-cates it requires to achieve a global optimum due to the random initialization. The best result is when k = 35 and b = 4 as shown in Fig. 4.27.

Figure 4.27: Silhouette value color map using 1D histogram in RGB, LAB, YCbCr clustered by k-means (cosine distance)

4.4.5.2 Test results

Training images are clustered using 1D 4-bin histogram in RGB, HSV and YCbCr color spaces with 24 or 35 groups. Training feature is the pixel level color information from RGB, HSV, LAB and YCbCr color spaces, or RGB, HSV and YCbCr. The results are shown in test 15-19. As comparison, training images are also clustered using 1D 4-bin histogram in RGB, HSV, LAB and YCbCr color spaces with 24 groups in test 17 in Tab. 4.10.

(46)

models for classification, especially when the dimension of the feature is low. Too many groups can lead to over clustering and decreases the variety and the number of training samples within each group, thus likely to cause overfiting for each trained model. In this case the classification will be very sensitive to the clustering and indexing results.

As the test results show, the optimal clustering solutions obtained in Chapter 4 can only provide an estimate of a suitable clustering solution for classification, but not necessarily lead to better classification results compared with other lower-scored clustering solutions. Because the optimal number of k is dependent on the choice of classifying feature, and there is also a trade-oﬀ between the training image background variety and the amount of training sample within each group.

4.4.6 nRnG HS AB CbCr

Evaluated in the same way the highest score is obtained when k = 25 and b = 4 clustered by nRnG HS AB CbCr histogram. The results are in test 20-23 in Tab. 4.10. However, the results have no advantages compared to that of using full color spaces. The performance decreases when some color component is excluded. This is similar to the observation in [15] [20].

4.4.7 Comparison

4.4.7.1 SKN

A hybrid color space SKN [19] which is optimized for skin detection is tested for both clustering and classification as a comparison with the performance of the traditional color spaces. Results are shown in Tab. 4.13 from test 1-4. It turns out that SKN has obvious poor performance in either clustering and classifying. Overall SKN also has problem in classifying dark red and brownish background as well as white yellowish background, as the traditional color spaces do. 4.4.7.2 32-bin HSV

(47)

test clustering histogram training feature k 840 synthetic images (%)_{precision recall F score precision recall F score}40 natural images (%) 1 LAB SKN 14 92.31 73.84 82.05 84.62 73.62 78.74 2 SKN SKN 14 81.00 52.13 63.43 60.03 59.36 59.69 3 YCbCr SKN 14 90.98 65.67 76.28 58.66 78.69 67.21 4 YCbCr SKN 24 86.79 73.74 79.73 52.47 78.78 62.99 5 HSV (32-bin) LAB 25 19.52 71.61 30.68 43.94 58.61 50.23 6 HSV (32-bin) YCbCr 25 85.03 76.24 80.40 66.49 93.57 77.74 7 HSV (32-bin) YCbCr 14 86.12 69.61 76.99 57.07 74.04 64.46 8 HSV (32-bin) RGB YCbCr 14 90.10 80.16 84.84 71.16 84.60 77.30

Table 4.13: Classification results of other’s solutions

4.4.8 Conclusion

Although the performance of the proposed idea is limited by the issue with the synthetic dataset, which neglects the global color balance within an image and can cause problems when clustering and indexing natural images, the test results on the synthetic dataset can still provide some insights on the choice of clustering histogram and training features. The best solution among all the tests on the synthetic dataset is to cluster with 4-bin RGB LAB YCbCr histogram with 24 groups and to train with color information from RGB HSV LAB YCbCr as shown in test 15 in Tab. 4.10. However, comparable performance can be obtained by a more simple solution which is to cluster using 4-bin LAB histogram and train with color features in RGB and YCbCr. To sum up, several assumptions can be made from the test results on the synthetic dataset.

1. Clustering by RGB LAB YCbCr or LAB alone and classifying by RGB HSV LAB YCbCr multiple color spaces perform better than other tested solutions for the proposed multi-classifier approach.

2. Clustering histogram and training feature have a mutual constraint on each other with respect to the classification results.

3. Clustering number k does not have a big impact on the classification re-sults as long as it is within an optimal range.

4. Higher-score clustering solution does not necessarily provide an optimal k to cluster for the classification task compared with lower-score clustering solution, depending on the evaluation measure and dataset.

5. Small bin number such as 4 in each component can lead to good clustering results.

6. Excluding the luminance component does not help improving clustering or classification results.

(48)

Chapter 5

Evaluation

Due to the issue brought up by the synthetic images, the natural dataset is used to verify the assumptions from the experiments on the synthetic dataset and to test the performance of the proposed idea. 8-fold cross validation is used for evaluation and F score color maps are generated to evaluate the clustering solution on final classification results.

5.1 8-fold cross validation

The 272 manually labeled natural images are used to evaluate the performance of the proposed solution in Sec. 4.4.8 via 8-fold cross validation. The dataset in-cludes both simple and complex background images. It is randomly partitioned into 8 subsamples and each one contains 34 images. One subsample is used as validation set and all the rest subsamples are used for training. The process is repeated 8 times so that each subsample is used as validation set for once. The average performance of the 8 repetitions is an evaluation of the proposed methods on the natural dataset. In this way, all the data can be used for both training and validation, which is a big advantage when the number of available data is limited.

Diﬀerent cluster number k from 6 to 45 are evaluated using silhouette crite-rion value with 4-bin LAB histogram in cosine distance. The highest silhouette value is when k = 8. Diﬀerent k are tested as comparison. The results are shown in Tab. 5.1. It further confirms the conclusion made in Sec. 4.4.8 that the best clustering solution does not necessarily lead to the best classification results, as in test 1-4. Besides, adding RGB and YCbCr color spaces in the clustering histogram can improve the classification results, though only a little, comparing test 5 and 8, test 6 and 9. But using 4 color spaces can decrease the performance, as in test 11.

As the results show that using multiple color spaces as training feature has an obvious advantages than using single color space, comparing the results of test 1-6, 12-18.

Color-based Human Hand Segmentation Based on Smart Classification of Dynamic Environments

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

Color-based Human Hand

Segmentation Based on Smart

Classification of Dynamic

Color-based Human Hand Segmentation

Based on Smart Classification of Dynamic

Environments

Qihui Wang

qihui@kth.se

Acronyms and Abbreviations

Abstract

Sammanfattning

Acknowledgment

Contents

Chapter 1

Introduction

1.1 Motivation

1.2 Problem statement

1.3 Research Questions

1.4 Contribution

1.4.1 Multiple classifiers

1.4.2 Accurate hand segmentation dataset

1.4.3 Multiple color spaces

1.4.4 Pixel-level SVM

1.5 Organization

Chapter 2

Background

2.1 Hand gesture recognition techniques

2.2 Algorithm based methods

2.2.1 Global appearance based recognition

2.2.2 Motion based detection

2.2.3 Local appearance based segmentation

2.3 Color Spaces

2.3.1 RGB

2.3.2 YCbCr

2.3.3 HSV

2.3.4 LAB

2.4 Hybrid color space

2.4.1 SKN

2.5 Summary

Chapter 3

Methodology

3.1 Methodology overview

3.2 Clustering

3.2.1 Background classification

3.2.2 Clustering algorithm

Distance measure

3.2.3 Model parameters

3.2.4 Evaluation measure

3.3 Supervised learning

3.3.1 LIBLINEAR

3.3.2 Training feature

3.3.3 Evaluation measure

Chapter 4

Experiments and results

4.1 Dataset

4.1.1 Dataset collection

4.1.2 Mask accuracy

4.1.3 Manual labeling

4.1.4 Synthetic image dataset

4.1.5 Natural image dataset

4.2 Clustering

4.3 Training feature quantization

4.4 Experiments on the synthetic dataset

4.4.1 One color space for clustering and training

4.4.2 Multiple color spaces for training

4.4.3 Clustering number k

4.4.4 Group classification results

4.4.5 Multiple color spaces for clustering

4.4.6 nRnG HS AB CbCr

4.4.7 Comparison

4.4.8 Conclusion

Chapter 5

Evaluation