A Novel System for Deep Analysis of Large-Scale Hand Pose Datasets

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS ,

STOCKHOLM SWEDEN 2018

A Novel System for Deep Analysis

of Large-Scale Hand Pose Datasets

MARIA TOURANAKOU

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Abstract

This degree project proposes the design and the implementation of a novel system for deep analysis on large-scale datasets of hand poses. The system consists of a set of modules for automatic redundancy removal, classification, statistical analysis and visualization of large-scale datasets based on their content characteristics. In this project, work is performed on the specific use case of images of hand movements in front of smartphone cameras. The characteristics of the images are investigated, and the images are pre-processed to reduce repetitive content and noise in the data. Two different design paradigms for content analysis and image classification are employed, a computer vision pipeline and a deep learning pipeline. The computer vision pipeline incorporates several stages of image processing including image segmentation, hand detection as well as feature extraction followed by a classification stage. The deep learning pipeline utilizes a convolutional neural network for classification. For industrial applications with high diversity on data content, deep learning is suggested for image classification and computer vision is recommended for feature analysis. Finally, statistical analysis is performed to visually extract required information about hand features and diversity of the classified data. The main contribution of this work lies in the customization of computer vision and deep learning tools for the design and the implementation of a hybrid system for deep data analysis.

Keywords

(3)

Abstrakt

Detta examensprojekt föreslår design och implementering av ett nytt system för djup analys av storskaliga datamängder av handställningar. Systemet består av en uppsättning moduler för automatisk borttagning av redundans, klassificering, statistisk analys och visualisering av storskaliga dataset baserade på deras egenskaper. I det här projektet utförs arbete på det specifika användningsområdet för bilder av handrörelser framför smarttelefonkameror. Egenskaperna hos bilderna undersöks, och bilderna förbehandlas för att minska repetitivt innehåll och ljud i data. Två olika designparadigmer för innehållsanalys och bildklassificering används, en datorvisionspipeline och en djuplärningsrörledning. Datasynsrörledningen innehåller flera steg i bildbehandling, inklusive bildsegmentering, handdetektering samt funktionen extraktion följt av ett klassificeringssteg. Den djupa inlärningsrörledningen använder ett fällningsnätverk för klassificering. För industriella applikationer med stor mångfald på datainnehåll föreslås djupinlärning för bildklassificering och vision rekommenderas för funktionsanalys. Slutligen utförs statistisk analys för att visuellt extrahera nödvändig information om handfunktioner och mångfald av klassificerade data. Huvuddelen av detta arbete ligger i anpassningen av datasyn och djupa inlärningsverktyg för design och implementering av ett hybridsystem för djup dataanalys.

Nyckelord

(4)

ACKNOWLEDGEMENT

I would like to warmly thank my supervisor, Shahrouz Yousefi, for his supervision and guidance during my thesis project. He has been a real mentor to me and his support has been key to my research work. I would also like to thank my professor and examiner, Markus Flierl, for providing the necessary knowledge background for my thesis project and for his feedback that helped me further improve my thesis report.

I would like to express my gratitude to my parents, Dimitris, Eirini and my sister, Katerina for their continuous support and encouragement throughout my studies and their sacrifices to help me get to this point and pursue my dreams.

(5)

1. Introduction

1.1. Problem Definition and Motivation of the Project

Modern computer vision and machine learning solutions require deep analysis and processing on large-scale datasets of images. Although collection and generation of large-scale data is an essential part of the process, performance of the systems developed based on the data heavily relies on the quality of the data itself. In the special case of hand poses collected from users of smart devices (user-generated data), the complexity of the process gets higher as input data can get highly diverse; the environment conditions (e.g. background, lighting conditions, etc.) and the characteristics of the hand poses (e.g. left vs right hand, skin tones, variations in the hand poses, etc.) may vary greatly. At the same time, input data can also present great similarity among them (e.g. specific hand poses like pointer, grab or pinch gestures). Data that end up being repeated in the database create data redundancy, increasing the complexity of the manual annotation and pre-processing, an extremely longer training process without any major improvement in quality of the detection/classification results. Currently removing redundancy and data annotation are principally done manually. Labeling images is a time consuming process but it is also not a scalable approach for large-scale image datasets.

By removing redundant data that does not add new information, large-scale datasets can be reduced in size while a. retaining the same quality in data analysis, and b. increasing performance. That highly affects the manual pre-processing and the training process. Better quality of data along with less data brings more representative interpretation of the relationship between the features of the data and the respective classes, and it improves the classification results. However, when classifying large-scale datasets, the diversity of the input data is as important to be taken into consideration as the similarity between it. Highly different/sparse data can affect classifiers not to converge to significant classes/classification results. Therefore, redundancy removal regards to a significant research challenge that needs to preserve the balance between diversity and similarity of the content.

(8)

converge to a hybrid solution of a novel system for automatic redundancy removal, classification and statistical analysis of large-scale image datasets. The initiative for this thesis has been driven by the real-world challenges that a computer vision company may encounter when dealing with large-scale image datasets such as in Internet of Things, robotics and other smart applications as well as challenges related to user-generated content such as images captured by regular smartphone cameras.

1.2. Research Questions

The research questions of the project lie in two principal directions:

RQ1: How existing computer vision and machine learning tools can be used to develop an effective system to automatically remove redundant data from collected images in the

dataset and retain the most representative data covering diversity of the hand poses, skin

tones, various backgrounds, lighting conditions etc. for image classification. RQ2: How to evaluate the quality of the large-scale dataset of hand poses using data mining techniques and statistical analysis tools.

1.3. Purpose

The purpose of the degree project is to design a novel system that can automatically:

- process and remove redundant data from a database so that content that introduces great similarity without adding new information is not repeated, storage space is being optimally utilized (as it significantly reduces the size of the database) while the quality and the diversity of the content are being maintained,

- analyze a database’s content a) by classifying data with satisfying/high accuracy and b) extracting and visualizing data analytics for the database in near real-time, providing an automatic statistical overview of the quality and the diversity of the database content.

(9)

large-scale datasets based on their content characteristics. The objective in each module of the project is slightly different; in pre-processing, redundancy removal is performed by removing near-duplicates from the data in order to speed up the post-processing, in image classification, feature extraction based on shape measurements and deep learning are tested to find the approach with the best accuracy based on the underlying characteristics of the data and in automatic statistical analysis, visualizations of the database are created (at the hit of a button) to automatically extract information regarding the data and its distribution.

1.4. Contribution

The contribution of this thesis project includes:

- The design and implementation of a hybrid system for retrieval, similarity comparison, classification and statistical analysis of images from a large-scale database based on their content (color and shape) characteristics.

- The automatization of redundancy removal as an underlying innovative principle that speeds up image processing of large-scale datasets. Quantitatively defining redundancy removal metrics for image pre-processing (parameters can be fine-tuned to perform analysis on different datasets)

- Showing that deep learning and convolutional neural networks perform better in the process of image classification of hand gestures in terms of accuracy than image classification based on traditional feature extraction in computer vision.

- The dynamic statistical analysis; information extraction and visualization, of large-scale datasets.

(10)

1.5. Limitation

This thesis project regards to an empirical research study that proposes the design and implementation of a new hybrid system for redundancy removal through similarity analysis, image classification and evaluation of the quality of the data through data mining and visualization.

Literature review of the state-of-the-art methods for redundancy removal and image classification has influenced the project definition and has driven the vision for the design of the proposed system. Multiple concepts introduced by literature set the background of the project, however the motivation behind the methods and techniques used throughout the project lies in the empirical study and experimentation of the researcher and the suggestions of the supervising team. The overall goal of this research is to identify, test and adapt a subset of relevant computer vision and machine learning techniques for the implementation of a novel, customized system for deep data analysis that can reveal powerful insights about the data. In that respect, due to the hybrid nature of the system that incorporates different scientific principles, time limitations and resources constraints, the current project serves as a limited study of a subset of techniques and does not qualify as an holistic review of all the potential alternatives. At the same time, the project itself does not regard to the extension of an already existed system reported by the scientific community neither to the pure application of scientific research findings. The results of the image classification experiments are qualitatively and quantitatively concluded and compared using the ground truth as a reference point. The selected techniques have been tested on a special case of data; on image datasets containing human hands. However, the system has been designed such that it can be extended to other use cases without requiring fundamental changes; system parameters can just be fine-tuned to perform in-depth analysis on different datasets.

(11)

1.6. Outline

(12)

2. Theory

2.1. Data Redundancy

In data driven solutions based on large-scale image datasets, there is an emerging need for efficient data processing. Data may be redundant when similar information is repeatedly present in the database reducing significantly the performance of processing and wasting system resources in storage space and computational power. Redundant data can also affect the workload of manual work, the lifecycle of a learning process as well as the quality of classification by creating noise and deviation. In principle, the larger the volume of input data gets, the more variation should be added so that detection and classification quality can improve. In practice though, especially in the case of user-generated content from users of mobile devices, redundant and noisy data may be present in large volume in the database so that the larger the dataset gets, the most difficult it gets to maintain the quality of the detection/classification results.

Defining redundancy is an essential and challenging process for the system design of a computer vision application and one that needs to be tailored around the specific nature of the data itself. Redundancy can be measured from different variation of the input. For example, for a color image with red (R), green (G), and blue (B) components (RGB image from now on) as input, near duplicates are mainly redundant whereas for binaries of hand poses after background subtraction, similarity of the hand poses (repetition of the same hand pose over and over) can be considered redundancy.

(13)

a) b)

c) d)

Figure 1: Color distribution is mostly the same in a and b, and in c and d images respectively.

Figure 2: Light distribution is almost the same.

a) b) c)

d) e) f)

(14)

Figure 4: Near-duplicate RGB images.

Redundancy removal is being often performed manually by selecting data that are unrepresentative and removing them from a dataset which is time consuming, not scalable and includes bias due to human judgement. From a computer vision and/or a machine learning perspective, redundancy removal can be automatically achieved through image similarity analysis (comparison), accurate image classification and faster data/statistical analysis. For instance, if with effective redundancy removal a 10M dataset can represent the 1M initial dataset, we extremely reduce the manual work.

The purpose of automatic redundancy removal of this project lies in two directions:

a) similarity analysis in image databases to decide on specific image descriptors that can compare characteristics (features) of the images such as the colour or the shape in order to keep only the most representative data to be processed. For example, RGB images that are near duplicates are not useful data, and

b) image classification in the most representative and accurate way so that further redundancy can be removed. For example, intraclass redundancy such as hand pose variations.

2.2. Feature Extraction

(15)

relevant representation of the images. The process of investigating and selecting the most appropriate features to be extracted from the images is of vital importance to the accuracy of post-processes such as image classification. Depending on the specific application, there are two main types of features to be extracted from the images: global and local features [5]. Although the term features is interchangeably used in scientific research with the term image descriptors, in this paper we define features as the properties of the image and as image descriptors the feature vectors that can be used to distinguish one image from another [2], [5]. Descriptors can be created from components of the same type or of different types to measure statistical, geometric, algebraic, differential, or spatial properties of an image [2].

Global descriptors represent information in the whole image; they regard to a global property of an image representing all the pixels while local descriptors mark out key areas in the image, highlighting local properties of the image for a subset of image pixels. Global descriptors can measure colour, shape or texture characteristics of an image and thus are usually appropriate for image retrieval, object detection and classification whereas local descriptors are often used for object recognition [4], [5], [6]. The key difference between object detection and object recognition are discussed in the section Object Recognition. Local descriptors are more distinctive and robust since they interpret local properties in the image. However, local descriptors require a significant amount of memory because the image may have hundreds of local features [4]. In global feature extraction, a single vector is produced by default to represent an image. One or more features, a set of features, can be extracted from the images to create a multidimensional feature vector and then different vectors can be compared to compare images. In local feature extraction, usually more than a single feature vector are created and that is the reason that this approach is more costly in processing. When a set of features is used to construct an image descriptor, appropriate weights can be applied according to the effect or the importance of each feature [2], [5], [6].

2.3. Object Detection and Object Recognition

(16)

cases the output of an image detection algorithm is the coordinates of a bounding box where the object of interest lies within. On the other hand, object recognition regards to the process of investigating a region of interest to identify the type of object existent in that area. Hence, object detection gives a binary classification to the search of an object and the coordinates of its bounding box if object localization is incorporated while object recognition analyses the characteristics of the object in the region of interest and provides an estimation of the object’s identity.

Localizing and identifying objects of certain characteristics regards to one of the most challenging tasks in computer vision and machine learning [7]. In most cases, images or sequences of images (videos) are acquired under real conditions and that presents a series of challenges. For example, most images are not object-centered [9] which can affect the detection and recognition process. Usually objects that are captured in the margin of the image are not easily detected. Images can also contain complex backgrounds, diverse lighting/illumination conditions and a variety of other components/objects close to the objects of interest presenting a high degree of noisy details which may confuse the detection and recognition processes.

At the same time, even objects of the same type, namely of the same class, may visually vary in an image due to difference in the orientation or the angle of the object, variation on the distance of the object from the camera when capturing, from now on scale variation, as well as due to variations on the object appearance such as shape, colour or pose variations.

That intraclass variation may create additional complexity to the identification/recognition process and as a result to complicate image classification.

In this project, one of the objectives is to identify and localize an object of certain properties and its possible variations for image classification. More specifically, to identify whether a hand exists in a frame or not, the bounding box that highlights the region of interest that the object of interest lies within and the analysis of the variation of hand poses in order to perform accurate hand poses classification.

2.4. Image Segmentation

(17)

Segmentation partitions an image into meaningful non-intersecting regions that consist of sets of pixels “according to some objective criterion, homogeneity in some feature space or separability in some other one” [11]. The criterion to which this subdivision is carried highly depends on the problem being solved, but it is principally decided upon discontinuity or similarity of intensity values [12]. In the first case, sudden changes in pixel values are the basis of the partition whereas in the second case, partitioning is held based on regions that share similar values [12]. The end goal of segmentation is to isolate the objects of interest and thus, the segmentation criterion is application-specific.

Image segmentation is a relatively difficult task to perform in image processing, especially in cases of user-generated content where the environment may vary greatly from image to image. The major challenge is to design the process in such a way that the objects of interest can still be detected while irrelevant image details can be diminished [12].

Some typical examples of segmentation are seen in Figure 5.

Figure 5: Illustration of various image segmentation applications. (Source: from Wang, 2015 [11])

(18)

preceded of object recognition. In that respect, the efficiency of object segmentation will highly affect the accuracy of the recognition process.

In this project, the specific case of hand gestures recognition and classification is studied. Segmentation aims at identifying and localizing the hand which is the object of interest, and further analysis aims at identifying the gestures the hand performs. More specifically, a module for hand segmentation has been used where the hand is segmented based on the skin colour of the hand [13], [14]. The algorithm looks for the skin colour object in each frame in order to locate the hand, then partitions the frame into background and hand by subtracting the background from the identified hand object, and segmenting the hand into a binary image to further analyze it.

Figure 6: Hand segmentation examples based on skin colour. We can observe that when image segmentation is accurate, the binary images reflect the hand

(19)

Figure 7: Hand segmentation examples based on skin colour. We can observe failing cases of hand segmentation e.g. in cases of low contrast between the

hand and its background.

2.5. Colour Image Processing

Colour regards to a powerful descriptor that can simplify object identification and extraction from a scene [12]. The colour that the human eye can perceive highly depends on the portion of light reflected from an object [12]. Usually a colour is represented in a colour space as a 3D vector with real values (e.g. RGB, CIE XYZ, HSV, CIELUV [1].

In the human eye, colours are perceived as a combination of the primary colours red (R), green (G) and blue (B) [12] forming the widely known RGB colour space [15]. A colour space is a mean of uniquely specifying, creating, and visualizing colours [16], [17], [18]. In the RGB colour space, each image is the composite of three images, one for each primary colour [12]. When working on the RGB space, operations are performed on each individual channel.

(20)

space. Brightness can also be referred to as the value component (V). Hue (H) and saturation (S) are the components related to the perception of colour by the human eye while the value is the intensity of a colour, namely how dark or light the colour is [15]. Hue simply regards to the most dominant colour in an image and it is the component that affects human perception to judge the colour of an object as blue, red or yellow [12]. Saturation is the amount of white light mixed with the hue producing a relative purity of the hue. Value regards to the brightness of the colour as indicated before. When the value is zero, the colour is black as there is no brightness, regardless of the hue or saturation values [19].

Hue and saturation are together called chromaticity. A colour can then be characterized by two major components, its chromaticity and its brightness [12]. In other words, hue and saturation regard to the viewing conditions and they define an absolute colour space meaning that the colour information should not vary while value is introducing varying lighting conditions. For example, the same object may seem to have a different colour under different lighting conditions, although its colour is not varying.

The implication here is that different lighting conditions can affect colour detection which regards to a major challenge in image segmentation that is usually based on identifying the object of interest based on colour.

The principal difference between the RGB and the HSV colour space is that in the HSV space the intensity information is separated by the colour information [15]. That separation may be useful in case we need to isolate colour information or to extract lighting information. For example, the value component can be excluded from the analysis if we wish to eliminate the illumination changes on the textures [15] or it can be exclusively analysed to perform a judgment based on the lighting conditions of a scene.

Conversion from one colour space to another is possible through a mathematical transformation that is translating the representation of a colour to another basis. Colour transformations are modelled with the expression:

g_(x,y) = T[f_(x,y)], (1.1)

(21)

2.6. Colour Histograms

A typical example of a global descriptor widely used by academia to characterize a digital image is its histogram. Colour is a vital low-level feature of an image. Histograms are simple, versatile and quick to compute [22] and regard to a powerful tool to characterize the colour distribution of an image. The histogram of an image is a graph that maps the frequency of the different pixel intensity values present in the image. The histogram consists of a set of bins where each bin regards to a particular intensity value. Each pixel in an image will be assigned to a bin of the histogram so that the value of each bin is the number of pixels that has the same corresponding intensity value [54]. In the simple case of a 8-bit grayscale image, the minimum gray level, namely its intensity value, is zero and the maximum one 255 so that 256 different gray level intensity values are possible to exist in that image. The histogram displays the 256 values and the number of pixels (frequency) that each intensity value occurs [12], [23].

In a colour image, the colour histogram is a set of bins that represents the distribution of colours such that each histogram bin corresponds to a particular colour in the colour space. The number of bins depends on the number of colours that are existent in the image [54]. A colour histogram for a given image is defined as a vector:

H = { H[0], H[1], H[2], H[3]...H[i],...H[n]}, (1.2)

where i represents the colour bin in the colour histogram and H[i] represents the number of pixels of colour i in the image, and n is the total number of bins used in the colour histogram [54]. An RGB colour histogram can be computed for a colour image by calculating the individual histograms of red, green and blue channels can be computed, or one 3D histograms of three axes representing the red, blue and green channels, and brightness at each point representing the pixel count [23].

In order to compare images of different sizes, histograms should be normalized usually by the total number of pixels in the image, or more commonly know the size of the image [54].

H_norm = H/n, (1.3)

(22)

Although histograms can provide a powerful representation of an image, they also present a major drawback. Histograms only represent statistical information with no indication on the location (spatial information) of the pixel intensities in the image. In that context, similar images can produce different colour histograms while images with different colours can produce quite similar histograms [20] as illustrated in Figures 8 and 9.

a) b) c) d) Figure 8: Binary images a, b, c and d all produce the same histogram. Although the distribution of intensities varies, the statistical information are

the same resulting to the same histogram.

a) b)

Figure 9: RGB images a and b will also produce the same histogram. Although the distribution of intensities varies, the statistical information are the same

resulting to the same histogram.

(23)

Image indexing is a process of image retrieval from an image or video database based their contents [1].

2.7. Similarity Analysis

In most image processing and computer vision tasks, a notion of similarity is introduced to determine the distance between images. The decision over a similarity metric or a combination of metrics highly depends on the research challenge being considered and/or the nature of the data.

To be able to compare images for similarity, a set of features needs be extracted from the images in a numerical form. This information can then be used to distinguish one image from another using a distance metric.

Popular distance functions have been used as a distance metric to measure similarity among data. The most popular distance function that has been widely used by academia due to its simplicity is the Euclidean distance (sum of squared difference), mathematically also known as L2 norm.

Definition: For a given pair of points p (x,y) and q(s,t), the squared Euclidean distance is defined as the sum of the squared differences of the components:

D_(p,q) =

√

(x− s)2 + (y− t)2 , (1.4)

However, it is not always enough to use a single feature for image retrieval as it may not retrieve accurate results. Depending on the application, a combination of different features used for similarity analysis can increase the recognition power of image indexing [2], [27].

2.8. Shape Descriptors

(24)

an object, can provide an overview of its geometrical information which may be less sensitive to scale, orientation and location changes [32] and can play a vital role in recognizing the object itself. In data mining-based systems of large-scale image datasets as the one we propose in this project, the automatic characterization of image content based on shape similarity is a major concern [30] and the key enabler of efficient image classification.

Generic shape descriptors such as Fourier descriptors and moment invariants [31], can provide a high dimensionality feature vector to accurately describe specific shapes. There are also descriptors that provide information about a single characteristic “over a variety of shapes such as circularity, ellipticity, rectangularity, triangularity, rectilinearity, complexity, mean curvature, symmetry, etc.” [31]. Despite the variety of shape descriptors, there is not a particular one that works efficiently in all possible applications [31].

There are two main methods for shape analysis: shape descriptors based on the boundary of the object and area descriptors that perform analysis on the points enclosed by the boundary of the object [31], [34]. Boundary based descriptors are also called line descriptors and they are used to calculate the “the length of the irregular boundary or curvature of an irregular object in a digital image in terms of pixels” [34]. Area descriptors analyze certain characteristics of the object such as the area, centre of gravity (centroid) or the orientation of the object [34].

Figure 10: A foreground region R in an binary image; White - 1 (Foreground pixels), Black - 0 (Background pixels)

(25)

An interesting method often incorporated when dealing with shape analysis is the normalization of images containing the objects of interest. The normalization factor may depend on the application. That is principally done to eliminate scale variations in the images so that regardless of the object being zoomed in or zoomed out, the object remains the same [34]. Of course, significant information is getting lost with that method that may also be a drawback for specific applications. Under the scope of that project, simple descriptors are used to extract shape characteristics from the objects for image classification.

2.9. Deep Learning and Convolutional Neural Networks

In recent years, artificial neural networks (NNs) and deep learning have attracted a lot of attention principally in applications of computer vision and pattern recognition [35]. The development of artificial neural networks has been inspired and emerged by our understanding of the structure and the function of the biological neural networks such that of the human brain [36]. In their simplest version, neural networks are structured into multiple layers of nodes incorporating a “feedforward” approach of distributing data. Feedforward means that data flow into the network in only one direction and neurons in one layer are only connected to neurons in the next layer while there is no feedback loop where the output of the model is fed back to the system [37], [38].

(26)

radically transformed, at the output layer. During training, the weights and thresholds are continually adjusted until training data with the same labels consistently yield similar outputs.” [39].

Feedforward networks formed the basis for the development of recurrent neural networks where feedback connections are added to the model [40]. Convolutional neural networks are a specific type of neural networks that can be either feedforward or recursive, with multiple layers, widely popular for image processing. CNNs can infer information from the raw values of the image pixels with low-cost processing as indicated by LeCun et al. [40] and Krizhevsky et al. [39] to enable highly representative, layered hierarchical feature extraction from image training data [41]. During the last decade, CNNs revived in the form of deep neural networks (DNNs) as they proved to be significantly successful in large-scale image classification tasks [43], [44], [45]. Deep neural networks (DNNs) employ deep learning to discover intricate structures in high-dimensional data [50]. Machine learning techniques require extensive engineering to design a feature extractor that can transform raw data into a useful representation from which classifiers can detect patterns [50]. Unlike traditional machine learning, deep learning uses representation learning. Representation learning regards to a set of methods that provide the capability to automatically infer representations for classification or detection from raw data.

(27)

2.10. AlexNet – CNN

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton introduced in their research [44] a novel architecture of a large, deep convolutional network that achieved considerably better performance comparing to the state-of-the-art neural networks of that time [45]. AlexNet, as the network was named after Alex Krizhevsky’s name, succeeded to classify 1.2 million high-resolution images into 1000 different classes in the ImageNet Large Scale Visual Recognition Challenge (LSVRC) contest in 2012. The depth of the model was the key factor to its significant performance. Although CNNs have been previously studied as an alternative to traditional neural networks, the success of AlexNet launched the official discussion on deep CNNs in computer vision in the research community [44], [41]. Alex Krizhevsky et al. realized the need of a network with larger learning capacity and a lot of prior knowledge for data that was not in the database, more powerful than the traditional feedforward neural networks [44]. AlexNet regards to a convolutional neural network (CNN) that is eight layers deep. Five convolution layers are followed by max-pooling layers and three fully-connected layers with 60 million parameters and 650,000 neurons [44]. Certain approaches in AlexNet’s architecture such as Rectified Linear Units (ReLUs) and dropout [47], max-pooling [44], [51], data augmentation [48] and Stochastic Gradient Descent Momentum (SGD) with momentum [49] speeded up the training process, reduced the effect of overfitting and optimized image classification. The excessive definition of the terms is out of the context of this paper.

3. Method

3.1. Dataset

(28)

Figure 11: Examples of different hand poses where differences in orientation are observed.

The dataset did not include data from the front facing cameras and no human faces were included in the input data. That aspect was addressed in the direction of ethics and privacy concerns.

The users were only guided on how to navigate/transition from state to state in the hand gestures but no guidance was provided on the orientation of the hand when the gesture was performed, the distance of the hand from the camera when capturing or the environment that the recordings were taken.

Data has been captured in many environments and under varying lighting conditions including variation of the image backgrounds and of the hand/skin tone in the dataset. Synthesized images has also been used in the dataset (different hand poses and various skin tones in the same background). Hand gestures have been captured at different distances from the camera and with variation of orientations.

(29)

Figure 13: Images with background and lighting conditions variation. We can also observe that the hand is not necessarily located in the center of the image. This dataset of hand poses was enough to include a representative set of variance in the data whereas to reduce bias in the dataset.

An already built image segmentation module was used throughout our experiments where the orientation of the images regarded to a defined parameter. The orientation of the images used as an input in our experiments was the same; portrait images were selected by default.

A subset of the original dataset of 1000 images was used for the pre-processing part.

(30)

3.2. Process

3.2.1. Module 1: Pre-processing for near-duplicate removal

As discussed earlier in this paper, large-scale image datasets can present redundancy and noise. We already defined redundancy as repetitive content that does not add new information to the system and irrelevant content that is considered noise. Under that scope, removing redundancy from the data increases performance characteristics of computer vision applications such as faster processing as the volume of data is decreased while the diversity and quality of it is maintained, and improves accuracy in processes such as image classification as noise is removed and does not confuse the classifiers. Hence, redundancy removal regards to a major enabler of efficient and accurate image processing.

Module 1 attempts to reduce redundancy by discovering near duplicates on a large-scale database by removing already acquired data that does not add new information, so it is redundant, and by judging whether any new image is useful to be stored or not at the time of its acquisition. As near duplicates are defined all images that duplicate at a significant region [53]. For a given distance metric, two images are defined as near duplicates is their distance is smaller than a previously selected threshold T:

dist(.) < T, where T is a selected threshold, (1.5)

For images already stored, the pre-processing module ensures that near-duplicate images (mostly similar) are removed from the database. At the same time, any new image getting into the system as a candidate, passes through the pre-processing module in order to decide whether a near-duplicate image already exists in the database and thus, the image is redundant or whether no near-duplicate can be found in the database, so the image contains useful information and need to be stored. In that way, post-processing is only being performed on images that value content diversity.

(31)

Fine-tuned thresholds judge whether a pair of images is similar or not in terms of on colour, light and image difference. For a given threshold T, two images are said to be similar if their distance is less than or equal to T for each of the three similarity metrics. To decide on certain threshold values, manual evaluations of alternative T values were performed on a large-scale near-duplicate image set and the values were set to empirically good thresholds to detect near-duplicate images on the dataset. The distance function used is the Euclidean distance (or L₂ norm).

For a reference image, the user has the choice to specify the number of similar images that wishes to see on screen to validate redundancy with existing data. Weights are assigned to each of the three parameters according to their importance in similarity detection and a combined metric characterizing the overall similarity of the input image decides whether the image needs to be added to the data or deleted.

(32)

Figure 14: For the input image, we retrieve the most similar images in the database. We can observe that there is no similar image in the database and

thus, the new image is useful to be stored.

Figure 15: More examples of similarity analysis based on the pre-processing metrics. We can observe that the input image is redundant in terms of illumination conditions and almost redundant in RGB difference and colour

(33)

Figure 16: The input image is almost redundant.

3.2.1.1. Image to image Difference - Image subtraction

Image subtraction, or image differencing, is the first similarity metric we use to discover redundant data. The difference between two images f(x,y) and h(x,y) is defined as

g_(x,y) = f_(x,y) - h_(x,y) (1.3), (1.6)

(34)

that images in Figure 19 are near-duplicates since they contain the same hand gesture and half of the background information are similar.

Figure 17: Image a, Image b, Difference between image a and b

Figure 18: Image a, Image b, Difference between image a and b. The difference is relatively small and that is why the difference image appears relatively black

(35)

Figure 19: Image a, Image b, Difference between image a and b. The difference is not small and that is why the difference image only approximately half black

when displayed on an 8-bit display.

Figure 20: Image a, Image b, Difference between image a and b.

(36)

location of the hand in the image as spatial information are taken into consideration. We can also realise that lighting conditions vary per image that may also influence the image comparison.

Figure 21: Image a, Image b, Difference between image a and b in brightness. We can better observe the influence of lighting conditions if we convert the RGB images to the HSV space and keep the V component that regards to the

brightness value.

Using image subtraction for the comparison of each pair of RGB images emerges two main observations: a)the spatial information is as important to be taken into consideration for similarity analysis as the colour distribution in the images itself, and b) the lighting conditions highly affect images’ appearance either by diminishing distinct intensity values (low brightness) or by presenting mainly distinct values (high brightness).

(37)

Figure 22: Image differences in RGB pixel values.

Figure 23: Image differences in RGB pixel values.

3.2.1.2. Colour matching as a similarity metric

(38)

pre-processing near duplicates. The spatial information is as important to be taken into consideration for similarity analysis as the colour distribution in the images itself. Colour is one of the most prominent features of an image and the most widely used approach to compare images based on their colour composition is with colour histograms. Hence, the second similarity metric we use to compare images is their colour difference.

For each pair of RGB images, the individual histograms of red, green and blue channels are calculated. The histograms of different images using the Euclidean distance are compared and their distance is defined as the colour difference. Such comparison is conducted based on the hypothesis that the individual RGB channels are independent. The computed histograms are normalized to scale each element of the histogram so that each histogram represents an image without regard to the image size. For a given threshold T, two images are similar in colour if their colour histogram distance is equal or smaller than the threshold value.

(39)

Figure 25: Example of two dissimilar images in colour.

We can argue that images that showcase large colour difference may depend on different skin colour or varying lighting conditions at the moment of capturing the images. That may also indicate that in simple backgrounds with

similar lighting and colours and/or in hand gestures of similar skin tone colour difference with be smaller.

Although colour histograms contain important information about colour images, they are bound to retrieve false positives in similarity comparison as images with completely different content can have a similar colour composition. As a result, to be able to detect similarity based on colour information is not also enough and some analysis on the brightness of the images might be proved helpful.

3.2.1.3. From RGB to HSV space: Illumination matching as a similarity metric

Observations from manual testing draw the conclusion that variations in lighting conditions highly increase the colour diversity in images. Although the RGB colour space stands as a useful starting point for representing colour features of images, is not perceptually uniform as equal distances in different intensity ranges and along different dimensions of the 3D RGB color space do not correspond to equal perception of color dissimilarity [54].

(40)

The value component can be derived from a RGB image as follows: V = (R+B+G)/3, (1.7)

where R, G, B the individual values in each channel in the original image [54]. For each pair of images, we compute the V (brightness) difference between them and we set an empirical threshold T to separate similar from dissimilar images with the actual purpose to roughly flag images as dark or light.

Figure 25: For a reference image, the most similar or dissimilar images are retrieved in terms of the Value component.

3.2.2. Module 2: Image Classification - Computer Vision Pipeline

After module one, we supposingly have a pre-processed dataset of images with reduced redundancy in the form of raw images. In this module, we design a computer vision pipeline that incorporates several stages of image processing with the final one being a classification task. More specifically, we analyze the content of the images and we investigate similarity based on the object of the image by incorporating image segmentation, region of interest detection and largest object detection, feature extraction, and image classification that gives a first rough estimate on the distribution of hand gestures in the data.

(41)

detect three basic families of hand gestures: the palm family of open and close palm, the pointer family and the pinch family. Anything else, if existing, will be considered noise.

The motivation behind this pipeline lies in the hypothesis that after successful recognition of the hand object in the images, feature extraction based on the shape characteristics of the hand will provide adequate information for classifying images into classes that share same characteristics. Most probably, these classes might reflect the hand gestures’ families that we visually observe in the dataset.

The computer vision pipeline consists of the following steps: - Image segmentation

- Post-processing of binary images including largest object and region of interest detection and possible normalization of images

- Feature selection and extraction based on shape characteristics of the object of interest (hand gestures)

- Data clustering and classification

3.2.2.1. Image Segmentation

For image segmentation, two different functions are tested to select the one that would bring better segmentation results for our dataset. The first publicly available function is called “generate skinmap” [57] and some examples of its segmentation results are showcased in Figure 26. The second function regarded to a segmentation module developed by the company, from now on called the “hand segmentation app”. The segmentation results from the hand segmentation app are presented in the Figure 27.

(42)

performed, out of which the 53% of binary images include a clearly visible hand gesture that does not require further denoising pre-processing. Therefore, we proceed with the second approach tested.

Figure 26: On the left the original image, while on the right the binary image as resulted from the segmentation process using the “generate skinmap”

(43)

Figure 27: On the left the original image, while on the right the binary image as resulted from the segmentation process using the “hand segmentation app”

function.

3.2.2.2. Post-processing of Segmented Images

(44)

The binary images resulted as the output of image segmentation include various components of different sizes as shown in Figure 27. After manual investigation, we assume that the largest component being present in the binary image regards to the hand gesture being performed whereas the smaller components regard to noise around the hand gesture.

To be able to analyze the hand gesture information, we detect the largest object in each binary image and we keep only that component sorting out the rest of components as shown in Figure 28. In this way, we can compare the original images with the binary images including the largest object detected as shown in Figure 29.

Figure 28: In the first row, three original segmented images in their binary form are shown. In the second row, we see the binary images containing only

(45)

Figure 29: After the detection of the largest component in the binary images and the noise removal, we can easily compare the original images with the segmented ones. We can observe that the white area in the segmented images

regards to the hand gestures performed in the original images.

To be able to proceed with the analysis, attention will be focused only on the region of interest in the binary images and not on the full image. The region of interest (ROI) regards to the smallest rectangle that can be drawn around the object of interest as showcased by Figure 30. We referred to that notion earlier in that report as the Bounding Box (B.B.) around the object of interest.

Figure 30: In the first row, the original segmented images containing only the largest detected object are shown, whereas in the second row the images containing only the ROI are shown. It is obvious that the ROIs are not uniform

(46)

3.2.2.3. Shape Features Extraction: Region of Interest Detection and

Normalization

In order to perform shape analysis in the binary images containing the ROIs, an important decision needs to be taken: whether the analysis should focus on the original ROIs, meaning that we have to deal with images of different sizes, or whether we need to normalize the ROIs in order to perform our analysis on images of uniform size.

After careful investigation and manual experimentation, the conclusion is that there is no approach that can fit all of our requirements. On the first hand, normalization of ROIs eliminates scale variation in the images, but at the same time deforms the original hand gesture shape characteristics. On the other hand, working on the original ROIs may bring more accurate results on certain features reflecting the original shape characteristics of the hand gestures while also incorporating noise due to scale variation.

For example, in principle, the ROI (or alternatively the frame) of an open hand gesture is more squared than the ROI of a pointer hand gesture that is highly more rectangular as seen in Figure 31. That highlights the importance of working on the original ROIs as the analysis of shape characteristics may bring more accurate results on certain features.

(47)

Figure 32: Normalized ROIs.Scale variation is eliminated, however, the shape characteristics are deformed as it is easily observed in the first pointer gesture. As illustrated by the figures above, both approaches on normalization or non-normalization have their own drawbacks. However, each approach has its own advantages for the extraction of certain shape characteristics. Hence, we decide to proceed with a hybrid approach: extracting shape features from normalized and non-normalized ROIs images where appropriate according each time to the individual shape features of interest.

3.2.2.4. Features Selection and Feature Extraction

The goal of feature selection regarding the shape characteristics of the hand gestures mapped in the ROIs is to find out the most representative shape properties of the objects that will later help the classifier separate images into meaningful hand gestures’ classes.

The end product of feature selection will be the extraction of relevant and significant shape properties from the objects of interest such that each binary image will be transformed into a feature vector representing the image as shown in Figure 33. That regards to the process of feature extraction.

(48)

Regions of interest in the binary images include numerous shape properties for the hand gestures resulting to a high-dimensionality space of variables. Of course, some of these properties may be irrelevant to the classification problem we wish to tackle. In this step, it is highly important to investigate which shape properties (features) are relevant as well as the correlation between different properties as features that are strongly correlated may end up creating redundancy and increasing the dimensionality of our feature vectors unnecessarily.

The process of feature selection and extraction has been a tedious, repetitive, manual process of experimentation through plugging in and plugging out different shape features from the ROIs and checking which combination of features would serve better classification results.

Experiments start by including 22 shape properties in the feature vector representing each binary ROI image. This vector is reduced to 14 parameters excluding the variables that are not feasible to be handled. These 14 parameters include shape properties extracted from the objects of interest such as the area of the objects, the orientation, the center of mass, the perimeter of the object, the major and minor axes, the convex hull, etc. Statistical analysis is performed on the distribution of each variable category using the standard deviation and the mean to evaluate the significance of each variable. The original values of the variables are normalized by setting their values in the range between zero and one, and we subtract the mean value of each category of variables (each column) from the respective variables to scale the elements in each category (column). The experiments highlight that ratios of variables (combinations of variables) perform better than single variables in detecting objects of similar shapes.

After multiple iterations, we conclude that the feature vectors that closely approximate the objects’ representation consist of 4 specific features:

- Elongation on original ROIs (X/Y)

- Diameter of the largest inscribed circle in the object on normalized ROIs (D)

(49)

As elongation, is defined the ratio of the major axis over the minor axis as these are showcased in Figure 34. As area is defined the number of pixels enclosed by the boundary of the object showcasing the size of the object.

Figure 34: The area regards to the size of the object, so the respective white area, the center of the mass is the center of the eclipse fitting the shape of the

object, the major and the minor axes are the largest and the smallest diameters passing from the center respectively.

We calculate the elongation on the original ROIs in an effort to detect the hand gestures performed as shown in Figure 35. Of course, that approach is highly biased based on the scale of the hand gesture.

Figure 35: a) Elongation on pointer hand gesture, b) elongation on close hand gesture. We observe that the major axis is longer in the pointer gesture and as

a result the ratio of axes (their elongation) might separate close hand from open hand hand gestures.

(50)

wish to take advantage of the variation of hand gestures mapped in the uniform space of the normalized frame. In that way, we assume that the size of the hand gesture of gestures of the same class might use similar space in the frame of the image such that for example, close hand gestures will approximately fit into the full frame whereas pointer hand gestures would fit in approximately half of the frame as visualized in Figure 36.

Figure 36: Area over the frame size ratio on normalized ROIs.

To be able to approximate the hand gesture, we also use the distance transform of a binary image to extract the largest inscribed circle in the hand object on normalized ROIs as shown in Figure 36. The distance transform gives as an output a gray level image that follows the same pattern as the original binary image, but the intensity values of the foreground region (white region) are modified _{to reflect the distance to the closest boundary from each} point.

A lower intensity value is assigned to the more distant points and a higher value to the closest points such that the higher the intensity values are, the brighter are seen in the plot of the distance map (Figure 37).

Figure 37: The original binary ROI image. The centroid location plotted on the hand gesture. On the left, the distance map of the hand gesture. We can observe that the most central points are whiter that corresponds to higher

(51)

The point of the highest intensity value of the distant map is the furthest from any point on the edges of the smallest polygon drawn around the object of interest (the contour), namely the convex hull (Figure 38). In principle, the highest intensity value of the distance map regards to the center of the maximum inscribed circle inside the object of interest. Our third variable uses the diameter of that circle on normalized ROIs as a shape property of the hand gestures. However, there might be cases where the maximum intensity value is not necessarily the center of the maximum inscribed circle. That deviation (bias) is part of our analysis.

Figure 38: a) The contour, or alternatively the object of interest, b)the convex hull, or alternatively the polygon drawn around the contour..

Figure 39: The red dot represents the center of the circle or alternatively the point with the highest intensity value in the convex hull. As indicated, there are cases that the center of the circle is at the boundary of the object. These

(52)

Figure 40: The circle regards to the maximum inscribed circle in the hand gestures on normalized ROIs.

With the rest of the variables, elongation, diameter and size of the object over the frame size we aspire to detect classes of similar gestures based on similar shape characteristics. However, these classes may not necessarily reflect the observable classes of hand gestures where the hand pose variation is the main factor of separation. For that reason, a fourth variable is developed for shape analysis based on the number of fingertips present in a hand gesture. In that context, it is assumed that it can be managed to cluster together gestures of the same family; back palm, open palm, pointer and pinch gestures. To find the number of fingertips present in a hand gesture, we use the convexity defects [58]. It is already defined that the convex hull is the drawn area around the contour of the object. As convexity defects is considered any deviation of the object of interest from its convex hull such that if we compute the difference of the original binary image with its convex hull, the convexity defects will appear as white regions as shown in Figure 41.

(53)

To approximate the number of fingertips using the convexity defects, the connected (white) regions are found, discarding the smallest regions, and based on the largest connected components existent, the number of fingertips is estimated. Results from the estimation are shown in the Figures 42 and 43.

(54)

Figure 43: Examples of the computed number of fingertips including more noisy images and inaccurate results.

3.2.2.5. Data Clustering and Classification

In order to decide on the subset of relevant features for image classification, both supervised and unsupervised techniques are used.

Supervised classification is used predefining the number of classes to check whether each one of our variables (extracted features) performs well in separating the hand gestures according to the specific feature. The limits to the variables’ values are set and the images are classified according to the computed values. The results of the supervised classification are used as an indicator to select the features that perform better in grouping together similar hand gestures.

(55)

and close pinch (0 fingertips), open pinch (2 fingertips), pointer (1 fingertip) and anything else classified as noise. The number of expected classes in supervised classification is set base on the assumptions. That may create bias in the analysis. The results of supervised classification not only help discover features that classify the data into significant classes, but also assign the respective weight to each variable according to its performance.

To explore the data distribution, unsupervised clustering is incorporated using k-means plugging in and out different sets of shape features extracted from the ROI of the binary images as discussed in the previous section. It is assumed that through experimentation, the most appropriate distance metrics (shape features) to cluster the data into significant classes of hand gestures can be selected. K-means is repetitively used as a criterion to judge the effect of shape features in the “natural” grouping of hand gestures.

After tedious iterations, it is concluded that a set of four shape features (elongation, diameter of the maximum inscribed circle in the gesture, number of fingertips detected in the segmented image, and the A/B ratio where A is the size of the image over the frame size) might be a good approach for characterizing the images. This approach is a hybrid solution of normalized and non-normalized images because for some of the features we need to equalize the scale of the hand gesture, whereas for other features the original image is more relevant for measuring the differences between different gestures.

3.2.3. Module 2: Image Classification - Deep Learning Pipeline

(56)

Figure 44: Back palm family (BP), a) BP1, b) BP2, and c) BP3 respectively according to the varying BP states 1 - open, 2 - half closed, and 3 - closed.

Figure 45: Open palm family (OP), a) OP1, b) OP2, and c) OP3 respectively according to the varying OP states 1 - open, 2 - half closed, and 3 - closed.

(57)

Figure 47: Pointer family (PO), a) PO1, and b) PO2 respectively according to the varying PO states 1 and 2.

3.2.3.1. AlexNet and Deep Learning Classification

Deep learning is employed using the most-studied, award-winning CNN AlexNet for image classification. For each image as an input, AlexNet provides a label for the object in the image as output along with the probabilities for each object category [56].

(58)

Figure 48: The figure illustrates the usability of a pre-trained network to speed up the training of a network. The last layers are replaced by task-specific layers so that specific features can be learned from the network. The parameters of

the network can be fine-tuned to experiment on accuracy.

In this case, the network has been pre-trained with a dataset of 30K images of hand gestures both of right and left hand, of the 10 respective classes described in the previous section such that 20 classes of hand gestures are possible. Images that the hand object cannot be located or identified are classified as background information (21st class) and are considered noise in the classification task. The test dataset consists of 1940 images of only right hand gestures. As a test (validation) dataset, we use the same dataset that we used in the computer vision pipeline in order to compare both approaches and conclude on the results. It is expected that deep learning will perform more accurate results than traditional computer vision techniques.

As indicated already, in this thesis project, a new network is not created neither the layers of the network are changed. Analysis is focused on transfer learning and the network parameters: a) the initial learning rate, b) the epochs and c) the number of iterations.

(59)

epochs are set to 4. Usually, the more epochs we have, the less iterations we need to perform during training.

3.2.3.2. Ground Truth based on Deep Learning

One of the precedent steps in that approach, is the use of deep learning to create the ground truth for our dataset in order to have a common base of understanding in our analysis. As ground truth we define the half-automatic, half-manual annotation of our (test) dataset in the respective hand gesture classes (20 classes for hand gestures, 1 class for background images). We aspire to use the effectiveness of deep learning to classify most of the hand gestures in appropriate classes, and then manually place the outliers of each class in the right one if need be, such that we have the entire dataset classified as it would be if we could achieve 100% accuracy.

The ground truth is created for many reasons:

- It may help us evaluate the quality of classification of binary segmented images with shape measurements (features) extracted in the computer vision pipeline. We can use the ground truth to give different colours to the images that belong to different classes so that we can observe patterns in the distribution of data and evaluate the computer vision pipeline.