Real-time hand pose estimation on a smart-phone using Deep Learning

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Real-time hand pose estimation on a smart-phone using Deep

Learning

VALENTIN GOURMET

(2)

(3)

Real-time hand pose

estimation on a smart-phone using Deep Learning

VALENTIN GOURMET

Master in Computer Science Date: June 15, 2019

Supervisor: Mårten Björkman Examiner: Danica Kragic Jensfelt

School of Electrical Engineering and Computer Science Host company: Manomotion AB

Company supervisor: Jean-Paul Kouma

(4)

Abstract

Hand pose estimation is a computer vision challenge that consists of detecting the coordinates of a hand’s key points in an image. This research investigates several deep learning-based solutions to determine whether or not it is possible to improve current state-of-the-art detectors for smart- phone applications. Several models are tested and compared based on accuracy, processing speed and memory size. A final network is selected and detailed to compare it to the state-of-the-art. The proposed solution is obtained by combining the Differentiable Spatial to Numerical Transform layer to predict numerical coordinates together with the Fire module presented in the SqueezeNet architecture. This deep neural network contains around 1 million parameters and is able to outperform the current best documented model in all the metrics described above. A qualitative analysis is also performed to examine the predictions of the final solution on test images.

Keywords

Hand joints, Deep Learning, Convolutional neural networks, Artificial intelligence, Embedded devices.

(5)

Sammanfattning

Att bestämma en hands orientering är en utmaning inom bildanalys som består i att detektera koordinaterna för olika nyckelpunkter för handen i en bild. I denna studie undersöks ett antal metoder baserade på djupinlärning för att avgöra huruvida det är möjligt att förbättra existerande detektorer för tillämpningar på smartphones. Flera olika modeller testas och jämförs baserat på noggrannhet, beräkningshastighet och minneskrav. Ett slutligt nätverk väljs, analyseras och jämföras med nuvarande state-of-the-art teknik. Den lösning som föreslås erhålls genom att kombinera ett så kallat Differentiable Spatial to Numerical Transform-lager, för att förutsäga numeriska koordinater, tillsammans med en så kallad Fire-modul som tidigare presenteras som en del av arkitekturen SqueezeNet. Detta djupa neurala nätverk innehåller cirka en miljon parametrar och kan överträffa den nuvarande mest dokumenterade modellen i alla de avseenden som beskrivits ovan. En kvalitativ analys utförs också för att undersöka den slutliga lösningens uppskattningar på testbilder.

(6)

Acknowledgements

This thesis was conducted at Manomotion AB, a computer vision start-up based in Stockholm. I would first like to thank my supervisor at Manomotion, Jean-Paul Kouma, for he has been a great mentor during my stay there. He showed me that with enough passion and dedication, anything is achievable, and helped me to better myself in countless ways.

I would also like to thank Mårten Björkman, my computer vision teacher and supervisor at KTH, who consistently managed to provide very useful feedback while supervising two full thesis student groups. I am very grateful to Danica Kragic, who also always found the time to promptly answer any of my questions. A huge thank you to Professor Alessandro Astolfi for granting me the opportunity to study abroad in Stockholm, this year abroad has been an amazing experience. He has been an excellent teacher when I was in London and a great coordinator during my Erasmus exchange. Imperial College has truly been an incredibly rewarding journey, and I could not be more grateful about all the opportunities it has and will offer me. I met in Sweden wonderful people including two of my fellow Imperial students, George Punter and Zoe Slattery. A special mention goes to them as we have shared endless hours working together and have supported each other when we needed it the most. Finally, I would like to thank my whole family, for they always believed in me and have always been a great source of encouragement even in the toughest moments.

(7)

1 Introduction

1.1 Thesis description

This thesis investigates several deep learning models with different properties that aim to solve hand pose estimation and can run in real-time on a smart-phone. The Principal, Manomotion AB, offered this project as they are currently creating the next-generation technology in hand gesture recognition for smart-phones, and believe it could heavily rely on hand pose information. Because similar research has already been conducted in this area, the goal of this work is to determine if it is possible to outperform already existing solutions. The research question can then be formulated as follows:

Is is possible to improve the state-of-the-art algorithms on real-time hand pose estimation on a smart-phone, and if so to what extent?

Improvement will be evaluated based on three different metrics which are accuracy, speed and memory size.

1.2 Report structure

The report is organised into 6 major chapters. In the first one, a general introduction to the subject is presented, outlining why and how this problem is relevant in today’s society and what it consists of.

Then, Section 2 explores the research and results that already exist in this field. Section 3 is used to describe the theory behind the work that has been done, together with the motivation behind choosing specific methods or algorithms. In Section 4, the method followed to tackle the problem investigated is detailed, as well as the difficulties that were encountered. Section 5 presents both the quantitative and qualitative results and compares them to the state-of-the-art to determine if an improvement was possible and to what extent. Finally, Section 6 presents the final conclusions of this research and multiple options for future work.

1.3 Background

Despite the multiple impressive results that deep learning has achieved in computer vision over recent years, joints detection remains an

(9)

ongoing challenge that is receiving a great deal of attention. This problem, also known as pose estimation, concerns the ability to locate many key points in one image (see Figure 1.1).

With the domination of artificial neural networks-based solutions compared to classical computer vision ones, another popular research area has been on improving current network architectures, for several purposes. These have included enhanced accuracy as well as processing speed and memory size, making them suitable for embedded platforms such as smart-phones. This research, proposed by Manomotion AB, was supervised by Jean-Paul Kouma (Manomotion) and Marten Bjorkman (KTH), and examined by Danica Kragic (KTH).

Figure 1.1: Hand joints model used in this research. (Mueller et al. 2017)

1.4 Problem definition

Accurate hand joints detection is a task many scientists have tried to solve and struggled with. This has mostly been due to two major factors, which are the lack of labelled data and the very nature of the problem:

Depending on the viewing angle, the hand can have many self-occlusions, meaning it is not even always possible to see all of the hand’s key points

(10)

learning has proven to be the best solution for several computer vision tasks, many challenges arise when it comes to implementing an accurate and fast neural network. These involve:

• Finding enough labelled and clean training data.

• Overcoming the vanishing gradient problem.

• Identifying the key factors responsible for successfully tackling the problem at hand.

• Obtaining a model small and fast enough to run on an embedded platform such as a smart-phone.

Each of these issues will be developed in more details in Section 3. This research specifically focuses on solving hand pose estimation from an egocentric (first person) viewpoint.

1.5 Purpose

Fast and robust hand joints detection could be used in numerous applications. The current trends include hand gesture recognition in smart-phone applications, augmented reality (AR) or virtual reality (VR) games. However, in the near future other technologies will probably rely on it. This could be the case of smartglasses, which have the potential to make smart-phones become obsolete, or health applications such as medical operations being conducted remotely across the globe using 5G technology.

1.6 Delimitations

The integration of the hand joints detector into a smart-phone is not a part of this work. Manomotion offered the necessary software to successfully test the final solution on a smart-phone.

1.7 Societal impact and ethics

Because hands are humans’ primary interaction tool with the world, being able to interpret their gestures using low-computational algorithms could profoundly change our society. With the current important

(11)

development of smart-glasses, this technology could help shift people’s ways of interacting with the world. The need for devices that connect us to the digital world via physical contact would disappear, which would completely transform our habits. From an economical point of view, this would have an enormous impact. There were 4.57 billions mobile phone users in 2018 (Statista 2019), and this number is forecast to continue increasing in the following years. The market behind the replacement of phones with smart-glasses would therefore be huge and very profitable. This technology would also influence the lucrative video- game industry ($43.8 billions generated worldwide (Techcrunch 2019)), which is progressively starting to turn towards augmented reality and virtual reality products as it was shown with Microsoft’s Kinect device.

It is important to underline, however, the ethical issues this technology raises. The need for collection of a large and realistic database could require filming users’ hands without them paying attention, but this would also require their approval beforehand which is not necessarily easily given. Additionally, the data collected should be diversified enough to accurately represent different types of people regarding their age or skin colour. Indeed, if only white people’s hands would be collected, it is likely the resulting algorithm would not work as well on black hands and could therefore be qualified as racist. For these reasons, it is crucial to have some ethical awareness when implementing such a product.

1.8 Contribution

This thesis aims to demonstrate that by combining multiple recent deep learning techniques, it is possible to improve existing solutions in the area of real-time hand pose estimation. It is done by first designing and testing several deep learning architectures, selecting the best one based on key metrics, and then comparing it to the state-of-the-art to determine if it can be outperformed. The end deliverable is a network architecture saved in the ONNX format together with the algorithm necessary to locate hand keypoints in an image. This format is chosen as it can easily be integrated into any smart-phone-based application that requires hand pose estimation.

(12)

2 Literature review

Until today, joints detection from single RGB images has mostly been applied to hand and human body keypoints detection. While human pose estimation is not what is being targeted, it is similar enough to study it in order to get insights on what could work best for hand pose estimation.

Pose estimation can also be divided into two categories: computer-based and smart-phone-based solutions. Both should be taken into account as it is possible to adapt computer-based solutions for embedded devices.

Currently, the most successful solutions rely on deep learning. One early successful attempt has been to combine a Deep Convolutional network with a Markov Random Field (Tompson et al. 2014). This method jointly trains a deep CNN using a multi-resolution input with overlapping receptive fields for feature extraction/heatmap prediction and a Markov Random Field to enforce global pose consistency. Another successful landmark in pose estimation was the Convolutional Pose Machines presented in 2016 for human joints detection (Wei et al. 2016).

It consists of successive convolutional stages, with each stage outputting one heatmap per joint and taking as input the original image added to the heatmaps from the previous stage. The loss is then computed pointwise on the heatmaps of the last stage and backpropagated through all the layers.

This heatmap matching technique inspired many subsequent solutions for both human and hand pose estimation. More specifically, the recent trend has been to use an encoder-decoder architecture for predicting heatmaps only once (Gouidis et al. 2018)(Mehta et al. 2017)(Nibali et al. 2018)(Simon et al. 2017)(Zimmermann and Brox 2017). These architectures first gradually reduce the input image’s size (encoder) with convolution and pooling layers for feature extraction, then expand these low dimensional features (decoder) to approximate heatmaps which aim to locate the position of the joints.

Such a method has already been successfully implemented on a smart-phone (Gouidis et al. 2018). In this paper, the authors managed to create a network that could predict joints in real-time, at 10 fps on a Google Pixel 2. To obtain a lightweight but powerful model, they combined a MobileNetV2-like (Sandler et al. 2018) architecture for feature extraction and dimensionality reduction, with a VNect-like (Mehta et al.

2017) network for heatmap prediction. The specificity of MobileNetV2 is that it introduced a new type of convolutional layer for computationally restricted devices, the inverted residual with linear bottleneck module.

(13)

As for Convolutional Pose Machines, the loss they used during training was computed pointwise on the predicted heatmaps, with target heatmaps being 2D Gaussians centered around the true key points. Since this is the only smart-phone-based documented solution, it will be used for comparison with the final results of this research. Last year, a novel layer for coordinates prediction from heatmaps was introduced (Nibali et al. 2018). It aimed at tackling the most common issues faced when trying to solve pose estimation with either direct regression or heatmap matching. They showed that by using their new layer in adapted ResNet architectures they could outperform most state-of-the-art methods for human pose estimation, especially in the case of low heatmap resolution, which is particularly relevant for embedded devices. This layer will be further detailed in Section 3.8.

(14)

3 Background Theory

This section aims to provide the necessary knowledge that was required in order to conduct this research and detail the common challenges faced when implementing deep neural networks.

3.1 Machine Learning

Machine learning is an area of artificial intelligence that has proven to be extremely successful at solving different complicated tasks. It consists in algorithms that aim to learn key features from data they are fed with, and making predictions based on these features. The data provided is often annotated, meaning the algorithm has one or multiple target values that it should predict based on its input. This type of training, called supervised learning, is illustrated in Figure 3.1. It is used throughout this research as the aim is to have an algorithm that predicts the pixel coordinates of a hand’s joints based on an input image.

Figure 3.1: Supervised training in machine learning. (Leonel 2018)

(15)

Two concepts are crucial to consider when it comes to making a successful machine learning algorithm: generalisation and overfitting.

When an algorithm is given some data to train with, it learns a mapping from an input space (in this case, hand images) to an output space (coordinates of the joints). However, it learns this mapping based only on the training data. Consider the example in Figure 3.2 where the aim is to predict some continuous output value y based on an input x, with a restricted amount of data samples. These data samples come from an underlying true function, which the algorithm tries to approximate. Even though the sampling process might have introduced some noise and the number of samples is restricted, the goal of the algorithm is to learn the true function.

Figure 3.2: The overfitting phenomenon. (Scikit-learn 2019) Figure 3.2 underlines the 3 major cases that can occur:

• The algorithm cannot learn well enough (left), in which case it underfits the training data.

• The algorithm learns “too well” the training data (right), which will lead to a low error on these specific data points used for training.

While this could indicate the algorithm learned well, it actually has overfitted the dataset. This means that given new inputs that the algorithm has not seen during training, the predicted values might be far away from the true target values, which is undesirable.

• If the algorithm manages to learn the true underlying function rather than overfitting the training samples (middle), it is said to generalise

(16)

well: Given new inputs drawn from the same true function, the predicted outputs will be very close to the target outputs, which is the optimal solution.

Several methods to reduce overfitting and improve generalisation will be explored in Section 3.6.

3.2 Deep Learning

Artificial neural networks are a specific type of machine learning algorithm. Inspired by the neural connections in the human brain, neural networks are made of computational layers stacked on top of each other, as shown in Figure 3.3. Each layer is connected to the previous one via weights which are being permanently updated during training so as to improve the network’s predictions.

Figure 3.3: A basic neural network. (Jay 2017)

The output of each neuron is a two-stage process: First, a linear combination of its inputs and its weights is added to a bias term. Then,

(17)

an activation function (such as Sigmoid, ReLU, Softmax) is applied on top of that value, which is summarised in Figure 3.4. This provides neural networks the ability to have non-linear outputs, and therefore learn a highly non-linear target function. Deep neural networks, as opposed to shallow ones, are simply networks with a lot of stacked layers. Deep learning therefore consists of training deep networks.

Figure 3.4: Mathematical operations in one neuron. (Jordan 2017) It has been observed that deeper networks tend to be more accurate than shallow ones or any other machine learning algorithm if provided with enough data, therefore the tendency has been to make networks deeper instead of wider. However, a major restriction arose that led to deep network solutions being abandoned for more than a decade:

vanishing gradients.

3.3 The vanishing gradient problem

To update its weights during training, a neural network computes an error based on its current prediction(s) and its target value(s), and back- propagates this error through its layers to modify them in order to improve its future predictions. At each iteration, the chain rule is used to take the derivative of a user-chosen error function with respect to each parameter and modify their current values following the formula in Figure 3.6, with lr the learning rate used to scale the strength of the update.

(18)

Figure 3.5: Update of 1 parameter (neuron) during backpropagation (Jay 2017).

Figure 3.6: Weight update formula. (Jay 2017)

This is done following the Gradient Descent algorithm and has the objective of moving the weights towards values that minimise the loss function (Figure 3.7).

(19)

Figure 3.7: Using Gradient Descent to find a minimum. (Suryansh 2018) For deep neural networks, the derivative term from the chain rule (Figure 3.6) for parameters in the early layers becomes a product of numerous quantities, most of which will be inferior to 1. The total product value will therefore be very close or equal to 0, meaning that the weight will not be updated. This is called the vanishing gradient problem, as the gradient that is used to update the parameter vanishes to 0. While this has been the biggest challenge to resolve in order to efficiently train deep neural networks, it has not been the only one.

3.4 Hyperparameter tuning

Another difficulty that arises when trying to train deep networks is choosing the right hyperparameters. These include the learning rate, the number of layers, the numbers of neurons in each layer, the input and output format. It is still nowadays very hard to analytically understand what features deep networks are extracting, and therefore complicated to efficiently design network architectures for a specific task. So far, most state-of-the-art solutions have been developed using empirical rules and trial-and-error remains the most widely used technique for designing new deep learning models. While several methods, which will be developed in Section 3.6, have shown to improve the training of deep networks, hyperparameter tuning is still another challenge to overcome as no rigorous solution exists for choosing them.

(20)

3.5 Convolutional Neural Networks

As it has been explained in Section 3.2, neural networks are composed of stacked layers, where each neuron in a layer is connected to all the output neurons of the previous layer. In the case where the input is an image, the initial way to adapt such an input format for neural networks was to flatten the image data from an WxH matrix to a (WxH) x 1 matrix, and then use this representation as input to the network, as shown in Figure 3.8.

Figure 3.8: Flattening of a 3x3 image matrix into a 9x1 vector. (Saha 2018) While this method worked for small grayscale images, it became problematic for larger colourised images. If the input is a 1920x1080 RGB image instead and is flattened and fed to a neural network with 100 neurons in the first processing layer, this would already require 1920*1080*3*100 = 622 080 000 parameters. This is obviously undesirable in an embedded device.

Convolutional neural networks (CNN) have tackled this issue by using two new types of processing layers: Convolution layers and pooling layers.

A convolution layer is made of a certain amount of filters of specific dimensions (Figure 3.9 and 3.11), both of which are specified by the user.

As described in Equation 1, its output g(x,y) is the convolution between its filters ω and the input data f(x,y) followed by an activation function that is applied pointwise. The convolutional kernels are made of parameters that are shared between the input nodes, and therefore each weight in a filter does not have to be connected to all the nodes in the previous layer

(21)

unlike normal neural networks. Instead, it is slid (convolved) across its input layer in order to compute its output.

g(x, y) = ω ⋆ f (x, y) =

∑a s=−a

∑b t=−b

ω(s, t)f (x− s, y − t) (1)

Figure 3.9: The convolution operation. (Choulwar 2019)

Two additional parameters must be taken into account when using convolutional layers: padding and stride. As Figure 3.10 illustrates, padding consists of adding additional zeros around the input images so as to preserve their dimensionality throughout the convolutional layer, whereas the stride determines by how much the convolution kernel is slid over its input channels and can be used for dimensionality reduction.

(22)

Figure 3.10: Padding and stride in a convolution layer with a 3*3 filter.

(Géron 2017)

Figure 3.11: One full convolutional layer. (Prijonor 2018)

A pooling layer is used to reduce the input dimension as well as to extract the most important features. The two most common pooling layers are the average pooling layer and the max pooling layer, the operations of which are shown in Figure 3.12. It has been shown that the max pooling layer, which simply extracts the maximum value in an input window and acts like a noise suppressor, works better in practice.

(23)

Figure 3.12: Pooling layers. (Choulwar 2019)

Convolutional Neural Networks are therefore deep neural networks commonly made of stacked convolution and pooling layers with a few fully connected layers at the very end (Figure 3.13). The role of convolutional layers is to extract high level features from the raw input data, whereas the fully connected layers provide an easy way of learning a non-linear combinations of those extracted features for final predictions.

Figure 3.13: Basic Convolutional Neural Network. (Saha 2018) For computer vision tasks which require a high dimensional output (such as image segmentation or heatmap matching), the common process is to map the learned low dimensional features to high dimension using upsampling methods. Those normally include nearest-neighbor interpolation or bilinear interpolation. However, these algorithms always use the same formula instead of learning how to upsample optimally. This

(24)

convolution layers, they are made of filters with learnable parameters convolved over an input layer and can produce a higher dimensional output than their input.

3.6 Training deep convolutional neural networks

A lot of research has been done on how to optimise and improve the training of deep convolutional neural networks as they showed they could outperform most regular algorithms. This section aims to detail the methods that have been used in this research.

3.6.1 Reducing overfitting

As discussed in Section 3.1, overfitting occurs when the algorithm learns the training samples “too well” and is not able to generalise.

Multiple methods can be used to reduce this problem.

• Data augmentation

Having a sufficient amount of labelled data helps a lot when it comes to generalising well. While it is relatively easy for a deep network to learn by heart 1000 training examples, it becomes a lot harder when there are more than 100 000 and therefore forces it to learn the underlying important features. However, labelling data by hand can be very hard and time consuming, but doing so artificially is extremely fast. Data augmentation consists in several techniques for creating additional labelled samples from already existing ones.

• Regularization

Regularization is a method used to control the complexity of a machine learning model. It can be implemented in several ways and aims to reduce overfitting. In this work, a special type of distribution regularization is used and will be developed in Section 3.8. It additionally constrains the network to learn a specific distribution for the predicted heatmaps.

• Dropout

Dropout is a specific regularization technique for deep neural networks popularised in a paper published by Nitish Srivastava et al. in 2014 and depicted in Figure 3.14. When dropout is applied to a layer with a parameter value p (between 0 and 1) during training,

(25)

at each epoch each neuron’s output of that layer will be forced to zero (it will be dropped out) with a probability p. The researchers showed that using dropout reduced overfitting in all the situations where it was used compared to regular networks.

Figure 3.14: Dropout. (Budhiraja 2016)

3.6.2 Reducing the vanishing gradient problem

As discussed in Section 3.2, the vanishing gradient problem has been one of the biggest obstacles to the development of deep neural networks. In recent years, several methods have been shown to reduce this phenomenon.

• Xavier and He Initialization and Non-saturating activation functions

In 2010, Xavier Glorot and Yoshua Bengio identified two major suspects explaining the difficulty of training neural networks (Glorot and Bengio 2010): weights initialization and the activation function. Until this point, initial parameter values were drawn from a normal distribution with a 0 mean and standard deviation of 1, and the most popular activation function was the sigmoid function.

The scientists showed in this paper that these two factors combined made the variance of the output of each layer much bigger than the input’s variance. This caused deeper layers to saturate, as the output

(26)

or 0, with a gradient almost null (see Figure 3.15). This means that during backpropagation, the layer’s weights do not get updated and therefore the network cannot learn.

Figure 3.15: Sigmoid activation function. (Géron 2017)

The first issue was resolved using a new formula for weight initialization, called Xavier Initialization. Their argument was that in order for the signal to flow properly both forwards and backwards, the output variance of each layer should be equal to the input variance, and the gradients should have equal variance before and after flowing through a layer. While both cannot be guaranteed if the number of input nodes is not equal to the number of output nodes, they showed that a specific initialization strategy could offer a good compromise for approximating such goals. The idea is that for each layer, depending on its activation function, the initial parameters should be drawn from a distribution whose parameters depend on the layer’s number of inputs and outputs.

Regarding the saturation of the sigmoid function, this was tackled by replacing it with the Rectified Linear Unit (ReLU), whose input response is shown in Figure 3.16. This was argued to be a much better choice as it does not saturate for large input values and is also very fast to compute.

(27)

Figure 3.16: ReLU activation function. (Sarkar 2018)

Unfortunately, this function suffers from the dying ReLU problem:

if the input weighted sum of a neuron is negative, then after applying the ReLU rectification its output will be zero meaning such a neuron is unlikely to come back to life. This was later resolved with two novel activation functions, namely the Parametric Rectified Linear Unit (PReLU) (He et al. 2015b) and the Exponential Linear Unit (ELU) (Clevert, Unterthiner, and Hochreiter 2016), whose reponses are described in Figure 3.17 and 3.18. While ELU outperformed all ReLU variants in Clevert’s experiments, it is slower to compute than PReLU, which is an important factor when running a network on an embedded device.

(28)

Figure 3.17: Parametric Rectified Linear Unit (PReLU) (He et al. 2015b)

Figure 3.18: Exponential Linear Unit (ELU) (Géron 2017)

In addition to coming up with the PReLU activation function, Kaiming He et al. created a new initialization method for ReLU and its variants, which was shown to improve the training of deep networks compared to previous strategies. The different formulas for choosing initial weights are summarised in Figure 3.19.

(29)

Figure 3.19: Xavier and He initialization. (Géron 2017)

• Batch normalization

This technique was developed by Sergey Ioffe and Christian Szegedy in 2015 and consists of adding an additional operation in each layer (Ioffe and Szegedy 2015): Before applying the activation function, it zero centers and normalizes its output and then scales and shifts the result using two new parameters, as described by the equations in Figure 3.20. In order to do so, it needs a mean and standard deviation. In the case where the training dataset is too big, the network’s parameters are often updated by processing mini-batches of training samples. The mean and standard deviation used are therefore the ones of the current mini-batch that is being processed.

This technique has shown to drastically reduce vanishing gradients.

(30)

Figure 3.20: Batch Normalization. (Géron 2017)

• Residual learning

Going deeper with neural networks has been one of the major goals in the machine learning community, as deeper networks tend to show much better results than smaller ones. In a paper published in 2015, Kaiming He et al. observed that going too deep could actually hurt the network’s performance, ending up with a worse training error compared to a smaller network. To tackle this problem, they introduced the Residual block (He et al. 2015a). It consists of two processing layers, with the specification that the output of the second layer is added pointwise to the input of the residual block before the

(31)

activation function is applied (see Figure 3.21).

Figure 3.21: Residual Block. (He et al. 2015a)

One intuition behind such a block was that by going deeper, it is desired that the network performs at least as well as its shallow version. Therefore, it should be able to easily learn the identity function between its input and its output. Since the layers are initialized with small weight values, it is possible that at the beginning of training, the layers inside a residual block output values close to zero. Adding the original input on top of these values helps the residual block to learn the identity mapping as its output will therefore be very similar to its input. Such a technique led the researchers to create a 152 layers deep network which outperformed all state-of-the-art architectures at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and was even able to outperform humans (Figure 3.22).

Figure 3.22: ILSVRC competition winners. (Das 2017)

(32)

• Transfer learning

One of the reasons why deep CNNs are extremely good at solving visual tasks is that they can learn hierarchical features representations through their multiple layers. Similarly to the visual cortex in the human brain, they decompose an image into features at different levels of representation. Figure 3.23 illustrates the phenomenon that while early layers are able to capture low-level features such as edges or blobs, the last layers capture features of higher level.

Figure 3.23: Feature extraction in CNNs. (Vision 2019)

Training a very deep neural network from scratch is very complicated, especially when the amount of labelled data is limited. One method is therefore to find a network that was trained for solving a similar task, and reuse its lower layers for solving a new specific task. This is called transfer learning. The intuition behind it is that low-level features are very similar between related tasks, and therefore reusing the corresponding layers puts the new network in better initial conditions for learning. As a consequence, it considerably speeds up training since low-level features are already learned, and provides much better results. It is common to freeze the lower layers such that their parameters stay fixed, and only retrain or

(33)

modify the upper layers to solve the new task at hand, as is depicted in Figure 3.24.

Figure 3.24: Transfer learning. (Géron 2017)

3.7 Convolutional neural networks in embedded devices

Although deep neural networks have shown impressive results at solving very different tasks, a key issue when it comes to using them in embedded devices is the computational power and memory they require. For example, one of the current best CNN architectures for image classification, Resnet-50, has 25.6 million parameters. Several attempts have been made in order to reduce networks’ size, number of parameters and processing time, such that convolutional neural networks could fit in embedded devices. Such methods will be described in the following section.

3.7.1 Fully convolutional network (FCN)

Most common convolutional network architectures consist of a series of convolutional layers followed by pooling layers, with at the very end one or a few fully connected layers (Figure 3.25).

(34)

Figure 3.25: Typical CNN architecture. (Das 2017)

However, the fully connected layers are responsible for most parameters: In AlexNet (Krizhevsky, Sutskever, and Hinton 2017), another powerful CNN architecture, 96.2% of its 61 million parameters come from the fully connected layers. Therefore, one approach has been to simply remove the fully connected layers and find other types of connection to solve computer vision tasks (Long, Shelhamer, and Darrell 2014). Section 3.8 will detail how this can be applied to coordinate prediction.

3.7.2 SqueezeNet

Convolutional layers in deep CNNs can still account for millions of parameters. The SqueezeNet architecture developed by Forrest N. Iandola et al. in 2017 unveiled a new type of convolutional block, called the Fire module, which enabled this model to obtain the same AlexNet accuracy on an image classification dataset while using 50x fewer parameters. A Fire module is made of two sub-layers: a squeeze layer followed by an expand layer (Figure 3.26). The squeeze layer is made of 1x1 filters and is used to reduce the number of input channels, thereby reducing the number of channels being convolved with 3x3 filters in the expand layer. The expand layer consists of both 1x1 and 3x3 filters, whose outputs are concatenated.

This both reduces the number of filter parameters by using some 1x1 filters instead of 3x3 kernels only, and also enables to capture different types of information from its input.

(35)

Figure 3.26: Fire module from the SqueezeNet achitecture. (Iandola et al.

2016)

Additionally, for Fire blocks whose number of outputs is the same as its number of inputs, they used bypass connections as in residual blocks from the ResNet architecture (He et al. 2015a), meaning the input is added pointwise to the output before applying the activation function.

3.8 Pose estimation

Pose estimation is the task of determining the pose of an object in an image. It is often achieved by localising the key points of such an object, such as the body or the hand’s joints. Three main methods exist for predicting joints coordinates using neural networks.

3.8.1 Coordinate regression with fully connected output layer The first approach for pose estimation has been to use a convolutional neural network with a fully connected layer at the end (Toshev and Szegedy 2013), whose role is to output directly the (x,y) coordinates of each desired joints in the input image. However, “fully connected layers are prone to overfitting, thus hampering the generalisation ability of the overall network” (Lin, Chen, and Yan 2013). Combined with the lack of a huge

(36)

amount of labelled data available for this type of problem, such a method was therefore not considered for this thesis.

3.8.2 Heatmap matching

Heatmap matching is nowadays the most used technique for joints detection (Tompson et al. 2014) (Newell, K. Yang, and Deng 2016)(W.

Yang et al. 2017). For each joint in the image, a target image (heatmap) is obtained by creating a Gaussian blob around the true joint location (see Figure 3.27).

Figure 3.27: Heatmaps for human pose estimation. (Pfister, Charles, and Zisserman 2015)

During training, the network takes as input an image and outputs k heatmaps, one for each joint. It therefore learns in parallel the location of all the joints. This has the objective to make the network learn a Gaussian distribution instead of finite coordinate value. The loss between the predicted and target heatmaps is computed pixel-wise. The absence of fully connected layers makes such network a much better at generalisation.

During inference however, the final desired outputs are (x,y) coordinates for all the joints, not heatmaps. Therefore the final predictions are taken as the argmax of each heatmap, meaning it returns the x and y coordinate of the brightest pixel in each heatmap.

(37)

3.8.3 Differentiable Spatial to Numerical Transform (DSNT)

Even though heatmap matching has been widely used compared to direct coordinate regression, it has 2 major flaws. The first and most important one is that the loss that is being optimised is not the actual metric we are interested in. The gradient starts flowing backwards at the heatmaps and not at the final coordinates, because we cannot compute the gradient of the argmax function. So even though only the brightest pixel is used for final predictions at test time, all heatmap pixels are used during training. As it can be seen in Figure 3.28, it means that there are cases where the final coordinates would actually be correct but would get a high loss, and another similar case where the final coordinates would be wrong but the loss would be lower (MSE stands for Mean Square Error and is a common loss metric).

Figure 3.28: One example where heatmap matching does not work well.

(Nibali et al. 2018)

The second one is that using the argmax function for coordinate prediction introduces quantization issues: it is very common that the output heatmaps have a lower resolution than the input image to lower the computational complexity. In such case, the output heatmap precision cannot match the input image’s when using the argmax function. This

(38)

in a power-restricted device such as a smart-phone. Some researchers identified these flaws and tackled them by creating a new type of convolutional layer (Nibali et al. 2018), entitled Differentiable Spatial to Numerical Transform (DSNT) layer. This layer is fully differentiable and adds no trainable parameters. It is used to adapt fully convolutional networks (FCN) to coordinate regression. The output of the FCN represents k unnormalized heatmaps, where k is the number of joints we want to predict. Then, two transforms are applied: First, the heatmaps are normalized using a heatmap activation function, such that all values in each heatmap respectively are positive and sum up to 1, so as to mimic a probability distribution function. Then, each heatmap is transformed to give two (x,y) coordinates normalized coordinates between -1 and 1, by computing the Frobenius norm between such heatmap and user-defined X and Y matrices generated using the formula from Equation 2, as explained in Figure 3.29.

X

_i,j

=

^2j⁻⁽ⁿ⁺¹⁾_n

Y

_i,j

=

²ⁱ^−(m+1)_m (2)

Figure 3.29: Normalized heatmap ( ˆZ) to coordinates using the DSNT transform. (Nibali et al. 2018)

(39)

The loss can then be computed as the Euclidean distance between the predicted coordinates μ and the ground truth p as described in Equation 3, thereby enabling the optimisation of the most relevant metric.

L

_euc

(µ, p) = ∥p − µ∥

2 (3) Using such method provides multiple benefits which the paper summarises in Figure 3.30.

Figure 3.30: Advantages of using the DSNT layer. (Nibali et al. 2018) The researchers also provide regularization techniques: they argue that since many possible outputs of the FCN could yield the same final coordinates, there is no guarantee that the model will “have strongly supervised pixel-wise gradients through the heatmap during training”, which is desirable. They show experimentally that some regularization methods help to improve the final results. In this case, they use regularization by adding a penalty term in the loss function as described in Equation 4, with λ the regularization coefficient measuring how much importance should be given to the penalty term.

L( ˆ Z, p) = L

_euc

(DSN T ( ˆ Z), p) + λL

_reg

( ˆ Z)

(4) Therefore, the network does not learn to strictly minimise the Euclidean distance between its predictions and the ground truth, but has to take the additional penalty into account. Specifically, they use regularization as a mean to force the normalized heatmaps to look like a spherical Gaussian centered around the desired joint. In order to

(40)

from each heatmap and the target normal distribution is added to the original Euclidean loss. In addition to minimising the Euclidean distance, this forces the output heatmaps to look like Gaussian distributions and provides better results when using the Jensen-Shannon divergence measure between two distributions P and Q as regularization penalty (Equation 5).

J SD(P ||Q) =

¹₂

D(P ||M) +

¹₂

D(Q ||M) where M =

¹₂

(P + Q)

and D(P ||Q) = − ∑

x

P (x)log(

^Q(x)_{P (x)}

)

(5)

In their research, it was shown that this layer was able to produce better results for human pose estimation compared to fully connected layer and most heatmap matching methods, especially for low heatmap resolutions.

(41)

4 Method

4.1 Software

Once the theory to be used was identified, the remaining work was to create networks and train them from scratch for this specific challenge. Initially, a lot of time and resources were attributed to using the Dlib framework, a popular C++ toolkit for machine learning algorithms.

The motivation behind choosing this library was that it is optimised for embedded applications, which is an important factor when real- time is a major property that is being sought after. Numerous network architectures were tried, however there was a recurrent issue: while most joints locations could be accurately predicted after the algorithm was finished training, the keypoint supposed to represent the wrist was always predicted in the middle of the image, regardless of where the hand actually was. One possible explanation for this was that in order to use a network that contained the DSNT layer, because this layer is recent and therefore not standard, it had to be implemented manually in the Dlib framework. Creating a new custom layer in a specific library required defining its forward pass (how to compute its output based on its input) and its backward pass (how to compute the gradients that are used to update the network’s weights during training). The latter was a potential source of errors as it meant modifying CUDA’s source code, a GPU-based technology for speeding up the training of neural networks developed by NVidia, in order to properly backpropagate data through this layer. It is believed that this was badly implemented and therefore resulted in the impossibility for any network to learn how to predict the wrist joint.

One additional argument supporting such belief was that to ensure that this problem was not just due to the fact that the wrist was a very hard point to find, a different network using fully connected layers as described in Section 3.8 was tested to see if it could overfit the training set and correctly locate the wrist keypoint. Since it turned out it could, the inventors of the DSNT layer were contacted in order to get more insight into how they tackled this backpropagation problem. They explained that they actually did not have to: They used the Pytorch deep learning library, which is based in Python, together with Autograd. Autograd can automatically compute the gradient of a user-defined function, such as the forward pass of the DSNT layer. Therefore, there is no need to define the backward pass of this layer. Nonetheless, because the end goal was

(42)

to embed the final network into a smart-phone, it was necessary to make sure a model trained in Pytorch could then be easily interpreted in C++.

It was found that the ONNX (Open Neural Network Exchange Format) ecosystem can be used to convert the trained Pytorch model to ONXX format which can be easily read by the OpenCV library in C++ and then be put in a smart-phone application. For these reasons, Pytorch and Python were chosen instead of Dlib and C++ for network training. To drastically speed up training, CUDA was used with a NVIDIA Quadro P5000 GPU.

4.2 Configuration

4.2.1 Dataset

One of the main limitations today when solving hand pose estimation with deep learning is the lack of labelled data. While deep CNNs tend to work extremely well on huge datasets, they might also severely overfit small ones. Labelling hand joints manually is not trivial, therefore scientists came up with the idea of training a Generative Adversarial Network (GAN) (Mueller et al. 2017), another type of artificial intelligence algorithm, to automatically create synthetic hand images and their corresponding labels (Figure 4.1). The GANerated dataset has the advantage of providing a huge amount of labelled data at the expense of being made of artificial hand images, which therefore do not truly represent the data that would be used for testing. Only the sub-dataset with no objects in front of the hands was used. Data augmentation was performed on this data by mirroring each image about the y-axis. Since the original generated images are all left hands, this ensures the network learns the key points irrespective of whether a left or right hand is used as input.

(43)

Figure 4.1: Synthetic image. (Mueller et al. 2017)

4.2.2 Optimiser and learning rate

Many optimisation algorithms have been invented in order to speed up gradient descent during training without affecting its convergence properties too much. A famous one is the Adam optimiser, which stands for adaptative moment estimation. The process to update the weights is detailed in Figure 4.2.

Figure 4.2: The Adam algorithm. (Géron 2017)

Essentially, it works by keeping track of previously computed gradients using a moment vector m to adapt by how much the current weights θ should be updated. During training, the learning rate coefficient

(44)

rate to speed up training, and then to decrease it to help the algorithm converge.

4.2.3 Data preprocessing

The images had to be pre-processed before getting fed into the network. They were first resized to 224x224 using bilinear interpolation, to ensure that the output heatmaps all have the same dimensions. Then each colour channel was divided by 255, to map the original colour channels’ range from [0,255] to [0,1]. This helps training as input values are in a smaller range.

4.3 Model selection

As it was mentioned in Section 3.2, one of the biggest challenges in finding a successful deep learning model is hyperparameter tuning.

Indeed, while there exists methods that have been shown to improve the training of very deep architectures or to reduce the vanishing gradient problem, none exist for choosing the right parameters. This includes finding the right number of layers and filters in each layer, as well as choosing the correct learning rate or deciding when to downsample or upsample in the network. In most applications that are based on deep learning, finding the right parameters has therefore been based on trial- and-error. However, there exist some empirical rules that often help in choosing where to start. The main one is simply to imitate famous network architectures that have won computer vision competitions, such as ResNet (He et al. 2015a) or AlexNet (Krizhevsky, Sutskever, and Hinton 2017). While this is a good starting point it does not offer any guarantee of robustness, so tweaking the various parameters in order to find the best model remains nonetheless a trial-and-error problem.

In order to obtain the final solution, the following guidelines were used. Initially, the first challenge is to determine whether or not the network can learn the problem at hand, even if it means overfitting.

This is due to the fact that in the corresponding research paper (Nibali et al. 2018), the DSNT layer was tested on body pose estimation only which, while being similar, is a different challenge. It was therefore required to check whether or not this new technique could be used in powerful networks to solve the hand pose estimation task without additional constraints. Therefore, stronger and heavier models without

(45)

regularization were tried at first. Once it was confirmed that they could indeed learn, regularization methods were added: Dropout, distribution regularization (Jensen-Shannon divergence). The original dataset was split into 3 subsets: a training set (80% of the original dataset), a validation set (10%) and a test set (10%). The networks were trained on the training set only. As long as the results seemed quantitatively and qualitatively satisfactory on the validation set, additional efforts were made to prune and compress the neural network. In total, 62 models were trained and compared (only for the Pytorch framework). The network that performed best on the validation set was selected and was evaluated on the test set to get a completely unbiased estimate of its performance. The following principles were used in order to make the network faster and lighter:

1. Use Fire modules instead of normal residual blocks in convolutional layers.

2. Reduce the total number of filters in each layer, especially in layers with high dimensional inputs.

3. Target computationally cheap activation functions: Instead of using ELU, try PReLU or even ReLU.

4. Use an input image of smaller size.

4.4 Final architecture

The model that offers the best trade-off between accuracy and processing speed was then retained. It was obtained by adapting the 1.1 version of pretrained SqueezeNet to coordinate regression. This SqueezeNet version has 2.4x less computations and slightly fewer parameters for the same accuracy compared to the 1.0 version.

The Fire modules were implemented using He initialization for a uniform distribution, a compression ratio of ⅛ and an expand ratio of

½. This means that if the number of output channels of a fire block is x, then the number of filters in the squeeze layer is x/8, and the number of 1x1 and 3x3 filters in the expand layer is x/2 each. Additionally, in layers where the number of input channels is equal to the number of output channels, a bypass connection is implemented as described in the ResNet architecture to improve the learning procedure. The original SqueezeNet 1.1 architecture, which contains 1,235,496 total parameters, is presented

(46)

Figure 4.3: SqueezeNet v1.1.

While it was originally trained for image classification, as it was explained in Section 3.6.2 the earlier layers have captured features which are still relevant for other image processing tasks such as hand pose estimation. To create the new network, the last layers of the pretrained SqueezeNet1.1 were removed and new ones were added on top on them.

(47)

5 Results

The final model architecture is shown in Figure 5.1. The pretrained Squeezenet layers were frozen during training so as to speed up training and reduce overfitting. Distribution regularization was not used for this model. Two factors might explain why this helped to obtain better results.

The first one is that by combining dropout and regularization on a very small model, such a network has too many additional restrictions in order to learn well enough. The second one is that for low heatmap resolutions, forcing the network to approximate a 2D blob becomes tricky as the blob’s size becomes too big with respect to the heatmap, and can therefore spread across the whole heatmap which is undesirable. This solution was trained with the Adam optimizer and an initial learning rate of 1e-3 for 15000 iterations, 1 iteration corresponding to the processing of 1 batch of 64 samples taken randomly from the training set. The randomness is to ensure the diversity in the samples chosen. The learning rate was decreased by a factor of 10 at iterations 12000 and 13500.

Manomotion provided the necessary code to embed the network in an Android smart-phone. As described in Section 4.1, the network was first transformed to the ONXX format in order to be easily used in OpenCV.

It is important to note that because the DSNT layer is not standard, the ONNX framework could not convert it from Pytorch to ONXX. Therefore, ONNX was only used to transform the Fully Convolutional Network that outputs unormalised heatmaps, and the DSNT layer was re-created and stacked on top of it in OpenCV. This is relatively simple as this layer does not contain any trainable parameters and therefore is independent of the network used.

(48)

Figure 5.1: Architecture of the final solution.

5.1 Quantitative results

The network contains a total of 1,062,204 parameters and weighs 4.05MB in memory. Compared to the solution presented by Gouidis et al. which contained 7.98 million parameters (Gouidis et al. 2018), this is a huge decrease in size and computational complexity, with a compression ratio bigger than 7.5. In their research, Gouidis et al. had managed to get 10fps on a Google Pixel 2 with an input image size of 112x112. Using Manomotion’s measurement tools, the proposed solution showed to run at 16 frames per second on a Google Pixel 2 and with the input image resized to 224x224, meaning one frame is processed in 62.5 ms, and 38 fps on a Huawei P30. This is 1.6 times faster than Gouidis et al.’s model for the same smart-phone. While a huge network compression was shown to be possible, the processing speed did not improve as much. This is due to two major reasons: The first one is that the input size is much larger, therefore more computations have to be done compared to a smaller input image for the same convolution layer. The second one is that in their work,

(49)

Gouidis et al. used the TensorFlow Lite framework to put their model into a smart-phone. TensorFlow Lite was specifically designed to optimise neural networks created in TensorFlow in order to embed them in mobile devices. On the other hand, the network presented in this thesis did not go through the same procedure, and it is probable that TensorFlow Lite is better suited for optimisation than ONNX.

The average Euclidean distance as well as the percentage of correct keypoints (PCK) on the test set are reported to compare the results to other existing solutions. The PCK is a common error metric in pose estimation. It is an accuracy measure that considers a predicted key point as correct if its distance to the true joint is below a certain threshold. The average Euclidean distance, measured on the test set which contains 36 000 samples (roughly 10% of the original dataset) was evaluated to be 0.0943. While this does not provide much information by itself, it is an important element of comparison if similar work was to be done on this topic. Figure 5.2 shows the percentage of correct key points as a function of the error threshold, which is in pixels. The results are compared with those from Gouidis et al., which had compared themselves to an older computer- based solution (Simon et al. 2017).

Figure 5.2: Comparison of PCK between the chosen model (left) and existing solutions (right). (Gouidis et al. 2018)

This PCK curve matches or exceeds the one detailed by Gouidis et al. for all threshold values, thereby showing a better accuracy was achieved with a much smaller network. It is important to underline, however, that the accuracy was not measured on the same data, as Gouidis et al. evaluated their network on the Tzionas dataset (Tzionas et al.

(50)

outperformed the existing state-of-the-art network if they were tested on the same data.

5.2 Qualitative results

The predictions on test images are compared to the ground truth joints for a qualitative analysis. Inference is also made on real-life images to evaluate how big of a factor the synthetic images are compared to if the training would have been on real hand images. The predictions are shown following Table 5.1 together with Figure 5.3 which represent the colour codes used for drawing the different joints.

Figure 5.3: Fingers of the hand. (FreeDownloads n.d.)

Table 5.1: Colour code table.

Joint Colour

Wrist Red

Thumb Blue Index Purple Middle Green Ring Yellow Pinkie White

(51)

5.2.1 Test images

It was observed that the predictions worsen when self occlusion between the fingers occurs (Figure 5.4). This is tolerable as even for humans it is challenging to label key points of a hand when all the joints are not visible. For pictures where all the joints are visible, the network was able to generalise well.

Figure 5.4: Network’s predictions (left) compared to ground truth (right).

(52)

5.2.2 Real hand images

The results on true hand images, though very satisfactory, are not always as good as on the synthetic images. Figure 5.5 shows that the results are close to perfect if the entire hand and all the joints are easily visible, however once occlusions occur the predictions become worse.

Figure 5.5: Inference on real hand images.

One obvious factor for such degradation is the nature of the data the algorithm was trained with. It is clear that the synthetic images offer less texture and details compared to real images. To try to get around this problem, the same experiment was conducted but with the images being blurred with a Gaussian filter of various sizes before being fed into the network. The idea behind this was to reduce the amount of details in the image as to mimic the image distribution from the training dataset.

This showed to drastically improve the results, specifically with a 3x3 filter which was therefore chosen. (see Figure 5.6). Another probable factor explaining the good generalisation to real images is that the frozen SqueezeNet layers were trained on real images and can therefore still detect relevant low level features in true hand images. The network is then able to perfectly fit open left and right hands, either from the front or the back side. It also exhibits translational invariance and allows rotation of the hand to some extent. Nonetheless, the network is not able to generalise as well to real hand images when occlusion occurs.

(53)

Figure 5.6: Predictions on original image (left) and on blurred image (right).

Finally, it is important to note that with the blurring, the network is able to perform extremely well in environments with good lighting (bright illumination) independently from the background which is illustrated in

(54)

to generalise its predictions independently of the lighting, which is a common issue for this type of problem.

Figure 5.7: Results improve for specific backgrounds.

Predictions on a smart-phone are presented in Figure 5.8 as to demonstrate that the network was successfully embedded in a mobile device and is still able to localise hand joints effectively. Even in the case of self-occlusions (hand closed), the algorithm can predict the keypoints reasonably well.

(55)

(56)

6 Conclusion

This thesis aimed to investigate if the current state-of-the-art hand joint detectors for smart-phones could be improved in three different metrics: accuracy, speed and memory size. First, a literature review was conducted in order to get additional information about the hand pose estimation challenge, including what were the several methods that had been used in order to solve this problem, as well as what made this problem so hard to solve. The state-of-the-art solutions were all found to be deep learning-based, with most of them relying on heatmap matching. The main difficulties identified when trying to implement a smart-phone based solution were the need of a large amount of labelled data, a heatmap-based method that could still work well for low resolution heatmaps and finally creating a network small and fast enough to run in real-time on a smart- phone.

The first one was tackled by using the GANerated hand dataset (Mueller et al. 2017) for training together with the mirroring data augmentation technique, which offered a huge labelled dataset to the expense of containing synthetic images. Regarding the type of network used for predicting the joints’ locations, it was decided that a Fully Convolutional Network together with the Differentiable Spatial to Numerical Transform layer provided the best advantages for both a fast and reliable solution (Nibali et al. 2018). In order to implement an even faster yet effective deep convolutional neural network, the Fire module presented in the SqueezeNet architecture (Iandola et al. 2016) was retained in order to replace normal convolutional layers.

Once these decisions were taken, the first task was to determine whether or not an architecture using the DSNT layer, which had only been proven to be successful at solving body pose estimation, could be adapted for solving hand pose estimation. When it was attempted using the Dlib framework, the wrist keypoint turned out to be impossible to correctly predict. It was concluded that this was due to a bad implementation of the DSNT layer in this library. To resolve this issue, Pytorch was chosen instead, which in turn solved the problem. After this, the main goal was to reduce the processing time and memory size of the tested model using several techniques described in Section 4.3. Then, because the end goal was to be able to predict hand joints on real hand images, additional attempts were made in order to improve qualitative results on such images. It was found that blurring the input image before being fed

(57)

into a network could significantly enhance the model’s predictions, as it made the image more similar to the data used for training. Using transfer learning with the SqueezeNet 1.1 model was also an important factor that led the network to generalise well to real hand images. Using the ONNX framework together with the OpenCV library made the embedding of the solution into any smart-phone fast and easy without affecting the results, as it was shown in Section 5.2.2. Finally, quantitative and qualitative results were presented and proved that the proposed architecture was able to run in real-time on a smart-phone, exceeding the state-of-the-art (Gouidis et al. 2018) in all the metrics that were considered. Therefore, this research has proven how it is possible to improve the state-of-the- art in hand pose estimation on a smart-phone in real-time by using deep learning. Some limitations have been however identified and are discussed in the next Section.

6.1 Future work

Throughout this work it has been possible to detect some factors that could help improve future solutions. The first and most important one is the data used for training. The GANerated dataset has several flaws, such as containing some noisy data (hand images that could not be real hand poses as in Figure 6.1) or more generally not reflecting the true distribution of the data that would be used during inference (real hand images). For this reason, a new large annotated dataset with real hand images could prove crucial in improving the final results.

Figure 6.1: Noisy data example from the dataset used. (Mueller et al. 2017)

(58)

In addition to trying to improve predictions, another track to explore would be reducing even more the prediction time by using different optimisation techniques, such as the TensorFlow Lite framework, or other compact CNN networks. These could include the SqueezeNext (Gholami et al. 2018) and EffNet (Freeman, Roese-Koerner, and Kummert 2018) architectures. Finally, the current model could potentially be improved by continuing to do hyperparameter tuning. This is however extremely time consuming as it would require changing one or few parameters at a time, training the entire model again, and comparing the results to the original model. Overall, this research has shown some very promising results for hand pose estimation in real-time and there remains a lot of potential to be explored for further improvements.

(59)

7 Bibliography Books

Géron, Aurélien (2017). Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’REILLY.

Journal Articles

Cheng, Jian et al. (2018). “Recent Advances in Efficient Computation of Deep Convolutional Neural Networks”. In: CoRR abs/1802.00939.

arXiv:1802.00939. URL: http://arxiv.org/abs/1802.00939.

Clevert, Djork-Arné, Unterthiner, Thomas, and Hochreiter, Sepp (2016).

“Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)”. In: CoRR abs/1511.07289.

Freeman, Ido, Roese-Koerner, Lutz, and Kummert, Anton (2018). “EffNet:

An Efficient Structure for Convolutional Neural Networks”. In: CoRR abs/1801.06434. arXiv:1801.06434. URL: http://arxiv.org/abs/

1801.06434.

Gholami, Amir et al. (2018). “SqueezeNext: Hardware-Aware Neural Network Design”. In: CoRR abs/1803.10615. arXiv:1803.10615. URL:

http://arxiv.org/abs/1803.10615.

Gouidis, Filippos et al. (2018). “Accurate Hand Keypoint Localization on Mobile Devices”. In: CoRR abs/1812.08028. arXiv:1812.08028. URL:

He, Kaiming et al. (2015a). “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385. arXiv: 1512 . 03385. URL:

Real-time hand pose estimation on a smart-phone using Deep Learning

Real-time hand pose estimation on a smart-phone using Deep

Learning

VALENTIN GOURMET

Real-time hand pose

estimation on a smart-phone using Deep Learning

VALENTIN GOURMET

Abstract

Keywords

Sammanfattning

Acknowledgements

Contents

1 Introduction

1.1 Thesis description

1.2 Report structure

1.3 Background

1.4 Problem definition

1.5 Purpose

1.6 Delimitations

1.7 Societal impact and ethics

1.8 Contribution

2 Literature review

3 Background Theory

3.1 Machine Learning

3.2 Deep Learning

3.3 The vanishing gradient problem

3.4 Hyperparameter tuning

3.5 Convolutional Neural Networks

3.6 Training deep convolutional neural networks

3.7 Convolutional neural networks in embedded devices

3.8 Pose estimation

X

=

Y

=

L

(µ, p) = ∥p − µ∥

L( ˆ Z, p) = L

(DSN T ( ˆ Z), p) + λL

( ˆ Z)

J SD(P ||Q) =

D(P ||M) +

D(Q ||M) where M =

(P + Q)

and D(P ||Q) = − ∑

P (x)log(

)

4 Method

4.1 Software

4.2 Configuration

4.3 Model selection

4.4 Final architecture

5 Results

5.1 Quantitative results

5.2 Qualitative results

6 Conclusion

6.1 Future work

7 Bibliography Books

Journal Articles