Deep Learning for estimation of fingertip location in 3-dimensional point clouds : An investigation of deep learning models for estimating fingertips in a 3D point cloud and its predictive uncertainty

(1)

Linköpings universitet SE–581 83 Linköping

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A--21/025--SE

Deep Learning for

estima-tion of ﬁngertip locaestima-tion in

3-dimensional point clouds

–

An investigation of deep learning models for estimating

ﬁnger-tips in a 3D point cloud and its predictive uncertainty

Phillip Hölscher

Supervisor : Josef Wilzén Examiner : Jose M Pena

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Sensor technology is rapidly developing and, consequently, the generation of point cloud data is constantly increasing. Since the recent release of PointNet, it is possible to process this unordered 3-dimensional data directly in a neural network. The company TLT Screen AB, which develops cutting-edge tracking technology, seeks to optimize the localization of the fingertips of a hand in a point cloud. To do so, the identification of relevant 3D neural network models for modeling hands and detection of fingertips in various hand orientations is essential. The Hand PointNet processes point clouds of hands directly and generate estimations of fixed points (joints), including fingertips, of the hands. Therefore, this model was selected to optimize the localization of fingertips for TLT Screen AB and forms the subject of this research. The model has advantages over conventional convolutional neural networks (CNN). First of all, in contrast to the 2D CNN, the Hand PointNet can use the full 3-dimensional spatial information. Compared to the 3D CNN, moreover, it avoids unnecessarily voluminous data and enables more efficient learning. The model was trained and evaluated on the public dataset MRSA Hand. In contrast to previously published work, the main object of this investigation is the estimation of only 5 joints, for the fingertips. The behavior of the model with a reduction from the usual 21, to 11 and only 5 joints are examined. It is found that the reduction of joints contributed to an increase in the mean error of the estimated joints. Furthermore, the examination of the distribution of the residuals of the estimate for fingertips is found to be less dense. MC dropout to study the prediction uncertainty for the fingertips has shown that the uncertainty increases when the joints are decreased. Finally, the results show that the uncertainty is greatest for the prediction of the thumb tip. Starting from the tip of the thumb, it is observed that the uncertainty of the estimates decreases with each additional fingertip.

Keywords: Deep Learning, Point Cloud, Hand Pose Estimation, Fingertip Localization, Model Uncertainty

(4)

Acknowledgments

First of all, I would like to thank Josef Wilzén, my supervisor at Linköping University. With a lot of input, you always enriched our meetings and of course my work. Special thanks also go to my external supervisor Tohid Ardeshiri, who made this project happen in the first place. I sincerely thank you for the trust you have placed in me with this demanding work. And for the openness in the meetings and the development of the project. I would also like to thank my examiner Jose M Pena and my opponent Vasileia Kampouraki for their constructive contributions throughout this thesis. A further thank you to all the other teachers involved at Linköping University for a unique experience during my education. I am especially happy to be able to thank the people I can now call friends, thanks to Andreas Christopoulos Charitos, Zijie Feng (Bill), Erik Anders, Lennart Schilling (Big L). You have made my time in Linköping easier and more enjoyable. Very big thanks go to Rosalie Waelen. Thank you for always being there for me. Finally, I would like to thank my entire family, you have always supported me in all my decisions, for that, I am infinitely grateful.

(5)

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Outline . . . 2 2 Theory 3 2.1 Related work . . . 3

2.2 Introduction to relevant data type . . . 6

2.2.1 Point cloud data . . . 6

2.3 Deep Learning . . . 7

2.3.1 Loss function . . . 9

2.3.2 Training the Model . . . 9

2.3.3 Optimizer . . . 10

2.3.4 Dropout . . . 11

2.3.4.1 Model uncertainty by Monte Carlo Dropout . . . 12

2.4 PointNet . . . 14 2.4.1 Architecture . . . 14 2.4.2 Permutation Invariance . . . 14 2.4.3 Transformation Invariance . . . 15 2.5 PointNet++ . . . 16 2.5.1 Architecture . . . 17 3 Data 18 3.1 Hand pose data . . . 18

3.2 Utilised data . . . 18 3.2.1 Data preprocessing . . . 20 3.2.2 Data transformation . . . 21 4 Method 22 4.1 Model . . . 22 4.1.1 Hand PointNet . . . 22

4.1.2 OBB-based Point Cloud Normalization . . . 22

(6)

4.1.4 Fingertip Refinement Network . . . 24 4.2 Training . . . 24 4.2.1 Hyperparameter Tuning . . . 24 4.3 Evaluation . . . 25 4.3.1 Standard metrics . . . 25 4.3.2 Predictive variance . . . 26 4.4 Implementation . . . 26 5 Results 28 5.1 Evaluation experiments . . . 28 5.1.1 Model comparison . . . 28 5.1.2 Predictive uncertainty . . . 32 6 Discussion 34 6.1 Results . . . 34 6.2 Method . . . 35

6.3 The work in a wider context . . . 36

6.3.1 Future Research . . . 36

6.3.2 Ethical considerations . . . 36

7 Conclusion 37

Bibliography 38

(7)

2.1 Point cloud represented in 3 dimensional coordinate system. . . 7

2.2 Neuron . . . 8

2.3 Activation functions . . . 8

2.4 Artificial neural network . . . 9

2.5 Neural network applying dropout . . . 12

2.6 Comparison of standard and dropout network . . . 12

2.7 PointNet architecture [qi2017pointnet] . . . 15

2.8 Example of permutations represented by a D dimensional vector and N sample of points . . . 15

2.9 Spatial transformer [jaderberg2015spatial] . . . 15

2.10 Input transformer of PointNet . . . 16

2.11 PointNet++ architecture [qi2017pointnet++] . . . 17

3.1 Hand poses . . . 19

3.2 Hand pose rotation . . . 19

3.3 Hand pose 4 - Point cloud . . . 20

3.4 Point cloud with varying number of joints . . . 21

4.1 Hand PointNet architecture [ge2018hand] . . . 23

4.2 Hand point cloud with ground truth and estimation . . . 25

5.1 Model comparison 1 . . . 29

5.2 Model comparison 2 . . . 30

5.3 Residuals . . . 31

5.4 Predictive uncertainty by fingerstips . . . 33

A.1 Overview of representations of 3D data . . . 42

A.2 Point cloud intaking models for feature learning . . . 43

A.3 MSRA - Hand pose - Gesture 5 - Examples . . . 44

A.4 MSRA - All hand pose (1-I) . . . 46

A.5 MSRA - All hand pose (IP-Y) . . . 47

(8)

List of Tables

3.1 Public data sets - Hand Pose Estimation . . . 20

4.1 Selected hyperparameter . . . 25

5.1 Mean error (mm) model comparison over on all joints for various hyperparameter settings . . . 29

5.2 Mean error (mm) model comparison on all joints . . . 30

5.3 Residuals . . . 31

5.4 Predictive uncertainty . . . 32

(9)

1.1 Motivation

The latest growth of Deep Learning (DL) applications in a multitude of academic and indus-trial domains has, amongst others, had an impact on the domain of hand pose estimation. For example, DL models are increasingly used for computer-based classification and segmenta-tion of images, videos, and comparable data types generated by cameras and sensors. The main category of these methods is supervised learning. In this context, a DL model is expected to perform numerical or categorical predictions on new, previously unseen data. Thus, to generalize the model and avoid problems such as overfitting, a DL model needs to be trained on a very large amount of data.

The domain of hand pose estimation is a widely researched area. The aim of this domain is to extract the positioning of hand skeleton parameters, also called joints. By doing so, the poses of hands can be localized, identified and classified. To develop hand pose estimation, a wide variety of modeling approaches have been investigated, but only since the release of the network architecture, PointNet [33] can point clouds, a set of points in space (e.g. a three-dimensional space), be processed directly. The advantage over a 2D convolutional neu-ral network (CNN) [30, 32] is the full utilization of 3D spatial information. Other approaches such as 3D CNN [9, 28] convert point clouds to volumetrics, which creates unnecessary vol-umes in the data and therefore increases the complexity.

The company TLT Screen AB develops cutting-edge tracking systems. Data augmentation of the joints for hand pose data is a complicated matter, therefore they are interested in inves-tigating deep learning methods for 3D point clouds to localize only the fingertips. The prediction of fingertips, and only fingertips, from a point cloud has not been done before. Generally, when estimating hand positions, models are trained and evaluated on a variety of joints. However, before a state-of-the-art technology can be trained and evaluated on the enterprise’s data, it is beneficial to first apply 3D point cloud consuming deep learning methods on the performance of a reduced joint number of 5, representing the fingertips of one hand. Furthermore, it is of great relevance to investigate the quality of the predictions through information reduction as a result of the decrease in the number of joints.

(10)

1.2. Aim

1.2 Aim

Motivated by the challenge of achieving high performance in fingertip localization in 3D point cloud data, the aim of this work is to use the state-of-the-art method to run a novel in-vestigation of the prediction quality using only the 5 fingertip joints. In recent years, a variety of methods for hand pose estimation have been presented, such as PointNets, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Autoencoder. This scientific development in the field of 3D point cloud data processing allows a choice of methods.

The two main research questions of this study are:

Can the performance of the selected model be maintained by the use and reduction to just 5 fingertip joints?

Does the reduction of joints cause a significant increase in the uncertainty of predictions of fingertips? To answer these research questions, the following steps are conducted:

• Identify relevant 3D neural network model for modeling hand and detection of finger-tips

• Evaluate the selected model performance by a variety in the trained number of joints • Apply and implement a well-chosen method to estimate the predictive uncertainty of

the models

1.3 Outline

The structure of this thesis is as follows:

• Chapter 2 discusses the theoretical background of the underlying methods used in the next chapter.

• Chapter 3 presents the data sets used to train and evaluate the models.

• The chapter 4 explains the applied method in detail and provides information on the implementation.

• Chapter 5 exists of the presentation of the generated results of the conducted experi-ments.

• Chapter 6 discusses the methods applied and implemented developed in a wider context and the results obtained.

(11)

This chapter elaborates on the relevant theory underlying the topics of this thesis. First the data types will be explained, with a focus on point clouds, then the theoretically relevant aspects of Deep Learning are outlined. Before the Hand PointNet, the method used in this thesis is explained in section 4 of this chapter. The prior models PointNet and PointNet++ are also discussed briefly.

2.1 Related work

This section presents existing literature that is of relevance to the topic of this thesis. First, different 3D data representations are presented. Second, relevant surveys on the topic of 3D point cloud processing with deep learning are introduced, as well as the PointNet model, the key publication in this field. Next, it will be determined which survey on the topic of hand pose estimation is of greatest use and a few examples of the progress in methods concerning the problem of hand pose estimation are provided. Finally, cutting-edge methods in 3D hand pose estimation are pointed out.

There are different types of representations for 3D data. The data can be broadly divided into the categories Euclidean and non-Euclidean. Point clouds can belong to both categories. Decisive for the classification are the properties of the data. For example, if the point cloud is a common coordinate system it follows a Euclidean structure [1]. Other types of Euclidean data can be RGB-D, depth images, volumetric (voxel, octree). The main data used in this thesis are point clouds in a 3-dimensional coordinate system. Thus, the Euclidean distance between the points can be determined. The data type point cloud will be discussed in greater detail in section 3.

Due to a large number of publications, some surveys providing an overview of deep learn-ing and 3D point cloud data were created. The first survey I here discuss, Paper [12], exten-sively reviews the progress of deep learning methods for solving problems related to point cloud data. It presents a detailed overview of methods for solving tasks such as 3D shape classification, 3D object detection, and tracking on the 3D point cloud, as well as a list of publicly accessible data sets and comparisons of existing methods.

(12)

2.1. Related work

Article [27] also looks at the development of deep learning methods for 3D point cloud data. This research differs from paper [12] in that task fields such as registration, augmenta-tion and compleaugmenta-tion have been included in the overview.

Point cloud feature learning methods are classified into two sub-categories by [26]: • Point-based methods: These are deep learning models which process unstructured and

unordered point cloud data directly.

• Tree-based methods: These use transformations for a regular representation of the point clouds, which is then used by the deep learning model.

A figure to illustrate the following allocation can be found in the appendix A.2.

Point-based deep learning: PointNets: PointNet [33] is the first published paper with a neu-ral network architecture that directly uses point clouds as input. The further development of the architecture, PointNet++ [34], entails extracting features in hierarchical iterations and can not only capture global but also local features. These methods were used for 3D object detection and semantic segmentation

ConvNets: Convolutional neural networks provide not only excellent performance in im-age analysis, but also in several tasks in the domain of point clouds. In recent years, various types of architectures have been proposed, see for example: [25, 44, 14, 3, 47, 45, 22, 21].

RNN based deep learning: The recurrent neural network is a special class of artificial neural networks. The key is that they are networks with loops in them. The loop allows the NN to transfer the information from one network to the next. The connection process of the long-term memory (type of RNN) model cell is referred to as sequential forward and of a bidirectional long-term memory (type of RNN) model sequential return (backward) connec-tion. This connection allows contextual processing of information and is applied in the field of point cloud data for segmentation tasks to identify local and global features [49, 15].

Autoencoder based deep learning: Autoencoders use unsupervised learning to learn representations of the data provided. Currently, they are used in particular for generative models. They can encode the irregularity of point clouds and account for sparseness in the upsampling phase. A variety of models using this approach have been introduced in the past few years [48, 5, 4, 35, 38, 13, 53].

Tree-based deep learning : Since only deep learning methods that process point clouds di-rectly are of interest for this work, this category will not be explained further.

Due to the high number of citations used, some works stand out in this area. The article

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation[33] de-serves to be mentioned in particular. Until the publication of this work, it was common to perform data transformations to 3D voxel grids or images. However, these transformations led to a loss of information. Since the release of the PointNet network, it is for the first time possible to process point clouds as input directly. Although it is a simple architecture, the network is highly efficient and effective. Moreover, it can handle applications such as ob-ject classification, part segmentation. The disadvantage of the PointNet architecture is that it does not cover local structure understanding. Therefore, there has been a hierarchical neural network proposed that recursively applies PointNet to nested partitioning on the data. This method is called PointNet++ [34] and makes contextual learning of local features possible.

SO-Net[23] is a deep learning architecture that creates a spatial distribution of unordered point clouds and thereby creates a self-organizing map (SOM). The model can be tuned using systematic point-to-node k nearest neighbor selection. Due to the parallel ability, it is possible to train the model faster and more efficiently than regular models. Tasks such as point cloud reconstruction, classification, object part segmentation and shape retrieval can be undertaken with the model.

In the paper Dynamic Graph CNN for Learning on Point Clouds [44], a convolutional neural network method called EdgeConv is introduced. This model can be used to perform

(13)

tasks such as classification and segmentation for 3D point clouds. The unique feature of this model is that each layer of the network is determined dynamically via a graph. A special aspect of this method is taking into account local neighborhood information in order to learn global shape properties.

Next, a suitable DL method for hand pose estimation will be presented, evaluated and cho-sen. The evolution of sensors and computing power allows for rapid development in the field of hand pose estimation. Surveys generate a good overview of the subject. In the paper [39], state-of-the-art methods for hand position estimation in a single depth frame are inves-tigated. The survey [50] mainly investigates which methods are best suited for hand pose estimation for in-depth images in 3-dimensional space 3D. However, in this work, DL meth-ods which process point clouds can be found. Given that the development in the field of DL methods for processing point clouds directly is very recent, the number of methods out there is limited.

The work [8] proposes a method called Hand PointNet, which is a further development of the above presented PointNet++. This method can process 3D point clouds directly as net-work input. Robustness is generated by an oriented bounding box, thus generating an orien-tation of the global hand position. A hand pose regression network is used to understand the structures of the hand. Furthermore, a simple PointNet was used to refine the fingertip po-sition prediction. This method was trained, evaluated and tested on three publicly available data sets. The result was evaluated using the mean error distance and the error threshold. Finally, the performance in the use of fingertip refinement was compared with 16 state-of-the-art methods and showed outstanding performance for all data sets in the generation of the 3D hand position estimation.

A method called Point-to-Point Regression PontNet is presented in the paper [10]. Here the 3D point cloud of a hand is taken directly as input to produce heat maps and unit vector fields on the point cloud. To better handle 3D spatial information, a stacked network architec-ture is applied to PointNet. The rest of the paper (including the experiment, the presentation of the results and the comparison with competitive methods) is similar to the previous paper on Hand PointNet. In this paper, too, the results delivered better performance compared to the comparative methods.

In paper [24], a new deep learning method, called Point-to-Point voting, is proposed, that can also process unordered point clouds for 3d hand pose estimation. In other words, no pre-processing methods such as nearest neighbor estimation of other approaches are required with this method. The base element is a permutation equivariant layer (PEL) and a residual network version to perform the hand pose estimation. Furthermore, a point-to-pose scheme is used to combine information from point-wise local features. Experiments have been done on NYU and Hand Challange 2015. The method showed superior performance to two data sets and outperforms state-of-the-art methods on both.

In the paper [2], the So-HandNet (Self-Organised Network) method is introduced, which is a further development of the above presented SO-Net. Here, too, a 3D point cloud is used as input to quantify the 3D hand pose. However, this method works with semi-supervised learning, meaning only a few annotated data are needed to train the model. A new type of encoder-decoder enables features to be extracted and evaluated from the point cloud. The methods were then tested on four different data sets. The performance is assessed by the mean error and the error threshold. Thereby, the results have been measured with fully stacked and differently stacked annotated frames. In both cases, excellent performances were achieved on the ICVL and NYU data sets.

Having presented different methods for processing point cloud data, now a model to be used in the remainders of this thesis will be selected. Several methods have successfully been applied to perform 3D hand pose estimation, however, only a few have been succesful for point clouds. Therefore, promising methods such as AWR [16], V2V-PoseNet [28], A2J [46], Pixel-wise regression [52], Dense Pixelwise Estimation [43], DeepPrior++ [31] cannot be

(14)

2.2. Introduction to relevant data type

considered to be used in the present context. Due to the inaccessible implementation meth-ods, the methods Point-to-Point Regression PontNet and Point-to-Point voting presented above cannot be taken into consideration either. Finally, the methods Hand PointNet and So-HandNet focus on different problems. Due to the interest in the estimation of fingertips in this thesis, Hand PointNet is a particularly relevant model. Thus, Hand PointNet is chosen as the method and object of inquiry in this research. The method is studied in greater depth and elaborated in the remaining part of this chapter and in chapter 4.

2.2 Introduction to relevant data type

In the field of machine learning, 3D data belongs to the subfield of computer vision. This data type allows a complete geometric sense of objects. Due to rapid developments in 3D sensing technologies, the amount of 3D data has increased enormously. The representation of the 3-dimensional data types can be classified into two categories, Euclidean and non-Euclidean. Euclidean data are subject to properties of the grid-structured data of a coordi-nate system and have global parametrization. However, elsewhere the 3D non-Euclidean data ordinarily has neither a global parametrization nor a common coordinate system [1]. Exam-ples are social networks, a network of sensors, computer graphics ,and others. This kind of 3-dimensional data representation is also called geometric data. Since 2017 DL methods ap-plied in this field are called geometric deep learning. However, a particularly significant data type for this work is the RGB-D data categorized as Euclidean-structure. The development of RGB-D sensors, such as Microsoft’s Kinect, has led to this data type becoming increasingly popular. These sensors are capable of generating information about a 3D object in terms of 2D color information RGB (red, green, blue) as well as D (depth map). This data is of great importance in this work as all public hand pose estimation data, presented in the table 3.1, are provided as a dept image.

3D point clouds can be understood as an unordered point cloud in 3-dimensional space in which a 3D object is rendered as a geometric approximation. The data types of the point clouds, meshes, and graphs are to be allocated in the non-Euclidean group [1]. It is worth mentioning that point clouds occupy a dual nature and can be categorized as Euclidean and non-Euclidean data, it correlates on the dimension of the preprocessing (e.g. globally or lo-cally). Nevertheless, in this thesis, the point clouds are processed in Euclidean distance.

The work 3D data representation [1] presents different data types for 3D data. It also reports on various DL models and their various processing of 3D data types. Applications for each data type are described as well as an analysis of challenges in this scientific field. An overview of a range of different representations of 3D data can be found in the appendix A.1. In the further discussion, however, only the distinction between Euclidean and non-Euclidean data is dealt with, as well as the brief description of RGB-D data and point clouds.

2.2.1 Point cloud data

Point clouds are not a new but increasingly common data format. A point cloud is a collection of N unordered points in a three-dimensional space t, y, zu P R3. As an example, let P and

Q be representative for two point clouds, and p =

p , y p  , z p  P P, where  = 1, ..., n, q= q , y q , z q  PQ, where =1, ..., m.

Figure 2.1 serves as an example of a point cloud (Point cloud from 1). On the left is a person in a position called "bridge". On the right is the representation of a 3-dimensional coordinate system and two points p (blue) and q (orange). These two points in the coordinate system represent two points from the point cloud on the left. However, in the context of this thesis, only point clouds located in Euclidean space are used. Therefore, the distances of the

1_Pytorch3d: _{https://colab.research.google.com/github/facebookresearch/pytorch3d/}

(15)

respective points can be calculated with the Euclidean distance. Thus the distance between the points of p and q can be calculated by

d₍_{p, q}_{) =}

b

(p´ q)2+ py´ qy

2

+ (pz´ qz)2 (2.1)

Figure 2.1: Point cloud represented in 3 dimensional coordinate system.

Different types of sensors allow the generation of point cloud data, also data forms like depth images mashes etc. can be converted into point clouds. Common uses of this data stream are in the fields of computer vision, robotics, virtual reality and augmented reality.

2.3 Deep Learning

Neurons represent the simplest structure and computational unit of the human brain. The human brain has 1011 neurons [20]. Not every neuron is connected to every other neu-ron, and certain neurons have stronger connections to some than others. Interconnections (or synapses) among neurons are used to transmit information from the outputs of certain neurons to the inputs of other neurons. However, one neuron has got around 5000 synapses. This comparably high number of neurons and connections enables extensive cognitive abil-ities such as speaking, abstracting, generating, and transporting knowledge, as well as the development of social systems and technologies [20]. In this process, each neuron computes a basic function. In simplified terms, this consists of integrating and firing. One neuron exe-cutes a computation with its inputs and then fires by exceeding a certain threshold value. An illustration of a neuron in the human brain can be seen in figure 2.2a. Dendrite is the con-nection (synapsis) to other, preceding neurons and thus transmits incoming stimuli. These stimuli will be processed (computed) in the nucleus. In case sufficient stimuli are received, it fires. The connection of myelin sheaths transmits stimuli to the dendrites, which are then connected to other neurons.

The artificial neural network (ANN) has been computationally modeled after the human brain. The computational neuron is illustrated in figure 2.2b. However, the process of compu-tationally modeled can be expressed by the simple function of y=ƒ₍_x_{; θ}₎_{. The computational}

neuron and input is defined as a vector x=₁_{, }₂_{, ..., }n. As already mentioned, not all

(16)

2.3. Deep Learning

(a) Biological neuron (b) Computational neuron

Figure 2.2: Neuron

w = ₁_{, }₂_{, ..., }n. The parameter θ = (b, ₁, ₂, ..., n) vector contains the weight

vector and the associated bias b. The output y is computed through a linear combination (or weighted sum) given by

ƒ₍_x_{; θ}_{) =}ϕ n ÿ =1 () +b =ϕ₍_wT_x₊b₎_, _(2.2)

where b is the bias parameter, w the vector of wights and ϕ(¨)the activation function. An activation function is used to transform the integration (weighted sum) into a single output which determines whether or not the neuron would fire. To represent non-linear relation-ships of the input vector, the activation function is needed to be non-linear. Frequently used activation functions [29] can be viewed in figure 2.3.

Figure 2.3: Activation functions

A feedforward neural networks, often referred to as Multi-layer peceptron (MLP), is the central models of DL. The model with N layers can be represented as a chain of functions as directed acyclic graph. For example, we consider a chain connection of 5 functions ƒ(_{) =} ƒ(5)₍ƒ(4)₍_...ƒ(1)₍₎₎₎_{. The length of the chain is understood as the depth of the model. The}

last layer, in this case ƒ(5), is interpreted as the output layer. The other layers ƒ(4) to ƒ(1) are called hidden layers. The input vector x, is known as the input layer. Each layer of the network has a certain number of nodes. The activation or output of a node from a layer in the MLP can be modulated as follows:

k  =ϕ b k  + r k´1 ÿ j=1 k jo k´1  ! =ϕ r k´1 ÿ j=1 k jo k´1  ! , (2.3) where bk =  k 0and  k

 indicates the activation of layer k in node , ϕ represents the

activation function, bk the bias at of node  in layer k, the incoming node weight  k

jfor node

(17)

(a) Artificial neural network architecture (b) Single layer of an artificial neural network

Figure 2.4: Artificial neural network

2.3.1 Loss function

Objective functions are an essential component when it comes to minimizing or maximizing algorithms in machine learning. The group of functions for minimization is also called the loss function. This evaluates the goodness of the prediction. The gradient descent method is widely applied to minimize the function. There exists a multitude of loss functions whose behavior is differentiated and thus adapted according to the problem. The main criteria for determining the loss function include the number of outliers, the type of machine learning algorithm, the efficiency of gradient descent. Likewise, whether the prediction is of numer-ical (regression loss) or categornumer-ical (classification loss) nature. Conventional loss functions for regression problems are introduced next.

The Mean Square Error (MSE) 2.4, also called Quadratic loss, L2 Loss is a commonly applied loss function in regression models. The distance of the target variable and predicted values are determined, squared, summed, and averaged.

Mean Square Error is given by

L₍_{y, ˆy}_{) =}

řn

=1(y´ ˆy)2

n (2.4)

The Mean Absolute Error (MAE) 2.5 or L1 Loss. This is another type of loss function that sums the absolute difference between the target variable and predicted and averaged.

Mean Absolute Error

L₍_{y, ˆy}_{) =}

řn

=1|y´ ˆy|

n (2.5)

These two types of loss functions entail differentiated approaches to problem-solving. The MSE is easier to solve, whereas MAE is more robust to outliers. Consequently, training with MAE may prove more appropriate for data with outliers. However, this could be adjusted with a dynamic learning rate, as this decreases as we move closer to the minima. The gradient of the MSE loss, on the other hand, exhibits dynamic behavior of the learning rate. Therefore making the end of the training more precise.

2.3.2 Training the Model

Algorithm 1 shows the process of forward propagation. Required are the layer depth of the network , the weight W() and bias b()vectors of the model as well as the input x to the process and the target output y. The main target is to obtain a computation for the cost function J. This consists of the calculation of a loss value L and a regularisation termΩ(θ₎_,

(18)

2.3. Deep Learning

which is used to prevent overfitting. The loss is calculated from the prediction ˆy and the target y. The regularisation term contains all parameters for weights and biases [11].

Algorithm 1:Forward propagation

/* Requirements */

Input: , W(), b(), x, y

/* Initialize */

1 h(0)=

/* For each layer */

2 for k = 1,..,l do 3 ak=b(k)+W(k)h(k´1) 4 hk=ƒ(a(k)) 5 ˆy=h() 6 J=L(ˆy, y) +λΩ(θ) Result: ˆy, J

Algorithm 2 shows the process of backward propagation.

In comparison to forwarding propagation, in backward propagation, the target y is also used alongside the input x. Starting from the output layer y, the computation proceeds through the gradients on the activations (k₎_{for each layer k, going backward to the first}

hidden layer. The gradients are used to adjust each layer. Thereby weights are changed to reduce the error. Different gradient-based optimisation methods can be used.

Algorithm 2:Backward propagation

/* Start back propagation */

1 g Ð∇_ˆyJ=∇_ˆyL(ˆy, y) 2 for k = l,l-1,...,1 do /* Step 1 */ 3 g Ð∇_a(k)J=g d ƒ1((k)) /* Step 2 */ 4 ∇_b(k)J=g+λ∇_b(k)Ω(θ) 5 ∇_W(k)J=gh(k´1)T+λ∇_W(_k)Ω(θ) /* Step 3 */ 6 g Ð h(k´1)J=W(k)Tg

Backward propagation starts at the end of the forward computation, where the gradient on the output layer will be calculated. All calculations are backwards from layer  to1. In step 1 a transformation takes place where the output layer gradient converts into a gradient at the pre-nonlinearity activation. Step 2 the gradient gets computed on weights and biases. The 3. and last step, the gradients in relation to the activations of the next lower hidden layer will propagate

2.3.3 Optimizer

The Adaptive Moment Estimation (Adam) [19] is an often employed optimization algorithm within machine learning. This technique, which only requires first-order gradients, is appli-cable for the optimization of stochastic objective functions. The Adam optimizer uses the benefits of well-known methods such as AdaGrad [6] and RMSProp [41] and can be applied to non-stationary objectives. For stochastic optimization, individual adaptive learning rates for the different parameters are employed. Estimates of the first moment (mean) and second moment (uncentered variance) of the gradients are thereby generated. The pseudo-code

(19)

(Al-gorithm 3) shows the procedure for minimizing the objective function ƒ(θ₎_{by updating its}

models parameters θ.

Algorithm 3:Adam optimizer, stochastic gradient descent algorithm

Input: α, η, β1, β2, θ0, ƒ(θ)

/* Initialize */

1 m₀Ð0 // First moment vector.

2 ₀Ð0 // Second moment vector.

3 t Ð 0 // Timestep

/* Convergence */

4 while θtnot converged do

5 t Ð t+1 6 gt Ð∇θƒt(θ_t´1) 7 mt Ð β1¨ mt´1+ (1 ´ β1)¨ gt 8 tÐ β2¨ t´1+ (1 ´ β2)¨ g2_t 9 mÓt Ð mt/(1 ´ β t 1) 10 ctÐ mt/(1 ´ β t 2) 11 θtÐ θ_t´1´ α ¨m_Ót/( a c t+η) Output: θt

The required inputs of the algorithm are, the initial parameter vector θ0, the to be

minimized stochastic objective function ƒ(θ₎_{, the exponential decay rates β}₁ε₍_{0, 1}_] _and β₂ε₍_{0, 1}_]_{for the first and second-moment estimates, as well as the learning rate α. The}

de-fault setting of η is set to η=10´8_{. The first and second moment (vectors) and the timestep}

t are initialized to zero’s. In the first step of each iteration of t=1, ..., T, the first-order gradi-ent gt will be evaluated by the vector of partial derivatives of ƒtwith respect to θ at timestep

θ_t´1_{. The algorithm uses the exponential decay rates β}₁ _{and β}₂ _{with the previously}

com-puted gradient gt to perform updates on exponential moving averages mt of the gradient

and the squared gradient t. The initialization of moment estimates with a vector of zeros,

especially at the beginning and while hyper-parameters are small, leads to biased towards zero. However, this initialisation bias is counteracted and thus generates bias-corrected esti-matesmÓtandct.

2.3.4 Dropout

Deep Learning methods can handle problems with greater complexity than classical Machine Learning methods. However, a large amount of data is needed to train the deep learning model. Training a neural network with many layers using insufficient data can lead to over-fitting and non-generalizability of the model. Overover-fitting is a condition in which the model identifies complex relationships in the training data but performs poorly on previously un-seen data. In Deep Learning, regularisation techniques are applied to improve poor generalis-ability. Methods for regularization are batch normalization data augmentation and dropout. Dropout is a method which combines various neural network architectures in an efficient way, and avoids overfitting [36]. In this process, units of the hidden layers are temporarily removed from the neural network along with all its incoming and outgoing connections. An example of this procedure can be seen in figure 2.5. The selection of the unit to be dropped is random. The probability parameter p, which acts as an independent to each unit, can be adjusted to a probability as desired. Whereby input units of the neural network act better with p close to 1.

The usage of dropout can be considered as a sampling of a larger neural network. A net-work with the remaining units is then also called a thinned netnet-work. When training a neural network with dropout, a number of2nthinned networks with extensive weight sharing are trained.

(20)

2.3. Deep Learning

Figure 2.5: Neural network applying dropout

The comparison of standard feed-forward neural network modeling (2.6) and the corre-sponding modeling with dropout (2.7) can be seen in the equations. The hidden layers of the neural network are modelled with L, where  εt1, .., Lu. The vectors z()and y()denote the inputs and outputs in layer , whereby y(0) = . The weights and biases at layer  are

represented by the vectors W()and b(). Any activation function is represented by ƒ . Standard network z(+1)  =w (+1)  y  +b(+1)  , y(+1)  =ƒ(z (+1)  ) (2.6)

The * in 2.7 indicates the element-wise product. The output y() of each  is multiplied element-wise by a vector of independent Bernoulli random variables r(). Thereby the thinned outputs ˜y is computed and used to the next layer as input. By running this operation for each

layer, a sub-network is sampled from the large overall network. Dropout network r() j „Bernoulli(p), ˜ y()₌r()_{˚ y}()_, z(+1)  =w (+1)  y˜()+b (+1)  , y(+1)  =ƒ(z (+1)  ) (2.7)

Visualised, this can be depicted as follows in figure 2.6.

(a) Standard network (b) Dropout network

Figure 2.6: Comparison of standard and dropout network

2.3.4.1 Model uncertainty by Monte Carlo Dropout

Depending on the application of the machine learning model, the uncertainty of the models can be of great importance. This uncertainty can be divided into at least two categories.

(21)

In [17], this was divided into aleatoric and epistemic. The aleatoric uncertainty, also called irreducible uncertainty, describes the state of randomness in the outcome of an experiment. Epistemic uncertainty, also called reducible uncertainty, indicates the unawareness of the best model. Reducible because this state of knowledge can change. However, one way to retrieve the (epistemic) uncertainty of a deep learning model is the Monte Carlo (MC) dropout [7].

Monte Carlo Dropout is a novel method that uses regular dropout in a NN, applied before every weight layer, to be interpreted as a Bayesian approximation of a probabilistic deep Gaussian process (GPs). This is a valuable tool to model distributions over function. The posterior can be evaluated analytically by modelling the distribution over functions through

F | X „ N(0, K(X, X))

Y | F „ N F, τ´1__N (2.8) where τ is the precision hyperparameter andNidentiy matrix of size NN. However the

covariance function of the GP is given by

K(x, y) = ż

p₍_w₎p₍b₎_{σ w}T_x₊b σ wT_y₊b dwdb _(2.9)

As before, with a non-linear function σ(¨)and prior distributions p(w), p(b₎_.

The deep GPs predictive probability is given with

p₍_{y | x}_{, X, Y}_{) =} ż p₍_{y | x}_{, ω}₎p₍_{ω | X, Y}₎_dω p₍_{y | x}_{, ω}_{) =}_{N y;} b y(x, ω), τ´1ID b y(x, ω=tW1, . . . , WLu) (2.10)

Whereby the p(ω | x, y)the posterior distribution is intractable. Therefore q(ω₎_{is used}

as a simple variational distribution to approximate the posterior distribution. The Kullback-Leibler (KL) divergence is used to minimize the distance of the posterior of the deep GP and the approximation distribution.

KL(q₍ω₎_{| p}₍_{ω |}_{X, Y}₎₎ _(2.11)

This can also be expressed as minimisation objective

´ ż

q₍ω₎_{log p}₍_{Y | X}_{, ω}₎_dω₊_KL₍q₍ω₎_}p₍ω₎₎ _(2.12)

whereby the approximate predictive distribution can be expressed as follows

q₍_y˚_|_x˚_{) =}

ż

p₍_y˚_|_x˚_{, w}₎q₍_w₎_dw _(2.13)

The random set of random variables of a model are described by  = !

WL=1

) with layer L. Now, to obtain the model uncertainty, the MC dropout is performed by taking re-peated random samples from the predictive distribution in order to subsequently average the numerical results. The Monte Carlo estimate also referred to as MC Dropout, can be sam-pled with T sets of vectors

!

Wt₁, ..., WTL

)T

t=1 of realizations from the Bernoulli distribution

! zt₁, ..., zTL T t=1 ) with zt = [z t j] K_ j1by Eq(y˚_|_x˚₎(y˚)_» 1 T T ÿ t=1 ˆ y˚₍_x˚_{, W}t 1, ..., W t ) (2.14)

In simplified terms, model averaging can be generated with T stochastic forward passes of the NN and a following averaging of the results.

(22)

2.4. PointNet

2.4 PointNet

In this section, the specificity of PointNet and its architecture is assessed gradually. Point-Net provides the basis on which the more advanced PointPoint-Net++ and Hand PointPoint-Net models are build. PointNet is groundbreaking work. The model architecture is the first deep learn-ing model capable of processlearn-ing point cloud data directly as input. Furthermore, this novel type of neural network is suitable for applications such as object classification and part and semantic segmentation.

2.4.1 Architecture

The architecture of the model is shown in figure 2.7. In order for PointNet to provide direct handling of raw point cloud data, the architecture structure of the model has to conform to the unique properties of a point set. These unique properties are:

• Permutation Invariance: Point cloud data has an unstructured character, meaning that none of the points in a point cloud follows a particular ranking or hierarchy. Therefore, the processing must ensure invariability to various representations of the data.

• Transformation Invariance: Even after transformations such as rotation and translation, the outcome of the classification or segmentation should be consistent.

• Point Interactions: Neighbouring points typically provide important information and therefore should be considered (e.g. individual points should not be treated in isola-tion).

The architecture of PointNet is divided into classification and segmentation networks. The main structure of the classification network consists of a common multilayer perceptron (MLP). Thereby n points and their 3 dimensions (x,y,z) are mapped to 64 -dimensions. It is important to mention that a single multilayer perceptron is used for each of the n points (i.e., the mapping is identical and independent for the n points). After repeating multiple instances, the n points of 64 dimensions are mapped to 1024 dimensions. A global feature vector in R1024 is then created from the points in a higher-dimensional embedding space using max-pooling. Lastly, a vector of k output classification values is generated. This is achieved by mapping the global feature vector to a fully connected layer with a 3 layer. The input and feature transformer is covered in more detail in the Transformation Invariance section.

In the segmentation network, each of the n input points (dimension of x,y,z) is allocated to a segment class m. The first step is to create a per-point vector withR1088. This results from the connection of the points in the 64-dimensional embedding space (local point features) and the global feature vector (global point features). As is the case in the classification network, an MLP is used. This reduces the dimensionality to 128 and then m. The result is an n x m array.

2.4.2 Permutation Invariance

As already mentioned, the point cloud data are unstructured and represented as numerical sets. With N data points, there are N! permutations, an example of which is shown in 2.8.

In order to make PointNet invariant to input permutations, different strategies can be followed. The authors have used symmetric functions to aggregate the information from each of the n points. The input of the symmetric function is n vectors. The output is a new invariant to the input order.

(23)

Figure 2.7: PointNet architecture [33]

Figure 2.8: Example of permutations represented by a D dimensional vector and N sample of points

where, ƒ :2RNÑR, h : RNÑRK and g :RKˆ ¨ ¨ ¨ ˆRK looooooomooooooon

n

ÑR is a symetric function. The approximation is done by a multi-layer perceptron network h and a composition of a single variable function g and a max-pooling function [8]. The authors empirically tested alternatives, where max-pooling performed best compared to summing and averaging.

The symmetric function is applied after n input points are mapped to higher-dimensional space. As a result, a global feature vector is produced that aims to cover an aggregate signif-icance of the n input points. This vector is then used directly for the classification as well as for the local point features of the segmentation.

2.4.3 Transformation Invariance

Both networks, classification, and segmentation are designed to be invariant for global orien-tation (and transformation). This kind of transformation for pose normalization is performed by a Spatial Transformer Networks (STN) [18]. The input and feature transformers are such sub-networks that perform this kind of pose normalization for the given input.

An example of how the spatial transformation (ST) generates a pose normalization of a rotated digital, can be viewed in figure 2.9. This transformation for digital classifiers limits the need for data augmentation. This method can also be applied to point cloud data and reduces the global orientation of an object.

(24)

2.5. PointNet++

Such a pose normalization is performed in the input of the PointNet. Each of the n points is represented as a vector (x,y,z dimension) and is mapped to the embedding spaces inde-pendently. Thus, a geometric transformation can be performed for each point by simple ma-trix multiplication. Figure 2.10a represents the input transformer subsection of the PointNet. What the localization net is to the ST, the T-net is to the PointNet in the input transformer. This regression network processes the n points vector and generates a33 transformation matrix, which is multiplied by the n 3 input matrix. The T-net exists mainly MLP and FC layers. The input is mapped to higher-dimensional space. Through max pooling, global fea-ture vectors are generated. Similar to what was explained in the strucfea-ture of the classification network. However, the dimension of the global feature vector is reduced toR256with the FC layer and combined with trainable weights and biases. The result is a transformation matrix of33 dimensions.

(a) Input transformer [33]

(b) T-Network

Figure 2.10: Input transformer of PointNet

A similar transformation is performed in the feature transformation. However, the in-put, in this case, is the normalized pose is extended to the 64-dimensional embedding space. Whereby the output is a6464 transformation multiplied by the n64 inputs. The dimen-sion of the trainable weights and biases is now 256x4096 and 1x4096. This high number of parameters can cause overfitting, thus a regularization term was added to the softmax train-ing loss. Lr eg= › › › ´ AA T›_› › 2 (2.16) predicted by a mini-network is A the feature alignment matrix.

2.5 PointNet++

PointNet++, also called hierarchical PointNet, is of particular importance in this work. These provide the framework for the Hand Pose Regression Network in the Hand PointNet. How-ever, since PointNet cannot capture the local features of the point clouds, PointNet++ was then proposed. This model uses PointNet in a recursive manner on a subset of the input point set. Thus, the PointNet++ model can be considered as a continuation of the PointNet [34].

(25)

The idea of processing is similar to a CNN. First, fine geometric structures are extracted from local features. These minor local features are grouped into larger units and processed. This generates a higher-level feature for the higher level. By repeating this process, features are generated for the entire point cloud.

2.5.1 Architecture

The hierarchical structure of the model is generated by a set of set abstraction levels together. With each abstraction level, features from a cluster of points are processed and abstracted to create a reduced number of elements. Each set abstraction consists of three essential layers:

• Sampling layer, defines the centroids of local regions by selecting a subset (points) of the input point.

• Grouping layer, finding neighboring points around the centroids and thereby structures subsets of local regions.

• PointNet, applies the PointNet model to create feature vectors of local region patterns. An example of hierarchical point set feature learning, as well as their set abstraction levels and the associated sample, grouping and PointNet layer, can be seen in figure 2.11. The input matrix of an abstraction level is modelled as follows: N ˆ(d₊C₎_{. N represents the number}

of points, d-dim coordinates and C-dim point feature. The output matrix of the abstraction level is N1_ˆ₍_d₊_C1₎_{, where N’ are subsampled points and C’-dim feature vectors which}

aggregates the local context.

Figure 2.11: PointNet++ architecture [34]

Sampling layer applies iterative Farthest Point Sampling (FPS) on the input points

₁_{, }₂_{, ..., }n. Thus a subset of points _1, _2, ..., m is selected, where jis the furthest

point (in metric distance) from the set 1, 2, ..., ´1.

Grouping layer takes sets of points as input and generates groups of sets of points as output. The size of the input is defined by the size of point sets N ˆ(d₊C₎_{and the coordinates of a}

set of centroids N1_{ˆ d. The size of the output is defined by N}1_{ˆ K ˆ}₍_d₊_C₎_{, where K denote}

the neighbouring points of the centroid point and thus each group represents a local region. PointNet the output of the previous grouping layer is taken as input in the PointNet layer. The centroid and local features are representative of each local region. The size of the output data is N1_ˆ₍_d₊_C‘₎

(26)

3 Data

This chapter is about the data worked with during the thesis project. First, an overall in-troduction to the 3-dimensional data type is given. Secondly, the concept of point cloud data is briefly explained. Thirdly, the utilized data are listed and explained. The fourth and last section is about how the data used has been pre-processed.

3.1 Hand pose data

Hand pose estimation is a sub-task in the field of computer vision. Mainly image or video is processed to extract patterns from the data. Hand pose estimation aims to recognize the pose of the hand in different forms and variations. The pose of a hand in this context is defined by M joints and their location. In the depth images 3.1 four different hand poses are visible. Each finger is marked by 4 joints. Additionally, the hand includes another joint for the overall center of the hand. Thus each hand and its position is represented by 21 joints, for the specific example. The number of joints may vary depending on the data set.

However, depth image is a subcategory of RGB-D data, which in 3D data representation falls under the category Euclidean-structure [1]. However, each hand position can be rotated differently, this can be seen in figure 3.2. Each image is referred to as a frame. In the figure 3.1 and 3.2 are three different frames (rotations) of the hand pose 4 illustrated.

3.2 Utilised data

The amount of publicly available data sets on topic of hand pose estimation is limited. How-ever, the main data set for this thesis is called MSRA Hand (Microsoft Research Asia). The the key information about the MSRA data set1[37] are, that it contains more than 76 thousand depth images and ground truth of 21 joints. The total number of frames has been grouped into 9 subjects (test persons). A total of 17 gestures were mapped, all of them can be seen in the appendix figures A.4 & A.5. Each gesture is represented in 500 frames, examples can be seen in figure A.3.

1_{MSRA data set:} _{https://www.dropbox.com/s/c91xvevra867m6t/cvpr15_MSRAHandGestureDB.}

(27)

(a) Hand pose - 1 (b) Hand pose - 2

(c) Hand pose - 3 (d) Hand pose - 4 - Rotation 1

Figure 3.1: Hand poses

(a) Hand pose 4 - Rotation 2 (b) Hand pose 4 - Rotation 3

Figure 3.2: Hand pose rotation

Further public data sets for hand pose estimation are ICVL 2[40], NYU3 [42], Hands

20174[51]. Characteristics such as the number of frames and their split into training and test frames as well as the number of joins per frame can be seen in table 3.1.

Similar to the MSRA data set, all listed data sets are being published in-depth image for-mat. However, the experimental setup of the data generation differs. This difference can be divided into manual and automatic annotation. Manual annotation (ICVL) can also include data gloves or similar tools. The automatic annotation (NYU, MSRA [42]) utilizes a tracking algo-rithm with a complex camera setup [39]. Due to the complexity of the problem and therefore different approaches of data generation and annotation, variations in the number of joints in the data sets arise.

There is a gap in the table for training and test frames for the MSRA data set, the data set used in this work. This data set consists of 9 subjects, as already mentioned, each of which

2_{ICVL data set: https://labicvl.github.io/hand.html}

3_{NYU data set: https://jonathantompson.github.io/NYU_Hand_Pose_Dataset.htm} 4_{Hands 2017 data set: http://icvl.ee.ic.ac.uk/hands17/challenge/}

(28)

3.2. Utilised data

Table 3.1: Public data sets - Hand Pose Estimation

Name #Frames #Training_frames _Frames#Test #Joints

ICVL 23655 22059 1569 16

MSRA 76500 21

NYU 81009 72757 8252 36

Hands 2017 1252000 957000 295000 21

Hands 2019 300000 175000 125000 21

has performed the same 17 gestures. Any subject or subjects can be selected for training and testing. However, later in this work, subject 0 was taken for validation and subject 1 for evaluation. Thus, subjects 2 to 8 serve as the training data set.

3.2.1 Data preprocessing

Depth images serve as a starting point for various approaches to hand pose estimation. Con-version allows depth images to be transformed into volumetric structures such as voxels or point clouds with Euclidean structure. To perform the transformation from depth image to point cloud the code of Yancheng Wang5(part author of Hand PointNet) was used. The hand position 4 from the previous figures 3.1 & 3.2 can be seen as a point cloud in figure 3.3 from three different perspectives. The total of 1024 blue points together correspond to the point cloud of the hand, red represents the M=21 joints.

(a) Hand pose 4 - Point cloud - Perspective 1 (b) Hand pose 4 - Point cloud - Perspective 2

(c) Hand pose 4 - Point cloud - Perspective 3

Figure 3.3: Hand pose 4 - Point cloud

(29)

3.2.2 Data transformation

As a reminder, the subject of this work is to test the performance of a deep learning model for hand pose estimation on the fingertips only. In this context, the last joint of the finger is de-fined as the fingertip. Accordingly, a transformation from M=21 to M=5 must be performed. With changes to the published preprocessing code of the Hand PointNet software6, it was possible to select joints as desired. The figure 3.4 shows the same hand point cloud with three different values for M=t21, 11, 5u.

Figure 3.4: Point cloud with varying number of joints

As an example, the indexing of the joints is set as M21 =t1, 2, ..., 21u. The index is to

be understood as follows, starting with the joints for the palms. All fingers have 4 joints, they are counted from the lowest part of the finger upwards, starting from the thumb and ending at the little finger. Thus, the fingertips, last joint of each finger, can be indexed by

M₅₌t_{5, 9, 13, 17, 21u. Since a reduction from a maximum of 21 joints to only one joint per} finger seems extreme, it was decided, to examine a further number of M. M=11 was chosen because it lies between the extremes. The indexing of the 11 joints was selected as follows

M₁₁₌_t_{1, 4, 5, 8, 9, 12, 13, 16, 17, 21u.}

(30)

4 Method

This chapter of the thesis explains the methods applied in detail. The Hand PointNet and its steps for data normalization and regression network are explained in further depth. Subse-quently, the data preprocessing is highlighted. The training process is then explained in more detail, as well as the evaluation. Finally, additional implementation steps are outlined.

4.1 Model

The model can be divided into 3 phases. In the first phase, the point cloud is sampled and its rotation normalized. In the second phase, the normalized point cloud is processed by Hand Pose Regression Network which estimates the joint located in the 3-dimensional space. In the third and final phase, a method called Fingertip Refinement Network is applied. This method should enable a new improved estimation of the fingertips. In particular, the Point-Net and hierarchical PointPoint-Net models explained in chapter 2 are applied in the third and second phases.

4.1.1 Hand PointNet

A point cloud of a hand, converted from a depth image, serves as input to the model. The output of the model is a 3D hand positionΦ=tϕmu

M

m=1PΛ in the camera coordinate system

(C.S.). M is equal to the total amount of joints in each hand andΛ represents the hand joint space with3 ˆ M. An oriented bounding box (OBB) is applied to normalize the N sampled point set of 3-dimensional points pPR3(=1, ..., N). Applying the hierarchical Pointnet [34],

explained in 2, the model takes hierarchical hand features and creates a 3-dimensional hand pose with fully connected layers. In order to increase the prediction accuracy of fingertips joints (last joint of each finger), a method called a Finger Refinement PointNet was developed and applied. The architectural structure of the 3D hand pose estimation method can be seen in figure 4.1.

4.1.2 OBB-based Point Cloud Normalization

The normalization of the point cloud is an essential part of the 3D hand pose estimation. The OBB based transformation produces a unification of the input data with respect to its

(31)

rota-Figure 4.1: Hand PointNet architecture [8]

tion. A homogeneous canonical coordinate alignment hand point cloud creates robustness to global orientations of the hand. Thus, the OBB operates in the same manner as the spatial transformation network [18] in the PointNet [33]. The OOB is verbatim a box that wraps the input data closely. A principal component analysis (PCA) on the 3D coordinates of input points is conducted to identify the orientation of the OBB. The transformation from camera C.S. to OBB C.S. can be expressed as follows:

pobb=Rcm

obb

T

¨pcm,

pnor = pobb´p´obb_/L

obb, (4.1)

prepresents 3D coordinates of point in camera C.S pcmand OBB in C.S pobb, the cen-troid of the point cloud

!

pobb

)N

=1 is denoted by p

´obb_{, the point p in the normalized 3D}

coordinate is represented by pnor, in the OBB in camera C.S. is Rcmobb the rotation matrix and

Lobbrepresents the maximum length of the edges of the OBB. In the course of training and

testing, the transformations of 4.1 and 4.2 are performed. For training, the points and joint lo-cations are transformed from Camera C.S. to normalized OBB C.S. In the test, the estimations from OBB ˆϕnor

m are back-transformed into the Camera C.S. ˆϕ cm m (m=1, .., M). ˆ ϕcm m =R cm obb ¨

Lobb¨ ˆϕnor_m +p´obb (4.2)

4.1.3 Hand Pose Regression Network

The hand pose regression network uses a normalised point cloud as input Xnor = ! xnor )N n=1 = ! pnor , n nor  )

. The 3D coordinate of the normalized point is represented by pnor and the corresponding 3D surface normal by n

nor

 . The hierarchical PointNet

em-ployed in this thesis is structured as follows: It consists of 3 set abstraction levels, with

N₌_t_{1024, 512, 128u local region, and varying k for the grouping. The dimensional}

fea-turesfor each local region C is set to C = t_{128, 256u. As in PointNet and PointNet++, a}

global feature vector of 1024-dimis generated by the three FC layers.

Using the normalized point cloud and the matching 3D joint positions of the ground truth !

Xnort , Φ nor t

)T

t=1the following objective function is minimised during the training:

w˚=arg min_ T ÿ t=1 › › ›αt´ F Xnort , w › › › 2 +λ }}2, (4.3) where w indicates the parameters of the network, F denotes the hand pose regres-sion PointNet, the strength of regulation is controlled by λ and αt is an F-dim