Real-time hand segmentation using deep learning

(1)

IN THE FIELD OF TECHNOLOGY DEGREE PROJECT

TECHNOLOGY AND LEARNING AND THE MAIN FIELD OF STUDY

INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Real-time hand

segmentation using deep learning

Hand-segmentering i realtid som använder djupinlärning

FEDERICO FAVIA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Real-time hand segmentation using deep learning

FEDERICO FAVIA

Master in ICT Innovation in EIT Visual Computing Communication - TIVNM

Date: January 19, 2021 Supervisor: Hanwei Wu Examiner: Markus Flierl

School of Electrical Engineering and Computer Science Host company: ManoMotion AB

Host supervisors: Thang Nguyen and Shahrouz Yousefi Swedish title: Hand-segmentering i realtid som använder djup inlärning

(4)

(5)

iii

“I’ve missed more than 9,000 shots in my career.

I’ve lost almost 300 games.

26 times I’ve been trusted to take the game winning shot and missed.

I’ve failed over and over and over again in my life.

And that is why I succeed.”

Michael Jeffrey Jordan

(6)

(7)

v

Abstract

Hand segmentation is a fundamental part of many computer vision systems aimed at gesture recognition or hand tracking. In particular, augmented reality solutions need a very accurate gesture analysis system in order to satisfy the end consumers in an appropriate manner. Therefore the hand segmentation step is critical. Segmentation is a well-known problem in image processing, being the process to divide a digital image into multiple regions with pixels of similar qualities. Classify what pixels belong to the hand and which ones belong to the background need to be performed within a real-time performance and a reasonable computational complexity. While in the past mainly light-weight probabilistic and machine learning approaches were used, this work investigates the challenges of real-time hand segmentation achieved through several deep learning techniques. Is it possible or not to improve current state-of-the- art segmentation systems for smartphone applications? Several models are tested and compared based on accuracy and processing speed. Trans- fer learning-like approach leads the method of this work since many architectures were built just for generic semantic segmentation or for particular applications such as autonomous driving. Great effort is spent on organizing a solid and generalized dataset of hands, exploiting the existing ones and data collected by ManoMotion AB. Since the first aim was to obtain a really accurate hand segmentation, in the end, RefineNet architecture is selected and both quantitative and qualitative evaluations are performed, considering its advantages and analysing the problems related to the computational time which could be improved in the future.

Keywords —Hand Segmentation, Semantic Segmentation, Deep Learn- ing, Convolutional Neural Networks, Real-time, Augmented Reality, Em- bedded Devices, Dataset, Transfer Learning

(8)

Sammanfattning

Handsegmentering är en grundläggande del av många datorvisionssy- stem som syftar till gestigenkänning eller handspårning. I synnerhet be- höver förstärkta verklighetslösningar ett mycket exakt gestanalyssystem för att tillfredsställa slutkonsumenterna på ett lämpligt sätt. Därför är handsegmenteringssteget kritiskt. Segmentering är ett välkänt problem vid bildbehandling, det vill säga processen att dela en digital bild i flera regioner med pixlar av liknande kvaliteter. Klassificera vilka pixlar som tillhör handen och vilka som hör till bakgrunden måste utföras i real- tidsprestanda och rimlig beräkningskomplexitet. Medan tidigare använts huvudsakligen lättviktiga probabilistiska metoder och maskininlärnings- metoder, undersöker detta arbete utmaningarna med realtidshandseg- mentering uppnådd genom flera djupinlärningstekniker. Är det möjligt eller inte att förbättra nuvarande toppmoderna segmenteringssystem för smartphone-applikationer? Flera modeller testas och jämförs baserat på noggrannhet och processhastighet. Transfer learning-liknande metoden leder metoden för detta arbete eftersom många arkitekturer byggdes ba- ra för generisk semantisk segmentering eller för specifika applikationer som autonom körning. Stora ansträngningar läggs på att organisera en gedigen och generaliserad uppsättning händer, utnyttja befintliga och data som samlats in av ManoMotion AB. Eftersom det första syftet var att få en riktigt exakt handsegmentering, väljs i slutändan RefineNet- arkitekturen och både kvantitativa och kvalitativa utvärderingar utförs med beaktande av fördelarna med det och analys av problemen relaterade till beräkningstiden som kan förbättras i framtiden.

Nickelord —Handsegmentering, Semantisk Segmentering, Djupinlär- ning, Konvolutionsneurala Nätverk, Realtid, Förstärkt Verklighet, Inbäd- dade Enheter, Datauppsättning, Transferlärning

(9)

vii

Aknowledgments

I would like to thank my supervisors at ManoMotion, Dr. Thang Nguyen and CTO Dr. Shahrouz Yousefi for their help and guidance throughout this work. I have learned a lot from both of them in the past six months and our interaction has brought drastic changes in my perspectives about research, technology and teamwork. I would also like to thank my internship mate Yasmin and the other colleagues from ManoMotion AB for sharing nice moments as well as their know-how. A proper thank you goes to EIT Digital Academy for granting me the opportunity to study abroad and experiment the vibrant entrepreneurial world, as well to KTH and University of Trento and their respective head professors for my path, Dr. Markus Flierl and Dr. Nicola Conci for their supervision of the thesis.

In particular, I enjoyed an amazing experience in Sweden with people from many cultures and backgrounds: the happy days - and the tough too - spent with my VCC colleagues and friends Martin and Kamil really helped me throughout my journey, as well as the other students, especially from the Italian community, met in the KTH campus.

My best friends of a lifetime, Ghiro, Lollo, Leo e Capo, were instead far away, either in Italy or some other European country, living the hard period of lockdown, but we always helped and supported each other going beyond distance, even exploiting our football passion: you are the very best! Also Cava and Lorenzo, my dorm’s friends back in Trento, were happily sharing with me their Arci tortures.

Finally, I would like to thank my whole family, Martina, Nicolangelo, my mom and my dad, and my cousin Guido, because even though we were far away they always believed in me and supported my spirit.

But in particular, unexpectedly, a special person, my everything, a bright Sun, shined into my life and made it wonderful and complete:

Λepa, I dedicate this work to you with all my love and mind.

I won’t forget to say thank you to myself and my confidence, hoping I mentally grew up even further in this year of ups and downs, satisfaction and moments of void, because I always put my everything, trying at the same time to look at the glass as half-full and to reach the future improved version of “me”. Now it is time to sail for new adventures, which will make wise use of what I have learnt during my academic career. “Take your chances or someone will take them!”

(10)

(11)

List of Figures

1.1 Demo of ManoMotion’s hand tracking technology [4]. . . 2

1.2 Hand segmentation and 2D joint estimation examples. . 3

2.1 Examples segmentations of [32] from the Intel egocentric object dataset. . . 11

2.2 Results comparison between [36] and baseline [35]. . . 13

2.3 Sample results of the hand segmentation and GR algorithm of [37]. . . 14

2.4 EgoHands dataset [40] with ground truth annotations. . 16

2.5 Multi-scale splitted proposed structure [42] and visual results on their dataset. . . 18

2.6 Samples from four hand datasets including EgoHands, EYTH, GTEA and HOF used in [47]. . . 18

2.7 Qualitative results on EgoHands dataset using baseline [40], RefineNet [47] and RefineNet+CRF. . . 20

2.8 HGR-Net proposed structure [53], with the first stage ded- icated to segmentation. . . 20

2.9 HandSeg proposed structure [56], compared with Decon- vNet [57] and SegNet [44]. . . 21

3.1 Active contour segmentation on Magnetic Resonance Imag- ing (MRI). . . 24

3.2 Graph cuts and energy minimization. . . 25

3.3 K-means segmentation. . . 26

3.4 Mean-shift segmentation. . . 27

3.5 Difference between semantic and instance segmentation, courtesy of [10]. . . 28

3.6 Supervised ML and difference between classification and regression problems. . . 30

3.7 Underfitting and overfitting phenomenon in ML. . . 30

xi

(14)

3.8 Feed-Forward Artificial NN (ANN). . . 31

3.9 Analogy between biological and artificial neuron. . . 31

3.10 Most common non-linear activation functions, including the most used recently ReLU and its modifications. . . . 32

3.11 Example of Gradient Descent for a loss function. . . 33

3.12 Feature representations in CNNs. . . 35

3.13 Understanding convolutional layers. . . 36

3.14 Max and Average pooling layers. . . 37

3.15 Usual structure of Convolutional NN (CNN). . . 37

3.16 Examples of image augmentations. . . 39

3.17 Examples of Dropout in a Neural Network (NN). . . 40

3.18 Key-observations from ResNet, starting from the relation between error and number of layers in plain networks (up), to the residual module (bottom left) and the final structure (bottom right) [52]. . . 41

3.19 Consequent steps (a) depth-wise convolutional layer and (b) point-wise convolutional layer to achive depth-wise separable convolution of MobileNet [62]. . . 43

3.20 Difference between a normal residual block [52] (a) and an inverted residual block introduced in MobileNetV2 [92] (b). 43 3.21 FCN for semantic segmentation [50]. . . 44

3.22 Improvement of FCN and transposed convolutions. . . . 45

3.23 Main traditional segmentation networks. . . 47

3.24 Key-elements and structure of RefineNet [48]. . . 50

3.25 Light-weight RCU, CRP and fusion blocks. For simplicity, only 2 convolutional layers are visualized instead of 4, for the CRP block. Finally RCU blocks are discarded in final architecture [101]. . . 51

3.26 Comparison of semantic segmentation frameworks: (a) FCN [50]. (b) Encoder-decoder of SegNet [44], Decon- vNet [57], U-Net [46], and Refinement in RefineNet [48]. (c) Multi-scale of DeepLab [61] and PSPNet [51]. (d) IC- Net [102]. . . 52

3.27 Top row: Atrous convolution example. - Middle row: ASPP of DeepLabV3. - Down row: Atrous Separable Con- volution. [61] [54] [100] . . . 53

3.28 Confusion matrix of a binary classification problem. . . . 54

3.29 Visualization of Jaccard Index, or IoU. . . 55

3.30 Visualization of Dice coefficient, or F1 score. . . 56

(15)

LIST OF FIGURES xiii

4.1 Overview of the proposed solution for hand segmentation. 60 4.2 Samples from selected datasets. . . 61 5.1 Impact on loss and metrics when training U-Net model

on a bigger and more diverse dataset: Dice coefficient on validation set is better for the model trained on the bigger dataset. . . 70 5.2 Some results, good and failures, of simple U-Net model

trained on bigger dataset, tested on ManoMotion AB hand dataset. . . 71 5.3 Complex U-Net model [46]. . . 73 5.4 Test results of RefineNet-101 (III) with respective frame

and ground truth: good examples and failings. . . 76 5.5 Visual results of TensorFlow lite version of RefineNet-101

(III) on video from computer with smaller GPU (12.5 FPS). 77 5.6 Visual results of RefineNet-50 on video from computer

with GPU (12.5 FPS): good frames and failings . . . 78 5.7 Training behaviour of four different architectures trained

on “hand-arm” dataset. It can be noticed that (a) RefineNet- 101 has the most unstable pattern for validation, probably due to its high complexity. . . 80 5.8 Comparison between the different architectures. . . 82 5.9 Training behaviour of the three different architectures trained

on “hand-arm” dataset. It can be noticed that (a) PSPNet has the most unstable pattern. . . 84 5.10 Plots comparing the relations between Dice accuracy on

average, mIoU and Dice on MM dataset. . . 86 5.11 Plots comparing the relations between Dice accuracy on

average and computational complexity and GPU inference speed. . . 88 5.12 3D scatter plot comparing AVG Dice cross-dataset, model

size and quality of visual results. . . 89 5.13 Example of unpooling with saved indexes. . . 90

(16)

(17)

List of Tables

2.1 Quantitative results on different datasets of [47] method. 19 3.1 Different ResNet architectures for image classification. . . 42 3.2 Comparison between MobileNet architectures and one of

the best ResNet architectures. . . 44 3.3 Some common semantic segmentation datasets, used es-

pecially for autonomous cars applications. . . 58 4.1 Overview and key details of the used datasets. . . 64 5.1 Cross-dataset evaluation metrics and comparison between

simple U-Net model and complex one [46], with the latter achieving on average better performance, even though not trained on the same dataset. . . 72 5.2 RefineNet variants implemented. . . 74 5.3 Cross-dataset evaluation metrics and comparison between

(I) RefineNet-101 trained on final “hand” dataset, (II) RefineNet- 101 trained on cropped MM dataset and (III) RefineNet- 101 trained on the “hand-arm” dataset. In the columns there is always Dice coefficient shown, if there are two values the one below is mIoU. . . 75 5.4 Cross-dataset evaluation metrics and comparison between

(III) RefineNet-101 and (i) RefineNet-50 both trained on the “hand-arm” dataset. Only Dice is shown. . . 78 5.5 Cross-dataset evaluation metrics and comparison between

the lighter versions of RefineNet, all trained on the “hand- arm” dataset. Only Dice is shown. . . 81

xv

(18)

5.6 Cross-dataset evaluation metrics and comparison between (a) PSPNet [51], (b) ICNet [102] and (c) DeepLabV3+

[100], all trained on the “hand-arm” dataset. Only Dice coefficient is shown, for some of the most representative datasets. . . 83

(19)

Acronyms

ADAS Advanced Driver-Assistance Systems. 4 AI Artificial Intelligence. 29

ANN Artificial NN. xii, 31

AR Augmented Reality. 1, 2, 4, 6, 22, 59, 62, 93, 95 ASL American Sign Language. 6, 62

ASPP Atrous Spatial Pyramid Pooling. xii, 19, 52, 53, 81 AVG Average. xiii, 69, 72, 74, 75, 78, 81, 83, 85, 89 CEO Chief Executive Officer. 2

CFF Cascade Feature Fusion. 51, 52 CGI Computer-Generated Imagery. 94

CIELAB International Commission on Illumination color space. 12, 13, 15

CNN Convolutional NN. x, xii, xx, 4, 15, 16, 18, 19, 21–23, 28, 29, 34, 35, 37, 40, 42, 44, 48, 51, 59, 65, 87, 91–93

CPU Central Process Unit. 60

CRF Conditional Random Field. xi, 19, 20, 27 CRP Chained Residual Pooling. xii, 49, 51

CV Computer Vision. 1, 3, 4, 10, 19, 23, 28, 35, 40, 60, 91 DaM Depth-adaptive Multi-scale. 22

xvii

(20)

DL Deep Learning. ix, 1, 3–7, 9, 10, 14–17, 21–23, 27–29, 31, 33, 38, 40, 42, 48, 51, 53, 58–60, 62, 64–67, 91–93, 95

DNN DeepNN. 3, 6, 17, 22, 32, 34, 35, 37, 39, 40, 91 DoF Degree of Freedom. 2

DSP Digital Signal Processing. 23 DTD Describable Texture Dataset. 17 ELU Exponential Linear Unit. 34

EYTH EgoYouTubeHands. xi, 18, 19, 61, 63, 64, 72, 75

FCN Fully Convolutional Network. x, xii, 18, 19, 29, 44–46, 52 FN False Negative. 53–55, 72

FoV Field of View. 52, 94 FP False Positive. 53–55, 72

FPS Frames per Second. xiii, 15, 17, 20, 21, 46, 49, 52, 68, 70, 72, 74, 77–79, 81, 83, 85, 92

GAN Generative Adversarial Network. 28, 94 GMM Gaussian Mixture Model. 26

GPU Graphics Process Unit. xiii, 1, 15, 38, 60, 68, 70, 72, 74, 77, 78, 81, 83, 85, 87–89, 92

GR Gesture Recognition. xi, xviii, 2, 4, 5, 9, 10, 13, 14, 16, 19, 20, 28, 61, 62, 91, 93, 94

GTEA Georgia Tech Egocentric Activity. xi, 18, 19, 61, 62, 64, 72, 75, 76, 78, 79, 81, 83

HCI Human Computer Interaction. 2, 63, 79

HGR1 Hand GR-1. 20, 62, 64, 72, 75, 76, 78, 81, 83 HMD Head-mounted Display. 4

(21)

Acronyms xix

HOF Hand Over Face. xi, 18, 19, 61, 63, 64, 72, 75 HOG Histogram of Oriented Gradients. 13

HOI Human Object Interaction. 3, 21, 22

HSV Hue-Saturation-Value color space. 9, 13, 15 HVS Human Visual System. 1, 28, 35, 68

ICT Information and Communication Technology. 5 IDE Integrated Development Environment. 60

ILSVRC ImageNet Large Scale Visual Recognition Challenge. 40, 41, 52

IoU Intersection Over Union. xii, xix, 55, 57 LiDAR Light Detection and Ranging. 5 MAP Maximum a Posteriori Probability. 23 MIL Multiple-instance Learning. 11

mIoU MeanIoU. xiii, xv, 19, 21, 22, 46, 49, 51, 52, 55, 56, 67–70, 72, 75, 77, 85, 86

ML Machine Learning. ix, xi, 1, 5, 10, 29–31, 53, 91

MM ManoMotion AB. xiii, xv, 6, 22, 58, 62–64, 69, 72, 74–79, 81, 83, 85, 86

MR Mixed Reality. 1

MRF Multi-Resolution Fusion. 49

MRI Magnetic Resonance Imaging. xi, 24 MSE Mean Squared Error. 32, 33

NN Neural Network. x, xii, xvii, xviii, xx, 3, 4, 22, 27, 28, 31, 32, 35, 37, 40

pdf Probability Density Function. 23

(22)

PReLU Parametric Rectified Linear Unit. 34 R-CNN Regional-CNN. 29

RCU Residual Convolution Unit. xii, 19, 49, 51, 79 RDF Randomized Decision Forest. 9, 13, 15, 21 ReLU Rectified Linear Unit. xii, 17, 32, 34, 36 RGB Red-Green-Blue color space. 4, 9, 35, 63 RGB-D Red-Green-Blue-Depth. 21, 22

RMS Root Mean Squared. 38 RN RefineNet. 74, 79–81 RNN Recurrent NN. 27, 28

SDK Software Development Kit. 4 SGD Stochastic Gradient Descent. 33 SIFT Scale-invariant Feature Transform. 13 SLAM Simultaneous Localization and Mapping. 4 SLIC Simple Linear Iterative Clustering. 13 SSD Single Shot MultiBox Detector. 64 SVM Support Vector Machine. 15 TL Transfer Learning. 34, 58 TN True Negative. 53, 54 TP True Positive. 53–55 UI User Interface. 2

VOC Visual Object Classes. 49, 51, 52, 58 VR Virtual Reality. 1, 2, 6, 93

(23)

Acronyms xxi

XR Extended Reality. 1, 2, 4, 91, 94

YCbCr Luminance-Chroma Blue-Chroma Red color space. 9 YOLO You Only Look Once Detector. 64

(24)

(25)

Chapter 1 Introduction

Currently, images and videos have a key role in our life and have become massively the focus of technology improvement in many fields of application. Upgrades in data science and processors have enabled computer machines to observe and infer semantic meaning from visual data in the same manner as the Human Visual System (HVS) does. The latter is very robust to understand the visual surrounding world along with its context.

Hence current state-of-the-art Computer Vision (CV) [1], the science and engineering of extracting visual information from raw sensor data, seeks to achieve the same efficiency. Especially through the relentless and as- tonishing development of Graphics Process Unit (GPU), a considerable progress has been permitted by Deep Learning (DL). In particular, this sub-field of Machine Learning (ML) has risen the performance and speed in addressing many CV tasks, with respect to traditional algorithms.

In addition, we are experiencing an era in which Extended Reality (XR) is a hugely growing field of applications. XR is a superset which includes the full spectrum from the complete real to the complete virtual in the concept of reality–virtuality continuum introduced by Milgram et al. [2].

Therefore, this extension of human experiences ranges from the senses of existence, represented by Virtual Reality (VR), to the acquisition of cognition, represented by Augmented Reality (AR), not excluding the intermediate stage of Mixed Reality (MR), where physical and digital objects co-exist and interact in real-time.

1

(26)

1.1 Background

The important excitement and development about XR technology is due to a big shift of the User Interface (UI) of computers during the years:

from keyboard and mouse for PCs, touchpad for laptops, touchscreen for smartphones, to new interactions as hand tracking (Figure 1.1) and eye tracking. All giant technology companies’ CEOs, as [3], have stated AR will pervade our future, perhaps merging even more with VR to become part of everyday life and to make the concept of device fading away.

Figure 1.1: Demo of ManoMotion’s hand tracking technology [4].

The work has been conducted at ManoMotion AB [4], offering this project as they are currently creating the next-generation technology in real-time 3D hand Gesture Recognition (GR) and analysis for smartphones: they believe accurate hand segmentation will improve their algorithms. Hand gestures are an important type of natural language used in many research areas such as hand tracking, hand GR, Human Com- puter Interaction (HCI). Recognize gestures requires the prior estimation and tracking of the hand position, for instance by means of 2D visual information such as joints, colour and detailed shape of the hand (Figure 1.2). According to KTH Royal Institute of Technology Professor Haibo Li [5], advisor of ManoMotion AB [4], to understand HCI [6] it is needed to grasp the concept of affordance: all actions are made physically possible by the properties of an object or the environment, therefore they are related to humans and their intention. Indeed James J. Gibson’s theory [7] always takes in consideration the physical environment for evaluating actions. These considerations can help to reduce from the mathematical 27 hand Degree of Freedom (DoF) of the mathematical hand to a phenomenological hand with 7 DoF, lighter for hardware implementation purposes in GR systems. Video tracking is actually transforming the way humans send control to machines and this natural interaction modality is

(27)

CHAPTER 1. INTRODUCTION 3

Figure 1.2: Hand segmentation and 2D joint estimation examples.

being used in interactive games. For example, the user’s action of press- ing a button on the controller can be replaced by a collection of more intuitive gestures in front of the camera. The unpredictability of the hand movements and the colour resemblance with other skin parts add a supplementary degree of complexity on top of the usual video-tracking challenges. According to Maggio and Cavallaro [8] this important process is called hand tracking, a vision task most commonly referred to 3D hand pose estimation, which aims to estimate the location of human hand joints from images and videos.

1.1.1 Hand Segmentation

The problem the company in which the work is conducted wants to explore is about implementing different DL techniques to improve the current solution for hand segmentation, a critical step for their gesture analysis system. Segmentation is a very well-known problem in image processing [9] and CV [1]: it is the process to divide a digital image in multiple regions with pixels of similar qualities. Image segmentation is typically used to locate objects in images and a large number of techniques are used to achieve this goal, but in this work mainly DeepNN (DNN) [10] for semantic segmentation will be investigated.

The hand segmentation process, categorizing pixels into hand or background, is useful to achieve a precise understanding of the hand to improve the hand tracking pipeline. Methods to detect hand regions in visual data have several applications as well, such as research eyes and hands coordination or Human Object Interaction (HOI), practises for which accurate pixel-level hand segmentation is fundamental. In partic-

(28)

ular cases, segmentation needs to be performed in smartphone environments, having less capacity and available resources. Furthermore, proper speed is necessary to avoid user delays in real-time software. Hand segmentation is thus a challenging problem in CV, also because hands are subject to several conditions in different images having different appear- ances. Semantic segmentation for autonomous cars or Advanced Driver- Assistance Systems (ADAS) is in some respects easier due to the less diverse road structures scenario. Texture, shape and colour of hands change very rapidly even in the same sequence because of skin colour, lighting conditions, shadows, hand posture, rapid finger movements, acquisition angle, etc. In addition, all these characteristics are different depending on age, gender or ethnicity of users. In this thesis, Convolutional NN (CNN) are employed for segmenting hands from Red-Green-Blue color space (RGB) images captured in unrestricted scenarios and a new varied hand dataset is created to evaluate the results. Effects on performance from the architectures’ parameters and the inputs are discussed through experiments.

1.2 Motivation

As introduced earlier, this work has been conducted under the supervision of Dr. Thang Nguyen, during an internship at ManoMotion AB [4], a Stockholm-based startup providing a core technology framework to achieve precise and real-time hand tracking and GR in 3D space simply using a 2D camera. In particular, they are focused on the AR market, from smartphones to Head-mounted Display (HMD) and headsets. In- deed, the company is building its Software Development Kit (SDK) for these devices because it strongly believes that a very precise system on these constrained machines will unlock in the future the full potential of XR and hand tracking for embedded devices such as wearables and smart glasses. Moreover, the recent and always novel improvements in DL solutions are boosting XR applications including hand GR despite their initial bigger computational burden.

In this very competitive market, a comparison tool of a novel segmentation method against the brand new Apple Segmentation for ARKit 4 [11] or Google Segmentation for Android [12] and ARCore [13] can be crucial. Especially Apple’s latest release of their CV framework mostly known for the Simultaneous Localization and Mapping (SLAM) capabil-

(29)

ities, has introduced people segmentation and occlusion in a given frame:

it is also able to provide texture information and depth estimation from a ML approach running on the Apple’s neural engine and a more detailed novel depth information gathered by the Light Detection and Ranging (LiDAR) scanner. Hence, the motivation behind this DL-based industry research is the fact it could lead to the development of a more robust hand segmentation method, from the model itself to a richer and more specific dataset, helpful for more accurate GR solutions.

1.3 Problem statement and research questions

Since similar research has already been conducted in this area, the goal of this thesis is to determine if it is possible to outperform existing architectures. It is important to highlight they are constructed for different segmentation problems and not the challenging case of hands. Thus the main objective is to investigate and tune the segmentation network in order to perform well on the hand problem and at the same time assess its suitability for mobile devices. The problem, from an Information and Communication Technology (ICT) perspective, is how to increase the accuracy and efficiency of the application, not by improving the hardware but by remodelling the existing algorithms. Based on the past literature gaps, the following research questions have been formulated:

• Is it possible to improve the state-of-the-art hand segmentation algorithms relying completely on DL, in particular in terms of accuracy?

• What are the approaches and features of different DL architectures which benefit the most the hand segmentation problem?

• Are these techniques feasible to implement in real-time, taking later into consideration the computational power of smartphones?

Therefore, the purpose of this thesis is to find answers to these questions providing evidence with suitable experiments.

1.4 Methodology and contributions

This thesis aims to prove that by carefully analyzing different recent DL techniques, it is possible to improve existing solutions in the area of real- time hand segmentation, or at least understand how to better design the

(30)

feature extractors of faster solutions. It is done by first creating and testing several DL architectures, and finally picking the best one based on chosen key metrics. The end deliverable is a trained network model saved in the TensorFlow Lite [14] format, which could be easily later integrated into a smartphone-based application.

1.5 Challenges and delimitations

Current hand estimation systems through DL treat hands and environment separately from a third person perspective. Exactly in this case it falls the figure-ground segmentation problem, which is partially addressed by this work. It implies the strong assumption that background and environment are not important, which is a challenge that could be faced in a better way. The difficulties of fast hand movements, occlusions, lighting and skin colour add challenges on the problem too.

Regarding the delimitations of this research, the integration of the segmentation network into the smartphone pipeline is not a part of this work since ManoMotion AB (MM) itself took care of the necessary software to test the final solution on a smartphone.

1.6 Ethics and sustainability

This work aims to improve the accuracy of hand segmentation in hand tracking systems to improve their final output. It is important to point out that these systems, especially if used in VR and AR headsets or smart glasses, can have a positive societal impact for our world, for example in the education and learning field, or they can improve as well the living conditions of people with disabilities, such as interpret automatically the American Sign Language (ASL) [15]. In addition to that, considering the huge transformation of physical contact habits due to the recent Covid-19 pandemic, hand tracking will also have a big potential [16]. At the same time the fact that the solution might be based on DNN and the training of such complex architectures is common knowledge to be heavily dependent on datasets, it is ethically important to ensure that these datasets are balanced and not biased for a fair ethnicity, age and cultural representation. Although the goal of this work is not to provide the solution to these issues, it is still significant to know about their

(31)

existence so that a fruitful discussion could be developed around them.

1.7 Outline

The report is structured as follows. Chapter 2 gives a deep review of the academic research related to both traditional and DL-based hand segmentation problem. It helps the reader to understand the shift from probabilistic approaches to DL ones. Chapter 3 and Chapter 4 provide details about the theoretical framework and methods used in this work, including the dataset creation. Chapter 5 shows the results obtained with different networks, with a discussion and the final model design. In the end, Chapter 6 provides a final summary of the work, giving as well possible opportunities for future contributions.

(32)

(33)

Chapter 2 Literature Review

2.1 Overview

The following chapter, besides a short introduction, is responsible to de- scribe extensively the related literature about non DL-based techniques in Section 2.2 and DL-based ones in Section 2.3. Regarding the related work to the problem of hand segmentation, extensive commitment has been put through very different techniques. Frequently, it has been pre- sented just as sub-problem of works about hand GR systems as in [17], [18], [19] and [20]. Most of the works have found out that the most efficient feature to segment hands is the colour, especially of the human skin [21] [22] [23] [24] [25]. Colour estimation for skin detection does not require complex calculations and it is invariant to occlusion and rotation.

Unfortunately, colour can be heavily affected by the illumination of the scene. Therefore other features, as texture, have been added resulting in heavier computations or particular techniques as Gaussian modelling [26], Randomized Decision Forest (RDF) [27] or optimum colour spaces have been studied to increase robustness [28] [29]. For instance in 2016, Kolkur et al. [30] studied a new human skin detection algorithm, not relying on one single colour space. They found that a combination of three different colour spaces:

• RGB;

• Hue-Saturation-Value color space (HSV);

• Luminance-Chroma Blue-Chroma Red color space (YCbCr)

9

(34)

and proposed thresholds, works better to detect skin rather than just using one. Even though this method is heavily problem dependent, they reached a precision of 89.33% and accuracy of 94.43% on a subset of six images from Pratheepan dataset [31]. The aim of this thesis is however to find hand segmentation techniques which can work under any indoor scenarios (background, lighting, hand types) and that can be possibly implemented in a constrained computing device.

2.2 Non Deep Learning-based methods

The following has the purpose to illustrate the most important research works in the 2010s about hand segmentation without relying on DL, both in normal images or videos with GR purpose and in the more challenging and realistic egocentric videos. Egocentric vision or first-person vision is a sub-field of CV that entails analyzing images and videos captured by a wearable camera, which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Con- sequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user’s activities and their context in a naturalistic setting.

Although as explained segmentation can be approached with traditional CV algorithms, or probabilistic models for the skin colour and then applying ML techniques, it is also studied as a figure-ground segmentation problem as in most of the followings discussed works.

Even though Ren and Gu [32] did not focus solely on hands, they developed a bottom-up motion-based approach to robustly segment out foreground objects in egocentric videos. Their solution to the problem is to utilize domain priors cues combined with layered optical flow. Indeed, their basic assumption is that the foreground moves, however arbitrarily, against a background that is static in the world frame. Not being an easy problem in egocentric setting, they observe there are many regularities, two of which are particularly useful for figure-ground separation: hands and objects tend to appear near the center of the view, and body motion tend to be small and horizontal. They first normalize the optical flow using estimated average background motion, then fit the normalized flow into affine layers, and feed the combined cues into GraphCut [33]. The resulting segmentation, weighted by a scoring scheme for recognition, is used to update appearance models and is transferred to the next frame us-

(35)

CHAPTER 2. LITERATURE REVIEW 11

ing the flow. Before GraphCut a max-margin linear classifier is used along ground truth segmentation to choose into the set of figure-ground features. They show that they can robustly segment out foreground objects and hands on a large dataset of 100K frames, robust to many challenges such as uncontrolled camera, large motion, rapid scale and appearance changes and low video quality. Moreover, they show that figure-ground segmentation significantly improves accuracy in handled object recognition, especially when combined with temporal integration, reaching 86%

on the challenging Intel [34] egocentric object recognition benchmark (Figure 2.1). Their system is yet too slow to deploy on portable devices.

Figure 2.1: Examples segmentations of [32] from the Intel egocentric object dataset.

Another important work based on the one above is [35]: Fathi et al. de- livered an unsupervised bottom-up segmentation method, which exploits the structure of egocentric domain to partition each frame into hand, object and background categories, by using Multiple-instance Learning (MIL). According to them, egocentric vision provides many advantages:

there is no need to instrument the environment by installing multiple fixed cameras, the object being manipulated is less likely to be occluded by the user’s body and discriminative object features are often available since manipulated objects tend to occur at the center of the image and at an approximately constant size. They first segment foreground regions containing hands and active objects from background, and then learn a model to separate hands from objects, prior assuming the dominance of the hand in the foreground (similar colour histogram of the superpixel) with respect to objects. Finally they refine into left and right hand with both the last two steps making use of GraphCut [33]. Their segmentation model is based mainly on four hypothesis to differentiate foreground from background:

1. The background is assumed to be static, and it is initially estimated

(36)

by fitting the fundamental matrix to dense optical flow vectors.

2. The foreground is every entity moving.

3. Background objects are assumed farther away from camera.

4. A panorama of background scenes can be built by stitching background images together using affine transformation, assuming the background is roughly on a plane. The temporally local panora- mas of background are based on colour and texture histogram and model of region boundaries.

Their method, being completely unsupervised (not using initial ground truth segmentation to learn priors of hands), achieves 48% error rate out- performing the 67% error by Ren and Gu [32] on 1000 annotated frames of Intel dataset.

Later in 2013, Li and Kitani [36] addressed the task of pixel-level hand detection in context of egocentric cameras as well, presenting new datasets (EDSH1, EDSH2 and EDSH-kitchen) which contain 200 million labeled pixels of hand images under various illuminations. These kind of videos observe extreme transitions in lighting, making them difficult for traditional background subtraction. At the same time intrinsic colour of hands does not change dramatically over time. The three past approaches roughly were:

• Local appearance based detection as skin-colour regions, which need to be combined with trackers to consider static and dynamic appearance.

• Global appearance based detection as template from 3D models which, however, require search over large search space.

• Motion based detection, not applicable in case of no hand motion or camera motion.

They modeled hand appearance with local features through three experiments: different patch sizes for colour features, feature selection over feature modality and feature selection over sparse descriptor elements.

Doing so, CIELAB colour space was found to the best alongside with using a small patch: this increased performance over a single pixel classifier by 5%, confirming the intuition that observing the local context

(37)

should help to disambiguate hand regions. In addition, the performance plateaus after 20 to 30 feature elements because only a few sparse features combined, HSV, CIELAB, Gabor, HOG and perhaps Scale-invariant Fea- ture Transform (SIFT), are needed to achieve near-optimal performance.

To achieve more robust performance they modeled global appearance as well, training a collection of regressors indexed by global colour histogram for illumination changes modeling, a prevalent phenomenon in wearable cameras. In conclusion, they outperformed the baseline of Fathi, Ren, and Rehg [35], reaching a F1 score of 0.835 on EDSH2 dataset, observing on average a 15% performance increase (Figure 2.2). Failure cases hap- pen with extreme conditions such as complete saturation (part of scene and hands are purely white), insufficient lighting (very dark) and high contrast shadows. Therefore the combination of local appearance, global shape prior and more expressive global illumination models can enable better higher level task such as GR and hand tracking.

Figure 2.2: Results comparison between [36] and baseline [35].

Indeed, in 2013, the main finding by Serra et al. [37] is that hand segmentation on egovision scenes improves GR, which is a crucial reason of the importance of this work for the company. They segment the hands through RDF classifiers on a superpixel instead of a single one for mem- ory and complexity concerns. Each superpixel is computed through Sim- ple Linear Iterative Clustering (SLIC) and K-means based local clustering in 5D space: International Commission on Illumination color space (CIELAB), also called LAB, and the two coordinates. Features proven to be robust are: CIELAB and HSV for colour, as well as Gabor feature for texture and Histogram of Oriented Gradients (HOG). Since hands viewed under similar global appearance will share similar distribution in feature space, images are distributed with K-means and the RDF classifiers are

(38)

trained. The optimal number of classifiers depends on the characteristics of dataset, increasing when the latter is more varied. This method is similar to the one by Li and Kitani [36], but the superpixel considers also semantic coherence in time and space. The temporal smoothing to delete blinking regions is obtained with the weighted combination of the previous frame (where h stands for pixel classified as hand ):

P (h) =

d

X

k=0

w_k(P (h|h) · P (h) + P (h|h) · P (h)) (2.1) On the other hand, the final goal of spatial consistency, performed through GrabCut algorithm [38] on the posterior probability of every pixel, is pruning away small and isolated pixel groups that are unlikely to be part of hand regions and also aggregate bigger connected pixel groups. Results are shown in Figure 2.3.

Figure 2.3: Sample results of the hand segmentation and GR algorithm of [37].

The last fundamental non DL-based work was done in 2016 by Be- tancourt et al. [39], which present a hierarchical strategy to segment and identify the left and right hands of the user in egocentric videos. They highlight how existent first-person vision methods handle hand segmentation as a background-foreground problem, ignoring some important factors:

• Hands are not a single “skin-like” moving element, but a pair of interacting cooperative entities, affected by light changes.

• Close hand interactions may lead to hand-to-hand occlusions and, as a consequence, create a single hand-like segment.

(39)

• The user has at most one left and one right hand.

In their method it is important to note that the three levels, hand- segmentation, occlusion-detection and hand-identification are mutually independent, which makes possible to improve them separately. The first level of the proposed method is a multi-model structure that finds the hand pixels on each egocentric frame, using CIELAB as best colour space, HSV histogram as global feature, K trained RDF classifiers and final post-processing. Experimental results show that the final level, the hand-identification, relying on a Maxwell function of angle and horizontal position to decide whether a hand-like segment is left or right, has 99%

confidence, improving state-of-the-art Support Vector Machine (SVM).

Finally combining these two levels, they are able to achieve a F1 score of around 0.92 and a throughput of 30 FPS, thanks to GPU and image resampler. The second level, executed if required, is the hand-to-hand occlusion disambiguation, addressed by exploiting temporal superpixels, which can lead to improvements around 10% in L/R hand-segmentation.

Before discussing the fully DL-based hand segmentation works, it is worth to mention a really interesting hybrid approach of 2015 by Bambach et al.

[40], which also has the great contribution of creating a new large egocentric dataset called EgoHands (Figure 2.4). Despite their pipeline heavily relies on DL, the segmentation step is performed without it and in a more traditional manner. Hand detection and tracking are fundamental problems of egocentric vision, both for computers and people; in fact, neuroscientists have discovered specific parts of the brain that respond to identifying our own hands since “feeling of ownership of our limbs is a fundamental aspect of self-consciousness”. They developed a hand detection method through a CNN, introducing a simple candidate region generation approach combining spatial biases and appearance models in a unified probability framework, that outperforms existing techniques at a fraction of the computational cost (1500 windows per frame in just 78 ms). For the pixel-wise hand segmentation step, they assume most pixels inside a box produced by the detector correspond to a hand: a simple colour skin model is used to estimate an initial foreground mask and an aggressive threshold marks all pixels within the box foreground except those having very low probability of being skin. This is the well-known semi-supervised segmentation algorithm GrabCut [38], which is carefully not applied on the entire image because arms, faces and other hands would confuse the background colour model. Instead, they use a padded

(40)

region around the bounding box, ensuring that only local background content is modeled. The union of the output masks for all detected boxes is taken as the final segmentation. Their technique achieves significantly better accuracy, 55.6%, than the baseline of [36], 47.8%). Another CNN is fine-tuned for the hand-based GR, classifying whole frames as one of four different activities, resulting in 50.9% classification accuracy, still roughly twice random chance. In conclusion, their results suggest that increased hand segmentation accuracy could deliver high activity recognition accuracy without the need to recognize objects or backgrounds.

Figure 2.4: EgoHands dataset [40] with ground truth annotations.

Some researchers believe instead hand segmentation is not needed for hand GR recognition through CNNs [41].

2.3 Deep Learning-based methods

It is important to notice that the main focus of this thesis is to apply DL methods to the hand segmentation problem which need to have at the same time high accuracy and real-time performance. Hence, the following has the purpose to illustrate some important research works in this field, which are considerably less than the figure-ground segmentation approaches explained above. However, it has been necessary to explain how the traditional algorithms addressed the problem to infer key observations on the challenges to face.

In 2016 Vodopivec, Lepetit, and Peer [42] developed a method for extracting accurate hand masks in egocentric views based on a novel DL architecture, since skin colour-based methods are prone to fail if other parts of image have similar colours. Their work was inspired by Tu and Bai [43] which developed a segmentation method in which the segmenter is iterated and the segmentation result of previous step is used in the

(41)

next iteration in addition with the original image. Differently, their initial segmentation is performed on a 16 times down-scaled version of input, which then is up-scaled before passing to second iteration, allowing to take into account context efficiently and precise localization of segmentation boundaries and avoiding pooling. In this manner they achieved two goals:

• Avoiding oversmoothed segments in comparison with usual DL architectures like SegNet [44].

• Speeding up the system which was too computationally intensive for real-time.

Their architecture (Figure 2.5) based on a multi-scale analysis of input is then splitted in two to obtain efficiency. Initially three chains of three convolutional layers are applied to the input image, the input down-scaled by 2 and the input down-scaled by 4. Pooling layers are not used in order not to lose fine details. The output of the three chains is concatenated to a fully connected logistic regression layer whose output is the estimated probability map of segmentation. Splitting the network provides advantages since the first part runs on a small version of original image making possible to reduce considerably the number of features maps with smaller filters in the second part without loosing accuracy. Training settings are a Leaky ReLU activation function, a boosted cross-entropy loss function and augmentation with simple geometric transformations such as scaling, rotation, shear and translation to avoid overfitting. In the end they achieved on their egocentric dataset of 248 images a 99.3% accuracy and 39 ms (∼ 26 FPS), with all layers using 3×3 filters. Up-scaling without the original image leads to less processing time but a decrease of the accuracy from 99.3% to 98.6%. In conclusion occlusions are crucial and the authors showed that starting with low resolution processing of the image helps capturing the context while using the input image a second time helps capturing the fine details of foreground hand.

DL is also applied to segment hands in [45] on two relatively simple datasets of gray level palm-print images with cluttered backgrounds, with a total of 12980 images. Images are resized to a lower resolution of 135×180 because of DL complexity, then thresholded to remove background replaced with 100 possible fictitious textures from Describable Texture Dataset (DTD) (basically an augmentation tool). They used two different DNNs with the same number of parameters to segment

(42)

Figure 2.5: Multi-scale splitted proposed structure [42] and visual results on their dataset.

hands: a U-shaped network [46] made of 13 layers and a basic SegNet [44], autoencoder-based model made of 4 layers to encode and 4 to de- code; “mean binary cross-entropy” is used as loss function. In conclusion, the U-shaped network achieved slightly better results on the test dataset than SegNet basic (accuracy 99.7% versus 99.51%, F1 score 99.72% versus 99.55%).

The main work to discuss is the one in 2018 by Urooj and Borji [47], based partially on Bambach et al. [40]: they fine-tuned a leading semantic segmentation method called RefineNet [48] improving sate-of-the-art pixel-hand detection (26% improvement with respect to [40] on Ego- Hands dataset). Another important contribution they did is the creation of two new datasets: EgoYouTubeHands (EYTH), with hand videos from YouTube recorded in wild conditions, and Hand Over Face (HOF), with 300 images to manage similar appearance occlusions (Figure 2.6).

Figure 2.6: Samples from four hand datasets including EgoHands, EYTH, GTEA and HOF used in [47].

Furthermore, they demonstrated the benefit of training a CNN (AlexNet [49]) for hand based activity recognition on hand segmentation maps obtained by their network, achieving higher accuracy (59.2% versus 58.6%).

They treated hand segmentation as a dense prediction problem, also called semantic segmentation (Section 3.2.1) task in contrast to object detection as [40]. Although in the past Fully Convolutional Network

(43)

(FCN) [50] were used for segmentation, or recently PSPNet [51], they adopted RefineNet [48]. As explained in Section 3.7.4, it is a cascade architecture of multiple blocks composed mainly of RCU, computing features from ResNet-101 [52] at different levels and fusing them to produce high resolution prediction map. Multi-scale evaluation with three scales increased performances.

In Tab. 2.1 results in terms of pixel-level mIoU, which is positively correlated with F1 score, on each dataset can be observed (meanPrecision and meanRecall were evaluated too) when the architecture is trained on each of them. Alongside with a robust segmentation method, appropriate

Dataset EgoHands EYTH GTEA HOF

Tr. Epochs 80 85 92 61

mIoU 81.4% 68.8% 82.1% 76.6%

Table 2.1: Quantitative results on different datasets of [47] method.

hand segmentation datasets are also important to achieve the goal. Hence they performed cross-dataset evaluation highlighting how RefineNet-101 trained on their dataset, “wilder” and more varied, generalizes better.

Failure cases, where mIoU < 0.6%, are intuitively motion blur, occlusions, skin appearance occlusions, small hands, lighting situation either too bright or insufficient. Although CRFs, well-known for their utility in refining pixel level predictions for CV problems such as saliency detection, help to visually improves fine level details like fingers (Figure 2.7), in particular in case of hand-to-hand occlusion, overall slightly hurt in terms of mIoU. Small hands in particular are more challenging for segmentation. For the future, using better high information such as superpixels needs to be considered.

HGR-Net [53] is a two-stage CNN architecture for robust recognition of hand gestures, first performing accurate semantic segmentation to determine hand regions and then identifying gestures. The segmentation stage architecture (Figure 2.8) is based on two main parts, combining deep fully convolutional residual network [52] for learning useful representations and Atrous Spatial Pyramid Pooling (ASPP) module (successfully employed in DeepLabv3 [54], see Section 3.7.4) to encode multi-scale context by adopting multiple dilation rates, finally up-sampling the output with bi-linear interpolation. Although the segmentation sub-network is

(44)

Figure 2.7: Qualitative results on EgoHands dataset using baseline [40], RefineNet [47] and RefineNet+CRF.

trained without depth information, it is particularly robust against challenges such as illumination variations and complex backgrounds. Depth sensing devices might not be suitable for all environments, especially in outdoor scenarios, and add to the computational burden. Both online and offline data augmentation is performed to prevent overfitting. The results evaluated with F1 score outperform segmentation state-of-the-art methods on the HGR1 dataset, achieving 23 ms per frame (43 FPS) with a 320x320 input image.

Figure 2.8: HGR-Net proposed structure [53], with the first stage dedi- cated to segmentation.

(45)

A needed mention is deserved by works applying DL for segmentation of hands, exploiting depth information [55]. For instance Bojja et al. [56]

created a large scale Red-Green-Blue-Depth (RGB-D) hand segmentation dataset using an Intel RealSense SR300 sensor and coloured gloves, and then they appropriately segmented it for real-time hand tracking applications. They proposed a new CNN leveraging on strided (transposed) convolutions in place of unpooling layers, achieving good gener- alization capabilities, mIoU of 97.2% and 5 ms (200 FPS) test time on GPU. While SegNet [44] memorizes max-pooling indexes and DeconvNet [57] uses more computationally intensive transposed convolutions, their method is a hybrid encoder-decoder (Figure 2.9) consisting in a hierarchy of deconvolutional layers (à la DeconvNet) and and skip-connections (à la U-Net [46]) to improve sharpness.

Figure 2.9: HandSeg proposed structure [56], compared with DeconvNet [57] and SegNet [44].

Kang et al. [58] proposed a hand segmentation method for HOI using only a depth map, consisting in a two-stage RDF method for detecting hands and then segmenting hands. Despite the challenge of small depth difference between a hand and objects during an interaction, the pro-

(46)

posed method achieves high accuracy in short processing time comparing to the other state-of-the-art methods. Although Kang, Lee, and Nguyen [59] did not focus on hand segmentation, they developed a depth-adaptive DNN using a depth map for semantic segmentation on a RGB-D dataset.

They adapted the receptive field not only for each layer but also for each neuron at the spatial location, proposing a Depth-adaptive Multi-scale (DaM) convolutional layer consisting of adaptive perception neuron and in-layer multi-scale neuron, reaching a mIoU of 83.8 and F1 score of 81.1 on a new HOI dataset created with Microsoft Kinect V2 sensor. The interesting work by Cruz and Chan [60] uses DL as well, in particular zoomed NNs, but only for hand detection and disambiguation, focusing on detecting left and right hands as different objects.

All the aforementioned methods heavily make use of particular fundamental DL segmentation networks, such as SegNet [44], DeconvNet [57], U-Net [46], DeepLab [61], RefineNet [48] and backbone networks as ResNet [52] and MobileNet [62], which will be partially discussed in Chapters 3.

2.4 Summary

In conclusion, extensive research in the field of hand segmentation has mainly been centered around skin colour modeling, figure-ground segmentation in egocentric videos, and more recently CNNs. The most attractive solutions in the field utilise DNNs to directly segment hands and have been proven to provide an accurate result, possibly in real-time, being able to understand some hidden patterns useful for classifying hand pixels which traditional methods were not able to find. Hence, they could be suitable for use in AR smartphone applications. Furthermore, it is noteworthy that several works reach different results of accuracy and computational time on different datasets, each claiming to be the most promising. To bridge this gap in research, the proposed solution will be trained on a more varied dataset purposely built, including precious data collected by MM [4], and an analysis regarding the effects of different DL structures on the segmented hands’ appearance will be provided.

(47)

Chapter 3 Background Theory

3.1 Overview

This chapter summarizes the key theoretical principles necessary to grasp the proposed DL solution for hand segmentation. Sections 3.2 reviews the segmentation problem in CV while the later sections describes CNNs’

theory and their use to solve the segmentation problem.

3.2 Segmentation

In CV and image processing, a subfield of Digital Signal Processing (DSP), image segmentation is the process of dividing images into similar regions, which can be semantically meaningful or not. From a pattern recognition perspective, it means assigning each pixel to a certain class and hence it can be thought of as a classification problem per pixel.

Reliable segmentation is possible with prior information, but such information is typically not available. It is a useful step because, according to Gestalt Theory [63], our human brain does it automatically, looking for patterns as proximity, similarity, continuity, closure, common regions and connectedness. There are mainly four types of image segmentation:

1. Luminance-based segmentation for grayscale images

• Optimal supervised thresholding: it is achieved minimizing the probability of error when modeling the two classes foreground and background with Probability Density Function (pdf).

• Maximum a Posteriori Probability (MAP) detector : choose the class with maximum a posteriori probability p(Ω_i|z), based

23

(48)

on Bayes’ rule p(Ω_i|z) = ^p(z|Ω_p(z)ⁱ^)p(Ωⁱ⁾. It can be applied also for colour in a higher dimensional space.

• Unsupervised thresholding: minimize within-class variance of foreground and background, namely maximize between-class variance, such in Otsu’s method [64].

2. Figure-ground segmentation

• Thresholding, for instance on the image histogram: it can be based on P-tile method, mode method, iterative method or adaptive method.

• Level-set methods: such as active contour models (“snakes”).

The shape is represented by a curve which is found by minimizing some energy E = E_internal+ E_image, meaning making the curve as short and straight as possible (min(Einternal)) and maximizing image gradient along curve(min(E_image = E_edge) (Figure 3.1). Regarding the strengths, it achieves very good results on medical images while for the weaknesses, it could easily get stuck on wrong sharp edges.

Figure 3.1: Active contour segmentation on Magnetic Resonance Imaging (MRI).

• Energy minimization with graphs: in algorithms such as Graph- Cut [33] or GrabCut [38] images are treated as graphs and weights (edges) are based on the similarity between pixels (Figure 3.2), with the goal to find the lowest cost split of the graph into two pieces. Prior probabilities are used to modeling the two classes.

(49)

CHAPTER 3. BACKGROUND THEORY 25

Figure 3.2: Graph cuts and energy minimization.

3. Colour-based segmentation

• Chroma keying: colour is more powerful for pixel-wise segmentation. Used for instance in green screen technique for visual effects or post-production.

• Linear discriminant function, Pn

i=1wifi+ w0 ≥ 0, to segment images with n features f_i into two categories.

4. Generic image segmentation

• Split-and-merge: hierarchically splitting image regions until their boundary is weak and the variation in each region are small enough and then merging neighbours as long as variations remain small. It is considered an out-dated method.

• Watershedding: creating a topological map over image domain, for example using gradient magnitude. Pixels with lowest value form the basins for initial watersheds, which are filled from the deepest points. When two basins meet an edge between two segments is created and the method ends when all pixels are either filled or edge pixels. Despite usually leading to over-segmentation, it is an efficient way to create superpixels that can be grouped using a method like merging (useful for interactive segmentation).

• K-means clustering: grouping pixels based on similarity in colours (or any other measure), associating them to different clusters. It is an iterative algorithm which chooses K initial mean values (randomly or starting from the centroids), assigns each pixel to the closest mean, updates means and continues until convergence (Figure 3.3). Some drawbacks are splitting of some segments and difficulty in initialize and find the right

(50)

K. If we assume a pixel has a combination of colours from multiple clusters, Gaussian Mixture Model (GMM) updated with expectation-minimization can be used to find the maximum likelihood estimate of model parameters, allowing clusters to be elliptic.

Figure 3.3: K-means segmentation.

• Mean-shift : it is a non-parametric model which efficiently find peaks in the dimensional data distribution (usually 5D, colour and spatial coordinates) without computing the complete function explicitly (Figure 3.4). Indeed, methods based on histograms neglect the dependency between neighboring pixels, sometimes causing segments to be splitted up into different pieces. Therefore spatial coherence is considered in a method such Mean-shift. Small kernels, usually Gaussian, K(x − x_i) are placed around each sample x_i with maximum density in the center. Finding maxima of the total density function f (x) = _n¹ Pn

i=1K(x − x_i) leads to following weighted sum:

x^new = Pn

i=1x_ik⁰(||x − x_i||²) Pn

i=1k⁰(||x − x_i||²) (3.1) The density will have peaks, also called modes, and starting from a particular pixel and performing gradient-ascent, the algorithm will converge to one of these modes, clustering the image based on the modes points converge to. It is slower than K-means but usually obtains better results.

• Normalized cuts [65], which is also a energy minimization graph-based method. It maximizes within cluster similari-

(51)

CHAPTER 3. BACKGROUND THEORY 27

Figure 3.4: Mean-shift segmentation.

ties, while minimizing across cluster similarities. Minimizing the following Eq. 3.2, where A and B are two disjoint sets of vertices and V = A ∪ B; cut(A,B) is the sum of edges between vertices in A and B while assoc(A,V) is the sum of the edges connected to any vertex in A:

N cut(A, B) = cut(A, B)

assoc(A, V ) + cut(A, B)

assoc(B, V ) (3.2) It is computed by solving a generalized eigenvalue problem, usually prefering cuts of approximately equal size. Differently from GraphCut [33], it does not need prior information of the pixels.

• Semantic segmentation, mainly performed through an energy minimization method as CRFs, with one node per pixel and links between all neighbouring nodes, and a set of models (for example represented by colour histogram), or through DL techniques [66] (Section 3.2.1). Also, CRFs have been proposed in a DL-fashion as Recurrent NN (RNN) to perform semantic segmentation [67].

(52)

3.2.1 Semantic Segmentation

One of the keys when HVS describes a scene, is to decompose it into separate entities. In CV object detection methods can help to draw bounding boxes around certain objects but true human-level understanding requires labeling each identity with pixel-level precision too, crucially important step in the field of autonomous cars and intelligent robots for instance [68]. Despite semantic segmentation has been an important problem in CV community since 2007, the major breakthrough came when fully CNNs were first used in 2014 [50] and it increased with the improvement of DL architectures. It is therefore fundamental to understand the definition and differences of semantic segmentation, especially if performed through DL, with some concepts that are slightly confused:

Figure 3.5: Difference between semantic and instance segmentation, courtesy of [10].

1. Semantic segmentation [69] is to know the category label of each pixels for known objects only, achieving fine-grained inference by making dense predictions. During recent years, Recurrent NN (RNN) [70] or even more complex DL architectures as Generative Adversarial Network (GAN) [71] have been proposed to solve this task [72] [73]. Different parts of a visual input are classified into semantically interpretable classes. Unsupervised methods such as clustering can be used for segmentation but their results are not necessarily semantic because they are not trained on classes only finding more generally region boundaries [74] [75]. Semantic segmentation achieves a more detailed understanding of imagery than image classification or object detection, being critical in a wide range of applications, such as autonomous driving, medical imaging or like for the case of this thesis, hand GR systems.

Real-time hand segmentation using deep learning

Real-time hand

segmentation using deep learning

Hand-segmentering i realtid som använder djupinlärning

FEDERICO FAVIA

Real-time hand segmentation using deep learning

FEDERICO FAVIA

Abstract

Sammanfattning

Aknowledgments

Contents

List of Figures

List of Tables

Acronyms

Chapter 1 Introduction

1.1 Background

1.1.1 Hand Segmentation

1.2 Motivation

1.3 Problem statement and research questions

1.4 Methodology and contributions

1.5 Challenges and delimitations

1.6 Ethics and sustainability

1.7 Outline

Chapter 2

Literature Review

2.1 Overview

2.2 Non Deep Learning-based methods

2.3 Deep Learning-based methods

2.4 Summary

Chapter 3

Background Theory

3.1 Overview

3.2 Segmentation

3.2.1 Semantic Segmentation