3D Hand Pose Tracking from Depth Images using Deep Reinforcement Learning

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018,

3D Hand Pose Tracking from Depth Images using Deep Reinforcement Learning

SNEHA SAHA

(2)

(3)

Abstract

Low-cost consumer depth cameras have enabled reasonable 3D hand pose tracking from single depth images. Such 3D hand pose tracking can be an integral part of many computer vision applications such as gesture recognition and human activity tracking. However, 3D human pose tracking still remains an open research problem as tracking of the hand involves nonrigidity due to finger articulation in complex background scenes, and occlusion which makes tracking a challenging task. In this work, we proposed a new approach to track 3D hand pose to capture both rigid and non-rigid hand gestures.

The common way of hand pose tracking involves dataset of hand images with the corresponding ground truth of 3D hand poses and machine learning techniques like a randomized forest or deep learning with the convolution neural network are applied to learn the mapping from the appearance of a hand in an image and its pose. These methods focus on improving the ability to distinguish the target hand pose and background but overlook the problem of inefficient search algorithms that explore the region of interest matching with the tracking model.

Recently, with the rapid success of Alpha-Go and Alpha-Zero, there has been progress towards using deep neural networks trained by reinforcement learning.

The human level performance was achieved by this tracker to pursue the change of target by repetitive actions controlled by the neural network model. Also, there has been a lot of research to learn the policy from raw video data in complex RL environment. In this work, we propose to design a new methodology to model hand pose tracking, where the rigid and nonrigid hand movement with the state action value pair are estimated and tracked using Reinforcement Learning (RL). The hand pose tracking is done with the bounding box to localize the gesture location. Similarly, we proposed this model can be extended to estimate the skeleton to track the nonrigidity of the finger articulation of the hand.

In overall, our proposed approach opens a new way to address the hand pose tracking problem using Deep RL as a self-learning procedure. To the best of our knowledge, our tracker is the first neural-network tracker that combines convolution neural networks with RL algorithms to track hand gesture.

(4)

Sammanfattning

Lågkostnadsdjupkameror har gjort det möjligt att effektivt spåra i händer i 3D.

Sådan 3Dhandspåspårning kan vara en omfattande del av många datorseen- debaserade applikationer, såsom gestigenkänning och spårning av mänsklig ak- tivitet. 3D-spårning av människor är dock fortfarande ett öppet forskningsprob- lem. Spårning av handen innebär nämligen att man ska betrakta handen som icke styvt objekt. Detta på grund av fingerrörelser i komplexa bakgrundssce- narier och ocklusion. I detta arbete föreslår vi ett nytt tillvägagångssätt för att spåra handen i 3D-rum där vi fångar både styva och icke-styva delar av han- dobjektet.

Ett vanligt sätt att lösa handspårningsproblemet är med hjälp av maskinin- lärningsalgortimer, t.ex. Random Forest eller Konvolutionella Neurala Nätverk.

Dessa algoritmer är dock mer lämpade för objektklassificering än for spårning.

På senaste tid har framgången av Alpha-Go och senare Alpha-Zero lett till det relativt snabba framteget inom förstärkande inlärning i kombination med Kon- volutionella Neurala Nätverk, en prestanda som kunde motsvara den mänskliga.

Det har också forskats mycket för att fram syftar att ta fram policy-modeller till Förstärkande inlärning, där inmatningsdatat var råa videosekvenser. I detta examensarbete tog vi fram en ny metod för att modellera handspårning med hjälp av förstärkande inlärning. Handen lokaliseras spåras med en avgränsande rektangel. Sedan visade vi också visat att detta arbete kunde utvigas för att även uppskatta, modellera och spåra handskletettsmodeller.

Detta är i stort sätt ett nytt sätt att lösa handspårningsproblemet med hjälp av förstärkande inlärning som självinlärningsprocedur. Så vitt vi vet är vår metod den första spårningsmodellen, som kombinerar Konvolutionelle Neura Nätverk med förstärkande inlärning för att lösa handspårningsproblemet.

(5)

Acknowledgment

I would like to thank my supervisors at ManoMotion¹, Dr. Thang Nguyen, Dr.

Jean-Paul Kouma and CTO Dr. Shahrouz Yousefi for their help and guidance throughout this work. I have learned a lot from all of them in the past six months and my interaction with them have brought drastic changes in my per- spectives about research, technology and team work. I would also like to thank my other colleagues from ManoMotion for making my work at the company so much fun as well as for all the supports.

I would also like to extend my thanks to Prof.Markus Flierl for the counsel provided during the course of the thesis. Last but not least, I thank my parents and my sister for encouraging me to chase my dreams. They made me into who I am.

1This thesis work was done in cooperation with ManoMotion AB.

(6)

List of Abbreviations and Acronyms

DoF Degree of Freedom

3D 3- Dimension

2D 2- Dimension

RGB Red-Green-Blue color space

RL Reinforcement Learning

CNN Convolution Neural Network ReLU Rectified Linear Unit

DRL Deep Reinforcement Learning

DQN Deep Q Network

HMM Hidden Markov Model

PCA Principle Component Analysis PDM Point Distribution Model SVD Singular Value Decomposition MDP Markov Decision Process

TD Temporal Difference

IoU Intersection of Union

OPE One Pass Evaluation

AUC Area under curve

SDK Software Development Kit

(8)

Chapter 1

Introduction

1.1 Motivation

A human hand is an example of a complex articulate object that exhibits many degrees of freedom (DoF), self-similarities, self-occlusion, and constrained parameters. Hand gestures are an important type of natural language used in many research areas such as hand tracking, hand gesture recognition, human- computer interfaces, etc. Hand gestures estimation requires the prior determi- nation of the hand position through estimation and tracking. One of the most effective strategies for hand tracking is to use 2D visual information such as color and shape of the hand. However, visual-sensor-based hand tracking sys- tems based on color and shape are very sensitive when tracking is performed under variable light conditions. One of the most widely used strategies for visual object detection is based on exhaustive spatial hypothesis search. While methods like sliding windows have been successful and effective for many years, they are still brute-force, independent of the image content and the visual category being searched. With the arrival of depth cameras and notable progress in machine learning in past few years, the research on human hand tracking and pose inference from 3D data has gained more popularity and become an active area of research. Also, as hand movements are made in 3D space, the recognition performance of hand gestures using 2D information is inherently limited.

Although there has been much work on hand tracking over the past decades using Random forest and CNN techniques [11], human hand motion exhibits high degrees of freedom with large viewpoint variations and partial occlusion, which still make the hand pose estimation problem very challenging. All these techniques focus on improving the ability to distinguish the target and background using the appearance model and may thus overlook the following problems: (1) inefficient search algorithms that explore the region of interest and select the best candidate by matching with the tracking model and (2) the need for a large number of labeled tracking sequences for training. In many real-world applications, it is expensive or impossible to collect a huge amount of training data.

So the idea is to, interactively train a tracker with far more limited supervision and allow it to explore the region of interest.

(9)

Also, there are sequential models used for tracking and these sequential models works on evidence collected from a set of small sequential images in order to detect object effectively. So in this work, we plan to address hand gesture tracking as a sequential search problem. The sequential search can be formulated as a Reinforcement learning framework to design a search policy (including the stopping condition). Reinforcement learning (RL) is a general paradigm in which an agent learns to control a dynamic system (its environment) through examples of real interactions without any model of the physics ruling this system. The agent directly learns from the video input, the reward, the terminal signal and set of possible actions- just as a human player would do. A feedback signal is observed by this agent after each interaction as a reward information, which is a local hint about the quality of the control. When addressing a Reinforcement learning problem, one considers the system as made up of states and accepting actions to move in the environment. A neural network agent may be learned to map from states to action (a policy) that maximizes the expected cumulative reward over the long term, which it locally models as a so-called value or Q-function.

RL also induces non-stationary at several levels. First, as in a lot of real-world machine learning applications, adaptation to nonstationary environments is a desired feature of a learning method. Yet most of existing machine learning algorithms assume stationary of the problem and aim at converging to a fixed solution. Few attempts to handle non-stationary of the environment in RL can be found in the literature. In recent years, Deep Reinforcement Learning (RL) has achieved superior performance in complex sequential tasks by using a deep neural network as its function approximation and by learning directly from raw images.

But using raw images as an input to deep RL to learn the state feature representation from the raw images require a prohibitively large amount of training time and data, to reach reasonable performance. This makes it difficult to use deep RL in real-world applications, especially when data is expensive. In this work, we model the tracker as an active agent with limited training data that must make decisions to maximize its reward, which is the correctness of a track of the hand gesture. Decisions ultimately specify where to devote finite computational resources at any point of time, that is should the agent process only a limited region around the currently predicted location (e.g.,“track”), or should it globally search over the entire frame (“reinitialize”)? Should the agent use the predicted image region to update its appearance model for the object being tracked (“update”), or should it be “ignored”? Such decisions are notoriously complicated when image evidence is ambiguous (due to partial occlusions): the agent may continue tracking an object but perhaps decide not to update its model of the object’s appearance. The agent is designed to track rigid and non-rigid hand gestures by bounding box initially and later by skeleton estimation to model the hand appearance through exploration (sampling more image regions for better accuracy), and exploitation (stopping the search efficiently when sufficiently confident about the target’s location). The use of RL enables even partially labeled ground truth data to be successfully utilized for semi- supervised learning.

Although there have been initial works to apply deep Reinforcement learning to estimate continuous parameters like camera poses or object locations from

(10)

a sequence of images [40], the similar task for higher dimensional data like 26 Degree-of-Freedom (DoF) of hand poses has not clearly been addressed. This work formalizes hand pose estimation in a new approach by using Deep Rein- forcement Learning to learn the parameter for hand gesture tracking. Initially, we propose to track the hand gesture with bounding box parameter but this same approach can be used to estimate the skeleton for the hand gestures using Deep RL model for tracking non-rigid hand gestures. This work will open a new dimension to tackle the problem of 3D hand pose tracking from 3D high dimensional data. The proposed approach will be tried with 3D data provided by a depth camera. However, in principle, the approach can be generalized for RGB images acquired by a normal camera accordingly. In complex scenes and scenes with significant occlusions in a single view, there is always a problem for hand gesture tracking in such scenes. In these situations, simultaneous tracking of articulated hand poses can be a challenging and also crucial for real- world applications of gesture recognition. With this aim in mind, we explicitly decompose the articulated hand motion into rigid motion and non-rigid dynamical motion. Rigid motion is approximated as the motion of a planar region and approached using a Particle filter while non-rigid dynamical motion is analyzed by a Hidden Markov Model (HMM) filter. Although all existing methods have some difficulties in tracking non-rigid motion, so the idea is to design an agent that interacts by selecting an action in a way that it able to estimate the hand gesture. In a similar way, we proposed a method to design an agent to estimate the skeleton from the hand gesture. So the agent needs to learn the reduced hand joint parameter that it needs to estimate the skeleton from each of the non-rigid hand pose gesture.

Also, another difficulty compared to 3D pose estimation at the level of the human body is the restricted availability of data. While the human body pose estimation can leverage several motion capture databases, there is hardly any such data for hands. This makes 3D hand pose tracking an interesting field of research. So in order to show the validity of our approach, a hand pose data set with large variation in hand shapes and sizes is necessary. Several real hand pose data sets are publicly available, but individually these data sets lack in varying hand gestures and size of subjects, number of original depth images and complexity of hand pose. Therefore we plan to create our own data set with several types of defined and freehand gestures of different hand shape. Since hand pose shapes have huge variation, so for this work we are using rigid hand movement poses and grab posture in non-rigid movement. The depth data are collected using Intel depth-sensor and the skeleton information for verifying our result is collected using a Leap Motion sensor.

1.2 Research Question

The purpose of this thesis is to find answers to the following research questions:

• How Deep Reinforcement Learning techniques are used to learn and solve hand pose estimation using evaluative feed-backs as rewards?

• What factors influence the agent to learn efficiently?

(11)

1.3 Outline

Chapter 2 provides review of the research work related to hand pose estimation and Deep RL application. Chapter 3 gives an elaborate background of Rein- forcement Learning and Neural Network model. Chapter 4 provides the detail of the method used in this thesis work. Chapter 5 shows the results obtained under various test scenarios for the experiment and Chapter 6 contains a discussion of the results along with conclusions and scope for future work.

(12)

Chapter 2

Literature Review

Vision-based hand pose estimation and tracking have been extensively studied in literature over many years. The increased performance of tracking in the real world can be attributed to two dominating trends: depth image and deep learning. The problem of 3D hand pose tracking has aroused a lot of attention in computer vision community for long, as it plays a significant role in human- computer interaction such as virtual/augmented reality applications. Over the past few years, hand pose estimation techniques have shifted almost entirely using only depth images, since Depth sensor such as MS Kinect and Intel Real- sense have become widely available. As a 2.5D source of information, depth information resolves many ambiguities present in the monocular RGB input.

Secondly, deep learning has transformed the way the vision problem is being solved. The use of deep neural network has made the hand pose estimation much easier. But despite the recent progress in hand estimation in this field [13, 20, 21, 22, 27], robust and accurate hand pose estimation remains a challenging task. Due to large pose variations and high dimension of hand motion, it is generally difficult to build an efficient mapping from image features to articulated hand pose parameters in a depth image.

Hand pose tracking from 2D pose suffer from self-occlusion of hand that creates ambiguity and difficult to resolve and thus suffer from accuracy problem which is otherwise not present in body pose estimation. By treating depth images as a 2D image, we can avoid converting depth information to a volumetric representation due to computational overhead and associated ambiguities. From a given depth map 2D CNN is used to capture local surface pattern but also treat the depth map as a set of 3D points to arrive at a final pose estimation in 3D. In standard hand pose tracking pipeline, the depth map is always treated as images, and 3D hand pose estimation is treated as holistic regression that aiming to directly map the depth images to 3D pose parameters such as joint angles or 3D coordinates. Dibra et al. proposed 3D hand pose estimation from a single depth image using AlexNet [11] architecture and pre-training the model on synthetic data to provide pose estimation. However, this method is limited to only single hand shape. Oikonomidis et al. [23] formulated 3D tracking of hand joints as an optimization problem that minimizes the discrepancy between the 3D structure and appearance of hypothesized 3D hand model instances. Qian et al. [3] mod- eled a hand simply by using a number of spheres. Then they proposed a hybrid

(13)

method that combines gradient based and stochastic optimization methods to estimate the 3D hand model with fast convergence and good accuracy. Also, Chin Yun et al.[18] proposed a 3D hand skeleton model estimation algorithm from depth images by using an Active Shape Model (ASM). Principle Compo- nent Analysis (PCA) appearance models have the advantage of the ability to generate a new appearance using a small training set, but linear correlations impose a limit to its applications. Complex scenes, occlusion and clutters pose generate serious distractions to these representations. These lead to a need of dimensionality reduction for all hand posture estimation approaches and also to reduce associated computational complexity. Santello et al. [24] revealed 90 percent of the variance of the data of grasps directed towards household objects could be described by as little as 3 principal components(PCs). Many other studies have since supported this view [25] [26] for dimensionality reduction for hand pose estimation and tracking.

The most common hand pose estimation techniques can be classified into model- driven approaches and data-driven approaches. Model-based methods synthesize the image observation from hand geometry, define an energy function to quantify the discrepancy between the synthesize and observed images and op- timize the function to obtain the hand pose. Data-driven method or learning based method learn a direct regression function that maps the image appears to hand pose using isometric self-organizing map [7], random forests [9, 21] and convolution neural networks (CNN) [27], to map image features to hand pose parameters. Several algorithms [31, 8] utilize pertained CNN on large-scale classification data set such as ImageNet[33]. Evaluating the regression function is usually much more efficient that model-based optimization. But, most learning based algorithm does not consider hand geometry and thus they consider hand pose as a number of independent joints. Thus the estimated hand pose can be physically invalid. Recently, deep prior approach by Oberweger et al.

[13] and Zhou et al. [14] exploit PCA based model prior to CNN that fully exploit the hand model geometry. However, due to the gap between classification and tracking problem, the trained CNN is not sufficient to solve the difficult tracking issues. Even if with CNN there were inefficient search algorithms that explore the region of interest and select the best candidate by matching with the tracking model. But for tracking, lack of data seems to arise from the difficulty of an-noting videos as opposed to images. So in this thesis, we showed a new way to address this challenge in a distinct manner. In terms of data, rather than requiring videos to be labeled with detailed bounding-boxes at each frame, we interactively train trackers with far more limited supervision (specify- ing rewards/penalties only when a tracker fails). Interestingly, RL also naturally lends itself to streaming “open-world” evaluation: when running a tracker on a never-before-seen video, the video can be used for both evaluation of the current tracker and for training (or refining) the tracker for future use. In order to track hand pose across a sequence, we formalize this tracking using self-learning procedure.

Also over the past years, there has been increased interest in devising learning techniques that combine unlabeled data with labeled data – i.e. in a semi- supervised way of learning. The availability of vast amounts of data by applications has made imperative the need to combine unsupervised and supervised

(14)

learning. This is because the cost of assigning labels to all the data can be expensive, and/or some of the data might not have any labels due to a selection bias. So, the underlying challenge is to formulate a learning task that uses both labeled and unlabeled data such that generalization of the learned model can be improved. Although the amount of available image data on the Internet is increasing constantly, it is nontrivial to build a supervised data set that dense because of the high costs of manual labeling. An alternate method for anno- tation of the data set is proposed by Deng et.al and Russakovsky in [30, 33].

However manual labeling is still a bottleneck. Gabriel et.al [17] proposed sug- gested feature learning by pre-training deep RL network’s hidden layers via supervised learning. Also, Snagdoo et.al [40] proposed an algorithm combined with supervised and Reinforcement learning to train a network. In supervised learning stage, the network is trained to select an action based on the position of the target. In RL stage the network trained previously with training sequence of state, action and reward are used to track the position of the object.

The goal of reinforcement learning (RL) is to learn a policy that decides sequential actions by maximizing the cumulative future rewards. In RL, an agent learns to control a dynamic system (its environment) through examples of real interactions without any model of the physics ruling this system. Deep learning using Neural Network enables RL to scale to decision-making problems that were previously intractable i.e., settings with high-dimensional state and action spaces. Recent trends [39, 41] in the RL field is to combine the deep neural networks with RL algorithms by representing RL models such as value function or policy. The first use of DRL was the development of an algorithm that could learn to play a range of Atari 2600 video games at a superhuman level directly from image pixels [39] . The second standout success was the development of a hybrid DRL system, AlphaGo, that defeated a human world champion in Go [41], paralleling the historic achievement of IBM’s Deep Blue in chess two decades and IBM’s Watson DeepQA system that beat the best human Jeopardy players. Besides deep RL’s state-of-the-art results, one of its most impressive accomplishments is its ability to learn directly from raw images. However, in order to bring the success of deep RL in virtual environments into real-world applications, we must address the lengthy training time that is required to learn a policy.

Deep RL suffers from the poor initial performance like classic RL algorithms since it learns in tabular format [1]. In addition, deep RL inherently takes longer to learn because besides learning a policy it also learns directly from raw images — instead of using hand-engineered features, deep RL needs to learn to construct relevant high-level features from raw images. These problems are consequential in real-world applications with expensive data, as in robotics, fi- nance, or medicine. To overcome this and to speed up the training of DRL can be achieved by addressing two problems it is trying to accomplish: 1) feature learning and 2) policy learning [17, 40] solve the problem of feature learning in order to speed up learning in deep RL by using pertained CNN network.

(15)

Chapter 3

Background

The essence of RL is learning through interaction. An RL agent interacts with its environment and, upon observing the consequences of its actions, can learn to alter its own behavior in response to rewards received. The paradigm of trial-and-error learning has its roots in behaviorist psychology and is one of the main foundations of RL [1]. The other key influence on RL is optimal control, which has lent the mathematical formalisms of the dynamic programming [5].

Also, Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in RL one has sparse and time-delayed labels – the rewards. Based only on those rewards the agent has to learn to behave in the environment. We take advantage of the Deep Neural Networks to solve this problem through regression and choose an action with highest predicted Q-value. To train our agent, we let the agent explore randomly for a few thousand steps and we record each state, action, and reward in a memory called experience replay. We train our agent on batches randomly chosen from these experience replays.

3.1 Reinforcement Learning

Reinforcement learning [1] is a branch of machine learning which describes the way an agent is able to learn its behavior through trial-and-error interactions with a environment. Figure 4.1 visualizes the general architecture of a Rein- forcement Learning agent. At each discrete time, step t the agent is able to perceive the current state of the environment st∈ S, where S is the set of possible states of the environment. Based on the current state stthe agent selects and perform an action at∈ A , where A(st) is the set of actions available in state s_t. After completing an action, the agent receives the reward r_t+1for its action in the next time step and can observe the updated state of the environment s_t+1.

(16)

Figure 3.1: Reinforcement Learning

The reward can be delayed so that it is not directly clear how beneficial each action is. The reward is defined as a function which maps the previous state of the environment to a numeric value probability of that particular state. Overall the goal of the agent is to maximize the expected return R_t, the total discounted reward of an agent can expect from the time step t.

Rt= rt+1+ γrt+2+ γ²rt+3+ ... =

∞

X

k=0

γ^krt+k+1 (3.1)

The discount factor γ (0 < γ < 1) is a parameter which determines how highly the future rewards are valued. If the discount factor γ = 0, the agent is only concerned with maximizing immediate reward ie. to choose action a_t so as to maximize R_t+1. As γ approaches 1, the return objective takes future reward into account more strongly.

In order to maximize the policy the agent need to maximize its return, the agent follow a policy πt, which maps from its perceived state to probabilities of selecting each possible action and define the agent’s behavior. The agent policy can be either be deterministic (πs) or stochastic (π_s|a).

3.2 Markov Decision Process

One of the important aspects of the states of reinforcement learning problem is that they are often assumed to have the Markov property. The Markov property is that the environment response at time t + 1 depends only on the current state stand action at, but not on the previous history of the state, action, and reward.

In other word the future is independent of the past, given the present and it can be represented by the below equation:

Pst+1= s⁰, rt+1= r⁰|st, at = P st+1= s⁰, rt+1= r⁰|st, at, at, ..., r1, s0, a0 (3.2) A Reinforcement learning task which satisfies the Markov property is called a Markov Decision Process (MDP). MDPs describe an environment for Reinforce- ment learning where the environment is fully observable. An MDP is defined by a tuple <S, A, P, R, γ >:

• S : Set of state s

(17)

• A: Set of action a

• P : State transition probability matrix

• R : Rewards function

• γ : Discount factor

Given a state s and an action a , the transition probability P^a

ss⁰ of an MDP defines the probabilities to the possible follow up state s⁰:

P_ss^a0 = Pst+1= s⁰|st= s, at= a

(3.3) Similarly , the reward function R^a

ss⁰ defines the expected value of the next reward for a state s, an action a, and a next follow up state s⁰:

R^a_ss0 = Ert+1|st= s, at= a, st+1= s⁰

(3.4) Value Function

The state-value function V^π(s) of an MDP is the expected return starting from state s and then for the following policy π the state value function can be represented as:

V^π(s) = E^πRt|st= s

(3.5)

It corresponds to the long term value of state s. In the same way the action value function Q^π(s, a) defines the value of taking an action a in state s under policy π:

Q^π(s, a) = E^πRt|st= s, at= a

(3.6) The Bellman Expectation equation describe how the state-value and action-value function can be decomposed into an an immediate reward plus a discounted value of the previous state :

V^π(s) = E^πrt+1+ γV^π(st+1)|st= s

(3.7) Q^π(s, a) = E^πrt+1+ γV^π(st+1, at+1)|st= s, at= a

(3.8) One problem in RL is to find the policy which achieves the greatest return. A policy which yields an expected return greater than or equal to that of all other policies π⁰ for all states is called the optimal policy π^∗. There always exists at least one optimal policy which is better than or equal to all others. All optimal policies share the same optimal state-value function V_s^∗, the maximum value function overall policies for all states s ∈ S is given by:

V^∗(s) = max

π V^π(s) (3.9)

Similarly all optimal policies also achieve the optimal action-value function Q^∗(s, a), which is the maximum action value function over all policies for all states s ∈ S and all actions a ∈ A(s):

Q^∗(s, a) = max

π Q^π(s, a) (3.10)

(18)

The optimal state-value function and the optimal action-value function are related by the Bellman Optimally Equations:

V^∗(s) = max

π Q^∗(s, a) (3.11)

Q^∗(s, a) = E^πrt+1+ γV^∗(st+1)|st= s, at= a

(3.12) Q^∗(s, a) = E^πrt+1+ γ max

π Q^∗(st+1, a⁰)|st= s, at= a

(3.13) If Q^∗(s, a) is known, the optimal policy π^∗(a|s) is defined as the action which maximizes Q^∗(s, a):

π^∗(a|s) =(0, if a = max

a∈AQ^∗(s, a) 1, otherwise

(3.14)

This optimal policy is said to be greedy with respect to the optimal action-value function. The optimal policy defines the agent’s optimal actions without the need of knowledge of the environment’s dynamics R^a

ss⁰, P^a

ss⁰.

Bellman Equation

The Bellman equations formulate the problem of maximizing the expected sum of rewards in terms of a recursive relationship of a value function. A policy π is considered better than another policy π⁰ if the expected return of that policy is greater than π⁰ for all s ∈ S, which implies , V^π(s) ≥ V^π

0

(s) for all s ∈ S.

Thus the optimal value function, V_∗(s) can be defined as:

V_∗(s) = max

π Vπ(s), ∀s ∈ S (3.15)

Similarly, the optimal action value function, Q_∗(s, a) can be defined as:

Q_∗(s, a) = max

π Qπ(s, a), ∀s ∈ S, a ∈ A (3.16) Also, for an optimal policy, the following equation can be written as:

V_∗(s, a) = max

a∈A(s)

Qπ^∗(s, a) (3.17)

And expanding Eqn. (3.17) with (3.14)- V_∗(s) = max

a E∗(Rt|st= s, at= a) (3.18)

V_∗(s) = max

a E∗(

∞

X

k=0

γ^krt+k|st= s, at= a) (3.19)

V_∗(s) = max

a

X

s⁰

p(s⁰|s, a)[R(s, a, s⁰+ γV_∗(s⁰))] (3.20)

(19)

Equation (3.20) is known as Bellman optimality equation for V_∗(s). The Bellman optimality equation for Q_∗ is:

Q_∗(s, a) = E(rt+ γ max

a⁰

Q_∗(s_t+1, a⁰)|s_t= s, a_t= a)

=X

s⁰

p(s⁰|s, a)[R(s, a, s⁰ + γ max

a⁰

Q_∗(s⁰, a⁰)] (3.21)

If the transition probabilities and the reward functions are known, the Bellman optimality equations can be solved in an iterative fashion. This approach is known as Dynamic programming. The algorithms which assume these probabilities to be known or estimate them online are collectively known as model-based (Section 3.3) approach.

3.3 Model-free Methods

Model-free methods can be applied to any reinforcement learning problem since they do not require any model of the environment. Most model-free approaches either try to learn a value function and infer an optimal policy from it (Value function based methods) or directly search in the space of the policy parameters to find an optimal policy (Policy search methods). Model-free approaches can also be classified as being either on-policy or off-policy. On-policy methods use the current policy to generate actions and use it to update the current policy while off-policy methods use a different exploratory policy to generate actions as compared to the policy which is being updated. The following subsections look at various model-free algorithms used, both value function based and policy search methods.

3.3.1 Value Function Based Methods

Monte Carlo Methods

Monte Carlo methods work on the idea of generalized policy iteration (GPI). The GPI is an iterative scheme and is composed of two steps. The first step tries to build an approximation of the value function based on the current policy, known as the policy evaluation step. In the second step, the policy is improved with respect to the current value function, known as the policy improvement step.

In Monte Carlo methods, to estimate the value function, rollouts are performed by executing the current policy on the system. The accumulated reward over the entire episode and the distribution of states encountered is used to form an estimate of the value function. The current policy is then estimated by making it directly greedy with respect to the current value function. Using these two steps iteratively, it can be shown that the algorithm converges to the optimal value function and policy. Though Monte Carlo methods are straightforward in their implementation, they require a large number of iterations for their convergence and suffer from a large variance in their value function estimation.

(20)

Temporal Difference Method

Temporal difference (TD) builds on the idea on the GPI but differs from the Monte Carlo methods in the policy evaluation step. Instead of using the total accumulated reward, the methods calculate a temporal error, which is the difference of the new estimate of the value function and the old estimate of the value function, by considering the reward received at the current time step and use it to update the value function. This kind of an update reduces the variance but increases the bias in the estimate of the value function. The update equation for the value function is given by:

V (s) ← V (s) + αh

r + γV (s⁰) − V (s)i

(3.22) where, α is the learning rate, r is the reward received at the current time in- stant, s⁰ is the new state and s is the old state.

Thus, temporal difference methods update the value function at each time step, unlike the Monte Carlo methods which wait till the episode has ended to update the value function. Two TD algorithms which have been widely used to solve RL problems are SARSA (State − Action − Reward − State − Action) and Q-Learning. Out of which we used the Q- Learning approach for our work.

3.4 Q-Learning

Watkins [15] introduced an off-policy temporal difference algorithm known as Q-learning. Q-learning is a model-free implementation of Reinforcement Learn- ing where a table of Q values is maintained against each state, action taken and the resulting reward. This approach does not require knowledge of R_ss^a0 or P_ss^a0. Q-learning does not wait for the final return Rt to update its estimate of Q(st, at). Instead, it updates the current estimate of Q(st, at) at each time step with the difference between the current estimate of Q(st, at) and the TD target. Q-learning’s is represented in terms of the reward rt+1, observed after performing a_t, plus the discounted action-value Q(s_t+1, a) of the next state s_t+1 where a is the action that maximizes Q in s_t+1.

Q(s_t, a_t) ← Q(s_t, a_t) + αrt+1+ γ max

a Q(s_t+1, a_t) − Q(s_t, a_t)

(3.23) The algorithm is summarized below,

(21)

Algorithm 1 Q learning Initialize Q(s,a) randomly repeat

Observe initial state s₁ for t = 1:T do

Select an action at using policy derived from Q(e.g. -greedy) Carry out action at

Observe reward rt and new state st+1

Update Q using (2-4) end

until terminated

3.5 Exploration and Exploitation

Exploration vs Exploitation problem arises when the model tends to stick to same actions while learning, in our case the model might learn to move in one direction of the axis rather than following other directions in the axis and in turn apply same policy every time. So the agent is allowed to try random actions while learning which can give a better reward. The probability is introduce, which decides the randomness of actions. The probability value is gradually decreased to reduce the randomness as we progress and then exploit the re- warding actions. Q-learning attempts to solve the credit assignment problem as it propagates rewards back in time until it reaches the crucial decision point which was the actual cause for the obtained reward.

Exploration vs. Exploitation

When a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. Initially, the action will be random and the agent per- forms crude “exploration”. As a Q-function converges, it returns more consistent Q-values and the number of exploration decreases. So Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”.

The reason exploration is needed can be framed as one of obtaining represen- tative training data. In order for an agent to learn how to deal optimally with all possible states in an environment, it must be exposed to as many of those states as possible. Unlike in traditional supervised learning settings, however, the agent in a reinforcement learning problem only has access to the environment through its own actions. There are two ways for the agent to explore are,

• Random method - One approach for exploring the state space is to generate actions randomly with uniform probability. This method is used if the task to be learned is divided into a learning phase and a performance phase and the cost during learning is being ignored, then this method may be applicable. However, in many situations, the agent’s performance during learning is an important facet of the problem formulation. This method is not well suited to such problems; purely random exploration is the least efficient exploration method in terms of costs [36].

(22)

• Epsilon-Greedy method - A simple combination of the greedy and random approaches yields one of the most used exploration strategies: -greedy. In this approach the agent chooses what it believes to be the optimal action most of the time, but occasionally acts randomly. This way the agent takes actions which it may not estimate to be ideal, but may provide new information to the agent. The in -greedy is an ad- justable parameter which determines the probability of taking a random, rather than principled, action. Due to its simplicity and surprising power, this approach has become the technique for most recent RL algorithms, including DQN and its variants.

If our Q-value function returns a vector instead of a scalar, the max op- erator of the linear loop of action is taken. Once the decrease below certain threshold, it remain constant over time.

Algorithm 2 Greedy Strategy

procedure ACTION = - Greedy (, s = state) Initialize a_aux ∈ A_s

if then ≥ rand(0,1)

Select a random action a_i∈ A_sfrom the action space aaux = ai

else

Initialize Qaux= 0

for each action ai ∈ As do

Compute Q(s, ai) based on s = state if then Q(s, ai) ≥ Qaux

Qaux = Q(s, ai) aaux = ai

Break end if end for end if Return a_aux

Credit Assignment

Credit Assignment problem can confuse the model to judge which past action was responsible for current reward. A negative reward in our case can be due to a low intersection (IoU) of a bounding box for hand pose for previously taken action and not the current action. So a Discount factor γ is added, which decides how far into the future our model looks while deciding an action. Thus, γ solves the credit assignment problem indirectly. In our case the model learned with γ

= 0.99.

(23)

3.6 Deep Neural Network

3.6.1 Basic Idea

The simplest neural network consist of linear transformation of some data points X:

X₁= f (W X + B) (3.24)

Since for some columns X, all its values influence all the values in the same columns in X₁, the input layer X and layer X₁ are said to be fully connected.

The convection of the neural network comes from the inspiration of physical neurons arranged in layers although the description is not as apparent when describing the networks in this fashion. The above transformation is obviously restricted to learning linear transformations, so to learn non-linear functions, a non-linear activation-function is added.To learn more complex functions, transformation can be recursively stacked:

X2= f (W1X1+ B) (3.25)

X_k+1= f (W_kX_k+ B) (3.26)

A loss function is function l : f R^d → R, where f is the neural network and R^d is some data. The data are input to the network and by all parts being differentiable in the network , we can specify the loss function and calculate the derivatives with respect to all parameters W, W₁, ... and B, B₁, .... Using this we can then minimize the loss using gradient descent. A common and effective way to do this is called back-propogation. The first layer X is commonly referred to as the input layer and the last layer as the output layer. The immediate values are referred as hidden layer.

3.6.2 Network Layers

The network exhibit a pyramidal architecture consisting of several planes, where the resolution of the image is reduced from plane to plane by a given factor.

The important feature is that the input of several lower level pixels is reduced to a single output when fed to single upper level neuron. This is the basis of the convolution and sub-sampling operation essential in a convolution neural network. In addition, to the particular sparse connectivity discussed above, Convolution neural network utilize several layers which are discussed below:

Convolution Layers

Convolution as a mathematical term represents applying a function repeatedly over the output of another function. In the context of image processing, this represents applying a filter effect over the image. During this process, the value of a central pixel is determined by adding all weighted value of its neighbour.

Common filter effects are: sharpen, blur, intensify, etc. The filter can be applied over the image at certain strides, or offsets. The larger the value of the stride, the bigger the size reduction (compression) of the original image. If the stride is

(24)

1, then the resulting image is of the same size as the original. One convolution layer may apply several filters, or feature maps. In essence, the filter is represented by a set of weights connected to a small patch of the original image, and it produces a single output. The resulting network structure mimics a series of overlapping receptive fields, which produces a series of "filter" outputs. Since all receptive fields share the same weights, we only have to compute the weight updates for a single instance of the filter during back propagation.

Pooling (sub-sampling) layers

In the general case, pooling (or sub-sampling) represents reduction of the overall size of a signal. In the context of image processing, it refers to reducing the size of the image. In Alexnet, dealing with image recognition, it is used to increase the in variance of the filters to geometric shifts of the input patterns. Pooling can be achieved by using the average, L1, L2, or max of given signal data in a local patch. In effect, it promotes dimensionality reduction and smoothing.

Several other neural network model use max pooling. The matrix of filter outputs is split into small non-overlapping grids (patches), and the maximum (or average) value of each grid becomes the output, as shown on Figure below.

Applying max pooling layers between convolution layers increases spatial and feature abstractness.

Figure 3.2: Example of 2D max pool operation

Normalization layers

The intention of this type of layer is to perform a type of "lateral inhibition". It is useful in combination with ReLU units because of their unbound activation. It allows for the detection of features with a spike in response value by normalizing over local input values. At the same time, it inhibits regions with uniformly large response values. The normalization can be performed across or within channels.

(25)

3.6.3 Common Activation Function

Two common activation functions include tanh and Rectified Linear Unit (ReLU) [20] . The tanh function is defined as:

tanh (x) = 2

1 + e^−x − 1 (3.27)

The ReLU function is defined as :

relu(x) = max(0, x) (3.28)

In terms of training time with gradient descent, the saturating tanh non linearity are much slower that non saturating non linearity like ReLU. According to Nair and Hinton [28], Deep convolution neural network with ReLU train several times faster than their equivalent with tanh units. Faster learning has a great influence on the performance of large models trained on large data sets.

3.6.4 Batch Normalization

As the weights in one layer changes during training, the following layers have to adapt to this change. In fact, later layers constantly have to adapt to changes in any of the previous layers during training, something called internal co-variance shift. It was shown that this problem can be solved by adding intermediate normalization layers, called batch normalization [35]. These layers whiten the activation of the previous layer, ie. element-wise subtraction of the mini-batch mean and divide by the square root of the variance. Since the statistics are calculated per mini-batch, they argue that this acts as a regularizer, and they empirically show that dropout in some cases is no longer needed. They argue that this is due to that the representation of one sample will shift differently depending on the other samples in the mini-batch. For some cases, whitening the outputs of the previous layer decreases what the next layer can represent, e.

g. not saturating the sigmoid function in the subsequent layer. To alleviate this problem, the authors propose learnable parameters that ensure that the normalization layer, if needed, can represent the identity function. During inference, normalization is done on population estimates of the mean and variance. These population estimates are inferred using running mean and variance estimates attained during training. Using batch normalization showed to decrease the number of training steps by a factor of 14 for some cases, and improving the test errors of previous state-of-the-art networks.

3.7 Deep Q Learning

The recent resurgence of neural networks in RL can be attributed to the wide- spread success of Deep Reinforcement Learning. For small problems, it is sufficient to maintain the estimated values of Q^∗(s, a) in a look-up table with one entry for each state-action pair. However, for problems with a large or continuous state and action spaces, it is not feasible to use a table due to time and memory constraints. Instead, it is desirable to produce a good estimate of the action-value function from a limited subset of the state-action space. In other words, generalization is needed from experienced state-action pairs to unseen

(26)

ones. In order to do so, the value function can be estimated with a function ap- proximator. One popular approach to function approximation is to use artificial neural networks due to their ability to approximate non-linear functions.

Deep-Q Network

The recent development of deep RL has gained great attention due to its ability to generalize and solve problems in different domains. The first such method, Deep Q-network (DQN) [39], learns to solve 49 Atari games directly from screen pixels by combining Q-learning with a deep neural network. When using neural networks for the approximation of the action-value function, it is possible to represent the value function for a large state space. The use of multilayer perceptrons as value functions has the advantage of good generalization as neural networks perform global approximation. A simple architecture for a neural network-based value function approximation is shown in Figure 3.3. The current state stand an action atare used as inputs to the neural network. The output of the network corresponds to the approximated action-value Q. To choose the action at with the highest action value in a state st,Q(st, at) needs to be com- puted with a forward-pass through the network for each possible action.

Figure 3.3: Neural network for action-value function approximation To train a neural network to approximate the Q function, standard gradient descent techniques can be used to learn the weights. However, in [37] demon- strate that when using non-linear models, such as neural networks, as function approximates there is a risk of the learning process becoming unstable or di- verging.

For this work, DQN is designed to directly learn from the visual inputs ( 80

* 80 pixels 7 bits Depth images) which takes state s as input. When combined with 6 possible actions like movement in X,Y and Z axis, the Q table size S * A

= 7 ∗ 256^3X80X80. Even if it were feasible to create such a table, it would be sparsely populated, and information gained from one state-action pair cannot be propagated to other state-action pairs. So we thought of modelling the state action pair with a neural network which is known as Deep Q Network (DQN).

The strength of the DQN lies in its ability to compactly represent both high- dimensional observations and the Q-function using deep neural networks. The DQN addressed the fundamental instability problem of using function approximation in RL by the use of experience replay technique[42].

(27)

Experience Replay

One reason for this instability is that unlike local schemes, such as a table, neural networks have a global approximation property. This property enables a weight update in a certain part of the state space to influence the values in other parts of the network. Thereby, an update can have the effect of making the network ’forget’ knowledge it has learned from an earlier sample. Another reason, because it is very inefficient to use experiences only once and then throw them away, in particular, if an experience only rarely occurs. Therefore, experience replay stores past experiences in a replay memory, where an experience is defined as a quadruple < s, a, s⁰, r > consisting of a state s, an action a and a new state s⁰ and a reward r. The memorized experiences are then presented to the learning algorithm more than once during training, preventing the network from forgetting previous knowledge. This approach is very easy to implement and will speed up the learning process which can otherwise be slow.

Additionally, high correlations between training samples when using sequential training samples can cause the training to become unstable. In RL, experiences are often generated sequentially, causing consecutive experiences to be highly correlated. However, optimization algorithms such as stochastic gradient descent generally assume that the training data is independently and identically distributed. Therefore, training a neural network online on sequential experiences can cause training to oscillate. Experience replay once again offers a solution as the replay memory decor relates experiences and thereby stabilizes the training.

Not only does this massively reduce the amount of interactions needed with the environment, but batches of experience can be sampled, reducing the variance of learning updates. Furthermore,by sampling uniformly from a large memory, the temporal correlations that can adversely affect RL algorithms are broken.

Finally, from a practical perspective, batches of data can be efficiently processed in parallel by modern hardware, increasing throughput of the system.

(28)

Chapter 4

Framework for Hand Pose Tracking

In this chapter, the major components and techniques involved in the thesis work are explained in detail. The methodology for hand pose tracking is formalized in a Reinforcement Learning framework where there is a DRL agent and this agent get state information and based on that it decides its action and calculates the reward function. This framework can be used for hand pose tracking initially using bounding box and can also be extended for skeleton estimation from the 3D hand gesture images. Thus this framework is useful for hand pose tracking for rigid and non-rigid gestures. Figure 4.1 define the general framework for Reinforcement Learning. For estimating and tracking the hand pose, we divided the framework initially to track the hand pose with bounding box. Figure 4.2 shows the RL framework to design out a methodology for hand pose estimation and tracking using bounding box parameter.

Figure 4.1: Framework for Reinforcement Learning

(29)

Figure 4.2: Hand Pose Estimation Framework using bounding box

Once our DRL agent is successful is tracking hand gesture using bounding box parameter we can extend our model and further train our DRL model with the skeleton information of each of the hand gesture. The skeleton allows the system to precisely detect the position of all the interest points of the hand (namely the fingers and the hand center). So this skeleton estimation can be used to track the non rigidity of the hand gestures. So later, proposed these methods can be extended for skeleton estimation from the 3D point cloud data of the hand gestures as shown in Figure 4.3.

Figure 4.3: Hand Pose Estimation using skeleton structure

(30)

The major components involved in this framework design are DRL model, Hand model, rigid and non-rigid movement parameters and dimension reduction technique for skeleton estimation from the hand pose. The DeepRL is a regression model which is trained to predict the location of rigid and non-rigid hand gestures. The hand tracking model has two phases called the training phase and testing phase. During the training phase, a collection of rigid and non-rigid hand gestures are inputted into the neural network DeepRL model to train it. Each training object is a depth image with the bounding box co- ordinate associated with it, which is used by the regression to learn. In the testing phase, the object for which the predicted location of the bounding box is unknown is inputted to the trained DRL one by one and the trained model estimate an action to move the bounding box for each input object. The Figure 4.4 below shows the working of DRL model. Once the DRL agent is trained

(a) Training Phase

(b) Testing Phase

Figure 4.4: Working of a DRL for tracking with bounding box

to track hand gesture by bounding box estimation, we proposed to extend this model to estimate the skeleton parameter from the depth images using Deep RL model. The skeleton estimation framework has two phases called the training phase and the testing phase. During the training phase, a collection of hand pose involving rigid and non-rigid movement tracked by the previous model is inputted in the neural network DeepRL model to train it. Each training object is a depth image with the reduced joint parameter associated with it, which is used by the regression to learn. In the testing phase, a depth image of the hand gesture without skeleton information is inputted to the trained DRL one by one and the trained model estimate an action to move hand joint coordinates for each input hand gesture. Below Figure 4.5 shows the block diagram of the DRL model for skeleton estimation.

(31)

(a) Training Phase

(b) Testing Phase

Figure 4.5: Proposed DRL model for skeleton estimation

4.1 Data Collection and Data Set

Lack of sufficient 3D hand gesture data leads us to collect our own data of different hand gesture and create our own data set. Training data collection is always a nontrivial task. To collect the training data in a semiautomatic way, we need to combine both the depth information and ground truth location for labeling. So we set up an experimental setup as shown in Figure 4.6 with the depth camera, leap motion, and mobile camera. The idea is to collect all depth, RGB and skeleton information for the hand gestures. The first task is to calibrate the depth camera. There are many techniques that can be used to calibrate the camera. However, in our experiments, we directly obtain the pixels mapping from the depth cameras from the Intel Depth sensor SDK. The leap motion sensor is also set up to collect the hand skeleton ground information for hand gestures. The data is collected in binary format, after processing its converted into depth images.

Figure 4.6: Data collection setup

(32)

Differently from the Kinect and other similar devices, the Leap Motion does not return a complete depth map but only a set of relevant hand points and some hand pose features. Figure 4.7 shows the orientation setup of the leap motion sensor that will be used in the proposed gesture recognition system, namely:

• Palm Centre C - roughly corresponds to the center of the palm region in the 3D space.

• Hand Orientation - h is pointing from the palm center to the fingers.

Figure 4.7: Leap motion orientation

The data is collected by two sensors by performing different hand poses in front of the sensor setup. The rigid hand gesture involves moving hand in X, Y and Z axis without finger articulation. And the non-rigid dynamic hand gesture involves Grab free movement which involves all of the finger articulation.

Movement of the fingers involves 26 Degree of Freedom (DoF). The data-set contains 5 subjects and 2 gestures, each gesture by each subject is repeated a predefined number of times. The data-set is structured in various sequences and each sequence is composed of a set of frames. For each gesture, 10 frames are captured. Subjects are requested to start with the frontal, open hand pose. The hand gesture is captured by the subject are performed in both left and right hand. The data collection setup is represented in the Figure 4.6.

The hand gestures are divided into following parts :

• Dynamic Hand Gesture

1. Hand Grab - front and back view

• Static Hand Gesture

1. Moving the hand in X,Y and Z axis

4.2 Hand Model

The hand model as defined by a skeleton model. The skeleton of human hand with a hierarchical structure consists of rigid links and joints. Each joint has one or two rotational degrees of freedom. This hierarchical structure can be

(33)

represented by a tree, where the root is the wrist. Fig 4.8 shows a hand skeleton model with degree of freedom for each of the joint location.

Figure 4.8: An example of hand model with 26 DoF

Degree of Freedom of Hand Joint

Degree of freedom (DoF) is the minimum number of independent variables required to completely describe a body. There are four type of transformation : translation, rotation, reflection and dilation. This transformation falls into two categories: the rigid transformation that does not change the shape or size of the object and non-rigid transformation that change the size not the shape of the object. For a rigid transformation of the hand, DoF means the number of independent coordinates needed to define the position of the body.

Hand pose estimation means estimation of the rigid global hand pose as well as the nonrigid finger articulation. The complexity induced by the high degrees of freedom of the articulated hand challenges many visual tracking techniques.

Capturing hand and finger motions in video sequences is a highly challenging task due to a large number of Degrees of Freedom (DoF) of the hand kinematic structure. To model hand kinematics, we adopt the commonly used 26 degrees of freedom (DoF) hand motion model. 6 DoF for the global hand pose and 4 DoF for each finger as shown in Figure 4.8.

The high dimensionality of this problem makes the estimation of these motion parameters from images prohibitive and formidable. Rigid transformation pre- serves all distances and angles and has no associated non-rigidity. Articulated transformation is a piecewise rigid transformation which involves the movement of the fingers.

4.2.1 Hand Pose Estimation - Rigid Transformation

In a planar mechanism, all of the relative motions of the object are in one plane or in parallel planes. An object(hand) can translate along X axis, or Y axis or it can rotate about the Z axis, in XY plane. Hand gesture involving rigid transformation parameters includes - Rotation, Translation in X,Y and Z axis. The palm centre of the hand is consist of 6 DoF. So we apply rigid transformation parameter to the palm centre and tracking along sequences to recognize hand

(34)

gesture. Shape and size correspondence of the hand is an important aspect of imaging, so it need to be normalized.

The hand gesture is first tracked with a bounding box. For tracking the hand gestures with rigid transformation the bounding box needs to move in X, Y and Z axis to be able to track the scale up, scale down of the hand gesture.

For designing an agent to learn a policy with RL there are few important parameters that need to be defined first - State space, Action and Reward function.

Action

There are two types of possible actions: movement actions that imply a change in the current observed region, and the terminal action to indicate that the object is found and that the search has ended.

For this hand gesture tracking we define seven predefined action a in the discrete action space. There are four translations moves (left, right, up, down) and two scale changes (Zoom in, Zoom out) that maintain aspect ratios of the bounding box, and finally the terminating action which the agent selects if it determines that the target is located. Action to track the rigid transformation of hand gesture is encoded as 7- dimensional one-hot vectors as shown in the Figure 4.9.

Figure 4.9: Actions for rigid object movement

Each local translation action moves the window by 1-pixel times of the current window size. The next state is then deterministic-ally obtained after taking the last action. The scaling actions are designed to facilitate the search of objects in various scales in localizing objects in a wide range of scales. The other four translation actions aim to perform successive changes in visual focus, playing an important role in both refinings the current attended object and searching for uncovered new objects.

States

The state is composed of feature matrix within the region of interest. The feature vectors are normalized before they feed into the neural network. Within the frames l of the video F1, ...FL, each of the image are re-size to 400 × 400 to the match the input of the CNN model input. Each of the input frames are matched with the corresponding bounding box representation bt= [xt, yt, wt, ht] which defines the ground truth location of hand pose at each frame. The input frame at each iteration t with the corresponding bounding box define the state information for the DRL agent. Here xt and yt denote the upper left corner of the bounding box and wt and htdenote the width and height of the bounding box.

3D Hand Pose Tracking from Depth Images using Deep Reinforcement Learning

3D Hand Pose Tracking from Depth Images using Deep Reinforcement Learning

SNEHA SAHA

Abstract

Sammanfattning

Acknowledgment

Contents

List of Abbreviations and Acronyms

Chapter 1

Introduction

1.1 Motivation

1.2 Research Question

1.3 Outline

Chapter 2

Literature Review

Chapter 3

Background

3.1 Reinforcement Learning

3.2 Markov Decision Process

3.3 Model-free Methods

3.3.1 Value Function Based Methods

3.4 Q-Learning

3.5 Exploration and Exploitation

3.6 Deep Neural Network

3.6.1 Basic Idea

3.6.2 Network Layers

3.6.3 Common Activation Function

3.6.4 Batch Normalization

3.7 Deep Q Learning

Chapter 4

Framework for Hand Pose Tracking

4.1 Data Collection and Data Set

4.2 Hand Model

4.2.1 Hand Pose Estimation - Rigid Transformation