3D POSE ESTIMATION IN THE CONTEXT OF GRIP POSITION FOR PHRI

(1)

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

3D POSE ESTIMATION IN THE

CONTEXT OF GRIP POSITION FOR

PHRI

Jacob Norman

jnn13008@student.mdh.se

Examiner: Martin Ekstr¨

om

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Fredrik Ekstrand

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Joaqu´ın Ballesteros

University of M´

alaga, M´

alaga, Spain

Supervisor: Jesus Manuel G´

omez de Gabriel

University of M´

alaga, M´

alaga, Spain

June 27, 2021

(2)

Abstract

For human-robot interaction with the intent to grip a human arm, it is necessary that the ideal gripping location can be identified. In this work, the gripping location is situated on the arm and thus it can be extracted using the position of the wrist and elbow joints. To achieve this human pose estimation is proposed as there exist robust methods that work both in and outside of lab environments. One such example is OpenPose which thanks to the COCO and MPII datasets has recorded impressive results in a variety of different scenarios in real-time. However, most of the images in these datasets are taken from a camera mounted at chest height on people that for the majority of the images are oriented upright. This presents the potential problem that prone humans which are the primary focus of this project can not be detected. Especially if seen from an angle that makes the human appear upside down in the camera frame. To remedy this two different approaches were tested, both aimed at creating a rotation-invariant 2D pose estimation method. The first method rotates the COCO training data in an attempt to create a model that can find humans regardless of orientation in the image. The second approach adds a RotationNet as a preprocessing step to correctly orient the images so that OpenPose can be used to estimate the 2D pose before rotating back the resulting skeletons.

(3)

List of Figures

1 Different perspectives created by approaching a prone human from different angles. 1 2 Different perspectives created by approaching a prone human from different heights. 2 3 Different perspectives created by viewing a prone human from three different distances. 2 4 The search and rescue/assistive robot Valkyrie at UMA consisting of a robot

ma-nipulator with six degrees of freedom mounted on a mobile platform. A gripper is mounted on the end effector with a camera just above. . . 3 5 Gantt chart depicting the initial timeline for the project week by week. . . 11 6 Flowchart of the complete system read left to right where the two-dimensional (2D)

pose estimation block represents the models presented in section 6.2 . . . 12 7 One of the perspective of the modified CMU panoptic dataset where the first row

from left to right is the original image and the obscured image. On the second row from left to right are the images that are rotated 90, 180 and 270 degrees respectively. All images have the same amount of pixels, the 90 and 270 degrees have been cropped for this figure to reduce its size. . . 13 8 Flowchart of the 2D pose estimation using RotationNet and OpenPose, in the

flowchart of the whole system in figure 6, . . . 14 9 Four different frames from the CMU trainable dataset with the vectors showing the

offset rotation plotted to the left and the correctly orientated image(desired before openpose) to the right. On the left images the blue vector is the line between the pelvis and neck while the orange line is the vertical vector starting at the pelvis. . 15 10 Architecture of the RotationNet with the MobileNetV2 model summarized into one

block. The global average pooling layer and Dense(fully connected) layer following the MobileNet interprets the features extracted from the MobileNet to classify the ImageNet dataset, the remaining layers are implemented to adapt the structure to RotationNet. . . 18 11 The first six elements in the shuffled and preprocessed dataset used for training, on

top of each image is the ground truth rotation of each image with positive values being counterclockwise oriented and negative values being clockwise oriented. . . . 19 12 Results from OpenPose on the modified CMU panoptic dataset which has been

cropped or rotated counterclockwise expressed as MPJPE between all different cam-era pairs in cm. . . 20 13 Results from OpenPose on the modified CMU panoptic dataset which has been

cropped or rotated counterclockwise expressed as percentage of times the wrist or elbow was not found sorted by camera. . . 21 14 Training graph of the Openpose & MobileNet thin respectively using rotated coco

images where the x-axis represents every 500th batch size and the y-axis represents the loss value. . . 22 15 Histogram showing the error distribution of the RotationNet where the Y-axis

rep-resent the frequency expressed in percent and the X-axis the error. . . 22 16 Training data of RotationNet with the mean squared error & mean absolute error

of the validation data expressed in the y-axis and the epoch expressed in the x-axis. 23 17 MPJPE of Openpose & RotationNet respectively tested on CMU Panoptic trainable

testing data where the Y-axis represents MPJPE and the X-axis represents the rotation of the image. . . 23 18 MPJPE of Openpose and RotationNet respectively tested on CMU Panoptic

train-able testing data where the Y-axis represents MPJPE and the X-axis represents the viewpoint expressed in panel index. . . 24 19 Frequency of misses of Openpose & RotationNet respectively tested on CMU

Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the rotation of the image. . . 24

(5)

20 Frequency of misses of Openpose & RotationNet respectivelytested on CMU Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the viewpoint expressed in panel index. . . 25 21 This image shows the self occlusion present in the CMU Panoptic modified dataset

from HD cameras 7, 9, 11 and 15 . . . 26

List of Tables

(6)

Acronyms

ROS Robot Operating System 2D two-dimensional

3D three-dimensional UMA University of Malaga MDH M¨alardalen University

CNN Convolutional Neural Network DCNN Deep Convolutional Neural Network MPJPE Mean Per Joint Position Error HRI Human-Robot Interaction

pHRI Physical Human-Robot Interaction ReLU Rectified Linear Unit

(7)

1 Introduction

Human-Robot Interaction (HRI) is a field of study which focuses on developing robots that are able to interact with humans in various everyday occurrences. The core of the field is physically embodied social robots which result in design challenges around the complex social structures that are inherent to human interaction. For example, robots with a human-inspired design encourage people to interact similarly to how they would in human-human interaction. This can be used to plan the interaction, However, if the human’s expectations of the interaction are not achieved it can result in frustration. For social robots to interact with humans, sensors that interpret the surroundings are necessary, the most common of which are the primary senses used in human-human interaction; vision, audition, and touch [1]. Some application areas for HRI are education, mental and physical health, and, applications in industry, domestic chores, and search and rescue [2]. In areas such as search and rescue, physical human robot contact is necessary and these interactions require the utmost care to not harm the patient even more. Therefore safety should remain the top priority and the system should behave in a predictable manner since there is generally no way to anticipate how the human would react [3,4].

At the department of “Ingenier´ıa de Sistemas y Automatica”1_{at University of Malaga (UMA),}

Spain (where this thesis was partly conducted) a search and rescue/assistive robot named Valkyrie is being developed. It is a robot manipulator mounted on a mobile platform that allows the robot to move to the location of a human in need and initiate contact by grasping their wrist. From this position, it will be possible to monitor the vital signs of the human, as well as, helping the human up and eventually leading the human to safety in the event of a disaster. This method of HRI is preferable because conveying information to panic-stricken humans is challenging, this method also allows Valkyrie to be used in elder care where it can assist people to stand up after falling in their homes or other environments where there are no other humans around that can help.

The aim of this thesis is to investigate the feasibility to detect and reconstruct three-dimensional (3D) poses of humans laying on the ground with a monocular camera so that it could be used in the future to grasp a human arm with a robot manipulator. To achieve this three factors will be considered, firstly the angle from which Valkyrie is approaching the human. This is necessary to investigate because depending on the angle of approach it can appear as if the human is rotated relative to the camera. This is a scenario not found in all applications of 3D pose estimation and could therefore present an area that has to see improvement to realize Valkyrie. Figure 1 shows the different perspectives that this issue creates. Secondly, the elevation of the approach is

Figure 1: Different perspectives created by approaching a prone human from different angles. interesting because if a prone human is approached on a flat surface the elevation of the cameras will be somewhere around chest height. This would result in an angle relative to the prone human

(8)

which deviates from the normal circumstance where both camera and subject are at the same elevation. In the scenario that the elevation is not high enough the soles of the feet could be the most prominent feature of the image as opposed to a picture taken from straight above which would never occur when approaching a prone human. Figure 2 shows the different perspectives that this issue creates. Lastly, the occlusion of the camera frame has to be investigated. When

Figure 2: Different perspectives created by approaching a prone human from different heights. the mobile manipulator approaches a human arm the other features will be left out of the frame of the camera. Therefore it is important that the proposed method can still identify the arm when no other features are visible.

Figure 3: Different perspectives created by viewing a prone human from three different distances. The motivation behind this project is to create a research platform on which HRI can be tested, the researchers at Departamento de Ingenieria de Sistemas y Automatica have a particular interest in HRI where the robot initiates the contact as this is a largely underrepresented field compared to human initiated contact. Furthermore, there are several projects focused on search and rescue robots at UMA, however, none of them are small enough that the they can be used to research human assistive robots. The need for search and rescue robots in general stem from two main sources, firstly, human-robot collaboration has showed to be effective in assembly lines when adopting robots[5]. Secondly adopting robots for search and rescue scenarios will reduce the risk rescue workers put themselves in when they enter hazardous areas to search for survivors. After finding a human Valkyrie can access the life conditions of the human and take appropriate response, such as leading them to safety or sending a signal that immediate medical assistance is necessary while monitoring the vital signs. This way rescue workers would only expose themselves to danger when necessary as opposed to constantly while searching for survivors. Valkyrie can also be used as a research platform for assistive robotics, many elderly live alone and it is not feasible to have help around at all times. With an assistive robot in the home, if one were to fall help would be able to arrive instantly whether the patient was able to signal for help or not. From that point the assistive robot would be able to help the human up or signal for help as well as assessing the condition of the human.

(9)

back-ground information that would be useful in order to understand the concepts and methods pre-sented in this paper. After the background the related work is prepre-sented which consists of more up do date methods on select fields that influenced the decision making in this project. Further on in the report the problem is broken down and the hypothesis, as well as, research questions are stated. In the methodology section the way in which the problem will be approached is presented followed by the methods used to solve the problem in the method section of the report. Details of the implementation and descriptive information how each step was performed is presented next which will allow one to replicate this work, should the need arise. Further on in the report the results will be presented after which they will be discussed in section result and lastly discussion. The last section will also address the hypothesis, research questions and future work.

Figure 4: The search and rescue/assistive robot Valkyrie at UMA consisting of a robot manipulator with six degrees of freedom mounted on a mobile platform. A gripper is mounted on the end effector with a camera just above.

(10)

2 Background

This section introduces some of the methods and concepts that were deemed necessary to un-derstand in order to fully comprehend this thesis, the first of which is computer vision and how an image on the computer screen translates to the objects captured by the camera. The second method of importance is Stereo vision, which can capture the depth otherwise lacking in an im-age. Thirdly, Convolutional Neural Networks(CNN) are explained which have revolutionized image processing and are being incorporated with the previously mentioned techniques.

2.1 Computer Vision

In order to find a human and grasp its arm, the robot first needs to detect the human it intends to rescue; this will be done with a camera. The simplest model of a camera is called a pinhole camera, it consists of a small hole and a camera house on the back wall of which the image will be projected upside down [6]. In a modern camera, the image is projected onto sensors that interpret the image into a digital format. This results in four different coordinate systems, the first two are for the object relative to the camera and relative to the base of the robot manipulator. The third is a 2D coordinate system for the image sensors and lastly, there is a coordinate system expressed in pixels for the digital image [7]. To represent an object in pixel coordinates requires a transformation from the camera coordinates. This is done with the intrinsic camera matrix, which describes the internal geometry of the camera. By taking pictures of a checkerboard with a known pattern it is possible to estimate it as well as rectify the lens distortion, this process is called camera calibration. The translation and rotational difference between the world and camera coordinates are described by the extrinsic camera matrix, and the whole transformation from world coordinates to pixel coordinates is performed with the camera matrix [7], this relation is expressed by equation 1 where the left most vector represents homogeneous pixel coordinates. The matrices from left to right are the affine matrix, projection matrix, and finally the extrinsic matrix which includes rotation and translation. The vector on the right side is the world coordinates in homogeneous coordinates.

  u v 1  ∼   a11 a12 a13 a21 a22 a23 a31 a32 a33     f 0 0 0 0 f 0 0 0 0 1 0   R3x3 T3x1 0 1     U V W 1     (1)

In this transformation the depth data is lost, however, there are several ways to recreate it, one of which is stereo vision. By using two cameras and placing them next to each other the same object can be seen in both image planes, if one knows the translation and rotation between the cameras and the difference in pixel coordinates between the two images it is possible to determine how far away the object is [6]. Another way of recreating the depth of the object is through perspective projection, if the distance from the principal point in the camera is known as well as the focal length together with the size of the object, the distance to the object can be determined. This relationship is expressed by equation 2 where x represents the distance from the principal point horizontally in pixels, X represents the horizontal distance from the principal point axis to the object in the camera coordinates. Finally, F represents the focal length, and Z the distance to the object[7]. The same is true for the Y coordinates.

Z = fX

x (2)

Initially it seems like both of these methods are inapplicable since weak perspective projection requires the size of the object and stereo vision requires two different perspectives.

However, by mounting the camera on the robot manipulator it is possible to move the ma-nipulator to capture multiple perspectives with the camera, the robot configuration can provide the position of the camera for each frame. Additionally, the right hand side of equation 2 can represents the focal length times the scaling of the object which can be estimated by comparing the 2D and 3D pose [8].

(11)

2.2 Stereo vision

The most researched area within stereo vision is stereo matching, it is the practice of finding the same pixel or feature between two or more perspectives. This is necessary to estimate the depth to that feature and may seem like an easy task since humans do it constantly, however, it is a challenge within computer vision [9]. In early works finding and matching features was done using Harris corner detector [6] which is still used and serves as the foundation for several approaches today. A downside to this algorithm, however, is that it responds poorly to changes in scale. Changes in scale are common in real-world applications when changing camera focus or zooming [6]. This problem was remedied by Lowe when SIFT was published [10], which is a feature descriptor that is invariant to translations, rotations, scaling and robust to moderate perspective transformations and illumination. This solution has proven useful for a variety of applications such as object recog-nition and image matching. Another solution to the correspondence problem is non-parametric local transforms [11], by transforming the images and then computing the correspondence problem it is possible to get an approach that is robust against factionalism which is when subsections of the image have their own distinct parameters. This is possible since the correspondence is no longer calculated on the data values but rather the relative order of the data values. For this approach to work, there has to be significant variation between the local transforms and the results in the corresponding area must be similar.

Once the correspondence problem has been solved and the same point has been found in both images it is possible to calculate the depth by projecting two rays from the focus of the camera to the object in the image plane. This process is called triangulation and makes it possible to calculate the coordinates of a point in space if the translation, rotation, and camera calibration are known [6]. Initially, this problem seems trivial since finding the intersection between two vectors is straightforward, however, due to noise, the two vectors rarely intersect. Hartley [12] presented an optimal solution to this problem that is invariant to projective transformations and finds the best point of intersection non iteratively.

Scharstein and Szeliski [13] present a taxonomy for stereo vision in an attempt to categorize the different components for dense stereo matching approaches, that is approaches that estimate the depth for all pixels in the image. Furthermore, Scharstein and Szeliski devised evaluation metrics for each of the components together with a framework to allow researchers to analyze their algorithms. This serves as a standardized test in the field of dense stereo matching with researchers submitting their algorithms to add them to Middlebury’s database of state of the art algorithms. This is presented on their website [14] and is still maintained. The state of the art in stereo vision today has progressed to the point where all the top methods on Middlebury utilize a Convolutional Neural Network (CNN).

2.3 CNN

A CNN is a machine learning algorithm that is most commonly used to extracts information from images by applying a series of filters, the size of the filters varies and the values are tuned to extract information that is relevant to achieve the desired result. On the more complex CNN architectures, there can be several hundred layers of filters, pooling layers, and activation functions that extract and condense information[15]. These architectures are referred to as Deep Convolutional Neural Network (DCNN)s because they have many layers which increase the complexity. DCNNs can have several million variables which makes training time-consuming. A downside to CNNs is that they are a black box in the sense that you can not take the result from one layer and distinguish if it is working or not. Within the field of computer vision, deep learning has revolutionized several topics with DCNNs, in particular, has had an impact on the field [16].

One major downside to CNNs is that they are dependent on the right data for training. In order to implement a CNN for a specific purpose, a representative dataset is necessary, However, this also allows them to be versatile, and consequently, CNNs are used for a variety of applications such as object detection, object classification, segmentation, and pose estimation to name a few.

A CNN is only as good as the dataset used to train it and in 2009 Deng et al. [17] published a dataset titled ImageNet intended for object recognition, image classification, and, automatic object clustering. This led to the creation of several DCNNs that because of the diversity of Imagenet are

(12)

versatile and thus have been used for a number of different applications. This is in part because of a method called transfer learning which circumvents the long training time by using the already existing weights learned from ImageNet and finetunes them with a different dataset. This takes advantage of the features the model already has learned and instead of re-learning everything the already known features can be adapted for the new purpose[18].

2.4 Human pose estimation

Human pose estimation is a field of study which attempts to extract the skeleton from a human in either 2D or 3D. This is achieved with a CNN that has fifteen to twenty-five outputs that represent coordinates of different body parts. A wireframe is then constructed between adjacent parts and a skeletal frame is created. Human pose estimation has also been applied to hands, feet, and, face to include fingers, toes, and eyes ears, etc. in the wireframe. Mean Per Joint Position Error (MPJPE) is used to evaluate models and calculates the mean error for every joint. This requires a dataset that has the ground truth when training a model of which there are several, however, not all datasets use the same skeleton structure so the MPJPE of different models is not always directly comparable. Furthermore, there exists a lot of ambiguities in the torso and, as a result, getting an accurate estimation of the hips is harder than the arms.

(13)

3 Related work

This section addresses the related work in HRI. First, 3D pose estimation will be investigated to find a human and locate an appropriate position to grip, then ways of controlling the robot manip-ulator and trajectory generation will be discussed. Lastly, evaluation methods will be investigated together with metrics to compare the systems to related work.

3.1 Pose estimation

Unlike the majority of the research in this area[19], this project will mount a monocular camera on a robot manipulator. This allows the possibility of moving the camera to get images from different perspectives or a sequence of images. With this additional information, methods such as stereo vision can be considered assuming the object is static. In this section, the methods used in related work will be sorted by input to the model, after which different data sets and evaluation methods will be discussed.

3.1.1 Single view RGB image

Human pose estimation from a single RGB image is a topic that has seen great progress thanks to deep learning [19,20]. Challenges in this topic include estimating the poses of multiple people in the same image, inferring the locations of occluded limbs and training a robust model that works outside of lab environments [20,21]. Since machine learning is the primary method for this problem it is important that a good dataset is used for training. Currently, there exists no 3D pose dataset that has ground truth data outside lab environments [22,8,23,24]. As a result most research has divided the problem in the two separate tasks, 2D pose estimation and 2D to 3D pose conversion. This allows the 2D pose estimator to train on a diverse dataset with ground truth while the 2D to 3D pose converter can train on infer depth on the joints from a motion capture dataset. While this does not solve the problem, it allows the use of several methods that alleviate it [22,8,23,24].

2D pose estimation 2D pose estimation is a problem that is largely considered solved, however, when it is used in the process of 3D pose estimation it has a smaller margin of error. This is the case because minor errors in the locations of the 2D body joints can have large consequences in the 3D reconstruction [19] since the errors within 2D pose estimation amplifies the error when moving to 3D. A 2D pose estimation method that have seen widespread use in 3D pose estimation for single RGB image input is the Stacked Hourglass Networks[22] proposed by Newell et al. [25]. By pooling and then up-sampling the image from many different resolutions it is possible to capture features at every scale. This allows the network to understand local features such as faces or hands while simultaneously being able to interpret them together with the rest of the image and identify the pose. Another 2D pose estimation algorithm developed by Cao et al. [26] was used before estimating the 3D pose for images of multiple views [27, 28]. Openpose uses a CNN to predict confidence maps for body parts and affinity fields that save the association between them, a greedy parser is then used to get the resulting 2D pose estimation. This method has been proven to work in real time and can provide the location of fingers as well as facial features.

2D to 3D pose conversion 3D pose estimation from a single image is an ill posed problem because of the 2D nature of the source data, this imposes challenges where the 3D pose estimator has to resolve depth ambiguity while also trying to infer the position from occluded limbs [19]. Among related work, a common approach to this problem is regression [8, 24, 22]. By using a CNN to calculate a heat map of the human, it is possible to create a bounding box around the human which normalises the subjects size and position. Thus freeing the regressor from having to localise the person and estimating the scale. The downside to this approach is that global position information is lost [8]. The evaluation methods for this problem also use a pelvis centred coordinate system in which there is no transformation between the subject and the camera. Dabral et al. [21] realised this was a problem for applications such as action recognition and proposed a weak-perspective projection assumption. This assumes that all points on a 3D object are at

(14)

roughly the same depth(the depth of the root join) and requires a scaling factor for the object which is estimated by comparing the 2D and 3D poses. A limitation on this method is that it does not work when the human is aligned with the optical axis. Furthermore, it is not intended to be highly accurate, but rather a system to make spatial ordering apparent. Similarly to this approach, Mehta et al. [8] proposed weak perspective projection, however, their approach do not require iterative optimisation which makes it less time-consuming. By using a generalised form of Procrustes analysis to align the projection of the 2D and 3D pose, the translation relative the camera can be described by a linear least squares equation.

To improve the robustness on 3D pose estimation outside of lab environments Yasin et al. [23] proposed two separate sources of training data for 2D and 3D pose estimation. The 3D data was gathered from a motion capture database and projected as normalised 2D poses. A 2D pose estimator would then estimate the pictorial structure model and retrieve the nearest normalised 3D pose which would be an estimate of the final pose. By projecting the 3D pose back into 2D the final 3D pose is found by minimising the projection error.

Similarly to this Wang et al. [24] also suggested that the final 3D pose should be projected back into 2D and used to improve the result. The major distinction between the two methods is that Yasin et al. utilises a K-nearest neighbour to improve the estimate while Wang et al. feeds the projection error into a CNN.

Another approach to solving the lack of a sufficient dataset is Adversarial learning [22], which employs the use of two networks: a generator that creates training samples and a discriminator that tries to distinguish them from real samples. The objective of the generator is to create 3D poses good enough to fool the discriminator into thinking the samples are real. Its architecture is based on the popular stacked hourglass with input both from 3D and 2D annotated data. The 2D to 3D converter in a generative adversarial network can only become as good as the discriminator, therefore a lot of emphases is placed on the discriminator which is based on a multi-source architecture that combines CNNs with input from the image used to generate the data, geometric descriptors and a 2D heat map as well as a depth map.

3.1.2 Multi view RGB image

According to Sarafianos et al. resolving depth ambiguities in 3D pose estimation would be a much simpler task if depth information could be obtained from a sensor [19]. Additionally, Amin et al. [29] argues that the search complexity can be reduced significantly by treating this problem as a joint inference problem between two 2D poses as opposed to a single 3D pose. With two different viewpoints available, stereo matching can be used to calculate the depth which unlike methods used for single view depth inference does not rely on estimations. Therefore, the methods presented in this section will be more robust than the ones presented for single view.

2D to 3D pose conversion Both Garcia et al. [28] and Schwartz et al. [27] utilised OpenPose explained in section 3.1.1 for 2D pose estimation. Garcia et al. used the joint locations from open pose to rectify the image and then as features for triangulation. Schwartz et al. also used the joint locations to remove joints which were only visible from one camera. A heat map generated from OpenPose was then randomly sampled from and back projected the pixel coordinate as a ray from which a 3D joint hypothesis was constructed from the point closest to all the rays. The 2D pose confidence was then calculated by projecting the 3D position to the 2D heat maps. On top of this Belief propagation was used for posterior estimation and temporal smoothness was used to reduce the jitter between frames. Hofman et al. [30] suggested a reversed approach in which the 2D pose is used to find a set of similar poses in a 3D pose library, the 3D pose is then evaluated by projecting it to the other cameras and comparing it with the 2D poses for each camera. If the error is too large the 2D poses are discarded, otherwise the triangulation and projection error is minimised by trying with more 3D poses and calculating the error. The best ranked results are then optimised with gradient descent.

Direct 3D pose estimation One of the problems with 2D to 3D pose reconstruction is that 3D information has to be inferred before the depth can be calculated [27,31]. Gai et al. [31] proposes a solution to this by first finding the relation between the different views and then estimating the

(15)

pose. This is done with a ResNet that inputs each image from different views and then merges the information in the pooling layer. Regression is then used to estimate the pose and shape of the human after which an adversarial network was trained to estimate the mesh whose error is propagated through the entire pipeline. This solution runs in real-time and is comparable to similar implementations done on single view RGB images with the distinction that it calculates the global coordinates of the 3D pose. Gai et al. also discovered that the joint error decreased when the number of viewpoints increased.

3.1.3 Data sets and evaluation

Pose estimation is often implemented with an AI approach that requires large datasets. While 3D pose estimation from a single image suffers heavily from poor data sets which contain either diverse or ground truth for joint depth. This is detrimental for machine learning approaches, how-ever, approaches to mitigate this issue exists [22,23,24]. 2D pose estimation does not suffer from this issue since the 2D pose estimators are trained on 2D pose estimation data which is extensive, diverse, and has ground truth. The depth is then calculated through geometry and therefore does not suffer from bad training data. Another downside to the lack of a good data set is that every researcher has to decide which data set to use, this results in several methods that are not directly comparable to each other. This can be a problem when every paper is claiming to improve on state-of-the-art by only comparing to the papers that used the same data set.

One data set that is recurrent among articles in 3D pose reconstruction is Human3.6 by Ionescu et al. [32] which has a standard evaluation protocol. The metric to determine the quality of the match is MPJPE which represents the error between the estimate and ground truth using Procrustes alignment. Another dataset of interest is CMU Panoptic which represents humans in a lab environment using a dome mounted with cameras. The dataset includes a total of 31 HD cameras, 480 VGA cameras, and 10 RGB-Depth sensors. Full 3D poses are captured of humans socializing, dancing, playing musical instruments, or showing off a range of motion [33].

(16)

4 Problem formulation

There are several challenges associated with human pose estimation, the first of which is related to the 3D annotated datasets. Since 3D annotated datasets are both expensive and difficult to create outside lab environments it can be difficult to adapt the model to the desired scenario. In order to build a model that can accurately find the wrist and elbow in the perspectives presented in figures 1, 2, and 3, it is necessary to have a dataset that represents these situations.

4.1 Limitations

To make this thesis manageable in the set time frame several limitations and constraints have been put on the work. Originally, this endeavour started in Spain, where access to the physical robot was possible, therefore, testing the solution directly on the robot was a big focus. Due to the impact on Covid-19 the project was moved to Sweden and access to the robot was no longer possible. Consequently, the project took on a more theoretical approach, in practice, this meant a lot more focus was put into identifying a good gripping location and the robustness of the solution as opposed to the interaction between the camera and robot manipulator.

L1 There will be one stationary human lying down in the camera frame.

L2 The exact gripping location will not be identified, instead, the wrist and elbow joints will be identified as it is assumed the ideal gripping location is on a vector between the joints. L3 Movement of the mobile manipulator will not be considered.

4.2 Constrains

C1 Images to test the solution will not be taken from a camera mounted on a robot manipulator. C2 Navigation of the robot manipulator will be considered out of scope for this thesis.

C3 Only solutions which use a monocular camera will be considered.

C4 The images used for this thesis is collected with a monocular camera, therefore, the approach has to take this into consideration.

4.3 Hypothesis

A monocular camera mounted on a robot manipulator provides

sufficient information to detect a prone human and identify a suitable

gripping location.

4.4 Research questions

RQ1 Can a gripping position be identified regardless from which direction Valkyrie

approaches a prone human?

RQ2 Is a data set designed to represent a prone human from a multitude of angles

necessary to achieve an acceptable estimation of the arm?

(17)

5 Methodology

This project will follow Agile guidelines [

34 ] as they are prevalent within the

industry with employers inquiring if recent graduates are familiar with it. There

are several agile methodologies to choose from, however, since there is only one

participant in this project a modified model of SCRUM and feature-driven

development has been devised and is explained in the section below.

The project starts with a research phase, the goal of which is to develop a better

understanding of the problem and finding state-of-the-art solutions, this phase will

end with a review of the information after which a solution will be decided upon.

The next stage then starts which is implementation-specific planning which

consists of creating a backlog of features where every feature has a development

and design plan as well as a priority list in which order the features should be

completed. The features will also be divided into several different stages that

represent core functionality and then future expansions. The next step an iterative

design process begins, where, similarly to SCRUM a feature(sprint) is selected and

then implemented. The feature is considered completed after each item has been

fulfilled in the definition of done (see table 1). When a feature is complete the next

feature in the list is selected, however, if the feature fails to meet all the criteria in

the definition of done that feature is skipped and instead placed back into step one

or two depending on the issue. Similar to SCRUM, this implementation phase will

consist of the number of days decided upon during the implementation planning.

After the implementation is complete the system will be evaluated as a whole and

once that is done finalization of the report and presentation are the last steps in

this project. The Gantt chart can be seen in figure 5

Definition of done

Feature

Functional test passed O

Feature evaluated and results recorded O

Acceptance criteria met O

Feature documented in project report O

Table 1: All 4 criteria required to fulfil the definition of done

(18)

6 Method

The proposed method to find the ideal gripping location on a prone human

consists of first taking a picture, then moving the robot manipulator on which the

camera is mounted, and taking another picture. Each of the pictures is then used

for 2D pose estimation, and the results are later triangulated. The flowchart of

this system can be seen in figure 6.

Figure 6: Flowchart of the complete system read left to right where the 2D pose estimation block represents the models presented in section 6.2

6.1 Evaluation of State of the Art

To simulate a prone human being approached from a multitude of angles, several

different datasets were considered. A common theme among them was a focus on

upright humans with a camera at chest height, most often in social or

sports-related scenarios. This is most commonly to represent humans for the

purpose of action understanding, surveillance, HRI, motion capture, and CGI [

35 ].

As a result, an already existing dataset could not be used for the purpose of this

report, instead, a dataset had to be modified in an attempt to represent the

scenario. This limits the number of available datasets since not all datasets are

under a license that allows modifications. One dataset that does is CMU Panoptic

[

36 ]. Furthermore, the dataset is constructed using a dome mounted with cameras

to capture different views. This allows the simulation of approaching the human

from different directions.

6.1.1 Modified CMU Panoptic

The CMU Pantoptic dataset has been used extensively in research and consists of

segments of social situations, range of movement, and dancing. CMU panoptic is

the largest 3D annotated dataset seen from the number of camera views [

37 ],

unfortunately there are no segments where the focus is on a human in a prone

position. To remedy this, the dataset will be modified by rotating images taken

from a range of motion segment 90, 180, and 270 degrees to represent the pose a

prone human would have when approached from the head or sides this can be seen

more clearly in Figure 1. Furthermore, a zoomed-in view of the right arm will also

be added to test if the arm is identifiable when the rest of the human is obscured.

Figure 7 shows example images taken from the modified dataset. This dataset will

be refered to in the report as ”Modified CMU Panoptic”

(19)

Figure 7: One of the perspective of the modified CMU panoptic dataset where the first row from left to right is the original image and the obscured image. On the second row from left to right are the images that are rotated 90, 180 and 270 degrees respectively. All images have the same amount of pixels, the 90 and 270 degrees have been cropped for this figure to reduce its size.

The state-of-the-art method was evaluated based on MPJPE of the wrist and

elbow joint. This is the same metric that is used by related work except for the

purpose of this report only the wrist and elbow joint of one of the arms is

considered. Since the cameras cover a 360 degree perspective around the human

only one of the arms is considered since using the closest arm would result in

mirrored results from cameras facing each other. The results are segmented by

rotation, crop, and placed in a grid to display triangulation between individual

cameras. A table graph showing how often all the required joints were not

detected is also presented in section 8.

6.1.2 Choice of state-of-the-art method

Among the state-of-the-art, there are several interesting methods, some of which

are already implemented and are free to use for research purposes and some which

are not implemented. When choosing which method to evaluate several factors

were considered. Firstly, how difficult would the model be to implement on

Valkyrie including training and eventual porting to Robot Operating

System (ROS)? Secondly, is this method proven to work in literature? Is the

method available with the author’s implementation or does it require

implementation and training from scratch?

The chosen state-of-the-art method to evaluate is Openpose because it already has

a ROS implementation that can integrate with Valkyrie, free to use Caffe model,

as well as, several TensorFlow ports which make transfer learning easier.

Furthermore, Openpose is well established in the literature and this choice also

coincides with the wishes of UMA.

(20)

6.2 Adaption of State-of-the-art

In an attempt to create a rotation-invariant model two different approaches were

tested. The first of which attempts to make Openpose rotation invariant by

rotating the input training data that is used to train the model end to end. The

intention behind this is if differently orientated humans are present in the training

data, hopefully, the CNN can adapt to be able to identify human limbs in all

scenarios. The second approach adds a DCNN as a preprocessing step that

extracts the orientation of the human which is used to rotate the image before

rotating the skeleton back after 2d pose estimation. This has been done previously

by Kong et al. [

38 ] to make a hand pose estimation model rotation-invariant.

6.2.1 Training OpenPose

The creators of Openpose has provided the training code to train Openpose end to

end, on the original model this was done using the Common Objects in

Context (COCO) dataset[

39 ]. In an effort to achieve a rotation-invariant 2D pose

estimation model the input COCO dataset was rotated randomly in the

preprocessing step and used to train both openpose and a MobileNet 2D pose

estimation solution.

6.2.2 RotationNet

As an alternative to rotation invariant 2D pose estimation, a preprocessing step

that attempts to extract the angle of the upper body was created. By taking the

MobileNetV2 architecture and adding two fully connected layers separated by an

activation layer with the Rectified Linear Unit (ReLU) function, it is possible to

create a system that takes an input image and treats it as a regression problem.

The desired output of this architecture is the angle at which the image has to be

rotated to get the human aligned with the vertical axis of the image. The 2D

skeleton can then be extracted using 2D pose estimation after which the skeleton

will be rotated back the same amount so that it aligns with the original image.

The flowchart of this system is presented in figure 8.

Figure 8: Flowchart of the 2D pose estimation using RotationNet and OpenPose, in the flowchart of the whole system in figure 6,

6.2.3 CMU panoptic trainable

The ”CMU panoptic trainable” dataset was created using 120 different views of six

people doing the same movements. The movements were selected from a sequence

where first a series of arm motions are enacted after which the whole upper body

(21)

Figure 9: Four different frames from the CMU trainable dataset with the vectors showing the offset rotation plotted to the left and the correctly orientated image(desired before openpose) to the right. On the left images the blue vector is the line between the pelvis and neck while the orange line is the vertical vector starting at the pelvis.

is moved. There is only one human in the frame at a time, and each picture has a

corresponding 3D skeleton and offset rotation. The offset rotation is extracted

with the formula presented in equation 3 where n

1

represents a unit vector

originating in the pelvis orientated towards the neck and n

2

represents a unit

vector originating in the pelvis orientated vertically with respect to the image.

θ = arccos n

1

· n

2

(3)

These vectors are further demonstrated in figure 9 where the left image shows the

offset and the right image shows the corrected image RotationNet is supposed to

feed to Openpose.

6.3 Ethical considerations

According to the license of CMU Panoptic[

40 ], the modified datasets used in this

thesis are not allowed to be distributed. No other additional sources of data were

collected during this project, therefore, there are no ethical considerations

(22)

7 Implementation

This section explains how each of the methods were realized, as well as, more in

depth descriptions.

7.1 Modified CMU panoptic

The modified dataset was created using 25 HD views from the CMU panoptic

pose1 sample [

41 ] which depicts one human moving his arms for 101 frames. These

images were then modified so for each frame there exists one normal image, one

cropped around the right arm, and three rotated 90, 180, and, 270 degrees. The

images that were rotated were not cropped, instead, they have the resolution

1080x1920, as opposed to, 1920x1080.

7.2 Training openpose

In the preprocessing step used for the end-to-end training of openpose, several

image augmentations are made. These include random scaling, rotation, flip, X,

and crop. In addition to creating more robust models that can handle nonperfect

images, this also reduces the risk of overfitting by artificially increasing the

dataset. With random variables in the preprocessing, the dataset can be used for

several epochs without seeing the same image twice. This was taken advantage of

when training openpose to be rotation-invariant because all that was necessary to

feed rotated training images was increase the max and min allowed rotation in the

preprocessing function. The hardware used for this project was Jetson Xavier

AGX, unfortunately, the machine learning framework Caffe which openpose is

built on does not support cuDNN 8.0

2

[

42 ]. As a substitute, a TensorFlow port

which recreated all the original preprocessing [

43 ] was used which was allowed to

train for 10 days on the Jetson Xavier platform. After the OpenPose training was

interrupted an ImageNet model that was implemented by the same git repository

was trained for 7 days.

7.3 Trainable CMU

The CMU trainable dataset is also created from the CMU-panoptic range of

motion pose 1, however, CMU trainable consists of 1851 frames as opposed to the

101 in the modified dataset. These frames were hand-selected from six different

subjects performing a series of range of motion movements including moving the

arms and upper body. The images are captured at the resolution of 640x480 from

120 different camera views. In total there are around 220000 images split 60-20-20

for training, validation, and testing. Each of the images has a groud truth 3D

skeleton, as well as, an offset rotation which was calculated using the cross product

of the vector between the pelvis and neck and a vertical vector. The testing

dataset was then rotated randomly between -179.99 and 180.00 degrees and this

was added to the offset rotation to create a ground truth rotation. The rotation

cropped the images so the resulting resolution stayed the same after the rotation

and the empty spaces were filled with black pixels. Bilinear interpolation was used

to avoid artifacts created by the rotation. A copy of the testing data was also

(23)

copied and rescaled to 224x224 before it was rotated, this was done so that the

RotationNet could be tested on data that was preprocessed the same way as the

training and validation data. Both of these models were trained with a batch size

of 16 because it was the highest possible value without running out of memory.

7.4 RotationNet

The RotationNet was implemented in TensorFlow 1.15[

44 ] using Keras [

45 ]

implementation of MobileNetV2 which ends with 1000 outputs that represent

different classes in ImageNet. On top of this was a dense layer that reduced the

number of outputs to 32, followed by an activation layer with the ReLU function,

followed by another dense layer that reduces the total number of outputs to one.

The architecture can be seen in figure 10.

7.4.1 Architecture

RotationNet is based on MobileNetV2 and only adds three layers to its

architecture. A fully connected layer which reduces the number of variables from

the 1000 classes output of ImageNet models down to 32, the second added layer is

an activation layer with a ReLU followed by a second fully connected layer that

brings the total number of variables down to one. MobileNet is 157 layers deep

and is considered a DCNN, when using the MobileNetV2 architecture of 224x224

it has a total of 3.4 million variables which puts it at the lower end of ImageNet

models. The goal of using MobileNetV2 is to use transfer learning to adapt the

features already learned when training on the ImageNet and fine tune them to find

the correct orientation.

7.4.2 Training

The training images were first resized to 224x224, normalized, and then rotated

randomly between -179.99 and 180 degrees which were then added with the offset

rotation native to the image. This was done in an effort to artificially increase the

size of the dataset in an effort the prevent overfitting. The dataset was then

randomly indexed into a buffer the size of the dataset in order to shuffle the

images. Figure 11 shows a grid of six images after the preprocessing step.

The input resolution of 224x224 was used since that is the native resolution on

which ImageNetV2 was trained so to take advantage of the pre-trained weights

this input resolution was necessary. A batch size of 128 was used during the

training, this means inference on 128 images was ran and the weights were

updated to minimize the error on all images. This value was used to add as many

rotations as possible to the batch so as the model would converge. The loss

function used was mean squared error, this is a common loss function for

regression problems and was chosen because it punishes bad estimates with a

higher loss value compared to mean absolute error.

To prevent the model from unlearning all previous knowledge when the new

weights are tuned from random, the model is trained in stages. During the first

pass, only the weights of the three last layers are updated and after that, all

weights will be updated. This was the largest power of two possible with the

hardware of this project. Throughout the training validation loss was monitored

(24)

Figure 10: Architecture of the RotationNet with the MobileNetV2 model summarized into one block. The global average pooling layer and Dense(fully connected) layer following the MobileNet interprets the features extracted from the MobileNet to classify the ImageNet dataset, the remain-ing layers are implemented to adapt the structure to RotationNet.

(25)

Figure 11: The first six elements in the shuffled and preprocessed dataset used for training, on top of each image is the ground truth rotation of each image with positive values being counterclockwise oriented and negative values being clockwise oriented.

and when three epochs without an improvement occurred, the training was

interrupted and the weights from the best performing epoch were saved.

7.4.3 Evaluation

The RotationNet was individually evaluated based on the mean absolute error,

standard deviation, and variance in order to get an understanding of how successful

the training had been and what result one can expect from the RotaionNet. The

entire subsystem shown in image 8 was then compared to OpenPose, both

Openpose and RotationNet were tested on the images of the CMU panoptic

trainable dataset which had been designated for testing and not been previously

seen by any of the algorithms. The results that were compared between the two

models were MPJPE of the image coordinates, resulting in MPJPE expressed in

pixel values and the frequency of misses similarly to the evaluation of the

state-of-the-art using modified CMU Panoptic. These results were then sorted by

panel index where each panel had 5-7 cameras and by rotation of the input image.

(26)

8 Results

This section presents all the results obtained from the implemented methods, the

results are primarily presented in the form of images and tables but are also

explained in plain text.

8.1 Evaluation of applicability of the state of the art

The MPJPE of the right wrist and elbow for the unmodified images was 1.4cm and

overall shows a similar result regardless of the camera combination used. There are

two outliers, however, between cameras nine and sixteen, as well as, camera

twenty-three and twenty-six. For the cropped images the MPJPE was 1.4961cm

and the same outliers exist. For the 90 degree rotated images, it is clear that

cameras three, five, and sixteen have miss-labeled the elbow and wrist joint,

otherwise, the cameras have successfully identified the wrist and elbow joint. The

MPJPE or the 90degeree rotated images is 24.0cm. In the 180degree rotated

images, 10 different cameras had difficulties determining the position of the joints,

which led to an MPJPE of 96.7 cm. For the 270 degree rotated images, there were

three cameras that did not find the correct joints similar to the 90 degree rotated

images, however, the problem was located at cameras fifteen, seventeen, and

twenty as opposed to cameras three, five, and sixteen as seen in figure 12.

Figure 12: Results from OpenPose on the modified CMU panoptic dataset which has been cropped or rotated counterclockwise expressed as MPJPE between all different camera pairs in cm.

Figure 13 shows the percentage of frames open-pose failed to detect the wrist or

elbow joint. In the unmodified dataset four cameras had more than 25% misses.

The cropped dataset has similar results with a few more misses overall but nothing

significant. The 90 degree dataset has significantly more problems with the same

four cameras with more than 40% misses in addition to camera seventeen which

has above 30%. For the 180 degree dataset, the results are much worse. A

(27)

majority of the cameras have above 30% miss on the tested data with camera 16

that had no misses on the normal and cropped dataset spiking with more than

80% misses. The 270 degree dataset has problems with the same cameras as the 90

degree dataset, however, overall it has performed better. where the 90 degree

dataset had five cameras around 15-20% the 270 degree dataset only had one.

Figure 13: Results from OpenPose on the modified CMU panoptic dataset which has been cropped or rotated counterclockwise expressed as percentage of times the wrist or elbow was not found sorted by camera.

8.2 State of the art training

The open-pose model was trained for just over 250 hours, this equates to seventeen

epochs of the coco2017 dataset which totals close to 120000 images. As can be

seen in the graph in figure 14a the loss value varies between 200 and 500 and is not

converging with time.

The MobileNet thin model was allowed to train for approximately 120 hours,

which equates to twenty six and a half epochs since the model has less complexity

compared to Openpose. The loss during training starts very high but then

fluctuates between 200 and 450. The loss is somewhat lower than openpose,

however, the Mobilenet thin does not converge either. The training graph can be

seen in figure 14b

8.3 RotationNet training

Since the RotationNet addresses a regression problem it did not require as much

training, the model trained for just over two days before the termination condition

was met. The training graph displays loss plotted against epochs can be seen in

figure 16a and 16b. The RotationNet converged after 30 epochs and the data put

aside for testing had a mean absolute error of 4.5, standard deviation of 6.3, and

variance of 40.1. The error distribution can be seen in figure 15.

(28)

(a) Openpose (b) MobileNet thin

Figure 14: Training graph of the Openpose & MobileNet thin respectively using rotated coco images where the x-axis represents every 500th batch size and the y-axis represents the loss value.

Figure 15: Histogram showing the error distribution of the RotationNet where the Y-axis represent the frequency expressed in percent and the X-axis the error.

8.4 RotationNet subsystem

The MPJPE of the subsystem in figure 8 was 8.9 pixels, whereas, for Openpose the

MPJPE was 31.1 resulting in a reduction of 71.4% when both the right elbow and

wrist were found. On the RotationNet subsystem, the number of cases where

either the elbow or wrists were not found was 32.1% compared to the 60.9% of

Openpose. When sorting the MPJPE by rotation one can see that Openpose

performs consistently within the -50-50 degree range, however, outside of that

interval the MPJPE suffers greatly. The graph resembles an inverted bell curve

with peaks close to 120 and valleys at approximately 10. The plot can be seen in

image 17a. In comparison to Openpose, RotationNet is consistent across all angles

with an average MPJPE similar to the -50-50 range of Openpose. This can be seen

in image 17b When comparing the MPJPE based on the panel index it is clear

that RotationNet is consistently better than Openpose, although, both methods

have views which are close to 150% the average. For RotationNet these are from

panel one, two, three, nine and nineteen. While for Openpose, these views are

from panel one, two, three, sixteen, and seventeen. The graphs can be seen in

(29)

(a) Mean squared error (b) Mean absolute error

Figure 16: Training data of RotationNet with the mean squared error & mean absolute error of the validation data expressed in the y-axis and the epoch expressed in the x-axis.

(a) Openpose (b) RotationNet

Figure 17: MPJPE of Openpose & RotationNet respectively tested on CMU Panoptic trainable testing data where the Y-axis represents MPJPE and the X-axis represents the rotation of the image.

figure 18a and 18b.

The frequency of misses on Openpose looks like an inverted bell curve similar to

the MPJPE curve, however, the misses degrade faster with the valley between -30

and 30 where approximately 30% of all images miss. The peaks reach almost 90%

as can be seen in figure 19a. Just like the MPJPE sorted by rotation, the misses of

RotationNet are consistent across all rotations with an average close to the best

section of OpenPose 19b. When sorting the Misses by panel RotationNet has panel

one and two above 70% misses and a total of six out of twenty above 40% 20b. In

contrast to this, every panel has at the least 40% misses where panels one and two

are close to and above 80% respectivly.

(30)

Figure 18: MPJPE of Openpose and RotationNet respectively tested on CMU Panoptic trainable testing data where the Y-axis represents MPJPE and the X-axis represents the viewpoint expressed in panel index.

Figure 19: Frequency of misses of Openpose & RotationNet respectively tested on CMU Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the rotation of the image.

(31)

Figure 20: Frequency of misses of Openpose & RotationNet respectivelytested on CMU Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the viewpoint expressed in panel index.

9 Discussion

In this section the results will be interpreted and compared to the research

questions.

9.1 Evaluation of applicability of the state of the art

Based on the results from OpenPose on the modified CMU panoptic dataset several

conclusions can be drawn. Firstly, on the unmodified dataset, the MPJPE was 1.4

which is significantly lower than what is presented by the current state-of-the-art.

This is most likely due to the fact that the current state-of-the-art uses MPJPE

across all key points and there are a lot of ambiguities in the torso. Secondly, The

cropped dataset had only had a 6.8% increase in the MPJPE which suggests that

the zoom of the approach is no issue for Openpose. Thirdly, all the cameras that

had a lot of misses on the unmodified and cropped dataset were situated to

capture the left side of the human which means the torso obscured the right arm.

This can be seen in figure 21 and considering that a robot could grab the left arm

in these scenarios this is not considered an issue. Fourthly, no correlation between

the cameras situated in the ceiling/floor and the MPJPE or number of misses can

be made, therefore, the conclusion that the height of the approach is not an issue

has been made. An important distinction is that the cameras placed near the floor

do not capture the same angle as the ones place above the head, which, means the

height of the approach could be an issue then approached from the feet, however,

no argument for that theory can be made with the modified CMU panoptic

dataset. Lastly, the angle of the approach shows a worse result based on how much

it is rotated from normal. six out of the seven cameras that had the most misses

on the 90 and 270 degree datasets were looking at the human from the subject’s

left side. In most of these cases, the arm was visible but an argument can be made

that it was partially to fully obscured in some of the frames. This indicates that

when the human is rotated it is harder for Openpose to recreate the 2D skeleton

based on where it thinks the arm should be. This is further corroborated by the

(32)

Figure 21: This image shows the self occlusion present in the CMU Panoptic modified dataset from HD cameras 7, 9, 11 and 15

3D POSE ESTIMATION IN THE CONTEXT OF GRIP POSITION FOR PHRI

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

3D POSE ESTIMATION IN THE

CONTEXT OF GRIP POSITION FOR

PHRI

Jacob Norman

jnn13008@student.mdh.se

Examiner: Martin Ekstr¨

om

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Fredrik Ekstrand

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Joaqu´ın Ballesteros

University of M´

alaga, M´

alaga, Spain

Supervisor: Jesus Manuel G´

omez de Gabriel

University of M´

alaga, M´

alaga, Spain

June 27, 2021

Table of Contents

List of Figures

List of Tables

Acronyms

1

Introduction

2

Background

2.1

Computer Vision

2.2

Stereo vision

2.3

CNN

2.4

Human pose estimation

3

Related work

3.1

Pose estimation

4

Problem formulation

4.1

Limitations

4.2

Constrains

4.3

Hypothesis

A monocular camera mounted on a robot manipulator provides

sufficient information to detect a prone human and identify a suitable

gripping location.

4.4

Research questions

RQ1 Can a gripping position be identified regardless from which direction Valkyrie

approaches a prone human?

RQ2 Is a data set designed to represent a prone human from a multitude of angles

necessary to achieve an acceptable estimation of the arm?

5

Methodology

This project will follow Agile guidelines [

34

] as they are prevalent within the

industry with employers inquiring if recent graduates are familiar with it. There

are several agile methodologies to choose from, however, since there is only one

participant in this project a modified model of SCRUM and feature-driven

development has been devised and is explained in the section below.

The project starts with a research phase, the goal of which is to develop a better

understanding of the problem and finding state-of-the-art solutions, this phase will