V¨
aster˚
as, Sweden
Thesis for the Degree of Master of Science in Engineering - Robotics
30.0 credits
3D POSE ESTIMATION IN THE
CONTEXT OF GRIP POSITION FOR
PHRI
Jacob Norman
jnn13008@student.mdh.se
Examiner: Martin Ekstr¨
om
M¨
alardalen University, V¨
aster˚
as, Sweden
Supervisor: Fredrik Ekstrand
M¨
alardalen University, V¨
aster˚
as, Sweden
Supervisor: Joaqu´ın Ballesteros
University of M´
alaga, M´
alaga, Spain
Supervisor: Jesus Manuel G´
omez de Gabriel
University of M´
alaga, M´
alaga, Spain
June 27, 2021
Abstract
For human-robot interaction with the intent to grip a human arm, it is necessary that the ideal gripping location can be identified. In this work, the gripping location is situated on the arm and thus it can be extracted using the position of the wrist and elbow joints. To achieve this human pose estimation is proposed as there exist robust methods that work both in and outside of lab environments. One such example is OpenPose which thanks to the COCO and MPII datasets has recorded impressive results in a variety of different scenarios in real-time. However, most of the images in these datasets are taken from a camera mounted at chest height on people that for the majority of the images are oriented upright. This presents the potential problem that prone humans which are the primary focus of this project can not be detected. Especially if seen from an angle that makes the human appear upside down in the camera frame. To remedy this two different approaches were tested, both aimed at creating a rotation-invariant 2D pose estimation method. The first method rotates the COCO training data in an attempt to create a model that can find humans regardless of orientation in the image. The second approach adds a RotationNet as a preprocessing step to correctly orient the images so that OpenPose can be used to estimate the 2D pose before rotating back the resulting skeletons.
Table of Contents
1 Introduction 1 2 Background 4 2.1 Computer Vision . . . 4 2.2 Stereo vision . . . 5 2.3 CNN . . . 52.4 Human pose estimation . . . 6
3 Related work 7 3.1 Pose estimation . . . 7
3.1.1 Single view RGB image . . . 7
3.1.2 Multi view RGB image . . . 8
3.1.3 Data sets and evaluation . . . 9
4 Problem formulation 10 4.1 Limitations . . . 10 4.2 Constrains . . . 10 4.3 Hypothesis . . . 10 4.4 Research questions . . . 10 5 Methodology 11 6 Method 12 6.1 Evaluation of State of the Art . . . 12
6.1.1 Modified CMU Panoptic . . . 12
6.1.2 Choice of state-of-the-art method . . . 13
6.2 Adaption of State-of-the-art . . . 14
6.2.1 Training OpenPose . . . 14
6.2.2 RotationNet . . . 14
6.2.3 CMU panoptic trainable . . . 14
6.3 Ethical considerations . . . 15
7 Implementation 16 7.1 Modified CMU panoptic . . . 16
7.2 Training openpose . . . 16 7.3 Trainable CMU . . . 16 7.4 RotationNet . . . 17 7.4.1 Architecture . . . 17 7.4.2 Training . . . 17 7.4.3 Evaluation . . . 19 8 Results 20 8.1 Evaluation of applicability of the state of the art . . . 20
8.2 State of the art training . . . 21
8.3 RotationNet training . . . 21
8.4 RotationNet subsystem . . . 22
9 Discussion 25 9.1 Evaluation of applicability of the state of the art . . . 25
9.2 Training of the state of the art . . . 26
9.3 RotationNet training . . . 27
9.4 RotationNet subsystem . . . 27
10 Goals and research questions 29
12 Acknowledgements 32
List of Figures
1 Different perspectives created by approaching a prone human from different angles. 1 2 Different perspectives created by approaching a prone human from different heights. 2 3 Different perspectives created by viewing a prone human from three different distances. 2 4 The search and rescue/assistive robot Valkyrie at UMA consisting of a robot
ma-nipulator with six degrees of freedom mounted on a mobile platform. A gripper is mounted on the end effector with a camera just above. . . 3 5 Gantt chart depicting the initial timeline for the project week by week. . . 11 6 Flowchart of the complete system read left to right where the two-dimensional (2D)
pose estimation block represents the models presented in section 6.2 . . . 12 7 One of the perspective of the modified CMU panoptic dataset where the first row
from left to right is the original image and the obscured image. On the second row from left to right are the images that are rotated 90, 180 and 270 degrees respectively. All images have the same amount of pixels, the 90 and 270 degrees have been cropped for this figure to reduce its size. . . 13 8 Flowchart of the 2D pose estimation using RotationNet and OpenPose, in the
flowchart of the whole system in figure 6, . . . 14 9 Four different frames from the CMU trainable dataset with the vectors showing the
offset rotation plotted to the left and the correctly orientated image(desired before openpose) to the right. On the left images the blue vector is the line between the pelvis and neck while the orange line is the vertical vector starting at the pelvis. . 15 10 Architecture of the RotationNet with the MobileNetV2 model summarized into one
block. The global average pooling layer and Dense(fully connected) layer following the MobileNet interprets the features extracted from the MobileNet to classify the ImageNet dataset, the remaining layers are implemented to adapt the structure to RotationNet. . . 18 11 The first six elements in the shuffled and preprocessed dataset used for training, on
top of each image is the ground truth rotation of each image with positive values being counterclockwise oriented and negative values being clockwise oriented. . . . 19 12 Results from OpenPose on the modified CMU panoptic dataset which has been
cropped or rotated counterclockwise expressed as MPJPE between all different cam-era pairs in cm. . . 20 13 Results from OpenPose on the modified CMU panoptic dataset which has been
cropped or rotated counterclockwise expressed as percentage of times the wrist or elbow was not found sorted by camera. . . 21 14 Training graph of the Openpose & MobileNet thin respectively using rotated coco
images where the x-axis represents every 500th batch size and the y-axis represents the loss value. . . 22 15 Histogram showing the error distribution of the RotationNet where the Y-axis
rep-resent the frequency expressed in percent and the X-axis the error. . . 22 16 Training data of RotationNet with the mean squared error & mean absolute error
of the validation data expressed in the y-axis and the epoch expressed in the x-axis. 23 17 MPJPE of Openpose & RotationNet respectively tested on CMU Panoptic trainable
testing data where the Y-axis represents MPJPE and the X-axis represents the rotation of the image. . . 23 18 MPJPE of Openpose and RotationNet respectively tested on CMU Panoptic
train-able testing data where the Y-axis represents MPJPE and the X-axis represents the viewpoint expressed in panel index. . . 24 19 Frequency of misses of Openpose & RotationNet respectively tested on CMU
Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the rotation of the image. . . 24
20 Frequency of misses of Openpose & RotationNet respectivelytested on CMU Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the viewpoint expressed in panel index. . . 25 21 This image shows the self occlusion present in the CMU Panoptic modified dataset
from HD cameras 7, 9, 11 and 15 . . . 26
List of Tables
Acronyms
ROS Robot Operating System 2D two-dimensional
3D three-dimensional UMA University of Malaga MDH M¨alardalen University
CNN Convolutional Neural Network DCNN Deep Convolutional Neural Network MPJPE Mean Per Joint Position Error HRI Human-Robot Interaction
pHRI Physical Human-Robot Interaction ReLU Rectified Linear Unit
1
Introduction
Human-Robot Interaction (HRI) is a field of study which focuses on developing robots that are able to interact with humans in various everyday occurrences. The core of the field is physically embodied social robots which result in design challenges around the complex social structures that are inherent to human interaction. For example, robots with a human-inspired design encourage people to interact similarly to how they would in human-human interaction. This can be used to plan the interaction, However, if the human’s expectations of the interaction are not achieved it can result in frustration. For social robots to interact with humans, sensors that interpret the surroundings are necessary, the most common of which are the primary senses used in human-human interaction; vision, audition, and touch [1]. Some application areas for HRI are education, mental and physical health, and, applications in industry, domestic chores, and search and rescue [2]. In areas such as search and rescue, physical human robot contact is necessary and these interactions require the utmost care to not harm the patient even more. Therefore safety should remain the top priority and the system should behave in a predictable manner since there is generally no way to anticipate how the human would react [3,4].
At the department of “Ingenier´ıa de Sistemas y Automatica”1at University of Malaga (UMA),
Spain (where this thesis was partly conducted) a search and rescue/assistive robot named Valkyrie is being developed. It is a robot manipulator mounted on a mobile platform that allows the robot to move to the location of a human in need and initiate contact by grasping their wrist. From this position, it will be possible to monitor the vital signs of the human, as well as, helping the human up and eventually leading the human to safety in the event of a disaster. This method of HRI is preferable because conveying information to panic-stricken humans is challenging, this method also allows Valkyrie to be used in elder care where it can assist people to stand up after falling in their homes or other environments where there are no other humans around that can help.
The aim of this thesis is to investigate the feasibility to detect and reconstruct three-dimensional (3D) poses of humans laying on the ground with a monocular camera so that it could be used in the future to grasp a human arm with a robot manipulator. To achieve this three factors will be considered, firstly the angle from which Valkyrie is approaching the human. This is necessary to investigate because depending on the angle of approach it can appear as if the human is rotated relative to the camera. This is a scenario not found in all applications of 3D pose estimation and could therefore present an area that has to see improvement to realize Valkyrie. Figure 1 shows the different perspectives that this issue creates. Secondly, the elevation of the approach is
Figure 1: Different perspectives created by approaching a prone human from different angles. interesting because if a prone human is approached on a flat surface the elevation of the cameras will be somewhere around chest height. This would result in an angle relative to the prone human
which deviates from the normal circumstance where both camera and subject are at the same elevation. In the scenario that the elevation is not high enough the soles of the feet could be the most prominent feature of the image as opposed to a picture taken from straight above which would never occur when approaching a prone human. Figure 2 shows the different perspectives that this issue creates. Lastly, the occlusion of the camera frame has to be investigated. When
Figure 2: Different perspectives created by approaching a prone human from different heights. the mobile manipulator approaches a human arm the other features will be left out of the frame of the camera. Therefore it is important that the proposed method can still identify the arm when no other features are visible.
Figure 3: Different perspectives created by viewing a prone human from three different distances. The motivation behind this project is to create a research platform on which HRI can be tested, the researchers at Departamento de Ingenieria de Sistemas y Automatica have a particular interest in HRI where the robot initiates the contact as this is a largely underrepresented field compared to human initiated contact. Furthermore, there are several projects focused on search and rescue robots at UMA, however, none of them are small enough that the they can be used to research human assistive robots. The need for search and rescue robots in general stem from two main sources, firstly, human-robot collaboration has showed to be effective in assembly lines when adopting robots[5]. Secondly adopting robots for search and rescue scenarios will reduce the risk rescue workers put themselves in when they enter hazardous areas to search for survivors. After finding a human Valkyrie can access the life conditions of the human and take appropriate response, such as leading them to safety or sending a signal that immediate medical assistance is necessary while monitoring the vital signs. This way rescue workers would only expose themselves to danger when necessary as opposed to constantly while searching for survivors. Valkyrie can also be used as a research platform for assistive robotics, many elderly live alone and it is not feasible to have help around at all times. With an assistive robot in the home, if one were to fall help would be able to arrive instantly whether the patient was able to signal for help or not. From that point the assistive robot would be able to help the human up or signal for help as well as assessing the condition of the human.
back-ground information that would be useful in order to understand the concepts and methods pre-sented in this paper. After the background the related work is prepre-sented which consists of more up do date methods on select fields that influenced the decision making in this project. Further on in the report the problem is broken down and the hypothesis, as well as, research questions are stated. In the methodology section the way in which the problem will be approached is presented followed by the methods used to solve the problem in the method section of the report. Details of the implementation and descriptive information how each step was performed is presented next which will allow one to replicate this work, should the need arise. Further on in the report the results will be presented after which they will be discussed in section result and lastly discussion. The last section will also address the hypothesis, research questions and future work.
Figure 4: The search and rescue/assistive robot Valkyrie at UMA consisting of a robot manipulator with six degrees of freedom mounted on a mobile platform. A gripper is mounted on the end effector with a camera just above.
2
Background
This section introduces some of the methods and concepts that were deemed necessary to un-derstand in order to fully comprehend this thesis, the first of which is computer vision and how an image on the computer screen translates to the objects captured by the camera. The second method of importance is Stereo vision, which can capture the depth otherwise lacking in an im-age. Thirdly, Convolutional Neural Networks(CNN) are explained which have revolutionized image processing and are being incorporated with the previously mentioned techniques.
2.1
Computer Vision
In order to find a human and grasp its arm, the robot first needs to detect the human it intends to rescue; this will be done with a camera. The simplest model of a camera is called a pinhole camera, it consists of a small hole and a camera house on the back wall of which the image will be projected upside down [6]. In a modern camera, the image is projected onto sensors that interpret the image into a digital format. This results in four different coordinate systems, the first two are for the object relative to the camera and relative to the base of the robot manipulator. The third is a 2D coordinate system for the image sensors and lastly, there is a coordinate system expressed in pixels for the digital image [7]. To represent an object in pixel coordinates requires a transformation from the camera coordinates. This is done with the intrinsic camera matrix, which describes the internal geometry of the camera. By taking pictures of a checkerboard with a known pattern it is possible to estimate it as well as rectify the lens distortion, this process is called camera calibration. The translation and rotational difference between the world and camera coordinates are described by the extrinsic camera matrix, and the whole transformation from world coordinates to pixel coordinates is performed with the camera matrix [7], this relation is expressed by equation 1 where the left most vector represents homogeneous pixel coordinates. The matrices from left to right are the affine matrix, projection matrix, and finally the extrinsic matrix which includes rotation and translation. The vector on the right side is the world coordinates in homogeneous coordinates.
u v 1 ∼ a11 a12 a13 a21 a22 a23 a31 a32 a33 f 0 0 0 0 f 0 0 0 0 1 0 R3x3 T3x1 0 1 U V W 1 (1)
In this transformation the depth data is lost, however, there are several ways to recreate it, one of which is stereo vision. By using two cameras and placing them next to each other the same object can be seen in both image planes, if one knows the translation and rotation between the cameras and the difference in pixel coordinates between the two images it is possible to determine how far away the object is [6]. Another way of recreating the depth of the object is through perspective projection, if the distance from the principal point in the camera is known as well as the focal length together with the size of the object, the distance to the object can be determined. This relationship is expressed by equation 2 where x represents the distance from the principal point horizontally in pixels, X represents the horizontal distance from the principal point axis to the object in the camera coordinates. Finally, F represents the focal length, and Z the distance to the object[7]. The same is true for the Y coordinates.
Z = fX
x (2)
Initially it seems like both of these methods are inapplicable since weak perspective projection requires the size of the object and stereo vision requires two different perspectives.
However, by mounting the camera on the robot manipulator it is possible to move the ma-nipulator to capture multiple perspectives with the camera, the robot configuration can provide the position of the camera for each frame. Additionally, the right hand side of equation 2 can represents the focal length times the scaling of the object which can be estimated by comparing the 2D and 3D pose [8].
2.2
Stereo vision
The most researched area within stereo vision is stereo matching, it is the practice of finding the same pixel or feature between two or more perspectives. This is necessary to estimate the depth to that feature and may seem like an easy task since humans do it constantly, however, it is a challenge within computer vision [9]. In early works finding and matching features was done using Harris corner detector [6] which is still used and serves as the foundation for several approaches today. A downside to this algorithm, however, is that it responds poorly to changes in scale. Changes in scale are common in real-world applications when changing camera focus or zooming [6]. This problem was remedied by Lowe when SIFT was published [10], which is a feature descriptor that is invariant to translations, rotations, scaling and robust to moderate perspective transformations and illumination. This solution has proven useful for a variety of applications such as object recog-nition and image matching. Another solution to the correspondence problem is non-parametric local transforms [11], by transforming the images and then computing the correspondence problem it is possible to get an approach that is robust against factionalism which is when subsections of the image have their own distinct parameters. This is possible since the correspondence is no longer calculated on the data values but rather the relative order of the data values. For this approach to work, there has to be significant variation between the local transforms and the results in the corresponding area must be similar.
Once the correspondence problem has been solved and the same point has been found in both images it is possible to calculate the depth by projecting two rays from the focus of the camera to the object in the image plane. This process is called triangulation and makes it possible to calculate the coordinates of a point in space if the translation, rotation, and camera calibration are known [6]. Initially, this problem seems trivial since finding the intersection between two vectors is straightforward, however, due to noise, the two vectors rarely intersect. Hartley [12] presented an optimal solution to this problem that is invariant to projective transformations and finds the best point of intersection non iteratively.
Scharstein and Szeliski [13] present a taxonomy for stereo vision in an attempt to categorize the different components for dense stereo matching approaches, that is approaches that estimate the depth for all pixels in the image. Furthermore, Scharstein and Szeliski devised evaluation metrics for each of the components together with a framework to allow researchers to analyze their algorithms. This serves as a standardized test in the field of dense stereo matching with researchers submitting their algorithms to add them to Middlebury’s database of state of the art algorithms. This is presented on their website [14] and is still maintained. The state of the art in stereo vision today has progressed to the point where all the top methods on Middlebury utilize a Convolutional Neural Network (CNN).
2.3
CNN
A CNN is a machine learning algorithm that is most commonly used to extracts information from images by applying a series of filters, the size of the filters varies and the values are tuned to extract information that is relevant to achieve the desired result. On the more complex CNN architectures, there can be several hundred layers of filters, pooling layers, and activation functions that extract and condense information[15]. These architectures are referred to as Deep Convolutional Neural Network (DCNN)s because they have many layers which increase the complexity. DCNNs can have several million variables which makes training time-consuming. A downside to CNNs is that they are a black box in the sense that you can not take the result from one layer and distinguish if it is working or not. Within the field of computer vision, deep learning has revolutionized several topics with DCNNs, in particular, has had an impact on the field [16].
One major downside to CNNs is that they are dependent on the right data for training. In order to implement a CNN for a specific purpose, a representative dataset is necessary, However, this also allows them to be versatile, and consequently, CNNs are used for a variety of applications such as object detection, object classification, segmentation, and pose estimation to name a few.
A CNN is only as good as the dataset used to train it and in 2009 Deng et al. [17] published a dataset titled ImageNet intended for object recognition, image classification, and, automatic object clustering. This led to the creation of several DCNNs that because of the diversity of Imagenet are
versatile and thus have been used for a number of different applications. This is in part because of a method called transfer learning which circumvents the long training time by using the already existing weights learned from ImageNet and finetunes them with a different dataset. This takes advantage of the features the model already has learned and instead of re-learning everything the already known features can be adapted for the new purpose[18].
2.4
Human pose estimation
Human pose estimation is a field of study which attempts to extract the skeleton from a human in either 2D or 3D. This is achieved with a CNN that has fifteen to twenty-five outputs that represent coordinates of different body parts. A wireframe is then constructed between adjacent parts and a skeletal frame is created. Human pose estimation has also been applied to hands, feet, and, face to include fingers, toes, and eyes ears, etc. in the wireframe. Mean Per Joint Position Error (MPJPE) is used to evaluate models and calculates the mean error for every joint. This requires a dataset that has the ground truth when training a model of which there are several, however, not all datasets use the same skeleton structure so the MPJPE of different models is not always directly comparable. Furthermore, there exists a lot of ambiguities in the torso and, as a result, getting an accurate estimation of the hips is harder than the arms.
3
Related work
This section addresses the related work in HRI. First, 3D pose estimation will be investigated to find a human and locate an appropriate position to grip, then ways of controlling the robot manip-ulator and trajectory generation will be discussed. Lastly, evaluation methods will be investigated together with metrics to compare the systems to related work.
3.1
Pose estimation
Unlike the majority of the research in this area[19], this project will mount a monocular camera on a robot manipulator. This allows the possibility of moving the camera to get images from different perspectives or a sequence of images. With this additional information, methods such as stereo vision can be considered assuming the object is static. In this section, the methods used in related work will be sorted by input to the model, after which different data sets and evaluation methods will be discussed.
3.1.1 Single view RGB image
Human pose estimation from a single RGB image is a topic that has seen great progress thanks to deep learning [19,20]. Challenges in this topic include estimating the poses of multiple people in the same image, inferring the locations of occluded limbs and training a robust model that works outside of lab environments [20,21]. Since machine learning is the primary method for this problem it is important that a good dataset is used for training. Currently, there exists no 3D pose dataset that has ground truth data outside lab environments [22,8,23,24]. As a result most research has divided the problem in the two separate tasks, 2D pose estimation and 2D to 3D pose conversion. This allows the 2D pose estimator to train on a diverse dataset with ground truth while the 2D to 3D pose converter can train on infer depth on the joints from a motion capture dataset. While this does not solve the problem, it allows the use of several methods that alleviate it [22,8,23,24].
2D pose estimation 2D pose estimation is a problem that is largely considered solved, however, when it is used in the process of 3D pose estimation it has a smaller margin of error. This is the case because minor errors in the locations of the 2D body joints can have large consequences in the 3D reconstruction [19] since the errors within 2D pose estimation amplifies the error when moving to 3D. A 2D pose estimation method that have seen widespread use in 3D pose estimation for single RGB image input is the Stacked Hourglass Networks[22] proposed by Newell et al. [25]. By pooling and then up-sampling the image from many different resolutions it is possible to capture features at every scale. This allows the network to understand local features such as faces or hands while simultaneously being able to interpret them together with the rest of the image and identify the pose. Another 2D pose estimation algorithm developed by Cao et al. [26] was used before estimating the 3D pose for images of multiple views [27, 28]. Openpose uses a CNN to predict confidence maps for body parts and affinity fields that save the association between them, a greedy parser is then used to get the resulting 2D pose estimation. This method has been proven to work in real time and can provide the location of fingers as well as facial features.
2D to 3D pose conversion 3D pose estimation from a single image is an ill posed problem because of the 2D nature of the source data, this imposes challenges where the 3D pose estimator has to resolve depth ambiguity while also trying to infer the position from occluded limbs [19]. Among related work, a common approach to this problem is regression [8, 24, 22]. By using a CNN to calculate a heat map of the human, it is possible to create a bounding box around the human which normalises the subjects size and position. Thus freeing the regressor from having to localise the person and estimating the scale. The downside to this approach is that global position information is lost [8]. The evaluation methods for this problem also use a pelvis centred coordinate system in which there is no transformation between the subject and the camera. Dabral et al. [21] realised this was a problem for applications such as action recognition and proposed a weak-perspective projection assumption. This assumes that all points on a 3D object are at
roughly the same depth(the depth of the root join) and requires a scaling factor for the object which is estimated by comparing the 2D and 3D poses. A limitation on this method is that it does not work when the human is aligned with the optical axis. Furthermore, it is not intended to be highly accurate, but rather a system to make spatial ordering apparent. Similarly to this approach, Mehta et al. [8] proposed weak perspective projection, however, their approach do not require iterative optimisation which makes it less time-consuming. By using a generalised form of Procrustes analysis to align the projection of the 2D and 3D pose, the translation relative the camera can be described by a linear least squares equation.
To improve the robustness on 3D pose estimation outside of lab environments Yasin et al. [23] proposed two separate sources of training data for 2D and 3D pose estimation. The 3D data was gathered from a motion capture database and projected as normalised 2D poses. A 2D pose estimator would then estimate the pictorial structure model and retrieve the nearest normalised 3D pose which would be an estimate of the final pose. By projecting the 3D pose back into 2D the final 3D pose is found by minimising the projection error.
Similarly to this Wang et al. [24] also suggested that the final 3D pose should be projected back into 2D and used to improve the result. The major distinction between the two methods is that Yasin et al. utilises a K-nearest neighbour to improve the estimate while Wang et al. feeds the projection error into a CNN.
Another approach to solving the lack of a sufficient dataset is Adversarial learning [22], which employs the use of two networks: a generator that creates training samples and a discriminator that tries to distinguish them from real samples. The objective of the generator is to create 3D poses good enough to fool the discriminator into thinking the samples are real. Its architecture is based on the popular stacked hourglass with input both from 3D and 2D annotated data. The 2D to 3D converter in a generative adversarial network can only become as good as the discriminator, therefore a lot of emphases is placed on the discriminator which is based on a multi-source architecture that combines CNNs with input from the image used to generate the data, geometric descriptors and a 2D heat map as well as a depth map.
3.1.2 Multi view RGB image
According to Sarafianos et al. resolving depth ambiguities in 3D pose estimation would be a much simpler task if depth information could be obtained from a sensor [19]. Additionally, Amin et al. [29] argues that the search complexity can be reduced significantly by treating this problem as a joint inference problem between two 2D poses as opposed to a single 3D pose. With two different viewpoints available, stereo matching can be used to calculate the depth which unlike methods used for single view depth inference does not rely on estimations. Therefore, the methods presented in this section will be more robust than the ones presented for single view.
2D to 3D pose conversion Both Garcia et al. [28] and Schwartz et al. [27] utilised OpenPose explained in section 3.1.1 for 2D pose estimation. Garcia et al. used the joint locations from open pose to rectify the image and then as features for triangulation. Schwartz et al. also used the joint locations to remove joints which were only visible from one camera. A heat map generated from OpenPose was then randomly sampled from and back projected the pixel coordinate as a ray from which a 3D joint hypothesis was constructed from the point closest to all the rays. The 2D pose confidence was then calculated by projecting the 3D position to the 2D heat maps. On top of this Belief propagation was used for posterior estimation and temporal smoothness was used to reduce the jitter between frames. Hofman et al. [30] suggested a reversed approach in which the 2D pose is used to find a set of similar poses in a 3D pose library, the 3D pose is then evaluated by projecting it to the other cameras and comparing it with the 2D poses for each camera. If the error is too large the 2D poses are discarded, otherwise the triangulation and projection error is minimised by trying with more 3D poses and calculating the error. The best ranked results are then optimised with gradient descent.
Direct 3D pose estimation One of the problems with 2D to 3D pose reconstruction is that 3D information has to be inferred before the depth can be calculated [27,31]. Gai et al. [31] proposes a solution to this by first finding the relation between the different views and then estimating the
pose. This is done with a ResNet that inputs each image from different views and then merges the information in the pooling layer. Regression is then used to estimate the pose and shape of the human after which an adversarial network was trained to estimate the mesh whose error is propagated through the entire pipeline. This solution runs in real-time and is comparable to similar implementations done on single view RGB images with the distinction that it calculates the global coordinates of the 3D pose. Gai et al. also discovered that the joint error decreased when the number of viewpoints increased.
3.1.3 Data sets and evaluation
Pose estimation is often implemented with an AI approach that requires large datasets. While 3D pose estimation from a single image suffers heavily from poor data sets which contain either diverse or ground truth for joint depth. This is detrimental for machine learning approaches, how-ever, approaches to mitigate this issue exists [22,23,24]. 2D pose estimation does not suffer from this issue since the 2D pose estimators are trained on 2D pose estimation data which is extensive, diverse, and has ground truth. The depth is then calculated through geometry and therefore does not suffer from bad training data. Another downside to the lack of a good data set is that every researcher has to decide which data set to use, this results in several methods that are not directly comparable to each other. This can be a problem when every paper is claiming to improve on state-of-the-art by only comparing to the papers that used the same data set.
One data set that is recurrent among articles in 3D pose reconstruction is Human3.6 by Ionescu et al. [32] which has a standard evaluation protocol. The metric to determine the quality of the match is MPJPE which represents the error between the estimate and ground truth using Procrustes alignment. Another dataset of interest is CMU Panoptic which represents humans in a lab environment using a dome mounted with cameras. The dataset includes a total of 31 HD cameras, 480 VGA cameras, and 10 RGB-Depth sensors. Full 3D poses are captured of humans socializing, dancing, playing musical instruments, or showing off a range of motion [33].
4
Problem formulation
There are several challenges associated with human pose estimation, the first of which is related to the 3D annotated datasets. Since 3D annotated datasets are both expensive and difficult to create outside lab environments it can be difficult to adapt the model to the desired scenario. In order to build a model that can accurately find the wrist and elbow in the perspectives presented in figures 1, 2, and 3, it is necessary to have a dataset that represents these situations.
4.1
Limitations
To make this thesis manageable in the set time frame several limitations and constraints have been put on the work. Originally, this endeavour started in Spain, where access to the physical robot was possible, therefore, testing the solution directly on the robot was a big focus. Due to the impact on Covid-19 the project was moved to Sweden and access to the robot was no longer possible. Consequently, the project took on a more theoretical approach, in practice, this meant a lot more focus was put into identifying a good gripping location and the robustness of the solution as opposed to the interaction between the camera and robot manipulator.
L1 There will be one stationary human lying down in the camera frame.
L2 The exact gripping location will not be identified, instead, the wrist and elbow joints will be identified as it is assumed the ideal gripping location is on a vector between the joints. L3 Movement of the mobile manipulator will not be considered.
4.2
Constrains
C1 Images to test the solution will not be taken from a camera mounted on a robot manipulator. C2 Navigation of the robot manipulator will be considered out of scope for this thesis.
C3 Only solutions which use a monocular camera will be considered.
C4 The images used for this thesis is collected with a monocular camera, therefore, the approach has to take this into consideration.
4.3
Hypothesis
A monocular camera mounted on a robot manipulator provides
sufficient information to detect a prone human and identify a suitable
gripping location.
4.4
Research questions
RQ1 Can a gripping position be identified regardless from which direction Valkyrie
approaches a prone human?
RQ2 Is a data set designed to represent a prone human from a multitude of angles
necessary to achieve an acceptable estimation of the arm?
5
Methodology
This project will follow Agile guidelines [
34
] as they are prevalent within the
industry with employers inquiring if recent graduates are familiar with it. There
are several agile methodologies to choose from, however, since there is only one
participant in this project a modified model of SCRUM and feature-driven
development has been devised and is explained in the section below.
The project starts with a research phase, the goal of which is to develop a better
understanding of the problem and finding state-of-the-art solutions, this phase will
end with a review of the information after which a solution will be decided upon.
The next stage then starts which is implementation-specific planning which
consists of creating a backlog of features where every feature has a development
and design plan as well as a priority list in which order the features should be
completed. The features will also be divided into several different stages that
represent core functionality and then future expansions. The next step an iterative
design process begins, where, similarly to SCRUM a feature(sprint) is selected and
then implemented. The feature is considered completed after each item has been
fulfilled in the definition of done (see table 1). When a feature is complete the next
feature in the list is selected, however, if the feature fails to meet all the criteria in
the definition of done that feature is skipped and instead placed back into step one
or two depending on the issue. Similar to SCRUM, this implementation phase will
consist of the number of days decided upon during the implementation planning.
After the implementation is complete the system will be evaluated as a whole and
once that is done finalization of the report and presentation are the last steps in
this project. The Gantt chart can be seen in figure 5
Definition of done
FeatureFunctional test passed O
Feature evaluated and results recorded O
Acceptance criteria met O
Feature documented in project report O
Table 1: All 4 criteria required to fulfil the definition of done
6
Method
The proposed method to find the ideal gripping location on a prone human
consists of first taking a picture, then moving the robot manipulator on which the
camera is mounted, and taking another picture. Each of the pictures is then used
for 2D pose estimation, and the results are later triangulated. The flowchart of
this system can be seen in figure 6.
Figure 6: Flowchart of the complete system read left to right where the 2D pose estimation block represents the models presented in section 6.2
6.1
Evaluation of State of the Art
To simulate a prone human being approached from a multitude of angles, several
different datasets were considered. A common theme among them was a focus on
upright humans with a camera at chest height, most often in social or
sports-related scenarios. This is most commonly to represent humans for the
purpose of action understanding, surveillance, HRI, motion capture, and CGI [
35
].
As a result, an already existing dataset could not be used for the purpose of this
report, instead, a dataset had to be modified in an attempt to represent the
scenario. This limits the number of available datasets since not all datasets are
under a license that allows modifications. One dataset that does is CMU Panoptic
[
36
]. Furthermore, the dataset is constructed using a dome mounted with cameras
to capture different views. This allows the simulation of approaching the human
from different directions.
6.1.1 Modified CMU Panoptic
The CMU Pantoptic dataset has been used extensively in research and consists of
segments of social situations, range of movement, and dancing. CMU panoptic is
the largest 3D annotated dataset seen from the number of camera views [
37
],
unfortunately there are no segments where the focus is on a human in a prone
position. To remedy this, the dataset will be modified by rotating images taken
from a range of motion segment 90, 180, and 270 degrees to represent the pose a
prone human would have when approached from the head or sides this can be seen
more clearly in Figure 1. Furthermore, a zoomed-in view of the right arm will also
be added to test if the arm is identifiable when the rest of the human is obscured.
Figure 7 shows example images taken from the modified dataset. This dataset will
be refered to in the report as ”Modified CMU Panoptic”
Figure 7: One of the perspective of the modified CMU panoptic dataset where the first row from left to right is the original image and the obscured image. On the second row from left to right are the images that are rotated 90, 180 and 270 degrees respectively. All images have the same amount of pixels, the 90 and 270 degrees have been cropped for this figure to reduce its size.
The state-of-the-art method was evaluated based on MPJPE of the wrist and
elbow joint. This is the same metric that is used by related work except for the
purpose of this report only the wrist and elbow joint of one of the arms is
considered. Since the cameras cover a 360 degree perspective around the human
only one of the arms is considered since using the closest arm would result in
mirrored results from cameras facing each other. The results are segmented by
rotation, crop, and placed in a grid to display triangulation between individual
cameras. A table graph showing how often all the required joints were not
detected is also presented in section 8.
6.1.2 Choice of state-of-the-art method
Among the state-of-the-art, there are several interesting methods, some of which
are already implemented and are free to use for research purposes and some which
are not implemented. When choosing which method to evaluate several factors
were considered. Firstly, how difficult would the model be to implement on
Valkyrie including training and eventual porting to Robot Operating
System (ROS)? Secondly, is this method proven to work in literature? Is the
method available with the author’s implementation or does it require
implementation and training from scratch?
The chosen state-of-the-art method to evaluate is Openpose because it already has
a ROS implementation that can integrate with Valkyrie, free to use Caffe model,
as well as, several TensorFlow ports which make transfer learning easier.
Furthermore, Openpose is well established in the literature and this choice also
coincides with the wishes of UMA.
6.2
Adaption of State-of-the-art
In an attempt to create a rotation-invariant model two different approaches were
tested. The first of which attempts to make Openpose rotation invariant by
rotating the input training data that is used to train the model end to end. The
intention behind this is if differently orientated humans are present in the training
data, hopefully, the CNN can adapt to be able to identify human limbs in all
scenarios. The second approach adds a DCNN as a preprocessing step that
extracts the orientation of the human which is used to rotate the image before
rotating the skeleton back after 2d pose estimation. This has been done previously
by Kong et al. [
38
] to make a hand pose estimation model rotation-invariant.
6.2.1 Training OpenPoseThe creators of Openpose has provided the training code to train Openpose end to
end, on the original model this was done using the Common Objects in
Context (COCO) dataset[
39
]. In an effort to achieve a rotation-invariant 2D pose
estimation model the input COCO dataset was rotated randomly in the
preprocessing step and used to train both openpose and a MobileNet 2D pose
estimation solution.
6.2.2 RotationNet
As an alternative to rotation invariant 2D pose estimation, a preprocessing step
that attempts to extract the angle of the upper body was created. By taking the
MobileNetV2 architecture and adding two fully connected layers separated by an
activation layer with the Rectified Linear Unit (ReLU) function, it is possible to
create a system that takes an input image and treats it as a regression problem.
The desired output of this architecture is the angle at which the image has to be
rotated to get the human aligned with the vertical axis of the image. The 2D
skeleton can then be extracted using 2D pose estimation after which the skeleton
will be rotated back the same amount so that it aligns with the original image.
The flowchart of this system is presented in figure 8.
Figure 8: Flowchart of the 2D pose estimation using RotationNet and OpenPose, in the flowchart of the whole system in figure 6,
6.2.3 CMU panoptic trainable
The ”CMU panoptic trainable” dataset was created using 120 different views of six
people doing the same movements. The movements were selected from a sequence
where first a series of arm motions are enacted after which the whole upper body
Figure 9: Four different frames from the CMU trainable dataset with the vectors showing the offset rotation plotted to the left and the correctly orientated image(desired before openpose) to the right. On the left images the blue vector is the line between the pelvis and neck while the orange line is the vertical vector starting at the pelvis.
is moved. There is only one human in the frame at a time, and each picture has a
corresponding 3D skeleton and offset rotation. The offset rotation is extracted
with the formula presented in equation 3 where n
1represents a unit vector
originating in the pelvis orientated towards the neck and n
2represents a unit
vector originating in the pelvis orientated vertically with respect to the image.
θ = arccos n
1· n
2(3)
These vectors are further demonstrated in figure 9 where the left image shows the
offset and the right image shows the corrected image RotationNet is supposed to
feed to Openpose.
6.3
Ethical considerations
According to the license of CMU Panoptic[
40
], the modified datasets used in this
thesis are not allowed to be distributed. No other additional sources of data were
collected during this project, therefore, there are no ethical considerations
7
Implementation
This section explains how each of the methods were realized, as well as, more in
depth descriptions.
7.1
Modified CMU panoptic
The modified dataset was created using 25 HD views from the CMU panoptic
pose1 sample [
41
] which depicts one human moving his arms for 101 frames. These
images were then modified so for each frame there exists one normal image, one
cropped around the right arm, and three rotated 90, 180, and, 270 degrees. The
images that were rotated were not cropped, instead, they have the resolution
1080x1920, as opposed to, 1920x1080.
7.2
Training openpose
In the preprocessing step used for the end-to-end training of openpose, several
image augmentations are made. These include random scaling, rotation, flip, X,
and crop. In addition to creating more robust models that can handle nonperfect
images, this also reduces the risk of overfitting by artificially increasing the
dataset. With random variables in the preprocessing, the dataset can be used for
several epochs without seeing the same image twice. This was taken advantage of
when training openpose to be rotation-invariant because all that was necessary to
feed rotated training images was increase the max and min allowed rotation in the
preprocessing function. The hardware used for this project was Jetson Xavier
AGX, unfortunately, the machine learning framework Caffe which openpose is
built on does not support cuDNN 8.0
2[
42
]. As a substitute, a TensorFlow port
which recreated all the original preprocessing [
43
] was used which was allowed to
train for 10 days on the Jetson Xavier platform. After the OpenPose training was
interrupted an ImageNet model that was implemented by the same git repository
was trained for 7 days.
7.3
Trainable CMU
The CMU trainable dataset is also created from the CMU-panoptic range of
motion pose 1, however, CMU trainable consists of 1851 frames as opposed to the
101 in the modified dataset. These frames were hand-selected from six different
subjects performing a series of range of motion movements including moving the
arms and upper body. The images are captured at the resolution of 640x480 from
120 different camera views. In total there are around 220000 images split 60-20-20
for training, validation, and testing. Each of the images has a groud truth 3D
skeleton, as well as, an offset rotation which was calculated using the cross product
of the vector between the pelvis and neck and a vertical vector. The testing
dataset was then rotated randomly between -179.99 and 180.00 degrees and this
was added to the offset rotation to create a ground truth rotation. The rotation
cropped the images so the resulting resolution stayed the same after the rotation
and the empty spaces were filled with black pixels. Bilinear interpolation was used
to avoid artifacts created by the rotation. A copy of the testing data was also
copied and rescaled to 224x224 before it was rotated, this was done so that the
RotationNet could be tested on data that was preprocessed the same way as the
training and validation data. Both of these models were trained with a batch size
of 16 because it was the highest possible value without running out of memory.
7.4
RotationNet
The RotationNet was implemented in TensorFlow 1.15[
44
] using Keras [
45
]
implementation of MobileNetV2 which ends with 1000 outputs that represent
different classes in ImageNet. On top of this was a dense layer that reduced the
number of outputs to 32, followed by an activation layer with the ReLU function,
followed by another dense layer that reduces the total number of outputs to one.
The architecture can be seen in figure 10.
7.4.1 Architecture
RotationNet is based on MobileNetV2 and only adds three layers to its
architecture. A fully connected layer which reduces the number of variables from
the 1000 classes output of ImageNet models down to 32, the second added layer is
an activation layer with a ReLU followed by a second fully connected layer that
brings the total number of variables down to one. MobileNet is 157 layers deep
and is considered a DCNN, when using the MobileNetV2 architecture of 224x224
it has a total of 3.4 million variables which puts it at the lower end of ImageNet
models. The goal of using MobileNetV2 is to use transfer learning to adapt the
features already learned when training on the ImageNet and fine tune them to find
the correct orientation.
7.4.2 Training
The training images were first resized to 224x224, normalized, and then rotated
randomly between -179.99 and 180 degrees which were then added with the offset
rotation native to the image. This was done in an effort to artificially increase the
size of the dataset in an effort the prevent overfitting. The dataset was then
randomly indexed into a buffer the size of the dataset in order to shuffle the
images. Figure 11 shows a grid of six images after the preprocessing step.
The input resolution of 224x224 was used since that is the native resolution on
which ImageNetV2 was trained so to take advantage of the pre-trained weights
this input resolution was necessary. A batch size of 128 was used during the
training, this means inference on 128 images was ran and the weights were
updated to minimize the error on all images. This value was used to add as many
rotations as possible to the batch so as the model would converge. The loss
function used was mean squared error, this is a common loss function for
regression problems and was chosen because it punishes bad estimates with a
higher loss value compared to mean absolute error.
To prevent the model from unlearning all previous knowledge when the new
weights are tuned from random, the model is trained in stages. During the first
pass, only the weights of the three last layers are updated and after that, all
weights will be updated. This was the largest power of two possible with the
hardware of this project. Throughout the training validation loss was monitored
Figure 10: Architecture of the RotationNet with the MobileNetV2 model summarized into one block. The global average pooling layer and Dense(fully connected) layer following the MobileNet interprets the features extracted from the MobileNet to classify the ImageNet dataset, the remain-ing layers are implemented to adapt the structure to RotationNet.
Figure 11: The first six elements in the shuffled and preprocessed dataset used for training, on top of each image is the ground truth rotation of each image with positive values being counterclockwise oriented and negative values being clockwise oriented.
and when three epochs without an improvement occurred, the training was
interrupted and the weights from the best performing epoch were saved.
7.4.3 EvaluationThe RotationNet was individually evaluated based on the mean absolute error,
standard deviation, and variance in order to get an understanding of how successful
the training had been and what result one can expect from the RotaionNet. The
entire subsystem shown in image 8 was then compared to OpenPose, both
Openpose and RotationNet were tested on the images of the CMU panoptic
trainable dataset which had been designated for testing and not been previously
seen by any of the algorithms. The results that were compared between the two
models were MPJPE of the image coordinates, resulting in MPJPE expressed in
pixel values and the frequency of misses similarly to the evaluation of the
state-of-the-art using modified CMU Panoptic. These results were then sorted by
panel index where each panel had 5-7 cameras and by rotation of the input image.
8
Results
This section presents all the results obtained from the implemented methods, the
results are primarily presented in the form of images and tables but are also
explained in plain text.
8.1
Evaluation of applicability of the state of the art
The MPJPE of the right wrist and elbow for the unmodified images was 1.4cm and
overall shows a similar result regardless of the camera combination used. There are
two outliers, however, between cameras nine and sixteen, as well as, camera
twenty-three and twenty-six. For the cropped images the MPJPE was 1.4961cm
and the same outliers exist. For the 90 degree rotated images, it is clear that
cameras three, five, and sixteen have miss-labeled the elbow and wrist joint,
otherwise, the cameras have successfully identified the wrist and elbow joint. The
MPJPE or the 90degeree rotated images is 24.0cm. In the 180degree rotated
images, 10 different cameras had difficulties determining the position of the joints,
which led to an MPJPE of 96.7 cm. For the 270 degree rotated images, there were
three cameras that did not find the correct joints similar to the 90 degree rotated
images, however, the problem was located at cameras fifteen, seventeen, and
twenty as opposed to cameras three, five, and sixteen as seen in figure 12.
Figure 12: Results from OpenPose on the modified CMU panoptic dataset which has been cropped or rotated counterclockwise expressed as MPJPE between all different camera pairs in cm.
Figure 13 shows the percentage of frames open-pose failed to detect the wrist or
elbow joint. In the unmodified dataset four cameras had more than 25% misses.
The cropped dataset has similar results with a few more misses overall but nothing
significant. The 90 degree dataset has significantly more problems with the same
four cameras with more than 40% misses in addition to camera seventeen which
has above 30%. For the 180 degree dataset, the results are much worse. A
majority of the cameras have above 30% miss on the tested data with camera 16
that had no misses on the normal and cropped dataset spiking with more than
80% misses. The 270 degree dataset has problems with the same cameras as the 90
degree dataset, however, overall it has performed better. where the 90 degree
dataset had five cameras around 15-20% the 270 degree dataset only had one.
Figure 13: Results from OpenPose on the modified CMU panoptic dataset which has been cropped or rotated counterclockwise expressed as percentage of times the wrist or elbow was not found sorted by camera.
8.2
State of the art training
The open-pose model was trained for just over 250 hours, this equates to seventeen
epochs of the coco2017 dataset which totals close to 120000 images. As can be
seen in the graph in figure 14a the loss value varies between 200 and 500 and is not
converging with time.
The MobileNet thin model was allowed to train for approximately 120 hours,
which equates to twenty six and a half epochs since the model has less complexity
compared to Openpose. The loss during training starts very high but then
fluctuates between 200 and 450. The loss is somewhat lower than openpose,
however, the Mobilenet thin does not converge either. The training graph can be
seen in figure 14b
8.3
RotationNet training
Since the RotationNet addresses a regression problem it did not require as much
training, the model trained for just over two days before the termination condition
was met. The training graph displays loss plotted against epochs can be seen in
figure 16a and 16b. The RotationNet converged after 30 epochs and the data put
aside for testing had a mean absolute error of 4.5, standard deviation of 6.3, and
variance of 40.1. The error distribution can be seen in figure 15.
(a) Openpose (b) MobileNet thin
Figure 14: Training graph of the Openpose & MobileNet thin respectively using rotated coco images where the x-axis represents every 500th batch size and the y-axis represents the loss value.
Figure 15: Histogram showing the error distribution of the RotationNet where the Y-axis represent the frequency expressed in percent and the X-axis the error.
8.4
RotationNet subsystem
The MPJPE of the subsystem in figure 8 was 8.9 pixels, whereas, for Openpose the
MPJPE was 31.1 resulting in a reduction of 71.4% when both the right elbow and
wrist were found. On the RotationNet subsystem, the number of cases where
either the elbow or wrists were not found was 32.1% compared to the 60.9% of
Openpose. When sorting the MPJPE by rotation one can see that Openpose
performs consistently within the -50-50 degree range, however, outside of that
interval the MPJPE suffers greatly. The graph resembles an inverted bell curve
with peaks close to 120 and valleys at approximately 10. The plot can be seen in
image 17a. In comparison to Openpose, RotationNet is consistent across all angles
with an average MPJPE similar to the -50-50 range of Openpose. This can be seen
in image 17b When comparing the MPJPE based on the panel index it is clear
that RotationNet is consistently better than Openpose, although, both methods
have views which are close to 150% the average. For RotationNet these are from
panel one, two, three, nine and nineteen. While for Openpose, these views are
from panel one, two, three, sixteen, and seventeen. The graphs can be seen in
(a) Mean squared error (b) Mean absolute error
Figure 16: Training data of RotationNet with the mean squared error & mean absolute error of the validation data expressed in the y-axis and the epoch expressed in the x-axis.
(a) Openpose (b) RotationNet
Figure 17: MPJPE of Openpose & RotationNet respectively tested on CMU Panoptic trainable testing data where the Y-axis represents MPJPE and the X-axis represents the rotation of the image.
figure 18a and 18b.
The frequency of misses on Openpose looks like an inverted bell curve similar to
the MPJPE curve, however, the misses degrade faster with the valley between -30
and 30 where approximately 30% of all images miss. The peaks reach almost 90%
as can be seen in figure 19a. Just like the MPJPE sorted by rotation, the misses of
RotationNet are consistent across all rotations with an average close to the best
section of OpenPose 19b. When sorting the Misses by panel RotationNet has panel
one and two above 70% misses and a total of six out of twenty above 40% 20b. In
contrast to this, every panel has at the least 40% misses where panels one and two
are close to and above 80% respectivly.
(a) Openpose (b) RotationNet
Figure 18: MPJPE of Openpose and RotationNet respectively tested on CMU Panoptic trainable testing data where the Y-axis represents MPJPE and the X-axis represents the viewpoint expressed in panel index.
(a) Openpose (b) RotationNet
Figure 19: Frequency of misses of Openpose & RotationNet respectively tested on CMU Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the rotation of the image.
(a) Openpose (b) RotationNet
Figure 20: Frequency of misses of Openpose & RotationNet respectivelytested on CMU Panop-tic trainable testing data where the Y-axis represents the percentage of misses and the X-axis represents the viewpoint expressed in panel index.
9
Discussion
In this section the results will be interpreted and compared to the research
questions.
9.1
Evaluation of applicability of the state of the art
Based on the results from OpenPose on the modified CMU panoptic dataset several
conclusions can be drawn. Firstly, on the unmodified dataset, the MPJPE was 1.4
which is significantly lower than what is presented by the current state-of-the-art.
This is most likely due to the fact that the current state-of-the-art uses MPJPE
across all key points and there are a lot of ambiguities in the torso. Secondly, The
cropped dataset had only had a 6.8% increase in the MPJPE which suggests that
the zoom of the approach is no issue for Openpose. Thirdly, all the cameras that
had a lot of misses on the unmodified and cropped dataset were situated to
capture the left side of the human which means the torso obscured the right arm.
This can be seen in figure 21 and considering that a robot could grab the left arm
in these scenarios this is not considered an issue. Fourthly, no correlation between
the cameras situated in the ceiling/floor and the MPJPE or number of misses can
be made, therefore, the conclusion that the height of the approach is not an issue
has been made. An important distinction is that the cameras placed near the floor
do not capture the same angle as the ones place above the head, which, means the
height of the approach could be an issue then approached from the feet, however,
no argument for that theory can be made with the modified CMU panoptic
dataset. Lastly, the angle of the approach shows a worse result based on how much
it is rotated from normal. six out of the seven cameras that had the most misses
on the 90 and 270 degree datasets were looking at the human from the subject’s
left side. In most of these cases, the arm was visible but an argument can be made
that it was partially to fully obscured in some of the frames. This indicates that
when the human is rotated it is harder for Openpose to recreate the 2D skeleton
based on where it thinks the arm should be. This is further corroborated by the
Figure 21: This image shows the self occlusion present in the CMU Panoptic modified dataset from HD cameras 7, 9, 11 and 15