Implementation and Evaluation of Gesture-less Mid-Air Interactions on a Stereoscopic Display

Full text

(1)LiU-ITN-TEK-A--20/054-SE. Implementation and Evaluation of Gesture-less Mid-Air Interactions on a Stereoscopic Display Nicholas Frederiksen Jakob Gunnarsson 2020-09-24. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--20/054-SE. Implementation and Evaluation of Gesture-less Mid-Air Interactions on a Stereoscopic Display The thesis work carried out in Medieteknik at Tekniska högskolan at Linköpings universitet. Nicholas Frederiksen Jakob Gunnarsson Norrköping 2020-09-24. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Abstract This report presents the work of a master thesis aimed at creating a modern stereoscopic mid-air interaction setup and developing and evaluating gesture-less mid-air interactions on said setup. A mid-air interaction system was created that combined OpenPose’s 2D hand joint detection with the depth data from an Intel depth camera to track a user’s hand position in three dimensions. Four mid-air interactions were developed, three gesture-less, and one gesture-based for comparison. One of the developed gesture-less interactions is capable of 6 Degreesof-Freedom (DoF). A user study was conducted to test the three gesture-less interactions intuitiveness, the setup’s stereoscopic immersion, and also the task completion speed of the new gesture-less 6-DoF interaction..

(4) Acknowledgments First of all, we would like to thank Karljohan Lundin Palmerius for creating this exciting opportunity that would become the subject of this master thesis. We would also like to thank him for sharing his deep knowledge of VR. Secondly, we would like to thank our supervisors Henry Fröcklin and Ali Samini, for helping out with the report and outlining of our user study, and also giving us great inspiration for our interaction modes, especially Plane. We could not have made it without you sharing your vision. Lastly, we would like to thank the people at Carnegie Mellon University who created, released, and maintain the OpenPose library which became the basis of our 3D hand tracking solution.. iv.

(5) Contents Abstract. iii. Acknowledgments. iv. Contents. v. List of Figures. vii. List of Tables. x. 1. 2. 3. 4. Introduction 1.1 Background and Motivation 1.2 Aim . . . . . . . . . . . . . . 1.3 Research questions . . . . . 1.4 Delimitations . . . . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 1 1 2 2 2. Related Work 2.1 Mid-Air Interaction . . . . . . 2.1.1 Setups . . . . . . . . . 2.1.2 Occlusion . . . . . . . 2.1.3 Interactions . . . . . . 2.2 Camera-based Hand Tracking 2.2.1 Hand Pose Estimation 2.2.2 OpenPose . . . . . . . 2.3 LiU’s NanoSim . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 3 3 3 4 4 5 6 7 8. Implementation of a Mid-Air Interaction System 3.1 Platform and Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 RGB-D Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Stereoscopic Rendering of Scene in Unity . . . . . . . . . . . . . . . . . 3.4 3D-Tracking of the Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 OpenPose with a Dynamic Bounding Box . . . . . . . . . . . . . 3.4.2 Converting Pixel- to Metric Coordinates: Going From 2D to 3D 3.5 Calibrating for the Correct Coordinate System . . . . . . . . . . . . . . 3.6 Reducing the Effect of Input Noise . . . . . . . . . . . . . . . . . . . . . 3.7 Improving Performance Issues . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 9 9 9 10 12 12 13 14 15 16. The Four Mid-Air Interaction Modes 4.1 Spring . . . . . . . . . . . . . . . . . . . 4.2 Sticky . . . . . . . . . . . . . . . . . . . . 4.3 Plane . . . . . . . . . . . . . . . . . . . . 4.4 Proximity Grab-and-Drag . . . . . . . . 4.5 Optimized RGB-D Camera Positioning .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 17 17 18 19 22 26. v. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . ..

(6) 5. 6. 7. 8. User Study 5.1 The Three Tests . . . . . . . . . . . . . . . 5.2 Test Environment . . . . . . . . . . . . . . 5.3 Pilot Studies . . . . . . . . . . . . . . . . . 5.4 Participants . . . . . . . . . . . . . . . . . 5.5 Design and Procedure . . . . . . . . . . . 5.5.1 Test A: Interaction Intuitiveness . 5.5.2 Test B: Stereoscopic Immersion . . 5.5.3 Test C: Manipulation Performance 5.6 Expected Outcome and Hypotheses . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 27 27 27 28 28 28 28 29 30 31. Results 6.1 The Setup . . 6.2 User Study . . 6.2.1 Test A 6.2.2 Test B . 6.2.3 Test C .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 33 33 35 35 38 40. Discussion 7.1 The Setup . . . . . . . . 7.2 User Study Results . . . 7.2.1 Test A . . . . . . 7.2.2 Test B . . . . . . . 7.2.3 Test C . . . . . . . 7.3 System Implementation 7.4 Future Work . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 41 41 42 42 44 44 46 46. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. Conclusion. 48. Bibliography. 51. vi.

(7) List of Figures 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7. 4.1. 4.2. 4.3. Diagram depicts a green cylinder in front of a screen whose center is showed by the pink triangle. a) is a side view of the viewers frustum, b) is a 3D view, c) is a top view, and d) are the 2 resulting renderings of the cylinder on the left and right eyes’ near plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Perspective view frustum for off-center projection. l is the position of the left frustum plane on the near plane, and on is the projection of the eye position, o, onto the near plane, whilst os is the projection of o onto the screen plane. c is the center of the screen, and ls , rs , ts , and bs are the distances from s to the positions of the left-, right-, top- and bottom frustum planes on the screen plane. x, y, and z are the right, up, and back vectors for the eye’s coordinate system, while x1 , y1 , and z1 are the same vectors in the screen’s frame of reference. . . . . . . . . . . . . . . . . The basics of creating interlaced stereo images. The final image contains the odd pixel rows of the left view and the even pixel rows of the right view. By wearing polarized glasses the viewer will receive the corresponding view for each eye. . . The skeletal model that OpenPose bases its keypoints on. We currently only make use of 3 out of the 21 keypoints, namely points 2, 3 and 7 with the potential to use more in the future. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Screenshots of the OpenPose hand skeleton and the green dynamic bounding box, which adapts to size of the hand. Where a closer/bigger hand in A produces a bigger bounding box compared to the farther/smaller hand in B. . . . . . . . . . . 3D schematic of the setup for the manual point measurement process. Image includes representations for a TV screen, box with three sticks and camera, as well as the coordinate system for the camera and screen that are located in respective systems origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The weighted mean distribution where only the most recent positing and the oldest gets new weights. Depending on the number of positions, ones would either be added or removed from the function where the ends would always stay the same. Illustration of a user manipulating virtual cube with Spring interaction, by first touching the surface of the cube, a), then dragging the hand to the left, creating tension in the spring causing the object to move towards the hand, b), then finally coming to a rest state once close enough, c). Note that the orientation of the cube has changed after its journey. This is due to the effects of Unity’s physics engine on a rigid body with no gravity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of a user manipulating virtual cube with Sticky interaction, by first touching the surface of the cube, a), then dragging the hand to the left, essentially bringing it along, b), then once the hand stops, so does the cube, c). . . . . . . . . . Illustration of a user manipulating virtual cube with Plane interaction, by first touching the surface of the cube, a), resulting in the cube leaping towards the palm, b), followed by a rotation and translation motion of the palm, with the cube following and maintaining a matching orientation, c). . . . . . . . . . . . . . . . . .. vii. 11. 11. 12. 12. 13. 15. 16. 18. 19. 19.

(8) Showing how the palm’s plane is estimated. P0 , P1 , and P2 are the positions of the Ý 3 detected joints which together define a plane. Ñ n is the normal of the resulting plane, where X, Y, and Z are the screen’s coordinate system. . . . . . . . . . . . . . 4.5 Yellow lines are the borders of the P0 P2 P1 triangle, with point Pc as the triangle centroid. Blue lines are the normal vector of the P0 P2 P1 triangle. The red line, in ÝÝÑ ÝÝÑ Ý this image, is set to be half the length of P2 P0 and is perpendicular to Ñ n and P2 P0 . Ý The offset position, Poffset , can then be specified to be any point on Ñ n , shifted to the end of the red line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Illustration shows an example of how the roll-rotation is found for a left-to-right swiping motion. The two grids represent the two planes from two consecutive frames. Inside each grid lies a blue triangle, representing the corresponding ÝÝÑ P0 P2 P1 triangle of each frame. Within each triangle lies the associated Pc P0 vector, represented as a red arrow for the previous vector and a green arrow for the ÝÝÑ1 current one. The Pc P0 vector is the red one rotated through a quaternion. Lastly, θ shows the desired roll angle. X, Y, and Z are the screen’s coordinate system. . . . 4.7 The inner and outer manipulations-zones on a cube viewed from the left plane. Inner zone is the pink colored region on the cube, and the outer zone is the yellow region. d is the set proximity/distance between the boarder of the inner zone and outer zone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Illustration of a user manipulating virtual cube’s position with Proximity Graband-Drag interaction, by first placing the index finger inside the cube, having it change color to magenta, b), then forming a fist, turning it red, c). Once the cube is red, the user may re-position the cube by moving their hand freely in 3-DoF, d). Once the user has found the desired position and wants to stop the manipulation, e), they simply open their hand, in which the cube will return to its original color, f). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Illustration of a user manipulating virtual cube’s orientation with Proximity Graband-Drag interaction, by first placing the index finger just outside the cube, having it change color to yellow, b), then forming a fist, turning it green, c). Once the cube is green, the user may rotate the cube along a chosen axis, either the up, right, or forward vector in the screen’s frame of reference. For example, by moving the hand to the right, rotation around the up vector would be locked, and any change in the hands x-coordinate would correspond to a rotation, d). Once the user has found the desired orientation and wants to stop the manipulation, e), they simply open their hand, in which the cube will return to its original color, f). . . . . . . . . 4.10 Example images demonstrating the zone of interaction, where A is the side view and B is the top view of the same scene. The yellow lines are the RGB-D frustum, where the red lines are the clapping distances for the depth data. The blue triangle is the virtual camera frustum. The green zone illustrates the zone of visual interaction, that lies within the virtual frustum where objects can be seen. The green zone together with the yellow zone makes up the complete zone of interaction. . . 4.4. 5.1. 5.2 5.3 5.4. The left image shows an overview of the room where testing and development took place. To minimize glare on the TV, lights on the ceiling and in the hallway were covered up/blocked with paper sheets, improving the general stereoscopic immersion. The right image shows the view the user had of the setup when using the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stills from Test A where A is from task 1 and B is from task 2. In B one can see the "plank" the subjects were asked to place the cubes under. . . . . . . . . . . . . . . . Illustration of the three screen regions. . . . . . . . . . . . . . . . . . . . . . . . . . . Stills from Test B where the big cube is used in A and the small cube is used in B. .. viii. 20. 21. 22. 24. 24. 25. 26. 28 29 30 30.

(9) 5.5. 6.1. 6.2. 6.3 6.4 6.5. 6.6. Stills from Test C where 1a is from the start of a Proximity Grab-and-Drag level and 1b is from the end of the same level. 2a is from the start of a Plane level and 2b is from the end of the same level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic of the setup. x, y, and z are the right, up, and forward vectors in the height-adjustable table’s frame of reference. φ, α, and θ are 19°, 25°, and 35° respectively. The RealSense D435i cameras were positioned 0.22m back and 0.65m up (0.75 in the case of camera 2) relative to the screen center, CS , in the table’s frame of reference. The user was to keep their head at a constant distance of 1m from the screen center for optimal stereoscopy. . . . . . . . . . . . . . . . . . . . . . Images from the two cameras showing their view of the user. A is a frame from camera one, the camera used for Plane and PGD. B is a frame from camera two used for Sticky and Spring. Both cameras uses a horizontal field of view of 90 degrees. . Illustration of the final system architecture, containing the key components. . . . . Diverging stacked bar chart of all subjects answers from the 5-point Likert scale in Test A, expressed in percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Each box shows the min, median, mean, and max angles for each rotation type in every region when using a larger cube. The type of rotation is represented by color, and the shade of the color represents what region they belong to, having the lighter and darker shades belong to the right and left region respectively. . . . . . . Test C task completion times for each subject for both PGD and Plane. . . . . . . .. ix. 31. 34. 34 34 35. 39 40.

(10) List of Tables 6.1 6.2 6.3 6.4 6.5 6.6. 6.7. 6.8. Subjects answers from the 5-point Likert scale in Test A. . . . . . . . . . . . . . . . . The most and least preferred interaction mode according to the N subjects, expressed in percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common subject sentiments formed from their comments about Spring. . . . . . . Common subject sentiments formed from their comments about Sticky. . . . . . . . Common subject sentiments formed from their comments about Plane. . . . . . . . A collection of all of the mean angles for when the stereoscopic immersion breaks at its earliest, holding a 10 ˆ 10 ˆ 10 cm virtual cube, in front of all three regions of the TV. Each cell is the mean angle in degrees for a type of rotation in a certain region. The last column is the combined mean angles from all regions. . . . . . . . A collection of all of the mean angles for when the stereoscopic immersion breaks at its earliest, holding a 5 ˆ 5 ˆ 5 cm virtual cube, in front of all three regions of the TV. Each cell is the mean angle in degrees for a type of rotation in a certain region. The last column is the combined mean angles from all regions. . . . . . . . . . . . . Paired samples statistics for interaction modes PGD and Plane in Test C. The mean is the task completion time for all, N, users in seconds. SD is the standard deviation.. x. 35 35 36 36 37. 38. 38 40.

(11) 1. Introduction. The thought of seamlessly interacting with computers by just waving your hands in thin air may seem futuristic when one’s frame of reference are scenes from effects movies like Iron Man. The truth is that means of mid-air interaction exist today and at a time when the afterthought of being sanitary is very present in everyday life, has it never been more relevant. Although methods of mid-air interaction exist, they are nowhere near as robust and versatile as depicted in movies, which indicates that there is much to improve with today’s methods and technology.. 1.1. Background and Motivation. A very applicable place to use mid-air interaction is at public exhibition spaces to manipulate 3D digital content on a stereoscopic display for an educational purpose, as demonstrated in [1]. Mid-Air interaction shows the potential of providing fun and engaging learning experiences when they are immersive and responsive. The biggest hurdle for mid-air systems at public exhibition spaces is that if the system is not intuitive to use and not robust, then it is harder to capture the exhibition guests’ interest and engagement. Therefore, it would be beneficial if they were as intuitive and robust as possible. However, the question is then how does one make an intuitive and robust mid-air interaction. A majority of mid-air interactions control the 3D digital content through gestures controls, based on the metaphors of physical object operations [2] or the gestures from user-elicitation studies [3], with varying Degrees-of-Freedom (DoF). Although, there lies a potential weakness in gesture-based controls, in the way that they limit the usability and intuitiveness of the system, as a gesture must be learned and performed correctly to give a good and consistent user experience. Therefore, it could be worth exploring alternative paths for solving manipulation controls for 3D content that do not use gestures at all. Research on mid-air interactions that use non-gesture or gesture-less designs is not as prevalent. Some work have previously used off-the-shelf hand tracking devices like Leap Motion1 to develop gesture-less mid-air interactions [4]. Although, the Leap Motion device does not have a sufficient range or tracking accuracy suited for a robust experience in an exhibition environment. 1 https://www.ultraleap.com/product/leap-motion-controller/. 1.

(12) 1.2. Aim However, it can be seen in the field of visual deep learning, that more accurate and robust hand tracking methods are being developed and introduced to the public [5]. Through deep learning, more accurate tracking of detailed points on a hand can be incorporated in the design of a mid-air interaction, allowing for more natural interactions. In 2019, a retail VR device called Oculus Quest2 came out that provided support for such interactions with their version of deep learning hand tracking, showing promising potential in the concept. Pushing mid-air interactions to be gesture-less and incorporating state of the art hand tracking through deep learning may be the logical next step for the future of mid-air, as it opens up possibilities for more intuitive and robust interaction designs. This is why further explorations in these kinds of mid-air interactions need to be explored and researched.. 1.2. Aim. This work aimed to develop a proof of concept of a mid-air interaction system, where users can interact with 3D objects on a stereoscopic display, through 3D hand tracking based on depth data from one RGB-D camera, with gesture-less interaction techniques/modes that would improve the user experience by making the system more intuitive and robust. It was also desired to investigate, through a user study, the immersion limitations and possible benefits of a gesture-less 6-DoF interaction compared with a 6-DoF gesture interaction. We hope that our work will contribute towards bringing mid-air interaction systems and setups to a level where it could be seen as a plausible replacement for a traditional touch table in a public exhibition environment/context. For this to happen, we believe that the intuitiveness and robustness of a system must come to a level competing with that of a touch table.. 1.3. Research questions. The following research questions that this thesis aims to answer are based on the presented aim above and involve our developed mid-air interactions. 1. How intuitive are the three developed mid-air interaction modes: Spring, Sticky and Plane? 2. At which rotation angle for yaw and pitch, respectively, is the stereoscopic immersion broken when the interaction Plane is used? 3. Which interaction, Plane or a gesture-based interaction, is the fastest for solving tasks where objects are required to be rotated?. 1.4. Delimitations. Delimitations to our work will include, no support for more than one user and one hand at a time. Only one depth camera will be used. Delimitations for the area of interaction for the user. No support for interaction with anything else than the custom Unity program.. 2 https://www.oculus.com/blog/introducing-hand-tracking-on-oculus-quest-bringing-your-real-hands-into-. vr/. 2.

(13) 2. Related Work. As a foundation and starting point for the development of the mid-air interaction system, previous work on the subject was examined. This included a wide search in different techniques for hand tracking, which is a vital part of hand based mid-air interaction.. 2.1. Mid-Air Interaction. Mid-Air interaction is a term to describe gesture-based and touch-less interaction with a remote device and is often connected but not limited to hand and finger movements. Real-time sensor tracking of body parts using non-intrusive sensors, often vision-based tracking using a camera, is used to achieve touch-less interaction with a system. Mid-Air interaction falls under the broad term Human-Computer Interaction (HCI) and is a natural type of HCI. The concept of using mid-air interaction as a way of HCI is something that dates back a couple of decades but it is not until recently that reasonably-priced depth cameras have made it a research subject worth perusing [6]. The depth cameras in question are mainly the Microsoft Kinect cameras with the first iteration of the Kinect being launched in 2010. Other depth sensors following the Kinect are the Leap Motion depth camera and Intel’s RealSense series of depth cameras, specifically the D400 series. The depth cameras all have something that differentiates them from each other which leads to the fact that there is no depth camera suited for all situations. The setup conditions and the use case for the mid-air interaction dictate the choice of depth camera.. 2.1.1. Setups. At the moment there is no standardized way to setup a mid-air interaction station, it all varies between research papers. For example, Mendes et al. [7] use a setup consisting of two Kinect cameras and a touch table as the display. One of the cameras was placed on a tripod on the side overseeing the user and tracking the head, while the other camera was placed above the table pointing down in order to track the user’s hands. Another example of fixed positions of depth cameras is the work of Chen et al. [8], where the setup’s display was a projector screen, and where an Intel RealSense camera was mounted right below the screen facing the user, who would be standing at a fixed distance from the display and camera. 3.

(14) 2.1. Mid-Air Interaction More compact setups like that of Speicher et al. [9], combined a Head-Mounted Display (HMD) together with a Leap Motion camera mounted on the HMD as a means to achieve mid-air interaction. Then there are studies that focus on the effects of mid-air interaction rather than methods themselves, and use dedicated tracking gloves or other intrusive markers to retrieve tracking data [10][11]. A vital part of mid-air interaction is that the interaction should not put unnecessary strain on the user, causing early fatigue. Paper [11] studied the question of how comfort influences the task performance for 3D selection using mid-air interaction. They conducted a user study on which arm positions out of 26 predetermined positions were the most comfortable for the user while standing. The study showed that positions, where the elbow was bent with the arm below the chest, were the most comfortable. A follow-up study, also presented in [11], concluded that the user was most comfortable when they had the option to rest their elbow on an armrest during mid-air interaction tasks. They also conducted a Fitt’s Law1 experiment, which showed that comfort had an impact on performance for 3D selection.. 2.1.2. Occlusion. Specific problems that arise when using mid-air interaction on a distant display, especially together with a stereoscopic display, are occlusions and visual conflicts between the hand and interactive 3D objects on the screen. Occlusion occurs because the hand is used directly as a means of interacting with objects displayed on a distant display behind the hand. It is often mentioned but not fully explored what the ramifications of occlusion have on the user performance [7][11]. A study from 2013 was done to investigate if visual conflicts affected the user’s ability to complete 3D selection tasks [10]. They specifically tested the performance between a direct hand approach, where occlusion occurred, and a cursor approach where there was an offset between the user’s hand position and the cursor object that was used to interact with the other objects in the scene. Their findings showed that the direct hand approach was more effective in completing the tasks because it was a more natural interaction in comparison to the offset cursor. This supports the idea that visual conflict and occlusion does not have a negative effect on completing trivial tasks for mid-air interaction. However, they also concluded that occlusion becomes a bigger issue for a more visually cluttered scene and recommends using the cursor approach in those cases. Although the direct hand had a greater effective throughput, when based on both speed and accuracy, the cursor was deemed superior as it provided higher precision at selecting and gave fewer errors.. 2.1.3. Interactions. There are different ways of interacting with digital content, such as 3D objects when using mid-air interaction. As mentioned earlier, occlusion is undesirable, and direct hand interaction without an offset will always cause occlusion when dealing with a distant display, which makes the search for other types of interaction methods interesting. There are several methods that can be called direct hand, all depending on the types of hand poses that are defined and the systems DoF. In 2014, Mendes et al. [7] conducted a user evaluation study in order to find which interaction, out of four different mid-air interactions and one touch-based interaction, was the most compelling for the user. As mentioned before, the setup in this study used a stereoscopic interactive table with two Microsoft Kinect devices, one for head tracking and one mounted above the table for hand tracking. The hand tracking was done using the open-source library 3 Gear Systems, which was discontinued in late 2014. The interaction techniques tested were for 6-DoF hand, 3-DoF hand, Handle-bar, Air Translate-Rotate-Scale (TRS), and Touch TRS + Widget. 1 Fitts, Paul M. (June 1954). "The information capacity of the human motor system in controlling the amplitude of movement". Journal of Experimental Psychology.. 4.

(15) 2.2. Camera-based Hand Tracking The 6-DoF hand interaction was done with one hand where translation was done by grabbing the object and moving the hand while rotation was done by rotating the wrist. The 3DoF hand interaction was done by dividing the DoF between the user’s hands. One hand was used for translation while the other was used for rotation, the same as before by rotating the wrist. The Handel-bar interaction was a way of solving the occlusion problem. The translation and rotation were done through a midpoint located between the user’s two hands. By moving the hand in the same direction, a translation would occur and by rotating the hands like a handlebar on a bike, a rotation would occur instead. The last interaction technique was the Air TRS interaction, which also was a two-handed interaction. One hand was used for selection and translation while the other was used for scaling and rotation. By pinching outside the selected object with one hand, rotation of the object could be achieved by rotating the second hand around the first hand. The scaling was done through movement and calculating the relative distance between the first and second hand, once both had pinched. The authors found that the 6-DoF Hand approach was the most natural interaction for the user because it mimics interaction with real objects. The only downside was the unwanted occlusion, which was expected from a direct hand approach. Furthermore, the Handle-bar interaction was as fast as the 6-DoF Hand in completing the tasks, and it did not have problems with occlusion. Additionally, they concluded that further research for offset mid-air interaction methods should be conducted. A later study from 2016 wanted to explore the possibilities of using direct hand interaction as a shape modeling tool [12]. They conducted a user study between, as they call it, free manipulation and an alternative interaction technique to test learnability, naturalness, mental load, and controllability on their system. A Leap Motion camera was used to track the participants’ hands. The free manipulation was done by letting the hands and objects have their own local coordinate system and then transforming the selected object position based on the relation between the coordinate systems. There was also support for bi-manual interaction so that the user could use both hands. For that case, a point between the hands was defined and used as an origin for transformations. The free manipulation allowed for 6-DoF interaction with the objects. Although, the method of defining the hand coordinate system was not explained and seemed to be tied to the Leap Motion device, which meant that it could not be used in this thesis. The alternative interaction technique was what they called The steering wheel metaphor and was a means for more precise interaction with an object. When the user approaches a side of an object, a graphic of a wheel would appear from that side, protruding out of the object with a small offset. In contrast to direct hand interaction, when using this technique, the user would indirectly control position, rotation, and scaling by moving and rotating the wheel attached to the object. Their result from the user study concluded that the users prefer free manipulation even though the wheel metaphor provided more precise movement. However, the difference in precision was something the users were willing to sacrifice because the naturalness of free manipulation was more rewarding to use. The learnability was easy for both interaction techniques since they mimic real-world interactions and were familiar. The authors also concluded that the controllability was largely affected by the Leap motion camera, which had insufficient tracking accuracy and did not register small finger movements which the user relied on for precision.. 2.2. Camera-based Hand Tracking. An essential part of mid-air interaction is the motion tracking of the hand(s). Tracking hands are important for recognizing gestures and manipulating content in the digital environment. Studies on mid-air interaction in recent years have had a variety of tracking technologies that. 5.

(16) 2.2. Camera-based Hand Tracking have been used to track the user’s hand(s) [6]. These tracking techniques vary from, Motion Capture Systems [13][14][15], to wearable tracking devices [16][10], to computer-vision-based tracking with depth/IR cameras (e.g. Leap Motion, Microsoft Kinect)[4][7]. However, many modern mid-air interaction prototypes use the latter technique and end up using a form of hand pose estimation, in order to track hand movement [7][8][17].. 2.2.1. Hand Pose Estimation. Hand pose estimation is the process of trying to recreate a digital hand which matches the position and orientation of palm and fingers to that of an observed human hand, in either 2D or 3D space. Lately, it has become more popular to describe hand pose estimation as a process where one finds all of the defined (often 21) joints of a hand and their position in a 2D or 3D image. It is then from these found joint positions, one can proceed to make an estimation of the inputted real hand pose [5]. Throughout the years, different approaches to the problem have surfaced and can be divided into three classes. The classes are discriminative, generative, and hybrid. Discriminative approaches are based on data-driven machine learning techniques. Recently, Convolutional Neural Networks (CNNs) have been successfully used for hand pose estimation [18] [19]. Generative approaches use a generative hand model for comparing the current pose estimate and the observation [20]. Hybrid methods, on the other hand, combine discriminative and generative approaches to achieve both robust and accurate hand pose estimation and hand tracking [21] [5]. In the more recent years, following great strides in the field of deep learning in computer vision, there have been several interesting advancements in discriminative and hybrid approaches on hand pose estimation. The different approaches for estimating hand poses using deep learning, could be categorised as either being; image-based or depth-based [5]. In brief, image-based methods are methods which use a single RGB image as input to the CNNs to find and locate the hand joints. Generally, these methods yield good generalization power, i.e. can be used in many situations/setups and environments. The drawback is that it is deemed to be a harder task in comparison and needs a considerably larger amount of data to train [5]. For the depth-based methods, the idea is to find hand joints using depth maps as input to the CNNs. Additionally, network architecture and hand pose algorithms themselves can also be divided into 2 types; detection-based and regression-based [5]. Detection-based algorithms are all about detecting each individual joint separately. The network produces a probability density map/confidence map/belief map as a heat map for each joint, where the precise joint location (pixel coordinates in image) can be found by extracting the largest value in respective heat maps. Once all joints are found, a reconstruction of the hand pose can be made. Comparatively, algorithms which are regression-based attempt to find the position of each joint directly. Meaning that the networks try to predict the (x,y,z) coordinates for every joint at once [22] [23] [24]. Using depth-based methods regardless of network type are well suited for outputting a 3D hand pose estimation. Moon et al. [25] developed V2V-PoseNet, a depth- and detectionbased method, which used a voxel-to-voxel network (V2V) to directly estimate the position of each hand joint based on the estimated 3D hand shape. In short, they produce a 3D heat map of an inputted depth map by first voxelizing it. They believed that great optimizations could be made to the training process if they could train the model to produce the same pose for different depth map inputs. The idea was that depth maps of the same hand but in different angles, should have a common 3D hand pose. So instead of having a huge dataset to cover all the shapes of a hand, they would train the model on the 3D point cloud of that hand and directly generate 3D poses via 3D encoder and decoders. Chen et al. [24], developed an approach named Pose-REN, a depth- and regression-based method, which relies on region grouping. It starts by taking a previously estimated hand pose 6.

(17) 2.2. Camera-based Hand Tracking as input, then puts it through an iterative refinement process to gain a more accurate estimate of the hand pose. More specifically, each iteration it crops spatial regions around each joint of the previously predicted hand pose from the feature maps. The cropped feature regions are then hierarchically distributed to the network following the topology of hand joints and then produce a refined hand pose, which in turn will be used as a feature region guide for cropping in the next iteration. It should be noted that according to the findings of Yuan et al. [5], whom conducted an investigation on different pose estimation methods, detection-based methods tend to outperform regression-based methods in terms of average error-rate. A difference between depth-based methods and image-based methods is that it is more common for depth-based methods to estimate the hand pose in 3D which can lead to superior HCI in terms of digital interaction capabilities. Although image-based methods can also achieve 3D estimations [26] [27], the drawback (as mentioned earlier) is that they require much larger datasets for training. Datasets which the big and influential studies had to create on their own. Apart from the promising results of many depth-based methods, one must bear in mind that many solutions were created and optimized in controlled environments and have poorer performance in a wild environment. Furthermore, the fact that the depth data must be as clean and high-resolution as possible in order to get accurate results, leads it to require that the hands must be as close to the sensors as possible for the best results, which may make it a less suitable application for a mid-air interaction station in an exhibition environment. For this reason, we were more inclined to apply an image-based method for hand pose estimation, as they tend to allow the subject hand to be further away from the sensors in comparison.. 2.2.2. OpenPose. The API known as OpenPose, developed by researchers Cao et al. [28] at the Carnegie Mellon University, is a real-time multi-person joint detection library for body, face, hands, and foot estimation. OpenPose is originally written in C++ and uses deep learning with image-based methods and detection-based algorithms for 2D joint detection on the aforementioned body parts. The unique part of their work is the vast amount of body joints (135), or keypoints as they call it, they can detect on multiple people in real-time through a method which predicts a set of 2D heat maps of body part locations and a set of 2D vector fields of Part Affinity Fields (PAF), which tells the degree of association between parts. In other words, PAFs helps to associate keypoints to the correct body part/limb of the detected person/people [28]. However, OpenPose does not achieve hand pose estimation this way. The library solves the hand pose problem by incorporating a procedure called multi-view bootstrapping, developed by Simon et al. [29]. They used a multi-camera approach to estimate hand poses using Carnegie Mellon University’s Panoptic Studio [30], which is a semi-sphere space that contains more than 500 cameras (480 VGA and 30+ HD cameras). They first trained a weak hand pose estimator (based on the architecture of the detectionbased pose estimator CPM [31], but with some modifications), using a dataset of synthetic hands. They then recorded video from 31 HD cameras of a person standing in the center of the panoptic space showing various hand motions and applied the hand pose estimator on the footage. For all these images/views the algorithm produced a hand pose estimate. It worked in most of the views but it did not do so well in the views where the hand was self-occluding or occluded by the body. For this reason, they followed this part up with a triangulation step. Given that they had information about all the cameras’ intrinsic parameters and physical position relative to one another, they could convert their 2D estimations into 3D estimations to evaluate their results. To give more detail, they used the RANSAC algorithm [32] to select 2D views at random and have them converted to 3D views. These 3D views can be described as 3D-models and the model that matched and was aligned the best with most 2D views was kept for the next step. 7.

(18) 2.3. LiU’s NanoSim Following this, they re-projected the 3D view onto the 2D views with bad pose estimates and added corrected annotations for that image. Notably, this was how they annotated all their data in the dataset (excluding the initial synthetic data), no manually inputted annotations were needed. Furthermore, with an expanded dataset containing the annotated images of the hands which are from multiple views, they trained a new hand pose estimator which yielded more accurate outputs. Conclusively, they repeated this process of re-projecting, annotating, and re-training the model for three iterations and therefore managed to end up with a very good and accurate model and dataset of annotated hand poses. The OpenPose library itself is well managed, open-source, supports windows [28], and can be implemented in C++ native plugins for Unity. For this reason, we believed that OpenPose was ideal for the development of this thesis work. We also wanted the benefits of an image- and detection-based hand pose estimator, as we planned to design a mid-air interaction setup that places the user roughly 1.5+ meters away from the camera. Although there exist faster hand pose estimators compared to OpenPose’s real-time hand detector, which runs at maximum 8 Frames-per-Second (FPS) on our setup, we found that its accuracy and robustness outperformed the other faster estimators that were available. Finally, OpenPose also includes 3D keypoint pose detection, by using the results of multiple synchronised camera views and performing 3D triangulation with non-linear LevenbergMarquardt refinement [33]. However, in our work, since we will only be using one consumer RGB-D camera, we plan to achieve 3D-point tracking by other means. Namely, by combining the information of the 2D hand joint detector of OpenPose with the cameras depth map.. 2.3. LiU’s NanoSim. The previous mid-air interaction station at Linköpings University named NanoSim was made up of a stereoscopic television and a first-generation Microsoft Kinect depth camera. The TV was mounted on a rig that leaned backward at an angle of 25 degrees, and the Kinect was mounted on the same rig above the TV and pointed parallel to the TV’s screen plane. The depth image from the Kinect was used to track the position of the user’s head to update the stereoscopic image accordingly. The depth image was also used to track the position of the user’s finger (defined as the blob closest to the screen) and the user’s hand when gesturing a circle formed with the fingers, similar to the ’okay’ sign. The latter gesture was used to move and manipulate 3D objects in the scene with 3-DoF [1]. In our work, the design the setup was based on this station.. 8.

(19) 3. Implementation of a Mid-Air Interaction System. The following chapter starts with details about the implementation of the mid-air interaction system using relevant computer graphics theory, which is then followed by detailing issues with the system and how they were solved.. 3.1. Platform and Framework. The implementation was done on a PC running Windows 10 and is therefore tailored to Windows desktops. The game engine Unity was chosen to render all the necessary 3D-graphics because of the low workload to create stereoscopic scenes, create and modify objects, and its native plugin functionality. This made Unity the cornerstone and framework for this implementation. Unreal Engine was considered a valid option but Unity was chosen because of familiarity. Unity uses Csharp scripts as its native means of implementation but pre-compiled C++ code can also be used thanks to the native plugin functionality. The flexibility to use precompiled C++ code opened up the possibility of implementations outside the Unity framework whilst using Unity functionalities.. 3.2. RGB-D Camera. The camera plays an essential role in the system as it enables tracking in 3 dimensions. RGB-D cameras are cameras that also measure the depth of each pixel in an RGB image. The RGB-D camera used for the implementation was the Intel RealSense D435i. The camera was modern and came with its own software development kit (SDK). The implementation could be done with any modern RGB-D camera, given that it has an API that allows for more advanced camera controls. However, the main motivation for choosing the Intel RealSense D435i was availability. After the RealSense camera was acquired different settings were set through the SDK in order to get optimal performance for hand tracking. The horizontal field of view was set to 90 degrees, both the color and depth images were set to stream at 30 FPS with a resolution of 848x480 while having the depth data filtering set to the preset High density.. 9.

(20) 3.3. Stereoscopic Rendering of Scene in Unity. 3.3. Stereoscopic Rendering of Scene in Unity. The stereoscopic rendering in Unity was achieved with two virtual camera-objects, perspective projection with skewed view frustum, and a shader. The two camera-objects were placed at a distance from each other to match the Interpupillary Distance (IPD) of an average adult (64 millimeter). Each camera corresponded to the view and perspective of each eye. To form the proper perspective of each eye and have it match the user’s point of view, the cameras frustum needed to be skewed (asymmetrical). The skewed frustums would then together result in correct off-center stereoscopy, see Figure 3.1. In computer graphics, a skewed frustum is most commonly defined by the standard perspective projection matrix1 , see equation 3.6, where the decisive terms are l, r, t, and b. These terms correspond to the positions of the left, right, top, and bottom planes on the near plane, respectively, see Figure 3.2. To estimate the proper l, r, t, and b values for each eye’s skewed frustum, the following geometry-based method was used: First, measurements were taken of the screen’s width and height, the perpendicular distance between the viewer’s face and screen, and the IPD. Then using the equations below, it was possible to estimate the desired terms in the projection matrix. Amongst equations 3.1 to 3.5, xˆ1 , yˆ1 , and zˆ1 are the right, up, and back vectors in the screen’s frame of reference. Equation 3.1 shows how os is defined, which is the projection of the eye position, o, onto the screen. Equations 3.2 through 3.5 utilise triangle similarity, see Figure 3.2, where n is the distance to the near plane from o. A value that can be set arbitrarily, so long as it is not greater than |os ´ o|, the distance from face to screen. The term c is the center position on the screen, and |c ´ os | is equal to the IPD/2, under the condition that the shift off-center only takes place along the x-axis. Finally, w and h are the width and height of the screen. os = o + zˆ1 (zˆ1 ¨ (c ´ o )). (3.1). 1 1 ˆ ´ w ´ x ¨ |c ´ os | 2 n 1 1 ˆ r= w ´ x ¨ |c ´ os | |os ´ o| 2 1 n h ´ yˆ1 ¨ |c ´ os | t= |os ´ o| 2 n 1 b= ´ h ´ yˆ1 ¨ |c ´ os | |os ´ o| 2 n l= |os ´ o|. . (3.2) (3.3) (3.4) (3.5). Figure 3.2, shows the geometry of the off-center view frustum for the left eye. The figure shows how to estimate the location of where the left plane is on the screen and near plane through triangle similarity. Lastly, to show the two images from each camera-object at once, a shader was made which could achieve a 3D interlace effect. This type of shader was necessary since the 3D-display that the TV used was designed for interlaced 3D. The shader constructs a third image whose pixel rows alternate between the corresponding rows of the two images, see Figure 3.3. . 2n r ´l.  0  Mproj =   0 0 1 http://www.songho.ca/opengl/gl. 0 2n t´b. 0 0. r +l r ´l t+b t´b f +n ´ f ń. 0 0. .   2n f  ´ f ń . ´1. (3.6). 0. p rojectionmatrix.html perspective. 10.

(21) 3.3. Stereoscopic Rendering of Scene in Unity. Figure 3.1: Diagram depicts a green cylinder in front of a screen whose center is showed by the pink triangle. a) is a side view of the viewers frustum, b) is a 3D view, c) is a top view, and d) are the 2 resulting renderings of the cylinder on the left and right eyes’ near plane.. Figure 3.2: Perspective view frustum for off-center projection. l is the position of the left frustum plane on the near plane, and on is the projection of the eye position, o, onto the near plane, whilst os is the projection of o onto the screen plane. c is the center of the screen, and ls , rs , ts , and bs are the distances from s to the positions of the left-, right-, top- and bottom frustum planes on the screen plane. x, y, and z are the right, up, and back vectors for the eye’s coordinate system, while x1 , y1 , and z1 are the same vectors in the screen’s frame of reference.. 11.

(22) 3.4. 3D-Tracking of the Hand. Figure 3.3: The basics of creating interlaced stereo images. The final image contains the odd pixel rows of the left view and the even pixel rows of the right view. By wearing polarized glasses the viewer will receive the corresponding view for each eye.. Figure 3.4: The skeletal model that OpenPose bases its keypoints on. We currently only make use of 3 out of the 21 keypoints, namely points 2, 3 and 7 with the potential to use more in the future. Image credited to Cao et al. [28] and Simon et al. [29]. 3.4. 3D-Tracking of the Hand. For tracking a user’s hand in 3D space, a frame by frame 2D hand joint detection solution was used together with RGB-D imagery. The hand joint detection was achieved through the open-source API, OpenPose. OpenPose was integrated into the system in order to retrieve pixel positions of specific hand joints on the user. Either tracking the right hand, left hand, or both, the latter at a significant performance drain with limited positioning of the hands. The right hand was chosen to be tracked because of performance, simplicity, and its often the dominant hand. OpenPose uses CNNs to recreate a simplified skeleton of the hand where the positions (pixel coordinates) of specific joints were retrieved. For this implementation, the specified joints were the second joint from the tip of the index finger and two joints in the thumb, see Figure 3.4.. 3.4.1. OpenPose with a Dynamic Bounding Box. The RGB video feed from the camera could not be sent to OpenPose directly, since the CNN required square images as input. Instead, a bounding box is used that marks the part of the image that will be sent to the CNN for hand pose estimation. The OpenPose hand estimator was trained on images that contained a hand in the middle with a margin of 20% to the image borders. Hence, the optimal input for OpenPose would be an image of a hand with similar proportions. The default bounding box was static and covered a large area of pixels in the middle of the frame. This default bounding box was used for initialization and reinitialization for the hand pose estimator. In order to get consistent hand pose estimations, a 12.

(23) 3.4. 3D-Tracking of the Hand. Figure 3.5: Screenshots of the OpenPose hand skeleton and the green dynamic bounding box, which adapts to size of the hand. Where a closer/bigger hand in A produces a bigger bounding box compared to the farther/smaller hand in B.. dynamic bounding box that could follow the hand and optimally crop the current frame was created. The dynamic bounding box updates each frame with the information from the previous frame. The new size and position for the bounding box are determined by the predicted image coordinate (u, v), with the origin at the top left of the image, for each joint as well as their corresponding confidence score. The confidence score is a percentage given from the CNN on how sure OpenPose is of its detection. The mean score for all joints was calculated and if it fell under a threshold of 0.08, then the new box would be the default bounding box instead. In practicality, this would mean that when the hand would generate a poor estimation or would be completely/partially outside of the frame, the system would force the user to re-estimate the hand pose in a more robust location. On the contrary, if the score was higher, then the dynamic bounding box would proceed to update with a new size and position for the next frame. The size was determent by the distance between the hand joints in pixel length. The distance between the smallest and largest u value (amongst the detected joints) was calculated, the same was done for the v values. Whichever distance is the largest, would then be multiplied by two and become the length of the sides of the dynamic bounding box. Additionally, if the largest pixel distance between joints were smaller then a threshold value of 76, then the box length would be equal to the threshold multiplied by two instead. This was done so that the bounding box would have a limited minimum size. This was discovered to be necessary since it would be common for the hand joints to be placed outside of the bounding box for the next frame, which would trigger a re-initialization. The new (u, v) pixel position for the center of the dynamic bounding box was set as the mean of all the joints u positions and v positions respectively. If the new center position was located so that the bounding box would have areas poking outside of the frame, then the center position would be shifted so that the bounding box stays inside the frame. This was done so that only real pixel values were sent to the CNN. Screenshots of the dynamic bounding box in use can be seen in Figure 3.5.. 3.4.2. Converting Pixel- to Metric Coordinates: Going From 2D to 3D. With a dynamic bounding box, OpenPose can function as a tracker in pixel space, outputting 2D pixel coordinates of the specified hand joints. The next step was to convert the 2D pixel coordinates to metric ones in 3D. The conversion was done by first aligning the RGB image with the depth image, which could be done through the RealSense SDK, followed by perspective projection using homogeneous coordinates. The projection would convert a pixel position with a corresponding depth value (e.g. [100,150, 0.69]) to metric (x, y, z)-coordinates within the camera coordinate system, using the cameras intrinsic parameters (focal length 13.

(24) 3.5. Calibrating for the Correct Coordinate System and principal point). In equation 3.7, the formula used for perspective projection is shown, where X, Y and Z denote the metric coordinates in camera space. Initially, only Z is known through the depth image. x and y are the coordinates for the point of interest defined in image coordinates ( x, y) instead of pixel coordinates (u, v). Going from pixel- to image coordinates is done through x = u ´ pp x and y = ppy ´ v, where pp is the principal point. The terms f x and f y are the focal length of the image plane in x and y. Once a point in the image plane could be defined in metric coordinates, successful 3D tracking was achieved. X fx ðñ X = Z Y fy y= ðñ Y = Z x=. 3.5. x Z, fx y Z, fy. (3.7). Calibrating for the Correct Coordinate System. Since it is desired to have the virtual objects appear to follow the real-world movements of the user’s hand in front of the 3D TV, it was necessary to define the tracked 3D-coordinates, in the screen’s frame of reference. To do this, a corresponding affine transformation matrix needed to be calculated. The transformation matrix was found through a manual point measurementbased method2 . To find the relation between the camera’s coordinate system and the screen’s coordinate system, four calibration points forming three vectors that are each parallel to the screen space’s three base vectors, are needed to be described in both the camera’s coordinate system and the screen’s coordinate system. To define the calibration points in the camera’s coordinate system, they just need to be visible in the camera’s view, the methods from section 3.4.2 will output their coordinates in 3D. To define the same points in the screen’s coordinate system, one usually measures them in real-life with measuring tape and defining the screen center as the origin, but since the screen center is not in the view of the camera, see section 4.5, a different approach must be done. By attaching a rectangular box to the bottom of the screen with equally long sticks protruding outwards in each corner of the box and having the tips of the sticks visible to the camera’s view, a shifted-screen coordinate system could be defined. The stick tips were chosen to be three of the calibration points since the fourth one could be calculated through vector geometry. To define the calibration points in the shifted-screen coordinate system, measurements were made between each stick tip and described in a coordinate system where the origin is defined as the center of the plane that the three stick tips spanned, see Figure 3.6. As long as the four points formed the three base vectors (two parallel to the screen and one perpendicular to the screen) this conversion would still work. The only drawback is that the origin will not be located at the screen center but elsewhere, forcing offset procedures to align Unity’s virtual scene origin with it. Once coordinates of the stick tips, in both the shifted-screen coordinate system and the camera coordinate system were obtained, the following method utilizing equations 3.8 – 3.10, were used to calculate the desired transformation matrix Matrix C is the matrix containing the camera space coordinates for the four calibration points found through the stick tips. Matrix S is the matrix containing the shifted-screen space coordinates for the same four points. Matrix T is the arbitrary 4x4 transformation matrix, which could be solved through equation 3.10 given that C´1 , the inverse matrix of C, is first calculated. 2 Calibration. video: https://www.youtube.com/watch?v=zx-jLitVgow. 14.

(25) 3.6. Reducing the Effect of Input Noise. Figure 3.6: 3D schematic of the setup for the manual point measurement process. Image includes representations for a TV screen, box with three sticks and camera, as well as the coordinate system for the camera and screen that are located in respective systems origin.. . c0x  c0y C=  c0z 1  s0x  s0y S=  s0z 1. c1x c1y c1z 1. c2x c2y c2z 1. s1x s1y s1z 1. s2x s2y s2z 1.  c3x c3y  , c3z  1  s3x s3y   s3z  1. S = CT ðñ C´1 S = C´1 CT ðñ C´1 S = T. (3.8). (3.9). (3.10). With T solved for, it became possible to convert points in the camera space to the shiftedscreen space. These new coordinates could then be put into Unity and would allow for matching movement between the real world and the virtual world. However, instead of shifting back the shifted-screen coordinate system with another matrix to have its origin align with the position of the real-world screen center, a shift of the contents in the scene in Unity was done instead.. 3.6. Reducing the Effect of Input Noise. While developing the hand tracking, there was a discovery of three sources of input noise that affected the user experience negatively. The input noise was undesired changes to the data used to define a point in 3D space. Where one source was related to the user, and the others were connected to the depth data and OpenPose. During the tracking of the hand, the user may want to keep their hand perfectly still, but it is practically impossible to do so. This results in correct but undesired changes in the position of the tracked 3D point. The second source was the depth data from the RealSense camera itself, which is noisy. Having the tracking points moving without the user’s intent transfers, depending on the interaction, to jittery movement of scene objects.. 15.

(26) 3.7. Improving Performance Issues. Figure 3.7: The weighted mean distribution where only the most recent positing and the oldest gets new weights. Depending on the number of positions, ones would either be added or removed from the function where the ends would always stay the same.. A weighted mean feature was implemented to counteract the effects of the input noise, and in turn, enhance the user experience of object interaction. The mean value was made up of eight positions, coordinates in 3D space, where one is the current position of the tracked point, and the rest are the previous seven positions. Eight positions were enough to even out most of the jittery movement and still have relative direct response during the intended movement. The weight distribution is seen in Figure 3.7. This mean method was applied to all of the three tracked points, making their movements smoother. The third noise occurs when the pixel position for the tracked joints in OpenPose does not add up with the depth data. This was the case either when OpenPose did a poor but confident estimation or if the depth data gave a false representation of reality. The depth data was clamped within a range, where preferably, only the user’s hands and arms were visible. Data points outside this range were set to zero and were thus easy to compare. If misalignment between OpenPose and the depth data happened the zero value would evoke the use of the previous 3D point instead of updating it.. 3.7. Improving Performance Issues. At a point during development, it became clear that the performance of the system was too poor to work with. Interacting with a system that is running at approximately 8 FPS proved to be too distracting and severely affected the user experience. A multi-threading scheme was implemented where the function call to update the tracked 3D coordinates was done on a separate thread, making the hand tracking and Unity graphics run independently of each other. The FPS in Unity was manually set to render at 16 FPS too allow smoother motion and physics, while the hand tracker still ran at 8 FPS.. 16.

(27) 4. The Four Mid-Air Interaction Modes. As mentioned in Chapter 1, this work was in pursuit of finding out if the user experience on a mid-air interaction system could become more intuitive and robust through intuitive and robust interaction techniques/modes. For this reason, several interaction modes had to be designed and implemented to confirm this theory. The following chapter presents our interaction modes. Four mid-air interaction modes were developed in total, using different metaphors, DoF, and interaction concepts. The four interactions were named: Spring, Sticky, Plane, and Proximity Grab-and-Drag. The latter interaction is the only gesture-based one, and was only used for comparing the performance of 6-DoF manipulation, see research question no.3 in section 1.2.. 4.1. Spring. The main characterization of this interaction mode is that it uses a rubber band metaphor or spring metaphor in a zero-gravity environment. The mode allows up to 5-DoF, three for translation, and two for rotation (pitch and yaw). However, control of the manipulations differs in precision. How to Select As mentioned in section 3.4, the position of three points on the hand is being tracked. In this interaction, only two points are used, the top index finger point and the top thumb point. The position of these points is fed into two sphere-objects. When the user’s index finger comes in contact with an object, a spring is formed with its ends tied between the user’s finger sphereobject and the contact point on the touched object. To convey that the object has been selected, a green line is rendered between the object and the finger. How to Move This interaction provides two means of moving an object. The aforementioned sphere-objects each have a collision box meaning that they can collide with rigid bodies. This allows the user to be able to push/poke objects in the virtual scene with both their index finger and/or thumb.. 17.

(28) 4.2. Sticky. Figure 4.1: Illustration of a user manipulating virtual cube with Spring interaction, by first touching the surface of the cube, a), then dragging the hand to the left, creating tension in the spring causing the object to move towards the hand, b), then finally coming to a rest state once close enough, c). Note that the orientation of the cube has changed after its journey. This is due to the effects of Unity’s physics engine on a rigid body with no gravity.. Additionally, once the user has formed a spring joint connection, the user can then pull the object around in a zero-gravity environment, using the game engine’s rigid body dynamics. The game engine’s spring joint dynamics also enable the simulation of a rubber band, meaning that the object would travel towards the finger at a speed corresponding to the spring force effect, see Figure 4.1. The force applied to the spring is visualized through variance in the green line’s thickness. The greater the spring’s tension force is, the thinner the line gets. The line thickness has a max and min clamp. How to Release Removing the spring joint connection is physics-based. The user needs to break the spring by applying a force upon it greater than the set break force threshold. Visit Unity’s online manual1 to learn more about Unity’s Spring joint class.. 4.2. Sticky. The metaphor which closest relates to this interaction is that of a sticky surface. The basic idea is that once the index finger comes in contact with a surface, that surface will become stuck on the finger. However, this interaction only supports 3-DoF, meaning that the surface will always be facing the same direction. How to Select Sticky only uses data from one hand joint, and much like the previously mentioned interaction, Spring, this interaction defines an object to be selected by simply touching it with the index finger sphere-object. How to Move Manipulating a selected object is done by moving the index finger. The object is to follow the position of the fingertip, see Figure 4.2. Although, the manipulation is limited to 3-DoF translation only. This is due to the limitations of only using tracking data from one point on the hand. 1 https://docs.unity3d.com/Manual/class-SpringJoint.html. 18.

(29) 4.3. Plane. Figure 4.2: Illustration of a user manipulating virtual cube with Sticky interaction, by first touching the surface of the cube, a), then dragging the hand to the left, essentially bringing it along, b), then once the hand stops, so does the cube, c).. Figure 4.3: Illustration of a user manipulating virtual cube with Plane interaction, by first touching the surface of the cube, a), resulting in the cube leaping towards the palm, b), followed by a rotation and translation motion of the palm, with the cube following and maintaining a matching orientation, c).. How to Release To be consistent with the sticky surface metaphor, a release from an object is done by rapid movements. More specifically, by pulling away from the object at a velocity higher than a set threshold. However, in this implementation, the distance of the finger’s positions in between two frames is measured rather than the velocity.. 4.3. Plane. The main contribution of this paper is the following interaction mode. This interaction is named Plane, for it is based on the concept of matching the object’s orientation with the orientation of the plane similar to the right hand’s palm. The closest metaphor that can be used to describe this interaction would be that of having an object stick to the center of one’s palm, following the orientation and position of it as one moves the hand around freely in 6-DoF, see Figure 4.3. The plane of the palm was estimated as the plane defined by the positions of the three tracked points on the hand, see section 3.4. Figure 4.4 shows the three points, one along the index finger, P0 , and two along the thumb, P1 and P2 .. 19.

(30) 4.3. Plane. Figure 4.4: Showing how the palm’s plane is estimated. P0 , P1 , and P2 are the positions of the Ý 3 detected joints which together define a plane. Ñ n is the normal of the resulting plane, where X, Y, and Z are the screen’s coordinate system.. The triangle that the three points form, in the order of P0 P2 P1 , was used for two purposes. Firstly, to estimate the center of a user’s palm, and secondly, to define the roll rotation angle. How to Select As in Sticky, the selection of an object was done by touching it with the index finger sphereobject. Once selected, the object would be moved to an offset position. The offset position was designed to be placed at roughly the middle of the user’s palm, and was defined by equations 4.1 and 4.2, where Poffset would be the final position of the selected object. Ñ Ý ÝÝÑ i = (nˆ ˆ P2 P0 ) ÝÝÑ Poffset = Pc + î| P2 P0 | a + nb ˆ. (4.1) (4.2). As the equations above show, the offset position is defined by starting on the triangle’s centroid, Pc , then shifting equal to a multiple, a, of the length between P0 and P2 , along the ÝÝÑ vector perpendicular to the plane’s normal, n, ˆ and the P2 P0 vector, then also shifting along the normal vector by a multiple, b. View Figure 4.5 for clarity. The idea of this offset was to add a level of immersion and intuitiveness to the interaction, by having it behave as closely to the metaphor it was based on as possible, while also being ÝÝÑ offset consistently on all hand sizes by being dependent on the P2 P0 vector. How to Move As mentioned earlier, this interaction has 6-DoF, where 3-DoF for translation work identical to the translation on Sticky, and the remaining 3-DoF for rotation work in the following way. 20.

(31) 4.3. Plane. Figure 4.5: Yellow lines are the borders of the P0 P2 P1 triangle, with point Pc as the triangle centroid. Blue lines are the normal vector of the P0 P2 P1 triangle. The red line, in this image, ÝÝÑ ÝÝÑ Ý is set to be half the length of P2 P0 and is perpendicular to Ñ n and P2 P0 . The offset position, Ý Poffset , can then be specified to be any point on Ñ n , shifted to the end of the red line.. To transform the selected object so that it rotates in the correct angles for pitch, yaw, and roll, two methods were used. For finding the appropriate pitch and yaw rotation angles, the plane in the current frame was compared to the previous one. More specifically, the angles between each planes’ normal. In Unity, this is achieved through the FromToDirection2 function of the Quaternion-class. The function essentially takes two vectors as input and finds the corresponding quaternion vector that would be needed to align one vector to the other. The resulting quaternion vector is then used to transform the selected object so that its pitch and yaw orientation matches the current plane. The method for finding the appropriate roll-rotation required a few more steps. Similarly, it focused on comparing the angles between two vectors, but this time they needed to be compared on a common plane. The vectors used for the comparison were respective planes’ ÝÝÑ Pc P0 vector. However, as shown in equation 4.4, the angle of interest, θ, was between the ÝÝÑ ÝÝÑ ÝÝÑ previous Pc P0 vector ( Pc P0 prev ) and the current Pc P0 vector on the current plane. ÝÝÑ To find where Pc P0 prev would lie on the current plane, equation 4.3 was used which applied the same quaternion rotation used for the pitch and yaw rotation on points P0.prev ÝÝÑ1 and Pc.prev , forming Pc P0 prev . However, the angle alone was not enough, since a user’s rollrotations could either be clockwise or anti-clockwise. To know if the angle was to be inverted, equation 4.5 was used. s was either 1 or -1 depending on the sign of the value on the righthand side. In Unity, the roll-rotation was achieved using the Quaternion.AngleAxis3 function, Ý by giving sθ and Ñ n as input arguments. ÝÝÑ1 Pc P0 prev = qP0.prev ´ qPc.prev Ý Ý where q = The quaternion rotation between Ñ n prev and Ñ n. (4.3). 2 https://docs.unity3d.com/ScriptReference/Quaternion.FromToRotation.html 3 https://docs.unity3d.com/ScriptReference/Quaternion.AngleAxis.html. 21.

(32) 4.4. Proximity Grab-and-Drag. Figure 4.6: Illustration shows an example of how the roll-rotation is found for a left-to-right swiping motion. The two grids represent the two planes from two consecutive frames. Inside each grid lies a blue triangle, representing the corresponding P0 P2 P1 triangle of each frame. ÝÝÑ Within each triangle lies the associated Pc P0 vector, represented as a red arrow for the preÝÝÑ1 vious vector and a green arrow for the current one. The Pc P0 vector is the red one rotated through a quaternion. Lastly, θ shows the desired roll angle. X, Y, and Z are the screen’s coordinate system.. θ = arccos. u¨v |u||v|. ÝÝÑ1 where u = Pc P0 prev ÝÝÑ v = Pc P0 ÝÝÑ1 ÝÝÑ Ý s = sign(Ñ n ¨ ( Pc P0 prev ˆ Pc P0 )). (4.4). (4.5). How to Release To release a selected object, similar to Sticky, the user needed to rapidly pull away from the object at a velocity greater than a set threshold. The threshold could be tuned to allow for more strict or relaxed release-sensitivity.. 4.4. Proximity Grab-and-Drag. This interaction uses two tracking points for its gesture-based controls and does not follow any particular metaphor. The control design most resembles that of slider-controls and pinching gestures. The interaction mode allows for a total of 6-DoF. However, the DoF are separated, as the most a user can manipulate an object at a time is three (translating). When manipulating orientation, only 1-DoF can be used at a time, either pitch, yaw, or roll. The gesture-state of the hand depends on the distance between the two tracked points on the 22.

No results found