Estimation of the 3D Pose of objects in a scenecaptured with Kinect camera using CAD models

(1)

Degree project in

Computer Science

Second cycle

Stockholm, Sweden 2013

Estimation of the 3D Pose of objects in a scene

captured with Kinect camera using CAD models

(2)

Estimation of the 3D Pose of objects in a scene

captured with Kinect camera using CAD models

VALENTINA ARMENISE

Master’s Thesis at KTH Supervisor: Renaud Detry

(3)

(4)

Abstract

This report presents a method for estimating the 3D pose of an object in a picture captured by recent structured-light sensors, such as the PrimeSense sensor commercialized un-der the name Kinect. In particular the work is focused on the estimation of the pose of a target represented by a CAD model without passing through the transformation of the model to a point cloud dataset, as required by the tradi-tional approach. We study an iterative method which uses the geometrical entities which a CAD model is composed of and which works on the distance point-face rather that on the distance point-point as the traditional ICP does. For this reason we will refer to this method as "Iterative Closest Face" (ICF). The work demonstrates that the algorithm, in contrast with the traditional approach, is able to converge to the solution without any initial knowledge on the pose of the target although its efficiency could be stepped up by the use of rough-alignment algorithms. In order to achieve the goal of finding the pose of an object in a scene com-posed of several objects, we adopt an algorithm to extract the points belonging to different objects and associate them to different clusters, proposed by the Point Cloud Library. For each different cluster, the ICF maximizes an objective function whose value is proportional to the quality of the match. The presented pose-estimation method is intended to create an alternative to the ICP algorithm when the ge-ometry of the CAD-model representing the target is easy enough to render the ICF more efficient than the traditional approach.

(5)

I would like to express my sincere gratitude to Renaud Detry, who gave me the possibility of working on this project and who constantly supported me when I did not know how to go further. I also would like to thank the Professor Andrea Bottino for his help and for his advices and for the time and trust he gave my project.

A sincere and deep gratitude goes to Ana Cecilia Martins, much more than a friend, with whom we managed to turn the diﬃculties of this year into extraordi-narily intense experiences and who gave me strength to go on with this project since the very beginning.

I also would like to thank my friends who give me joy of life every day more and more, without whom I would never be able to fight for my dreams.

I really have no words to thank my parents and my sister for supporting my choices of life and for loving me the way I am.

Finally, I would like to thank Matteo who decided to believe in me when not even I could, and who was by my side when I needed the most.

(6)

List of Figures

2.1 The influence region diagram for a Point Feature Histogram[15]. . . 7

2.2 The influence region diagram for a Fast Point Feature Histogram[14]. . . 8

3.1 The Kinect camera[37] . . . 12

3.2 How structured scanning lasers work [38] . . . 13

3.3 Scene captured using OpenNi. . . 14

4.1 CAD model in Ply format. . . 16

4.2 Point cloud target . . . 17

4.3 Best alignment . . . 18

4.4 Regions defined by a rectangular face . . . 19

4.5 Vectors connecting the vertex with a point inside the triangle and a point outside 21 5.1 Objects on a table . . . 27

5.2 Result of Cluster Extraction . . . 28

5.3 Cluster Extraction and Pose Estimation . . . 29

6.1 Ply models . . . 32

6.2 Scenario 1: Box5, a cup and an Indian statue on the table . . . 33

6.3 Scenario 2: Box5 and another rectangular shaped object on the table . . . 35

6.4 Scenatio 3: Box5 and YellowRoundSalt on the table . . . 36

6.5 Scenario 4: Box5, YellowRoundSalt and a cup on the table . . . 37

6.6 Scenario 5: EisKaﬀe and Amicelly on the table . . . 38

7.1 Scenario 1: Error in the Translation and Orientation . . . 42

7.5 Mean Error. . . 46

7.6 Standard Deviation of the Error . . . 47

(9)

8.2 mouse event handler . . . 50

B.1 CAD models with MeshLab . . . 56

B.2 Point Cloud with MeshLab. . . 56

C.1 Relation among classes . . . 60

D.1 PLY file format example [27] . . . 62

(10)

List of Acronyms

CAD Computer Aided Design ICP Iterative Closest Point

ICF Iterative Closest Face

PFH Point Feature Histogram FPFH Fast Point Feature Histogram TOF Time of Flight

VTK Visualization ToolKit PCL Point Cloud Library PCD Point Cloud Data

(11)

(12)

Chapter 1

Introduction

This thesis studies the problem of object pose estimation in 3D point clouds. The thesis specifically focuses on data captured with recent structured-light sensors, such as the PrimeSense sensor [1] recently commercialized by Microsoft under the name Kinect. The aim of this thesis is to evaluate a pose-estimation algorithm that directly uses the faces of an object’s CAD model to match the object to a scene, instead of sampling points from the model and resort to point-cloud methods such as the popular ICP [2].

In fact, although several algorithms for the pose estimation have been developed, the current way of doing pose estimation of 3D object in a scene captured by the Kinect uses approaches based on the use of point clouds [3].

This project has been developed as a Master thesis project in the Computer Vision & Active Perception laboratory, part of the School of Computer Science and Communication (CSC) at KTH University, Stockholm and finds its place in a bigger project which aims at the objects grasping by robots, carried on in the same department.

1.1 Motivations behind the project

One of the biggest limit of a normal camera is the diﬃculty of extracting depth information from the recorded data. Up until now, several methods have been developed for computing the distance of the objects from the camera, such as the 3D reconstruction using stereo-camera. This method requires the accurate calibration of two cameras and textured objects, which makes it diﬃcult to implement.

Only in 2009, Microsoft developed a cheap and easy to use camera able to reconstruct a 3D space: the Kinect for Xbox 360. The kinect features an RGB camera and a depth sensor which consists of an infrared light projector [4].

Nowadays, with the progresses reached in the robotic field, knowing the exact position of the objects in the space is a goal to be achieved. The possibility of giving a robot a 3D representation of the environment facilitates the possibility of having a robot moving freely in a space easily detecting and avoiding obstacles.

(13)

CHAPTER 1. INTRODUCTION

The robotic and computer vision department of Kungliga Tekniska Hogskolan (KTH) Swedish university is investing many resources in the 3D reconstruction-issue.

Computer vision (or machine vision) is the science which studies the way a machine can understand a scene, solving some tasks (recognizing objects, find the position of objects, avoiding obstacles) and pushing always further the frontiers in order to get a machine seeing what a human can see and understanding by looking at a scene. So far the computer vision got interesting results in the analysis of 2 dimensional pictures and is quite successful in processing pixels and their RGB values. Of course the information a 2D picture can give are limited because of the lack of the depth information. After the delivery of technologies able to capture the depth data, the possibility of modeling the human capacity to understand a scene became more concrete; these technologies were often expensive and quite complex to use. The PrimeSense sensor represents the first cheap and ready-to-use depth sensor and attracted the attention of many researchers and developers. In less than one year impressive results have been reached in detection, recognition, segmentation, filtering not only of objects but also of human faces and gestures, thanks to the cooperation of researchers working from all around the world all attracted by the availability of such a cheap depth-sensor .

The aim of the thesis is to find an object in a scene using the geometric properties of a CAD model. In the Point Cloud Library [5] the approach proposed to do pose estimation of an object requires that both the scene and the model of the target are point cloud datasets. Following this approach if we want to find an object in the scene having the CAD representation of the target, a transformation of the model from the CAD representation to a point cloud dataset is required. Here, the importance of maintaining the geometrical information included in a CAD model is defended. Instead of finding a matching between two point clouds, transforming the CAD model in a cluster of points, we want to find a matching between the CAD model and a point cloud dataset. In fact if the number of faces of the CAD model is low pose estimation can be done more eﬃciently than if many points were sampled for each face. The cost of the algorithm is about O(no× ni× (np× nf)) where np is

the number of points of the range scene, nf the number of faces of the CAD model,

no the number of objects in the scene and ni the number of iterations.

A CAD model is a 3D model, the representation of a 3D object described by points in a 3D space connected by geometrical entities [6]. Today, 3D models are widely used in industries of several fields. In the movie industry they are used as characters and objects for motion pictures. In the architecture industry they are used to build models of buildings, houses, landscape to use in Architecture Softwares. They are also used in video-games industries and medical industries where they are used to represents complex organs of the human body [6]. it is easy to understand and well known how important and useful CAD models are.

(14)

1.2. CONTRIBUTIONS

1.2 Contributions

As said before, a wide range of algorithms have been developed to segment the scene, remove the background and points which do not belong to the objects (e.g: table where the objects lie on), remove the noise, filter some points and many other tasks have been accomplished. In particular, for object pose estimation, which is the aim of this project, quite reliable methods have been developed to segment the scene and extract the clusters which more likely represents the target from a scene. Several approaches have been considered, including features extraction, key-points identifications and many others. The current way of applying these methods to the data captured by the PrimeSense sensor employs the use of features descriptors [7]. Many computer vision algorithms use feature detectors in the initial step which consists in analyzing the whole image looking for the presence of a particular (or more than one) feature. According to the result of this first step, the algorithm will examine the picture only in the regions containing the searched feature. The clouds of close points are referred to as clusters. Each cluster represents an object in the scene and will be further analyzed only if it contains the feature the algorithm is looking for. Depending on this, only a subset of all the clusters present in a 3D picture is extracted. These represent the clusters which have good probability of being the searched target. To compute the best matching and eventually find the right solution an iterative algorithm is used, known as "Iterative Closest Point" [2]. This algorithm is, from the logical point of view, similar to the algorithm developed in this project but if the former works with points, the latter works with faces and planes. More accurate description of the ICP algorithm is proposed in the section "Related Works" together with the explanation of the algorithms which use ICP to eventually find the exact pose of the object in the scene. Such a method solves the problem of aligning two point clouds and requires a transformation from CAD model to point-cloud when the target is in CAD representation. This stands for consumption of resources and time and could be easily avoidable when the target has a simple geometry. Moreover, rough estimation of the pose is necessary before applying the ICP. Indeed, the extraction of features not only helps to select clusters which are similar to the target according to the features extracted, but also gives rough estimation of the pose. The ICP is applied to refine the solution but alone, without an algorithm which finds a rough initial alignment, is not able to find the good pose of the target.

In this thesis an alternative to the ICP is developed. The iterative algorithm we propose, which I named "Iterative Closest Face" and will be analyzed throughout the report, is based on the distance point-face rather than on the distance point-point and carries the advantages that no extraction of features, no rough estimation of the initial pose and no transformation of the CAD model is required. The algorithm here proposed could potentially work with an initial rough feature-based alignment, as the ICP does, but this project is focused on the algorithm itself and demonstrates that it can work without any initial alignment. With this method it is possible to look for an object represented by a CAD model in a range scene directly.

(15)

CHAPTER 1. INTRODUCTION

Finally, a visualization tool has been developed in order to display the results graphically: after the computation a graphical interface displays the position of the model in the scene, displaying it in the coordinates and orientations resulting from the algorithm. This has been really helpful for the improvement and the debugging of the algorithm during the developing part, cause I was able to see directly the result in the scene having concrete idea of the mathematical distance of the model from the right cluster, considering both orientation and translation.

1.3 Outline

The report is structured as follows. First, an overview on the related works and existing algorithms is proposed in the chapter "Related Works", followed by a de-scription of 3D sensors and in particular of the PrimeSense sensor in the chapter "Work Environment". The mathematical theory behind the algorithm is introduced and detailed explained in the chapter "Problem Resolution". In the last part of the report the experimental results and the statistical analysis are presented and com-mented. In the Appendix there are details regarding the structure of the code and the softwares and libraries which were useful for the development of the algorithm.

(16)

Chapter 2

Related Works

The research on the pose estimation issue involves a wide range of algorithms and approaches. They mainly vary according to the features used to describe the object, the mathematical approach to solve the problem and the format of the data given as input to the algorithm representing the model of the object to find [8].

The localization of an object in a 3D scene is mainly addressed by using features descriptors which univocally identify an object or a part of the object. In particular with features we can refer both to the result of a neighborhood operation applied to the image (this is the way FPH and FPFH work) or to specific structures of the image (edges, corners, blobs, ridges, shape). Elizabeth Gonzalez et al. used the Fourier descriptors on the silhouettes of the object from diﬀerent views to solve the problem of the recognition/pose of an object in their work "Active object recognition based on Fourier descriptors clustering" [9]. Marius Bulacu and Lambert Schomaker presented a method for estimating planar surfaces using edge-direction distribution (EDD) [10].

Another typical approach is the one which uses keypoints in the image to find the pose of the object. The SIFT keypoint detector identifies the points of interest processing the Diﬀerence of Gaussian (DoG) images [11] and uses these distinctive invariant features to perform a reliable matching between diﬀerent views of an object or scene [12].

The pose estimation methods also vary according to the dimensionality of the image. For example the approach must vary if the matching is 3D-3D or 3D-2D. In the latter case a projection model (either a perspective or orthographic projection) is required.

The Posit algorithm is a model-based approach to estimate the 3D pose of an object, including both the rotation around the axis x,y,z and the shift along the axis x,y,z. It requires as input the model and the coordinates of 4 not planar points of the surface of the object in the picture [13]. The method is based on an approximation of the perspective projection to find the rotation matrix and the translation and iterates on them to refine the result.

(17)

CHAPTER 2. RELATED WORKS

these methods to the data captured by this new device in order to obtain the 3D pose estimation of objects in a scene captured by the PrimeSense sensor. The approach applied to the point cloud data coming from the PrimeSense sensor so far has been addressed by the use of feature descriptors, like the (Fast) Point Feature Histogram Descriptor [14] [15]. In the next section an overview on this descriptor is proposed together with an explanation of the algorithm which uses the descriptor.

2.1 Point Feature Histogram (PFH) Descriptor and Fast

Point Feature Histogram (FPFH) Descriptor

"The goal of the PFH formulation is to encode a point’s k-neighborhood geometri-cal properties by generalizing the mean curvature around the point using a multi-dimensional histogram of values" [15]. The idea is to provide a point with a signature independent from the 6D pose and robust to several levels of noise using the surface normal of the points within a given radius. For this reason the accuracy of this descriptor depends on the quality of the surface normal estimation at each point. Figure 2.1 shows the target point (the one considered for the estimation of the descriptor) with the neighbors within a given radius. The histogram is computed considering the interaction between all the pair of points. The distance between two points is described by the angular variation of their normals and the euclidean distance between the two points. In a more recent version of this descriptor the eu-clidean distance has been removed, not having a significant impact on the robustness of the method. The steps followed by the algorithm are [15]:

• for each point p, all of p’s neighbors enclosed in the sphere with a given radius

r are selected (k-neighborhood);

• for every pair of points pi and pj (i �= j) in the k-neighborhood of p and their

estimated normals ni and nj (pi being the point with a smaller angle between

its associated normal and the line connecting the points), define a Darboux

uvnframe (u = ni, v = (pj− pi) × u, w = u × v);

• compute the angular variations of ni and nj.

The FPFH [14] is a simplified version of the prime PFH [15], which had the drawback of being computationally costly. Indeed for a given point cloud P with n points the cost is 0(n ∗ k2_{), where k is the number of neighbors for each point. The} new version speeds up the computation achieving a cost of only 0(n ∗ k). The steps followed by the algorithm are [14]:

• the SPFH (Simplified Point Feature Histogram) is computed: for each query point only the relationship between itself and its neighbors are computed; • for each point p, the k neighbors are redetermined and the neighboring SPFH

values are used to weight the final histogram FPFH;

(18)

2.2. SAMPLE CONSENSUS INITIAL ALIGNMENT

Figure 2.1. The influence region diagram for a Point Feature Histogram[15].

The formula which expresses the computation of the final histogram is:

F P F H(p) = SP F (p) + 1 k k � i=1 1 wk SP F(pk)

Figure 2.2 shows the the influence region diagram for FPFH.

2.2 Sample Consensus Initial Alignment

Shape descriptors can be used to provide pose samples in a RANSAC alignment of a model to a scene. The alignment of two point clouds P and Q developed by the same authors [16] [17] [18] of FPFH and FPH employs the following scheme [15]:

• given a point cloud P , select s sample points which have a minimum user-defined pairwise distance;

• for each sample point extracted find a correspondent point in Q whose his-togram has a similar value;

• compute the rigid transformation defined by the sample points and their cor-respondences;

• define an error metric for the point cloud;

This scheme finds a good alignment trying diﬀerent values of correspondences and repeat these three steps several time, storing the transformation which gives the best error metric.

(19)

CHAPTER 2. RELATED WORKS

Figure 2.2. The influence region diagram for a Fast Point Feature

Histogram[14].

Once first rough alignment between two point clouds is found, the ICP algorithm is used to refine it.

2.3 ICP Algorithm

The ICP, "Iterative Closest Point", algorithm is the most used algorithm to align two point clouds. The algorithm requires as input two point clouds and gives as output the rotation and translation required to overlap the two datasets in order to minimize the distance between them. Starting from rough estimation of the pose, it iterates following these major steps for each point [2]:

• find the closest point

• find the best transform for this correspondence • transform the dataset

Given two point clouds M and P , for each point of P , the closest point in

M is found according to the euclidean distance. The mathematical formula which

expresses these steps is the following one: 1

|M|

� v�M

�v − matchp(v)�2

Then, the transformations R and t which minimize

(20)

2.3. ICP ALGORITHM

� v�M

�Rmv− t − pv�

are computed, where R is the 3D rotational matrix, t the 3D translational vector,

mv�M and pv�P. The ICP algorithm is guaranteed to converge to a local minimum,

but if the initial guess is accurate enough then it converges to the global minimum. The error decreases monotonically.

The main drawback of the ICP algorithm is its complexity (O(N2

p), Np= number

(21)

(22)

Chapter 3

Work Environment

3.1 Depth Sensing Cameras

Before the launch of the PrimeSense sensor in 2009, several depth sensors were produced. They use diﬀerent mechanisms to capture depth information: range-gated ToF, RF-modulated ToF, pulsed-light ToF, and projected-light stereo [20]. All of them provide RGB color information and depth information for each pixel.

The Time of Flight cameras [21] are based on the time-of-flight principle where the whole scene is captured by each pulse of light. This principle allows the capture of depth information in real time since the whole scene is captured in one shot. Moreover the extraction of the depth map is much easier than in stereo vision or triangulation systems where complex algorithms need to be implemented. The drawback of a TOF camera is that, due to multiple reflections, it is possible that the light reaches the object along several paths altering the real distance of an object. Time-of-flight sensors, however, can operate over very large distance (Kilometers).

Triangulation mechanisms require to find correspondent points in two different pictures of the same scene, taken by two different cameras or by a moving camera in order to simulate the human binocular vision. This is the technique used by the stereo-camera. The main difficulty of such a system is the difficulty of finding correspondent points. For example it could be very difficult to find correspondences for points of the image which lie in regions of homogeneous intensity or color. Nevertheless, triangulation systems have the advantages of being quite accurate at any ambient light condition.

While cameras capture the image in one shot, laser scanning systems [22] scans the scene point by point with a laser. If this, from one side, has the drawback of requiring more time, it also carries the advantage of being more accurate. Compared with a TOF camera for example, it does not face the problem of multiple reflection because only one point is illuminated at once. As the cameras, they can be based on both triangular and time-of-flight systems. The advantage of triangular laser scanning systems, compared with the TOF ones, is that the result is more accurate. In fact in time-of-flight systems the high speed of reflection of the light make it

(23)

CHAPTER 3. WORK ENVIRONMENT

diﬃcult to compute the round trip time under precision of a millimeter. Triangular scanning systems, in contrast, have an accuracy of micrometers.

The PrimeSense sensor, the last depth sensor produced by Microsoft in cooper-ation with PrimeSense, uses the stereo triangulcooper-ation systems to capture the depth information using one laser-based IR projector and one IR camera [23]. The images captured by the laser and the camera are not the same but correspond to diﬀerent camera positions. Thus, it carries all the advantages of a triangulation sensor, fur-ther embellished by the cheap price: in contrast with the majority of all the ofur-ther depth sensing cameras produced till now, which cost thousands of dollars, it costs about only $100.

3.2 PrimeSense sensor

The PrimeSense sensor, commercialized under the name Kinect for XBOX 360, or simply Kinect, is a "controller free gaming and entertainment" [24] experienced by Microsoft for the XBOX 360 video game platform. It enables users to interact with the video game platform without the use of any controller but with the sample use of gestures and spoken commands. Kinect was launched in North America in November 4th 2010 and arrived in Europe few days later the 8th of November 2010. It is the fastest sold electronic device since it sold 8 millions units in the first 60 days after its delivery.

The sensor features a RGB camera and a depth sensor which consists of an infrared light projector and an IR camera. Figure 3.1 shows how the sensor looks like.

Figure 3.1. The Kinect camera[37]

It works using the same principle of the 3D structured-light scanner [23] where a beam of light is projected onto a 3D shaped surface and appears distorted from perspectives diﬀerent from the one of the projector. The IR camera is used to look at the deformation of the pattern and calculate the depth information by the use

(24)

3.3. HOW TO HACK THE PRIMESENSE SENSOR

of triangulation mechanisms. Diﬀerently from the normal scanning lasers, where a point is illuminated at once, structured scanning lasers scan multiple points or the entire scene at once. Figure 3.2 shows the depth computation with a structured-light scanning laser.

Figure 3.2. How structured scanning lasers work [38]

3.3 How to hack the PrimeSense sensor

To hack the Kinect camera the only thing required is a standalone sensor, with its own power supply and the installations of 3 pieces of software: OpenNI [25], Sensor Kinect (a pc driver) and NITE [26]. To obtain the depth data from a picture captured with the PrimeSense sensor I used the OpenNi open source code sample "Sample-NiSimpleRead" and I have modified the code to obtain the x, y, z coordinates printed in a file with .asc extension. Due to the fact that the PrimeSense sensor only acquires data not closer than 50mm, points closer than this value were associated to a depth equal to zero. These values were filtered and eliminated from my dataset. Figure 3.3 provides an example of a 3D picture both in the normal representation with RGB values and in the depth-map representation where the intensity of the light depends on the distance of a point from the sensor.

(25)

CHAPTER 3. WORK ENVIRONMENT

Figure 3.3. Scene captured using OpenNi.

(26)

Chapter 4

Problem resolution

4.1 Problem definition

The goal to achieve in the present work is to align the surface of the model, which represents the target to find in the scene, to points sensed by the PrimeSense sensor. The approach used in this project is based on geometrical notions and uses simple geometrical elements such as planes, points, distances in a 3D space. The algorithm takes a CAD model and a range scene as input. We want to find the exact pose of an object (the target) described by a CAD model in a range scene captured by the PrimeSense sensor. The pose of an object in a 3D scene consists of both its orientation and translation.

4.2 Theory behind the Main Algorithm

The aim of the algorithm is to find the pose of an object in a scene using the geometrical features of the CAD model (faces and vertexes) which represents the target and the 3D euclidean distance between points and surfaces. The geometrical approach explored in this project is based on the maximization of an objective function which estimates the matching between the CAD model and a cloud of points belonging to a single object.

The matching function is evaluated for diﬀerent poses (translations and orien-tations) and the one corresponding to the best value of the objective function is the one given by the combination of translation and orientation that best match the object among the poses explored.

4.2.1 The objective function

The value of the objective function represents the matching score between an object in the scene and the model in a specific pose. The matching value is proportional to the distance between the object and the model in the 3D space. The issue to solve is how to compute the distance between solids represented in two diﬀerent

(27)

CHAPTER 4. PROBLEM RESOLUTION

representations: the object in a point cloud dataset and the target in a CAD model. The idea is to calculate the distance each point of the point cloud dataset has from the closest face of the model (dp(x) = distance between point x and the closest face

of the model aligned to pose P ). The objective function F is defined as follows:

F(P, S) =�

x�S

f(x, P )

where f(x, P ) is a function depending on the distance the point x has from P , defined as f(x, P ) = � 1 −gp(x) m , if gp(x)< m 0, if gp(x)> m gp(x) is a function defined as gp(x) = � dp(x), if dp(x)< 30 4 · dp(x), if dp(x)> 30

in order to have bigger numerical gap in the value of the objective function of physically distant poses.

The way in which f(x, P ) is defined is so that for each pose only the points which have distance minor that m millimeters influence the objective function.

Figure 4.1. CAD model in Ply format.

(28)

4.2. THEORY BEHIND THE MAIN ALGORITHM

Figure 4.2. Point cloud target

The objective function expresses the overlap grade the model has in a specific pose with the considered object and is proportional to the sum of the real distance each point has from the closest face.

The target in the range scene does not have all the faces the model has (see Fig. 4.1, 4.2), because of the occlusion. The goal is to recognize the target among all the objects in the scene and align the model (see Fig. 4.1) with it (see Fig. 4.2) minimizing the distance between the two solids in the two diﬀerent representations. Figure 4.3 shows the alignment which gives the best value of the objective function.

4.2.2 Distance Point-Face

The distance (d) of a point from a face is the distance of the point from the plane which the face belongs to, if its projection falls inside the edges of the face; in the case the projection of the point falls outside the face, the method applied is based on simple geometrical considerations. Since the CAD model is defined by a Ply file [27], the method is slightly diﬀerent according to the shape of the faces the model is described with: these can be either rectangular or triangular.

(29)

Figure 4.3. Best alignment

The case of the rectangular face

In the case of rectangular faces the idea is to consider the 9 regions in which a face divides a plane (see Fig. 4.4).

Considering a 3D point P , its projection P� _{on the plane of the face can fall} outside or inside the face. When the projected point falls inside the face, the projection of P� _{on each of the edges of the face will fall inside the edges. In this} case the distance will be simply the distance of the point from the plane

d= d(P, P�).

If it falls outside the face, P� _{can be in each of the 8 regions outside the rectangle} (Fig. 4.5). If it falls in the regions 2, 4, 6, 8 then

d=�(d(P, P�)2+ d(closestedge)2) If P’ falls in the regions 1, 3, 5, 7

d=�(d(P, P�)2+ d(closestvertex)2)

The easiest way to determine in which region the projection P’ falls is considering the P” projection of P’ on each of the 4 edges and check if it belongs or not to that edge.

(30)

4.2. THEORY BEHIND THE MAIN ALGORITHM

Figure 4.4. Regions defined by a rectangular face

How to determine if the projection of a point belongs to a segment

The projection of a point P�_{(x, y, z) belongs to a segment AB if the angles P ˆ}_A_B and P ˆBA are both acute or right, which means AP · AB ≥ 0 and BP · BA ≥ 0. In

cartesian coordinate this can be expressed as

(P x − Ax)(Bx − Ax) + (Py − Ay)(By − Ay) + (Pz − Az)(Bz − Az) ≥ 0 and

(P x − Bx)(Ax − Bx) + (Py − By)(Ay − By) + (Pz − Bz)(Az − Bz) ≥ 0

The case of the triangular face

There are several and accurate methods to calculate the distance of a point from a triangle in a 3D space. The common way to determine if a point falls inside a triangle is to consider the vectors that connect the point to each vertex of the triangle and sum the angles between these vectors. If the sum is 2 ∗ PI then the point falls inside the triangle. This method is computationally slow. A fast and eﬃcient method is required not to overload the algorithm. Among the possible ones the straightest method has been considered in order to use the same functions and the same logic which are behind the same computation in the case of a rectangular face.

(31)

Algorithm 1 Distance point-face pseudo-code if P� _{falls between A and B then}

dist_{← dist + 0}

else

entered← entered + 1

end if

if P� _{falls between B and C then}

dist← dist + 0

else

entered_{← entered + 1}

end if

if P�_{falls between C and D then}

else

entered_{← entered + 1}

end if

if P� _{falls between D and A then}

else

entered← entered + 1

end if

if entered = 2 then

compute distance from the closest face

else

if entered = 4 then

compute distance from the closest vertex

end if end if

First, the projection of the point on the plane where the face lies on is computed. If the projected point falls inside the triangle the distance is the distance point-plane. If the projection falls outside then the distance between the point and each vertex and between the point and each segment-line is computed. The minimum among these distances is considered.

To understand if the projection of the point falls inside or outside the triangle, the cross product between two vectors is used. The result of the cross product of two vectors is perpendicular to both of the vectors - if the second vector is on the left of the first one looking from above, then the resulting vector points up, otherwise it points down. Considering a triangle of vertexes ABC and a point P , we calculate the cross product (P − A) × (B − A). The cross product vector will point either up or down depending on the position of B − A respect to P − A. The problem is that the triangle in a 3D space can have any orientation, and it is not possible to know which should be the direction of the cross product for a point inside the triangle

(32)

4.3. ITERATIVE CLOSEST FACE ALGORITHM (ICF)

unless a reference point is used. The reference point can be the opposite vertex of the triangle. For example, if the cross product (B − A) × (P − A) does not point in the same direction of (B − A) × (C − A), then the point is not inside the triangle. If it does then the same test with the other vertexes has to be done: if the point is on the same side of AB as C, of BC as A and of CA as B, then the point is inside the triangle. The function is easy and does not require roots, projection and other operations that would slow down the algorithm. The method used is really easy to understand and fast to implement, but is not the fastest one. The Barycentric Technique [28] is another easy method which has the advantage of being faster and the downside of being mathematically more complicated. In this project the first method has been implemented. The implementation of the second faster method is here left as a clue for future revisions of the work.

Figure 4.5 shows an example of a point inside a triangle and a point outside. The cross product between the vector connecting the point to the vertex and BC will point outside in one case and inside in the other.

Figure 4.5. Vectors connecting the vertex with a point inside the triangle and

a point outside

4.3 Iterative Closest Face Algorithm (ICF)

The computation of the distance between each point of the range scene and the closest face of the model is calculated for diﬀerent poses (translations and orien-tations). The set of poses to iterate on is computed following criteria aimed at accelerating the convergence to the global maximum and at the avoidance of local maximums. Since the cost function defined to solve the problem is not easily dif-ferentiable, the application of traditional methods to optimize the function is not possible. Therefore, the approach used is based on random search which iterates on randomly generated poses for a first set of iterations. The space of search be-comes smaller each time that the algorithm gets closer to a maximum. For each iteration the seed of the next pose to be generated is the last maximum. To avoid being trapped in a local solution, the algorithm generates a new random pose, no

(33)

longer dependent on the previous best solution, each time the best solution does not change for a given number of iterations.

4.3.1 The Search Strategy

The implemented algorithm is an iterative not deterministic method, in the sense that the probability of reaching the correct result increases with the rise of the number of iterations. The idea behind the algorithm is to reduce the space of search at each iteration. Starting from a random generated set of coordinates and rotational angles, each iteration produces a set of values close to the best values found until the previous iteration. These values follow a gaussian distribution which has as mean value the best value reached until the previous iteration and a given standard deviation. At each iteration the standard deviation is decreased in order to reduce the space of search around the solution. In practice, at each iteration a kernel is defined. A kernel is a function whose shape is described by a mean value and a standard deviation. The samples defined by the kernel function follow its shape. At each iteration a set of samples is extracted from the kernel and evaluated. As the algorithm goes further in the number of iterations, the gaussian gets tighter around the solution.

4.3.2 Find the 6D pose of the object

One of the trickiest issues faced in the present work has been the implementation of the iterative search in such a way that it could converge to good results in reasonable time. As said before, the objective function is not easily diﬀerentiable and the search for the solution (the global maximum) was done with an iterative algorithm, which enables results closer to the solution at each iteration.

The structure of the algorithm consists of three cycles. The number of iterations for each cycle is the value of the parameters threshold1, threshold2 and threshold3. The parameters used in these experiments are threshold1 = 50, threshold2 = 100,

threshold3 = 20, and the value of m of the objective function is set to 200. This

value says that only those points whose distance from the closest face is lower than 200mm contributes to the objective function. A local maximum is identified when the solution does not change for more than 3 iterations. In this case the

LocalM aximumDetectorflag is set to T RUE. If more than 500 points have a

dis-tance from the closest face bigger than max = 50mm then the objective function is set to 0 and the computation for that pose is stopped. This condition accelerates the algorithm helping the recognition of a bad pose before the end of the evaluation of the distance for each point. The pseudocode of the iterative algorithm implemented is given below together with its explanation.

The first step of the pseudocode is to generate random values for the x,y and z coordinate. The iterative algorithm has an external cycle of 50 iterations. For each of these iterations:

(34)

Algorithm 2 Rotations and Translation Test pseudo-code

[xbest, ybest, zbest] ← random(xmin, xmax, ymin, ymax, zmin, zmax)

k.loch= 50

k.orih= 180 ∗ PI/180

while iteration1 ≤ threshold1 do

if localMaximumDetectorOnT raslation = F ALSE then

decreasek.loch

end if

if localMaximumDetectorOnT raslation then

increasek.loch

[xbest, ybest, zbest] = randoml(xmin, xmax, ymin, ymax, zmin, zmax)

BestT ranslList← BestT ranslList + best

end if

k_{← generateKernel([xbest, ybest, zbest], k.loch)} t← extractT ranslation(k)

pose_{← generateP ose(t)}

if pose.objV alue > best.objV alue then

best← pose

end if

[tetax, tetay, tetaz] = random()

decreasek.orih

if localMaximumDetector then

BestList← BestList + P oseList.begin()

P oseList_{← EMP T Y} increasek.orih

[tetax, tetay, tetaz] = random()

end if

k← generateKernel([tetax, tetay, tetaz], k.orih)

o_{← extractOrientation(k)} pose_{← generateP ose(t, o)} P oseList← P oseList + pose

end while

bestP ose_{← findthebestP ose(P oseList)} P oseList← EMP T Y

P oseList← P oseList + bestP ose

[tetax, tetay, tetaz] ← [bestPose.tetax, bestPose.tetay, bestPose.tetaz]

end while

BestList← BestList + bestP ose

bestP ose_{← findbestpose(BestList)}

if bestP ose.objV alue ≥ best.objV alue then

best← bestP ose

[xbest, ybest, zbest] ← [best.x, best.y, best.z]

end if

BestT ransl← BestT ransl + best

(35)

• if the algorithm is not trapped in a local solution, then the space of search is reduced (standard deviation for the translation is decreased of 10 mm). • if the algorithm is trapped in a local solution and a matching has been found

(the value of the objective function for the best pose is diﬀerent from 0), the space of search is increased again (the value of the standard deviation for the translation is carried up to 20 mm) to help the algorithm to escape from the local solution and the values of x, y, z are generated randomly. Also, the best pose of the previous cycle is inserted in a list called BestT ransl.

From the current values of x, y, z a set of coordinates is generated following a gaussian distribution given a standard deviation and the values of x, y, z as mean. These values define a kernel from which a sample is extracted at each iteration. The power of this strategy is that the search is strictly addressed in order to converge to the solution. Indeed, for each iteration a new kernel is defined around the best value found until the previous iteration. This means that, each time, the algorithm searches closer to the solution, reducing the space of search. After a sample of coordinates is extracted, the model is translated to the point described by those coordinates and the value of the objective function is computed. This value is compared with the objective function value of the best pose (best) found until that iteration and if it is higher the best pose is updated. With this change of coordinates the model is translated to a point which is closer to the solution and the search for the right orientation starts generating random values for the angles of the x, y, z axis. The algorithm iterates 100 times on the pose having the previously computed translation, seeking the best values of angles which maximize the matching for that translation. For each of these iterations:

• the space of search is decreased by 10 degrees;

• if the algorithm is trapped in a local solution, then the current solution is saved in a list, the axis are rotated randomly between -180 and +180 degrees and the space of search is increased back to 180 degrees.

Again, for each of these iterations, which corresponds to a combination of translation and rotation, a kernel having as mean values the angles of the external cycle is defined and 20 samples are extracted. These 20 poses are evaluated in order to find the pose of the model which best matches the object in the scene. This pose represents the pose around which the kernel for the following set of 20 iterations will be computed. This means that the next iteration of the external cycle will produce 20 poses around the best value produced in the previous iteration. In particular, the rotational angles of the best pose represent the mean values for the definition of a new kernel function in the next set of 20 iterations, unless a local maxima is found: in that case the kernel is defined with parameters generated randomly, the pose corresponding to the local solution is saved in the list BestList and the search starts again. At the end of the 100 iterations, BestList contains all the solutions which trapped the algorithm in a local solution. These poses are compared in order

(36)

to find the best pose which gives the best matching among the ones in the list. This pose (bestP ose) represents the best solution the algorithm finds, starting by translating the model in a sample point extracted from a kernel and searching the best orientation for that translation. If this solution is better than the previous one, the value is stored as the best value found (best). Its coordinates represent the mean values for the kernel function of the following iteration of the most external cycle. This means that the following iteration will search in a space of search around those values of coordinates if the solution is not recognized to be a local maximum, otherwise the algorithm starts the search randomly and everything happens all over again. At the end of all the iterations BestT ransl list contains the best values found in the search. These poses are evaluated in order to find the final best solution.

(37)

(38)

Chapter 5

The environment issue

The ICF easily aligns the model to the target when the point cloud given as input contains only the points belonging to the searched object, but we want to be able to identify the right object among different ones in a scene. For example, we want to identify and find the pose of the target in a scene composed of a table and some other objects lying on it. Of course we need the table to be recognized as something belonging to the background and not as an object to be analyzed. The idea is to filter the points belonging to the table and separate the objects in different data structures. Each object would be defined by all the points which are close to each other less than a certain threshold: points far away from each other more than 20mm (for instance) would belong to different objects. The PCL open source library provides an algorithm that is able to find and segment individual objects lying on a plane (i.e: a table) in different clusters, recognizing the plane as something belonging to the background [29].

Figure 5.1 shows the point cloud dataset belonging to the scene represented in figure 3.3. Figure 5.2 shows the clusters extracted from the scene and put in diﬀerent datasets.

Figure 5.1. Objects on a table

A clustering method needs to divide an unorganized point cloud P in smaller parts. It uses a division of the 3D space based on fixed width boxes or octree

(39)

CHAPTER 5. THE ENVIRONMENT ISSUE

(a) (b)

Figure 5.2. Result of Cluster Extraction

data structures and determines the area connected to a given node using a flood fill algorithm [29] (for detailed explanation refer to poincloud.org webpage).

Algorithm 3 Extract Euclidean Clusters

create a kd-tree representation of the point cloud dataset P

set up an empty list of clusters C and a queue of the points that need to be checked Q

for pi⊂ P do

Q← Q + pi

for pi⊂ Q do

search in the set Pikthe neighbors of pi in a sphere with radius r < d

for every neighbor pik⊂ Pik do

if pik has not been processed then

Q_{← Q + p}ik

end if end for

if all the list Q has been processed then

C← C + Q

Q← empty

end if end for end for

The algorithm used to extract the clusters explained in the pseudo-code belongs to the pcl::EuclideanClusterExtraction class [29]. Before the clusters extraction the algorithm applies a filter to reduce the number of points of the range scene and, thus, the processing time. The Cluster Extraction algorithm is based on the RANSAC planar segmentation and works only if the objects lie on a planar surface, a table

(40)

in the specific case. An example of what we do expect from the application of the clusters extraction algorithm and the Iterative Closest Face one is showed in figure 5.3.

Figure 5.3. Cluster Extraction and Pose Estimation

The ICF, thus, will iterate on all the clusters maximizing the objective func-tion for each of them. The cluster whose cost funcfunc-tion shows the best matching represents the target in the scene.

(41)

(42)

Chapter 6

Cluster Extraction and Pose Estimation

Results

6.1 Model used during the experiments

The models used during the experiments were essentially 4 (see Fig. 6.1) and belongs to the set of CAD models of KTH Computer Vision Department [30]. The models are:

• Box5.ply: It s a blue box which consists of 24 vertexes and 6 rectangular faces; • EisKaﬀe.ply: It is a cylindric coﬀee container which consists of 102 vertexes

and 75 faces. 50 of these faces are triangular and 25 rectangular shaped; • YellowRoundSalt.ply: It is a cylindric salt container which consists of 130

vertexes and 96 faces, 32 rectangular and 64 triangular shaped;

• Amicelly.ply: It is an hexagonal parallelepiped container which consists of 38 vertexes and 24 faces all of them triangular shaped.

The different number of faces is what distinguishes these models in terms of com-plexity. The most complex ones are the "EsiKaffe" model and the "YellowRoundSalt" model whose number of faces (75 and 96 faces respectively) stands for higher com-putational time required by the algorithm. In fact for these models the algorithm runs for several minutes before reaching a solution and needs a high number of re-sources. Although the algorithm can still be improved to increase the performances and reduce the computational time, for models whose complexity is higher than the complexity of the coffee and salt container, the efficiency of the ICF decreases so as that the application of the traditional method is definitely preferred.

(43)

CHAPTER 6. CLUSTER EXTRACTION AND POSE ESTIMATION RESULTS

(a) Box5 (b) EisKaﬀe

(c) YellowRoundSalt (d) Amicelly

Figure 6.1. Ply models

6.2 Experiments

The experiments have been run on a recent laptop computer. For each experiment the figures will show the RGB 2D picture together with the point cloud dataset of the scene from diﬀerent perspectives. The grey object in the point cloud picture represents the CAD model of the target. The aim of the algorithm is to find the target in the scene in the right translation and orientation, recognizing it among more objects lying on the table. Six scenarios are proposed and the experiments, that have been run in order to get information on the robustness of the algorithm, are discussed in the section "Statistical Analysis". In this section, instead, the scenes are analyzed and described in order to point out the diﬃculties of each scenario and evaluate the results of the algorithm according to the features of a specific scene.

Figure 6.2 shows a simple situation where Box5 (which is the target to find), a cup and an Indian statue lie on the table. In the proposed situation the three

(44)

6.2. EXPERIMENTS

(a)

(b)

(c) (d)

(45)

objects have completely diﬀerent shapes and the algorithm easily recognizes the right cluster. In fact, since the number of points belonging to the box is much bigger than the number of points belonging to the other two objects, the objective function of the box will have a higher value. The only diﬃculty in this case is to find the right orientation and translation of the box in the scene.

The situation proposed in Figure 6.3 is more complex because the object to find is the smallest one. Since the value of the objective function increases with the number of points - it is based on the sum of the distance between each point of a cluster and the closest face of the model - without any extra-condition the algorithm would match the model with the wrong object, the biggest one. An additional condition is required: the solution is the cluster with the highest value of the cost function if it does not have more than 5 points with a distance from the closest face bigger than 2.7cm. With this condition an object which has a significantly diﬀerent size can’t be a candidate for the correct matching no matter if the value of the object function is higher than the one of the target.

Figure 6.4 shows the situation where a box and a salt cylinder-shaped package lie on a table. If we wanted to find the pose of the box the situation would be close to the one proposed in the Figure 6.2, since the number of points belonging to the box is higher than the number of points belonging to the cylinder. In this test, instead, the target is the cylinder. The objective function of the cylinder would have an higher score as long as we consider the same number of points for the two objects, since the model would match exactly the shape of the cylinder. If the number of points affecting the objective function is different for the two clusters it is not possible to rely only on the value of the cost function. Therefore, some other considerations are required otherwise the model would be matched with the box: the additional condition explained before solves the problem. When the difference in the number of points is due to a different resolution (a different distance from the camera) rather than to a difference in the size of the objects, this ploy does not work: other features, like the color, could be used. The same consideration stands for objects which have both similar shape and size.

In the scenario presented in figure 6.5, 3 objects lie on the table: a cup, a salt container and a box. This case is interesting because of the presence of the cup which has a shape similar to the one of the salt container. The latter is a little bit thinner and higher but both of them are cylinders and have similar size. The target is the salt container. The cost function has a value of about 39 for the box, of about 13 for the cup and of about 15 for the salt container. The matching with the box is excluded because it has 19 points which have a distance from the model major than 2.7cm. Both the cup and the salt container do not have any point with such a distance and the matched cluster is the right one.

In the scenario presented in figure 6.6 two objects of both similar shape and dimension lie on the table: a cylinder container (EisKaﬀe) and an hexagonal con-tainer (Amicelly). The number of points belonging to the hexagonal parallelepiped is higher than the number of points belonging to the cylinder. For this reason the target is always matched with the former one. Figure 6.6 shows the resulting

(46)

6.2. EXPERIMENTS

(a)

(b) (c)

(d)

Figure 6.3. Scenario 2: Box5 and another rectangular shaped object on the

(47)

(a) (b)

(c) (d)

Figure 6.4. Scenatio 3: Box5 and YellowRoundSalt on the table

ing when the target is the hexagonal parallelepiped, figure 6.7 when the target is the cylinder. Because of the similar size of the objects, the additional condition used in the previous scenarios does not work. As said before, in this case considering only the geometrical features of the model is not enough. Other information, such as the colors of the clusters, should be used to distinguish the two objects. For example, the cylinder has the majority of the pixels around the rgb value of the blue color, while the hexagonal parallelepiped around the rgb value of the yellow color. In this case, where there is a clear diﬀerence in the colors of two diﬀerent objects, this information can be easily used by the algorithm to find the right cluster.

Figure 6.8 shows the same scenario as before with the only diﬀerence that the hexagonal parallelepiped has diﬀerent orientation. If the experiment is run with the Amicelly as target, the algorithm results in the right matching since the hexagonal container has a higher number of points. Whereas, if the experiment is run with the cylinder as target, the model is matched with the wrong object - nevertheless, it has correct orientation - as shown in figure 6.9.

(48)

6.2. EXPERIMENTS

(a)

(b)

(c)

(d)

(49)

(a) (b)

(c) (d)

Figure 6.6. Scenario 5: EisKaﬀe and Amicelly on the table

(50)

6.2. EXPERIMENTS

(a) (b)

(c) (d)

(51)

(a) (b)

(c) (d)

(a) (b)

(c) (d)

(52)

Chapter 7

Statistical Analysis

In this chapter the result of the experiments made on the most significant scenarios are displayed and commented. In particular the analyzed scenarios are the 1, 2, 3, 5 since they vary enough to give the reader a complete idea of the robustness of the algorithm in diﬀerent cases. For each scenario the algorithm has been executed 100 times and the error both in the translation and orientation has been computed. Each translation of the model is defined by the coordinates of its centre. The error in the position is computed by comparing the coordinates resulting from the experiment with the expected coordinates for the centre of the model. In particular the error in the translation is calculated as the euclidean distance between two points in the space.

To compute the error in the orientation the rotational angles resulting from the experiment were translated in the quaternion format and compared with the quaternion representing the angles of the expected rotation. The distance between two quaternions [31] was computed using a function of the Nuklei library [32] which calculates the angular distance between two angles expressed in quaternions, in this way:

γ = |q1 · q2| dist= 2 arccos γ

The result is a number which varies from 0 to π and represents the angular error eventually computed so that if dist > π

2 then dist = π − dist to wrap the error to

π

2. The error in the coordinates x, y, z is expressed in millimeters and the error in the translation is expressed in degrees.

For the EisKaﬀe model the procedure to compute the rotational error has been slightly diﬀerent. Because of the cylindric shape of the model, any orientation which has the main axis of the model parallel to the main axis of the expected solution gives good matching. Thus, the main axis was computed for each orientation resulting from the experiments and compared with the main axis of the model in the expected orientation. The angle between the two lines gives the error in the orientation. In the appendix a chapter explains how to calculate the error in the orientation in this case, showing the mathematical approach.

(53)

CHAPTER 7. STATISTICAL ANALYSIS

(a)

(b)

Figure 7.1. Scenario 1: Error in the Translation and Orientation

For the Amicelly model there are 6 possible orientations along the y axis which give good matching since the parallelepiped is hexagonal shaped. In scenario 5 then, the quaternion resulted from the experiment is compared with the quaternions corresponding to 6 orientations. The minimum among these distances corresponds to the error.

The figures below show the results for the diﬀerent scenarios. In figures 7.5, 7.6 the mean error and the standard deviation of the error of both the translation and orientation are summarized in barplots.

The mean values of the error in the translation in the scenario 1 is 22.86 and the standard deviations is 11.12. The mean value of the rotational error is 15.48◦ and the standard deviation is 24.57◦_.

(54)

(a)

(b)

The mean value of the error in the translation in the scenario 2 is 21.62 and the standard deviations is 9.35. The mean value of the rotational error is 10.38◦ _and the standard deviation is 31.86◦_.

The mean value of the error in the translation in the scenario 3 is 16.03 and the standard deviations is 10.17. The mean value of the rotational error is 7.08◦ _and the standard deviation is 4.16◦_.

The mean value of the error in the translation in the scenario 5 is 12.44 and the correspondent standard deviations is 7.99. The mean value of the rotational error is 7.58◦ _{and the standard deviation is 8.19}◦_.

(55)

(a)

(b)

7.1 The computational cost of the algorithm

The computational cost of the algorithm is strictly related to the number of points in the picture (np), the number of objects lying on the table (no) and to the complexity

of the CAD model which represents the target (which means the number of faces

nf). Indeed, the algorithm, after extracting the clusters from the scene, iterates for

each point of each cluster 50 × 100 × 20 times. Also, for each point of each cluster the algorithm computes the distance point-face for each face of the model, meaning that for complex models, like the EisKaﬀe one, with a high number of faces, the computational time rises significantly. The algorithm complexity is O(no× (50 ×

100 × 20) × (np× nf)).

A significant drop down of the computational time would be achieved with the

(56)

7.1. THE COMPUTATIONAL COST OF THE ALGORITHM

(a)

(b)

use of parallel programming to process all the clusters at the same time. In this project this improvement has not been implemented due to time issues, but it is here proposed as a clue for future works.

The computational time for the diﬀerent scenarios is 33.688s, 30.404s, 9m13s and 2m29s respectively for the scenario 1, 2, 3, 5.

(57)

(a)

(b)

Figure 7.5. Mean Error

(58)

7.1. THE COMPUTATIONAL COST OF THE ALGORITHM

(a)

(b)

(59)

(60)

Chapter 8

The visualization tool

When the algorithm reaches a solution, the result is printed in a file with extension

.ply [27]. This file contains coordinates of the model translated and rotated

accord-ing to the solution. The visualization tool has been programmed usaccord-ing VTK library [33] which is a powerful software to display and process both point cloud datasets and ply files. The tool displays the point cloud scene captured by the PrimeSense sensor - saved in a .xyz file - and the CAD model - in the .ply format - overlapping the area in the scene which gave the highest matching score (see Fig. 8.1).

Figure 8.1. Visualization Tool Example

(61)

dis-CHAPTER 8. THE VISUALIZATION TOOL

plays the coordinates of the point clicked with the mouse on the Terminal window, as shown in figure 8.2.

Figure 8.2. mouse event handler

Estimation of the 3D Pose of objects in a scenecaptured with Kinect camera using CAD models

Degree project in

Computer Science

Second cycle

Stockholm, Sweden 2013

Estimation of the 3D Pose of objects in a scene

captured with Kinect camera using CAD models

Estimation of the 3D Pose of objects in a scene

captured with Kinect camera using CAD models

Abstract

Contents

List of Figures

List of Acronyms

Chapter 1

Introduction

1.1 Motivations behind the project

1.2 Contributions

1.3 Outline

Chapter 2

Related Works

2.1 Point Feature Histogram (PFH) Descriptor and Fast

Point Feature Histogram (FPFH) Descriptor

2.2 Sample Consensus Initial Alignment

2.3 ICP Algorithm

Chapter 3

Work Environment

3.1 Depth Sensing Cameras

3.2 PrimeSense sensor

3.3 How to hack the PrimeSense sensor

Chapter 4

Problem resolution

4.1 Problem definition

4.2 Theory behind the Main Algorithm

4.3

Iterative Closest Face Algorithm (ICF)

Chapter 5

The environment issue

Chapter 6

Cluster Extraction and Pose Estimation

Results

6.1 Model used during the experiments

6.2 Experiments

Chapter 7

Statistical Analysis

7.1 The computational cost of the algorithm

Chapter 8

The visualization tool