Recognition and Registration of 3D Models in Depth Sensor Data

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016

Recognition and registration

of 3D models in depth

sensor data

Ola Grankvist

(2)

Master of Science Thesis in Electrical Engineering

Recognition and registration of 3D models in depth sensor data

Ola Grankvist LiTH-ISY-EX--16/4993--SE Supervisor: Martin Danelljan

isy, Linköpings universitet

Morgan Bengtsson

Fotonic

Examiner: Per-Erik Forssén

isy_{, Linköpings universitet}

Computer Vision Laboratory Department of Electrical Engineering

(3)

Abstract

Object Recognition is the art of localizing predefined objects in image sensor data. In this thesis a depth sensor was used which has the benefit that the 3D pose of the object can be estimated. This has applications in e.g. automatic manufactur-ing, where a robot picks up parts or tools with a robot arm.

This master thesis presents an implementation and an evaluation of a system for object recognition of 3D models in depth sensor data. The system uses several depth images rendered from a 3D model and describes their characteristics using so-called feature descriptors. These are then matched with the descriptors of a scene depth image to find the 3D pose of the model in the scene. The pose esti-mate is then refined iteratively using a registration method. Different descriptors and registration methods are investigated.

One of the main contributions of this thesis is that it compares two different types of descriptors, local and global, which has seen little attention in research. This is done for two different scene scenarios, and for different types of objects and depth sensors. The evaluation shows that global descriptors are fast and robust for objects with a smooth visible surface whereas the local descriptors perform better for larger objects in clutter and occlusion. This thesis also presents a novel global descriptor, the CESF, which is observed to be more robust than other global descriptors. As for the registration methods, the ICP is shown to perform most accurately and ICP point-to-plane more robust.

(4)

(5)

Notation vii 1 Introduction 1 1.1 Background . . . 1 1.2 Related Work . . . 3 1.3 Goal . . . 3 1.4 System Overview . . . 4 1.5 Limitations . . . 4 1.6 Method . . . 5 1.6.1 Tools . . . 5 1.7 Thesis Structure . . . 5 2 Background 7 2.1 Depth sensors . . . 7 2.1.1 Structured Light . . . 7 2.1.2 Time-of-flight . . . 8 2.1.3 Rasterization . . . 8 2.2 Point Cloud . . . 9 2.2.1 Normal Estimation . . . 9 2.3 Kd tree . . . 9 2.4 Rigid Transformation . . . 10

2.4.1 Estimate a rigid transformation . . . 10

2.5 Quaternions . . . 10 3 Theory 11 3.1 Local Pipeline . . . 11 3.1.1 Preprocessing . . . 12 3.1.2 Keypoint Detection . . . 13 3.1.3 Feature Descriptors . . . 13 3.1.4 Matching . . . 16 3.1.5 Correspondence grouping . . . 16

3.1.6 Initial Pose Estimation . . . 17

3.1.7 Pose Refinement . . . 18

(6)

vi Contents

3.1.8 Choosing Final Pose . . . 21

3.2 Global Pipeline . . . 21

3.2.1 Pre-processing and Segmentation . . . 21

3.2.2 Descriptors . . . 22

3.2.3 Matching . . . 25

3.2.4 Initial Pose Estimation . . . 25

4 Experiments and Results 29 4.1 Experiments . . . 29 4.1.1 3D models . . . 29 4.1.2 Evaluation . . . 30 4.2 Results . . . 36 4.2.1 The System . . . 36 4.2.2 Registration . . . 42 5 Discussion 47 5.1 Result . . . 47 5.1.1 The System . . . 47 5.1.2 Registration . . . 50 5.2 Method . . . 51 5.2.1 Evaluation . . . 51 5.2.2 The System . . . 51 6 Conclusions 53 A 3D models 57 Bibliography 59

(7)

Notation

Abbreviations

Abbreviation Meaning

icp _{Iterative Closest Point} DOF Degree(s) of Freedom PCL Point Cloud Library CPU Central Processing Unit CAD Computer Aided Design RGB Red-Green-Blue

CRH Camera Roll Histogram VFH Viewpoint Feature Histogram

CVFH Clustered Viewpoint Feature Histogram ESF Ensemble of Shape Functions

CESF Clustered Ensemble of Shape Functions SHOT Signature of Histograms of Orientations

FPFH Fast Point Feature Histogram SPSH Simplified Point Feature Histogram RANSAC Random Sample Consensus

GICP Generlized ICP LM Levenberg Marquardt

3D-NDT Three-Dimensional Normal-Distributions Transform FPFH Fast Point Feature Histogram

EVD Eigenvalue Decomposition OPP Orthogonal Procrustes Problem SVD Singular Value Decomposition

ToF Time-of-Flight RF Reference Frame

ISS3D Intrinsic Shape Signatures 3D FOV Field-Of-View

(8)

(9)

1

Introduction

This chapter presents an introduction to this report where some background is presented together with a formulation of the problem and the method used to solve it. This master thesis was carried out at the office of Fotonic in Stockholm.

1.1 Background

Object recognition is the art of localizing predefined objects in an image. The ap-plications of object recognition are numerous. Some examples are people recog-nition for surveillance and statistical applications as in counting people passing through a door, microscopy cell counting and OCR, optical character recognition. The approach to the problem depends on the scene, the sensor, the objects and the application.

This thesis investigates recognition of rigid objects described by a 3D model. Thus, 3D structure of the object is well defined and does not vary between scenes, in contrast to people or faces. Applications of this is typically in automatic man-ufacturing where a robot picks up a tool or an object for assembly or quality in-spection. To do this, the robot needs to know which object to pick up and where and how it is localized in the scene. Another application is for a robot or a vehicle to find an object in a warehouse, for example a pallet. In this situation the objects can be fairly large and cover a larger part of the scene. Using an image sensor for these applications removes the need for marking the objects with a marker, e.g. an RFID tag, and it also makes it possible for the robot to use the image sensor in other tasks as in navigating in a warehouse.

A depth sensor was used in this thesis. It captures the depth of the scene measured from the sensor. As the sensor is calibrated, each pixel value is a depth measurement from the sensor to the corresponding point in the scene. An ex-ample of a depth image can be seen in figure 1.1b where the captured scene in

(10)

2 1 Introduction

(a) An RGB image, the black pixels are areas where the depth image could not in-terpret a depth value.

(b) A depth image. The color represen-tation ranges from small depth values in green to higher in red through blue.

(c) A point cloud representation using the same color representation as in figure 1.1b.

Figure 1.1:An RGB image, a depth image and a point cloud representation of the same scene.

colored sensor data is seen in figure 1.1a. Using a depth sensor the 3D structure of a captured scene can be estimated. These are often represented by a point cloud as can be seen in figure 1.1c, point clouds are described more thoroughly in section 2.2.

A benefit of using a depth sensor is that you can compare the 3D structure of the object’s 3D model with the 3D structure of the scene to localize the object. One method of doing this is to use feature descriptors, referred to as just "descrip-tors", for describing 3D structure in a parametric manner of both the object’s 3D model and the captured scene. Then by matching the descriptors of the model with those of the scene the objects 3D pose can be determined. The pose has six-Degrees-Of-Freedom, 6-DoF, with three degrees for a translation and three for an orientation.

(11)

im-1.2 Related Work 3

ages using texture and color information of the 3D model together with an illu-mination model. Though color and texture are dependent on the illuillu-mination source and intensity which makes this approach difficult.

With a depth sensor the 3D pose can be determined, which is important if e.g. a robot arm is to pick up the object. Another benefit of knowing the 3D structure of the scene is to exclude unwanted pixels situated far away in the background or pixels representing 3D shapes as a plane for removing pixels localized on a table or on a wall.

1.2 Related Work

Existing research of 3D object recognition based on 3D data of an object, as a 3D model, often categorizes the approach in using either global or local feature descriptors. Local descriptors describe a local region of the scene or of the 3D model while global descriptors describe the whole of the object using only one descriptor.

Guo et al. [7] presents a survey of descriptors, mainly local, and claims that local descriptors are better suited than global for objects in occlusion and clut-ter. [7] also presents a commonly used pipeline including different varieties of it for local descriptors. Most articles as [15], [2], [16], [3], [19], [6], [21], [12] fol-lows these pipelines using local descriptors and compare these with other local descriptors. Global descriptors as presented in [1], [20] describes the whole of an object using one descriptor instead. Both [1] and [20] have compared global de-scriptors. In the existing research, less attention has been given to quantitatively comparing the pipeline based on global descriptors with the pipeline based on local descriptors.

1.3 Goal

The goal of this thesis is to implement a system and investigate different ap-proaches for object recognition and registration of a 3D model in depth sensor data with the aim of realtime performance. Two different scenarios are consid-ered. In the first a few objects will cover a large part of the scene in occlusion. In the second scenario, more numerous objects will be present on a table, some of which are partly occluded. An application of the first scenario is where a robot or a vehicle wants to find a larger object of interest in a warehouse. Hence this sce-nario will be called the warehouse scesce-nario further on in this thesis. The second scenario represents scenes where a robot will pick a specific tool out of several on a table or to find a manufactured object for inspection. This scenario will be called the table scenario.

(12)

4 1 Introduction

Figure 1.2:A system overview.

1.4 System Overview

A system overview can be seen in figure 1.2. The system has an offline training stage where the inputs are synthetic depth images of the 3D model. These are renderings of the model captured from different viewpoints in order to simulate a depth sensor. Each of the synthetic depth images are preprocessed and their fea-tures are described by feature descriptors. The descriptors try to describe either the whole depth image or smaller parts of it in a compact and robust manner.

The input to the system’s online stage is a depth image of a scene. This im-age is also preprocessed and described by feature descriptors. These descriptors are then matched with the ones of the model. Using the matches that are con-sidered good enough, a set of initial six-degrees-of-freedom object poses are hy-pothesized. These are later refined using a registration method, and the pose hypothesis that is considered the best is chosen.

Two different pipelines are implemented, one for the local descriptors as pre-sented in [7] and one for the the global as prepre-sented in [1] and [13]. Figure 1.2 shows an general overview independent of the pipelines.

1.5 Limitations

The input to the system will not be a 3D model but several synthetic depth images created by a 3D model rendered from different viewpoints. The viewpoints of these are manually chosen, thus this is not a part of the system. The technique used for creating these is described in section 2.1.3.

The system is designed to only find one object at a time and the scenes of both scenarios will only contain one of those objects.

(13)

1.6 Method 5

1.6 Method

This section describes the tools and briefly the evaluations of the system and algo-rithms. A more thorough explanation of the experiments can be seen in section 4.1. The system, briefly described at 1.4, was evaluated with the two scenarios described in section 1.3. Both synthetic and real images were used in the evalua-tion. The real images were taken from two different depth sensors. Due to time issues and difficulties with finding large real objects for the warehouse scenario, this scenario was only evaluated with synthetic depth images.

The setup for the experiments with the table scenario can be seen in figure 4.3a.

1.6.1 Tools

The system was developed in the program language C++. Most of the functions and algorithms used in this thesis comes from the Point Cloud Library, PCL, ver-sion 1.8.0, described in [11], which is also developed in C++. During the work of the thesis, bugs have been found in PCL. When encountered, the source code was modified so that the algorithm worked satisfactory.

The program was built as an application in Fotonic Viewer, a depth image visualizer. The input scene depth image was visualized with the best hypothesis aligned onto it.

MATLAB® has been used for visualizing plots in this report.

Two different depth sensors were used, a Structured Light sensor, Fotonic As-tra and a Time-of-flight sensor, Microsoft Kinect One.

The system was implemented on a computer equipped with a 3.00 GHz Intel® Core™ i7-3540M CPU and 16 GB RAM memory.

1.7 Thesis Structure

The structure of the thesis is as follows.

• Chapter 2 presents some background theory about depth sensors and gives a quick guideline and references to some techniques used in this thesis. • Chapter 3 gives a thorough presentation of the different algorithms used in

the system.

• Chapter 4 presents the results of the evaluations.

• Chapter 5 presents a discussion of the results and of the method. • Chapter 6 gives a conclusion of the evaluations.

(14)

(15)

2

Background

This chapter presents some background that is needed to fully understand this thesis. Where the areas are only briefly described a reference is given to a more thorough explanation.

2.1 Depth sensors

There are several different techniques for estimating the depth of a scene. In this thesis two different approaches were used, Structured Light and Time-of-Flight, ToF. The resolution and the Field-of-View, FOV of the sensors is seen in table 2.1. The Structured light technique results in a more smooth depth image but with less detail while ToF captures more detail but with edge artifacts.

Sensor Resolution Horizontal FOV Vertical FOV

Fotonic Astra 640x480 59° 46°

Microsoft Kinect One 512x424 57° 43° Synthetic sensor 512x512 30° 30°

Table 2.1:Depth sensor data.

2.1.1 Structured Light

In this thesis, a Fotonic Astra structured light depth sensor was used. The Astra projects an infrared random dot pattern onto the scene. The sensor captures the dot pattern and estimates the depth by looking at the horizontal offset in small regions.

As the sensor finds the local regions by correlation the dots can not be too deformed. Too still be able to estimate the depth a larger region is used. This result is that in areas rich in 3D structure detail is lost.

(16)

8 2 Background

Figure 2.1:Time-of-Flight sensor illustration.

Figure 2.2:ToF edge effects at the LS5 and at the Milk Carton Box.

2.1.2 Time-of-flight

A Time-of-Flight (ToF) sensor measures depth by quantifying the phase shift of an emitted light signal when it bounces back from objects in a scene. In this thesis a Microsoft Kinect One ToF sensor was used which emitted signal is an infrared pulse train. The reflected signal’s phase is shifted and the distance it has traveled is proportional to this phase shift. Figure 2.1 shows an illustration. When measuring the depth there is an ambiguity that several different depths results in the same phase shift. To remove this several pulse trains are used with different frequencies.

When a pixel in a sensor captures a reflected signal from an edge of an object, the reflected light from the object and the light of the background are captured. This results in artifacts that appears as interpolated values between the object and the background. An example of this is seen in figure 2.2.

2.1.3 Rasterization

To create the synthetic depth images for both the scenes as well as for the model point clouds, a technique called Perspective Projection and Rasterization was used. Using 3D models, their 6-DOF pose, the viewpoint 6-DOF pose, resolution

(17)

2.2 Point Cloud 9

and Field-of-View of the sensor, rasterization creates images so that only surfaces visible from the viewpoint will be part of the depth image.

Each object is rendered and the computed depth for each pixel is stored in a depth buffer. The depth buffer is a two-dimensional array where each element represents a pixel in the output image. If several objects renders a depth value in the same element, the lowest value will be stored in the element.

2.2 Point Cloud

Using a depth image we can create a point cloud which is a set of points in a three dimensional coordinate system (x, y, z) with the camera center in the origin. If the camera is calibrated, each pixel represents a 3D point using its value as depth information.

When using a depth image, the organization of the points are known as they are organized in the memory. Hence, the approximate nearest neighbours of each point in (x, y) is it’s nearest pixels. This can be used to speed up computations.

As the system in this thesis used many algorithms from PCL, which does not offer many possibilities to use the fact that points clouds are organized, this was not taken advantage of. The depth images are seen as arbitrary point clouds without an organization. A discussion of this is seen in section 3.1.1.

2.2.1 Normal Estimation

To estimate the normal vector of a point in a point cloud is an important part in several of the algorithms used in this thesis. The problem of estimating the nor-mal of a point is the same as estimating the surrounding surface as a plane. This problem is solved by using Eigenvalue Decomposition, EVD, of the covariance matrix of the surrounding points of the query point. The covariance matrix C is computed as in (2.1) where p is the query point and pi all its neighbours inside a

chosen radius. C = k X i=1 (pi−p) · (pi −p)T (2.1)

Using EVD of C, the two eigenvectors that correspond to the two largest eigen-values spans the plane and the third eigenvector is the estimated normal of the plane. There is an ambiguity of the direction of the third eigenvector. But since the viewpoint, vp, of the sensor is known the correct orientation of the normal

n is defined as when the normal is pointing towards the viewpoint. Thus the constraint n · (vp−p) > 0 must be satisfied.

2.3 Kd tree

A Kd tree is a data structure used for fast nearest neighbour searches. Kd trees are a special case of binary trees. At the creation, all points are split at the root

(18)

10 2 Background

level by the median of the first dimension. The points with first dimension values lower than the median is placed in the left sub-tree and points with higher in the right. In each level down in the tree the points are split in the next dimension, returning to the first after all has been exhausted. Using a point cloud the first dimension is the x-axis, the second the y-axis etc. A more thorough explanation of Kd trees is presented in [4].

2.4 Rigid Transformation

A rigid transformation of a point cloud preserves the distances between every pair of points. This includes translation and rotation and combination of them.

2.4.1 Estimate a rigid transformation

The problem of estimating a rigid transformation between two point clouds is called the Orthogonal Procrustes Problem, OPP. The problem has a closed-form solution based on Singular Value Decomposition, SVD. SVD is a generalization of Eigenvalue Decomposition, EVD, but where EVD only can be applied to square matrices, SVD can be used for arbitrary mxn matrices. Both OPP and SVD are described in [9].

2.5 Quaternions

A quaternion is a unit normalized four dimensional vector that represents a rota-tion or an orientarota-tion in R3. When comparing two orientations, represented by the quaternions p and q, they represent the same orientation if p = ±q. Using quaternions to represent orientations has several benefits according to [9]. First they are considered more numerically stable since they only uses four compo-nents instead of nine as in matrix representations. Compared to other represen-tations such as the axis-angle or Euler angles they do not have any singularities nor ambiguities except for the undetermined sign.

When evaluating the system in this thesis, the orientations of the poses are compared using quaternions. Two orientations represented by the quaternions p and q are considered the same if |p · q| > 1 − , where is the rotation error and is considered a small number.

A thorough explanation of quaternions and other representations of orienta-tions is found in [9].

(19)

3

Theory

This section presents the theory of the implemented system. The system’s input are point clouds of the model object as well as a point cloud of a scene with the presence of the object. The number of model point clouds varies depending on the object and of the algorithms. In some cases only one point cloud was needed where in others fifteen were used.

The model point clouds are rendered from different viewpoints as described in section 2.1.3. This is since matching between a complete 3D point cloud and a scene depth image does not work well.

The system has an offline training stage where the model point clouds are preprocessed. The online stage starts with a scene point cloud as input and it outputs a 6-DOF pose of the model in the scene.

Depending on if global or local feature descriptors are being used the system has two different pipelines.

3.1 Local Pipeline

Using local feature descriptors, the pipeline used is seen in figure 3.1. Similar pipelines are used in [17], [6]. For the online stage, the system takes a point cloud of the scene as input and outputs a 6-DOF pose for the model.

Each input model point cloud and the scene point cloud are downsampled by a voxelgrid. From these new point clouds some points are chosen as keypoints. The surrounding structure of these are estimated by feature descriptors using the downsampled point clouds. By matching the scene descriptors with the model descriptors, sets of correspondences between points in the scene and points in the model point clouds are estimated. As many of the correspondences will be incorrect, they will be clustered according to some geometric relation. Clusters with more than a certain number of members are considered as hypotheses. Each

(20)

12 3 Theory

hypothesis pose is estimated by solving the Orthogonal Procrustes Problem as described in section 2.4.1. The pose is then refined by some registration method. The best hypothesis in terms of amount of points considered inliers is chosen.

Figure 3.1:The local pipeline

3.1.1 Preprocessing

Both the scene and the model point clouds are downsampled using a voxelgrid to speed up the computations. As the objects of both scenarios are supposed to have a dominant plane, a table for the table scenario and a floor for the warehouse scenario, Plane Extraction is used to find and remove the points on the dominant plane. To further speed up the system, points too distant from the sensor are removed.

Voxelgrid

A Voxelgrid filter creates a 3D voxel grid over the input data and approximates the distribution of points in each voxel by the centroid of all present points. This results in one point per voxel and thus a downsampling that represents the sur-face better than just approximating an occupied voxel by the center. In this thesis only cubic voxels have been used, so when the voxel size is referred to it means one side. E.g. a voxel size of 5 mm referes to a voxel of 5 mm x 5 mm x 5 mm.

In general the voxelgrid filter removes the organization of the point clouds. This can be solved by setting the removed points to Not-a-Number, NaN, though this led to errors in several of the other algorithms in the system.

Some experiments were performed during the work with this thesis without downsampling and using the whole organized point cloud in the system. Though this approach was noticed to be slower than using the voxelgrid but losing the organization.

Another possible downsampling technique is lower the sampling rate, e.g. re-moving every third pixel, which will keep the organization of the points. But a benefit of using the voxelgrid instead is that it makes the sampling of points in world coordinates of an object in the scene less dependent of the distance from

(21)

3.1 Local Pipeline 13

the sensor. A voxelgrid will remove more points at regions closer to the sensor where the sampling is more dense than in regions further away from the sensor.

Plane Extraction

For finding the dominant plane in the scene the RANSAC algorithm is used as follows:

1. Pick 3 random points from the scene point cloud and estimate a plane model using those.

2. Test all points in the scene point cloud to the model. Those that fit accord-ing to a distance threshold are considered inliers. If the number of inliers is higher than the best model so far, this model is considered the best model. 3. Iterate.

The plane model with the most inliers is returned.

3.1.2 Keypoint Detection

Keypoints are points that later will be described by some sort of feature signa-ture or histogram, a descriptor. These are either found by keypoints detection algorithms that find points in areas rich in structure or they are picked by sam-pling the point cloud using a voxelgrid. Note that the point clouds are already downsampled by a voxelgrid at the preprocessing stage, this keypoints sampling is a second sampling. During the work of the thesis, a voxelgrid performed faster and gave enough interesting keypoints and was used in the evaluation of the sys-tem.

3.1.3 Feature Descriptors

For each keypoint in both the scene and the model point clouds a feature descrip-tor will computed for describing the surrounding structure of that point using its neighbouring points. Two local feature descriptors are used in this thesis, the Fast Point Feature Histogram, FPFH and Signature of Histograms of OrienTations, SHOT.

The descriptors need to be invariant to rigid transformations, i.e. viewpoint changes as long as the same surface is seen in both the model and the scene. Scale invariance is not needed as the depth sensor is calibrated and the model point clouds are derived from a mesh in correct scale.

Fast Point Feature Histogram

FPFH, presented in [12] is a development of the Point Feature Histogram with reduced computational complexity and almost the same discriminative power. A FPFH for a keypoint point p is computed as in (3.1) where ωi is the distance

(22)

14 3 Theory FP FH(p) = SP FH(p) +1 k k X i=1 1 ωi · SP FH(pi) (3.1)

SPFH, Simplified Point Feature Histogram is binned by three angular features computed between point pairs consisting of p and each of its neighbours inside a local surrounding sphere.

For each point pair pi and pj, a Darboux Coodinate system (u, v, w) is defined

as in (3.2) using their respective normals ni and nj. pi being the point with the

smaller angle between its normal and the points interconnecting line. u = ni v = (pj −_p_i₎ ||_(p_j−_p_i_)||×u w = u × v (3.2)

The three angular features are computed using (3.3). cos α = v · nj cos φ = ui· pj−pi ||_p_j−_p_i|| cos θ = arctan(w · nj, u · nj) (3.3)

Figure 3.2 shows an illustration of the angular features.

pi ni=u w=uxv v pj v w u nj α Φ θ

Figure 3.2:Ilustration of the angular features.

Each feature creates a histogram using 11 bins for its interval. The final his-togram is then a concatenation of the three resulting in a 33 bins hishis-togram.

For the computation of FPFH for a whole point cloud the SPFHs are first computed for all points by creating pairs with its neighbours. Then in a second step, the FPFH of each point is computed using both the SPFH of itself and the weighted ones of its neighbours as in (3.1).

(23)

Figure 3.3:The FPFH of the red point is a weighted combination of the SPFH of the points inside the red circle. These SPFH are computed from point pairs between these points and their neighbours inside the colored dashed circles.

Signature of Histograms of Orientations

SHOT, described in [18] divides the surrounding 3D space of the keypoint into a spherical grid. In each grid, a histogram of orientations is created by computing the angle between the normals of the points in the grid and keypoints z-axis in a local Reference Frame, RF.

[7] is a survey on local 3D descriptors and the authors describe SHOT as "... highly descriptive, computationally efficient and robust to noise."

For computing the local RF, SHOT uses the Distance-Weighted Covariance Matrix, COV (pi), for the spherical neighbourhood of a radius rf rame of the

key-point pi. COV (pi) is computed according to (3.4) using the weights calculated in

(3.5). COV (pi) = X |_p_j−_p_i|_<r_{f rame} wj(pj−pi)(pj−pi)T/ X |_p_j−_p_i|_<r_{f rame} wj (3.4) where wi = 1/ {pj : |pj−pi|< rdensity} (3.5)

Where rdensityis the sampling density in a neighbourhood around the point.

The three eigenvectors from an EVD of the matrix spans the RF. According to [18] there is a sign ambiguity for EVD which results in non unique RF. Their solution is to reorient the eigenvectors in the direction of the majority of the vectors it represents.

The grid is divided into 8 Azimuth divisions, 2 elevation divisions and 2 ra-dial divisions around the z-axis of the local RF, that is the reoriented eigenvector corresponding to the biggest eigenvalue. In each grid, a histogram is created by computing cos(θq) for each point. θqis the angle between the normal nqof each

(24)

16 3 Theory

as cos(θq) = z · nq. Using 11 bins in the histograms, the resulting descriptors

length is 8 × 2 × 2 × 11 = 352.

An illustration of the grid is seen in figure 3.4

z

Figure 3.4: The SHOT grid, seen from two different viewpoints. The radial divisions are seen in red, the azimuth in blue and turquoise and the elevation in green.

To avoid boundary effects, SHOT uses quadrilinear interpolation. Interpola-tion is performed both between neighbouring bins in each histogram but also between bins with the same index in neighbouring grids. Finally, the descriptor is normalized.

3.1.4 Matching

The model point clouds descriptors are stored in Kd trees as described in section 2.3. Once the scene descriptors are computed, nearest neighbour search is per-formed for finding a corresponding model keypoint for each scene keypoint. If the distance between the two is below a certain threshold, the correspondence is saved using the indices of the scene and the model keypoint. Note that a scene keypoint can have several correspondences among the model keypoints.

3.1.5 Correspondence grouping

As several of the correspondences may be incorrect, a correspondence grouping algorithm is used to group correspondences that are geometrically consistent in clusters. A cluster bigger than a certain threshold is considered a hypothesis for the model pose. [7] presents several variants of grouping algorithms, from 1-DOF constraints to 6-DOF constraints. According to [17], 6-DOF constraints are desir-able but not practically possible since the voting process is too computationally demanding.

(25)

3D Hough Voting

In this system, 3D Hough Voting, as presented in [17] is used. It lets correspon-dences vote in a 3D parameter space, hence the resulting clusters share a 3-DOF pose (translation). Tombari and Stefano [17] shows that 3D Hough Voting outper-forms other existing methods.

The the notation used in this section is presented below:

• The element V_i,GMdefines the ith element connected to object M in reference frame G.

• Transformation RM_GLdefines the transformation from the reference frame G to L of object M.

Offline, some features from each of the model point clouds can be prepro-cessed:

1. Compute the centroid, CMof the point cloud.

2. Compute a Local RF for all keypoints using EVD of the Distance weighted covariance matrix presented in (3.4). The Eigenvectors λ1, λ2, λ3 defines the RF.

3. Compute a vector to CMfrom each keypoint pM_i in the global RF as V_i,GM = CM−_pM

i .

4. Transform V_i,GM to the local RF as V_i,LM = RM_GL· V_i,GM. Where · is the matrix product and RM_GL a rotation matrix where each row is a unit vector of the local RF of the point pi.

Online, with the scene point cloud obtained. For a keypoint pjin the scene point

cloud with a correspondence match piin the model point cloud the pipeline is as

follows:

1. Compute local RF for pj the same way as in the offline stage. This yields

V_i,LM= V_j,LS .

2. Transform V_j,LS to the global RF of the scene by V_i,GS = RS_LG· V_i,LS + pS_j. RS_LGis a rotation matrix where each column is a unit vector of the local RF of the point pj.

This resulting vector V_j,GS points to a potential centroid of the model object in 3D space. A sufficiently high number of votes in a bin in 3D space results in a hypothesis. The size of the bins and the minimum number of votes for a hypothesis are input parameters. Figure 3.5 illustrates an example.

3.1.6 Initial Pose Estimation

For each hypothesis, a 6-DOF pose is computed by solving the OPP as described in section 2.4.1 using all its correspondences.

(26)

18 3 Theory

MODEL

SCENE

Figure 3.5: 3D Hough Voting. Two correct matches in green votes for the blue bin and two incorrect matches in red votes for the red bins.

3.1.7 Pose Refinement

The art of aligning two point clouds is called registration. In this pipeline it is used for better aligning the model point cloud onto the scene after the more coarse initial pose estimation which is often not considered good enough.

Iterative Closest Point

Iterative Closest Point, ICP, as originally described in [5] is a widely used algo-rithm for aligning two point clouds using initial estimates for the relative poses. The algorithm iteratively finds matches between points in the two point clouds, estimates a rigid transformation aligning one point cloud with the other by

(27)

mini-3.1 Local Pipeline 19

mizing an error metric. The ICP algorithm is seen in Algorithm 1. whileiteration < maxIterations do

1. For each point in the source point cloud, find the closest point in the target point cloud. If distance less than a certain threshold, mark as corresponding points;

2. Estimate a rigid transformation by minimizing the sum of squared distances between all corresponding points using closed form solution described in section 2.4.1;

3. Transform the source point cloud using the transformation; ifsquareDistanceError < distanceEpsilon then

Break; end

iftransformationDistance < transformationEpsilon then Break;

end end

Algorithm 1:The original ICP

In this system, the source is the model point cloud and the target is the scene point cloud. For speeding up the algorithm, only a random subset of the points is used. Figure 3.6 illustrates the ICP.

(28)

20 3 Theory

ICP variants

Many variants of ICP have been proposed, [10] gives a overview different variants of ICP.

According to [10], one variant that has been proven to both converge in fewer iterations and to be more accurate is using a point-to-plane error metric. This is computed as a sum of squared distances between each source point and the tan-gent plane of its corresponding point. Another variant is to solve the least squares minimization by an iterative optimizer as the Levenberg-Marquardt method. In this thesis, both the orignial ICP, an ICP using Levenberg-Marquardt and the point-to-plane ICP were used.

Generalized ICP, GICP, as first presented in [14] uses another variant in the estimation of the rigid transform by a probabilistic plane-to-plane approach. As-suming that the measured points are drawn from a Normal distribution at true points with perfect correspondences in the other point cloud, Maximum-likelihood Estimation, MLE, is used to compute the transformation. Using the normal of each point to compute the covariance matrices, each point is considered to have a high probability to be in the plane of its covariance matrix. But there is also an added parameter that represents the uncertainty along the normal in the co-variance matrix. An example coco-variance matrix with the normal as the first axis is seen in (3.6).         0 0 0 1 0 0 0 1         (3.6) As seen in figure 3.7 the points on the green surface with the vertical covari-ance matrices will not match with the closest point on the red surface as the original ICP would.

Figure 3.7: An example of the GICP plane-to-plane matching. Only the points on the two surfaces with similar covariance matrices will match.

The stopping criteria for the ICP are if the difference of the transformation and of the total error between two consecutive iterations are considered to small or if the algorithm reaches the maximum iterations of 40.

Three-Dimensional Normal-Distributions Transform

The Three-Dimensional Normal-Distributions Transform, 3D-NDT, presented in [8] estimates the target point cloud by a 3D grid of Normal Distributions. The distribution results in a smooth representation of the point cloud with piecewise

(29)

3.2 Global Pipeline 21

continuous derivatives. The algorithm then maximizes the likelihood that the points on the source point cloud lies on the target surface using Newton’s method. Magnusson [8] presents the 3D-NDT to be quicker, more accurate and more ro-bust than ICP.

3.1.8 Choosing Final Pose

For choosing the correct pose for the model among the hypotheses, the one with the largest amount of inliers is chosen. An inlier is a point in the model point cloud that has its closest point in the scene point cloud less than a certain distance threshold away.

3.2 Global Pipeline

Using global feature descriptors results in a different pipeline, as seen in figure 3.8. The inputs and output are the same as for the local pipeline.

Global descriptors describe the model and different clusters of the scene with one or a few descriptors. The clusters are created by clustering the scene point cloud and can be seen as standalone blobs in 3D space. In the same way as the local pipeline, the system tries to find the closest model descriptor for each of the clusters in scene. Each cluster gets a number of matches which results in a number of hypotheses. As in the local pipeline, Refinement Registration is used for better aligning the point clouds of the hypotheses and the best one is chosen by the number of inliers, see section 3.1.

Figure 3.8:The global pipeline

3.2.1 Pre-processing and Segmentation

Both the scene and the model point clouds are downsampled using the Voxelgrid filter described in section 3.1.1. Also in this pipeline, points too distant and these situated on the plane are removed as described in section 3.1.1.

(30)

22 3 Theory

Euclidean Cluster Extraction

Euclidean Cluster Extraction starts a cluster with a random point in the scene point cloud. All neighbouring points within a distance threshold from the point are added to the cluster. This then continues for the points in the cluster until no more points can be added. At that point a new cluster is started and the above is repeated. All clusters sufficiently large are kept.

An example of a depth image of a scene and the result after plane removal and clustering is seen in figure 3.9.

3.2.2 Descriptors

In this thesis three global descriptors have been used, the Clustered Viewpoint Feature Histogram, CVFH, the Ensemble of Shape Functions, ESF and the pro-posed Clustered Ensemble of Shape Functions, CESF. CVFH and CESF do some clustering in addition to calculating a descriptor histogram.

Clustered Viewpoint Feature Histogram

CVFH, presented in [1] is a global descriptor that is a further development of the Viewpoint Feature Histogram, VFH. VFH is inspired by the local descriptor Point Feature Histogram but is computed for all points in a cluster instead of in a local area.

The VFH describes an object’s partial view by a histogram of angles between the centroid of all surface points and their normals. Let pc and nc be the

cen-troid of all points respective the cencen-troid of their normals. A Darboux coordinate system (ui, vi, wi) for each point pair pi and pcis defined as in (3.2).

From these coordinates, the normal angular deviations can be calculated as (3.7). Note the similarity between the CVFH and SPFH. The CVFH computes the SPFH between each point and the centroid as illustrated in figure 3.10 as well as βi and SDC. αi = vi· ni βi = ni· pc ||_p_c|| φi = ui· pi−pc ||_p_i−_p_c|| θi = atan2(wi· ni, ui· ni) SDC = (pc−pi) 2 max i ((pc −_p_i₎2₎ (3.7)

The CVFH descriptor is a histogram constituted by 45 bins each for cos αi,

cos φi, θi and SDC and 128 bins for cos βi summing up to 308 dimensions.

Where SDC represents the Shape Distribution Component that can distinguish surfaces with similar normal distributions but with different point distributions.

(31)

(a)

(b)

(32)

24 3 Theory

pc

Figure 3.10: The CVFH computes angular features of point pairs between the centroid of all points and each point in a cluster.

To avoid scale invariance, each bin in CVFH counts the absolute number of points. [1] proves this to be more robust to missing parts of the object as this only will influence local parts of the histogram.

The CVFH histogram is calculated for each stable region in the input point cloud. Stable regions are identified by a smooth region growing algorithm after removing points with high curvature caused by noise, sharp edges or non-planar patches. Each new cluster is initialized with a random point pi with a normal

ni. A point pj with normal nj is added to cluster Ck if the constraint in (3.8) is

satisfied.

Ck B {kpi−pjk< td∧ni· nj > tn} (3.8)

tdand tnare distance and angular thresholds respectively. Only clusters with

more than a certain number of points are considered. Using this clustering only one region needs to be visible in the scene or model view.

If the clustering creates several clusters, one histogram for each cluster is com-puted. This results in several histograms for either a model point cloud or for a scene segment.

Ensemble of Shape Functions

The Ensemble of Shape Functions, ESF, first presented in [20] is global descriptor based on three shape functions describing area, distance and angle of the points in the point cloud. This descriptor does not need any preprocessing such as nor-mal estimation.

(33)

ESF uses a voxelgrid filter that estimates the point cloud in 64x64x64 vox-els as described in section 3.1.1. It then iterates using the number of points as number of iterations. For every iteration three random points are chosen and ten histograms are computed. The histograms are of four types:

• D2 : Computes the distances between the point pairs of the three chosen points. Then the lines connecting the pairs are traced through the voxelgrid and classified either on or off the surface or a mixed combination of both on and off. The lines classified off the surface have only their endpoints on the surface. The lines are binned in either histograms On, Off or Mixed and then in what length the line has.

• D2-ratio: The lines that are classified as mixed are also binned in another histogram depending on the ratio of voxels on or off the surface for the line. • D3 : Computes the area the three points span and bins according to the size of the area. And in the same way as D2, the area is classified as either On or Off the surface or a Mixed combination of both.

• A3 : This part computes the angle formed by two of the lines and bins it depending on its size. By using the surface classification of the line opposite of the angle, the angle is binned in either On, Off or Mixed combination in the same way as above.

This creates three D2, D3 and A3 histograms each and one D2-ratio histogram of each 64 bins resulting in a concatenated histogram of 10 ∗ 64 = 640 dimensions. Figure 3.11 shows an illustration of the On, Off or Mixed classification.

Clustered Ensemble of Shape Functions

This thesis presents a novel global descriptor, named Clustered Ensemble of Shape Functions, CESF. In the same way as the CVFH is a clustered variant of the VFH, the CESF is a clustered variant of the ESF.

The CESF starts with the stable region clustering and the removal of points with high curvature described in section 3.2.2. It then computed an ESF-histogram for each sufficiently large cluster.

3.2.3 Matching

For each cluster histogram in each segment of the scene, a nearest neighbour search is performed among the model histograms. This results in a number of hypotheses for histogram matches. The top hypotheses in order of Euclidean distance for histograms are chosen for continued evaluation and pose estimation.

3.2.4 Initial Pose Estimation

For each of the best hypotheses from the matching step, a 6-DOF pose is esti-mated. The 3-DOF translation is computed from the differences of the centroids

(34)

26 3 Theory

D2 off

D2 on D2 off

D3 mixed A3 off

Figure 3.11: An ESF toy example of points at the side of a cylinder. Three randomly chosen points are marked in black bold with D2 lines in orange, D3 area in blue and A3 angle in green.

of the two clusters. By aligning the average normals of the two clusters, a 2-DOF rotation is estimated but this leaves the last rotation along the camera axis un-known. The global descriptors presented here will be the same independent of the camera roll. The object’s pose can therefore be determined up to an 1-DOF ambiguity. For the last rotation [1] uses a Camera roll histogram, CRH, which they show to work very satisfactorily.

Camera Roll Histogram

Each point’s normal is projected onto the plane orthogonal to the vector given by the camera center and the centroid pcof the given cluster.

The histogram is binned by the angles of the projected normals relative to the camera’s up-view vector. The histogram uses 90 bins resulting in 4 degrees resolution.

Matching of two histograms, one for the model cluster and one for the scene cluster is a correlation maximization problem. [1] solves this by computing the Discrete Fourier Transform for both histograms, multiplying the complex coeffi-cients of the model histogram with the complex conjugate coefficoeffi-cients and

(35)

com-3.2 Global Pipeline 27

pute the cross power spectrum R. Peaks in this spectrum are rotation angles that align the two histograms. Multiple distinguished peaks result in multiple new hypotheses for the 6-DOF pose.

(36)

(37)

4

Experiments and Results

This chapter represents the experiments and the results of the evaluations. The evaluation was performed both as a system while varying the descriptors as well as separated for the registration. When not stated otherwise, normal distributed noise with a standard deviation of 1 mm was added to the synthetic scenes. When the ground truth was known, the maximum distance error for a convergence to be classified as correct was 100 mm for the warehouse scenario and 50 mm for the table scenario. The maximum rotation error, in section 2.5, was set to 0.05 for both scenarios.

4.1 Experiments

This section presents the 3D models and how the evaluation was performed.

4.1.1 3D models

The 3D models used in this thesis were either found on the Internet, provided by Fotonic or created by the author of this thesis. The 3D models used for the system to be recognized are presented below.

• Statue of Liberty: found at

https://sketchfab.com/models/6626abe1f3e8469a9d8f4b74d8aa2a71 on the 9th of june 2016, created by the user jerryfisher. The model visual-ized in Microsoft 3D builder is seen in figure 4.1b. The model measures 1500 mm x 450 mm x 380 mm in the warehouse scenario and 12.6 mm x 3.8 mm x 3.2 mm in the table scenario.

• LS5 : An Optronic Lidar sensor, 3D model created by Optronic. The LS5 is situated to the left in figure 4.1a.

(38)

30 4 Experiments and Results

• Milk Carton Box: 3D model created by the author to match the geometry of a 1L Milk Carton Box. The Milk Carton Box is situated in the middle of figure 4.1a.

Other 3D models were used for creating the synthetic depth images. These can be found in Appendix A.

4.1.2 Evaluation

The system has been evaluated both by using real depth images taken from the two depth sensors as well as synthetic depth images created from 3D-models using the technique described in section 2.1.3. When using the synthetic depth images the poses of the 3D models were known, hence a ground truth to compare with. For the synthetic depth images, Normal Distributed Noise was added to simulate real depth images.

Warehouse scenario

Due to problems with finding 3D models for large real objects, the first of the two scenarios described in section 1.3 was only evaluated using synthetic depth images. The models used for the recognition were the LS5 and the Statue of Liberty.

30 images were created with a mix of totally visible objects and occluded ones. A plane was added behind the objects to simulate a floor or a wall. An example of two synthetic depth images with added noise with a standard deviation of 1 mm is seen in figure 4.4. The other images are similar to these.

The evaluation of the warehouse scenario was performed in terms of:

• Robustness: Fraction of correctly converged solutions by the means of a distance error and an angle error below a certain threshold.

• Computation time: The total time the system used for the computations.

Table scenario

The table scenario was evaluated both using synthetic depth images as well as using real depth images.

For the real scenes, two different models were used for recognition, the LS5 and the Milk Carton Box. The setup can be seen in figure 4.3a with a depth image of the scene in figure 4.3b taken by the Microsoft Kinect One. For each of the two depth sensors, 60 images were taken where the two models were visible, and 20 images where some small amount of occlusion were introduced.

The evaluation of the real scenes were also done in terms of robustness and computation time. But as there were no ground truth for these images the ro-bustness was estimated by visual inspection of the model aligned onto the scene depth image.

(39)

4.1 Experiments 31

(a) The LS5 and the Milk Carton Box.

(b) The Statue of Liberty.

(40)

Figure 4.2:A correctly converged model in purple.

An example of this is seen in figure 4.2 where the LS5 has aligned in purple onto the RGB scene depth image. An RGB sensor is mounted on both the Astra and the Kinect One.

Using the synthetic depth images, the models that were recognized by the system were the LS5 and the Statue of Liberty. 30 images were created with a mix of totally visible objects and occluded ones. Also here a plane was added to simulate the table. An example of two synthetic depth image with added Normal Distributed Noise with standard deviation of 1 mm is seen in figure 4.5. The Statue of Liberty is upside down to the left of the Boletus Mushroom in figure 4.5a and down to the left in figure 4.5b; the LS5 is slightly tilted in the middle of both images.

The evaluation of the synthetic depth images was performed in the same terms as for the synthetic depth images in the warehouse scenario.

(41)

4.1 Experiments 33

(a) The setup of the scene.

(b)A depth image of the same scene using the Microsoft Kinect One.

(42)

(a)

(b)

Figure 4.4: Synthetic depth images of the warehouse scenario with added noise.

(43)

4.1 Experiments 35

(a)

(b)

(44)

Registration

The registration methods were evaluated on their own on the Statue of Liberty using the synthetic depth images of the warehouse scenario. Due to time limita-tions, the registration were not evaluated using other scenarios or objects. Only these depth images for which at least one of the registration methods could align the model to a correct solution was used. The evaluation was done in terms of:

• Robustness: Fraction of correctly converged solutions by the means of a distance error and an angle error below a certain threshold.

• Accuracy: Standard deviation of the distance error and of the angle error from the correctly converged solutions.

• Computation time: The total time the system used for the computations.

4.2 Results

This section presents the results of the experiments presented in section 4.1.

4.2.1 The System

This section presents the results of the evaluation of the object recognition system while varying descriptors (and with these the pipeline). The recognition was evaluated in terms of:

• Robustness: Fracion of correctly converged solutions.

• Computation time: The average computation time the system used for the computations.

When not stated otherwise the global pipeline passed on the top five hypothe-ses to the pose refinement step and at least five correspondence votes were needed for the 3D Hough Voting to consider a cluster.

During the evaluations, ICP point-to-plane was noticed to work most satisfac-tory for the warehouse scenario and therefore it was used in these scenes. The same went for the ICP point-to-point in the table scenario.

The warehouse scenario was only evaluated using the synthetic depth images with the LS5 and the Statue of Liberty. The table scenario was evaluated using both synthetic depth images (LS5 and the Statue of liberty) and real depth images (LS5 and Milk Carton Box). The real depth images were taken with Astra and Kinect One depth sensors.

LS5

Table 4.1 presents the results for different scenarios for the LS5. Each scenario has two or three rows. The first presents the robustness and the last the average computation time.

(45)

4.2 Results 37 Twelve model point clouds rendered from different viewpoints were used for the synthetic scenes and eight for the real scenes. The four extra for the synthetic scenes were rendered from viewpoints situated below the model and since the LS5 was standing on all images in the real scenes these were not needed.

As the LS5 is more or less symmetrical around its own vertical axis it was not possible to determine which side that was facing the sensor. A correct result has therefore an ambiguity around one degree of freedom.

Scenario Measurement CVFH ESF CESF SHOT FPFH Warehouse synt. Robustness 35% 0% 20% 90% 100% Comp. time (s) 0.61 0.96 0.80 3.13 0.72 Table synt. Robustness 43% 0% 37% 60% 67% Comp. time (s) 0.19 0.60 0.43 1.95 1.22 Table Astra Robustness 73% 0% 58% 10% 35% Table Astra Occ. Robustness 55% 0% 55% 10% 35% Comp. time (s) 0.34 0.59 0.56 5.68 3.10 Table Kinect One Robustness 32% 0% 37% 3% 40% Table Kinect One Occ. Robustness 25% 0% 40% 0% 30% Comp. time (s) 0.36 0.66 0.70 7.6 3.51 Table 4.1:Recognition of the LS5 using multiple model point clouds

The parameters that changed between the scenarios are presented in table 4.2. In this table Views means the number of model point clouds, Downsampl. and Keypointsthe voxel size of the respective voxelgrid. Hough cluster the size of the bins in the 3D Hough Voting. These parameters were used for all the LS5 evaluations if not stated otherwise.

Scenario Views Downsampl. Keypoints Hough cluster Warehouse synthetic 12 20 mm 50 mm 100 mm

Table synthetic 12 10 mm 15 mm 50 mm

Table Astra 8 10 mm 10 mm 50 mm

Table Kinect One 8 10 mm 10 mm 50 mm

Table 4.2:Chosen parameters for the table 4.1.

Evaluations of the system when using only one model point cloud of the LS5 were also performed, and are presented in table 4.3. The single point cloud was rendered from the side, so no top nor bottom were captured. Otherwise the same parameters as in table 4.2 were used.

(46)

Scenario Measurement CVFH ESF CESF SHOT FPFH Warehouse synt. _{Comp. time (s)}Robustness 20%_0.77 _0.960% 15%_0.88 75%_0.70 95%_0.43

Table synt. Robustness 73% 0% 77% 33% 27% Comp. time (s) 0.23 0.59 0.46 0.23 0.19 Table Astra Robustness 76% 2% 69% 10% 29% Table Astra Occ. Robustness 80% 15% 80% 10% 35% Comp. time (s) 0.30 0.39 0.60 1.51 0.50 Table Kinect One Robustness 78% 0% 78% 3% 18% Table Kinect One Occ. Robustness 70% 0% 70% 0% 10% Comp. time (s) 0.40 0.74 0.72 3.54 2.34 Table 4.3:Recognition of the LS5 using one single model point cloud

The LS5 was standing on all images in the real scenes, hence its camera roll was close to zero. This was exploited by skipping the Camera Roll Histogram for the global descriptors as the model point cloud already had an upright orienta-tion. The result is presented in table 4.4 still using only one model point cloud. This experiment could not be performed with the synthetic depth images as the LS5 is not standing upright in these.

Scenario Measurement CVFH ESF CESF Table Astra Robustness 100% 3% 95% Table Astra Occlusion Robustness 95% 0% 95% Comp. time (s) 0:32 0.70 0.60 Table Kinect One Robustness 98% 2% 98% Table Kinect One Occlusion Robustness 90% 0% 85% Comp. time (s) 0:28 0.72 0.62 Table 4.4:Recognition of the LS5 skipping the CRH.

Figure 4.6 shows the recognition when varying the number of hypotheses passed on to the pose refinement step in the global pipeline in the real scenes. In this plot, the occlusion scenes and the normal ones are added together. Also here the CRH was left out and only one model point cloud was used. As the ESF had so few correct alignments is was left out. Due to time limitations this experiment was not performed with the synthetic scenes.

Figure 4.7 presents the recognition of the LS5 with different amount of added noise in the synthetic scenes. The global descriptors used one model point cloud and the local descriptors used twelve as the pipelines performed best with these numbers. Also here the ESF was left out since it had so few correct alignments.

Milk carton Box

The evaluation of the Milk Carton Box is seen in table 4.5 with its corresponding parameters in table 4.6.

(47)

4.2 Results 39 1 2 3 4 5 6 7 8 9 10 Number of hypotheses 0 0.2 0.4 0.6 0.8 1 Fraction of convergences CVFH Astra CESF Astra CVFH Kinect One CESF Kinect One

1 2 3 4 5 6 7 8 9 10 Number of hypotheses 0 0.2 0.4 0.6 0.8 Computation time(s) CVFH Astra CESF Astra CVFH Kinect One CESF Kinect One

Figure 4.6:Recognition of the LS5 when varying the amount of hypotheses.

When using the CESF and CVFH each side of the Milk Carton Box became a cluster. These sides converged with arbitrary sides of the Milk Carton Box in the scene, hence it was not possible to determine the correct pose. Instead a correct alignment was based on if the sides of the model point cloud of the Milk Carton Box was aligned at the sides of the correct object of the scene. An example of a convergence classified as correct can be seen in figure 4.8. To be able to compare the pipelines, the same assessment was used for the local descriptors, if they for example aligned the model point cloud upside-down at correct scene cluster as in figure 4.9, that was considered correct as the sides were aligned correctly.

Scenario Measurement CVFH ESF CESF SHOT FPFH Table Astra Robustness 31% 0% 68% 0% 16% Table Astra Occ. Robustness 30% 0% 60% 0% 0%

Comp. time (s) 0.43 0.72 0.74 6.16 1.73 Table Kinect One Robustness 28% 0% 77% 0% 5% Table Kinect One Occ. Robustness 35% 0% 40% 0% 5% Comp. time (s) 1.67 1.56 1.86 8.23 11.42

(48)

0 5 10 15 20 25

Standard Deviation of noise (mm)

0 0.5 1 Fraction of convergences First Scenario CVFH CESF SHOT FPFH 0 5 10 15 20 25

0 0.5 1 Fraction of convergences Second Scenario CVFH CESF SHOT FPFH

Figure 4.7:Robustness of the LS5 with added noise.

Table 4.5:Recognition of the Milk Carton Box.

Scenario Views Downsampl. Keypoints Hough cluster

Table Astra 15 10 mm 15 mm 100 mm

Table Kinect One 15 10 mm 15 mm 100 mm Table 4.6:Chosen parameters for the table 4.5.

Statue of Liberty

Table 4.7 presents the results of the recognition of the Statue of Liberty with corresponding chosen parameters in table 4.8. Figure 4.10a shows a correct con-vergence of the statue of liberty and figure 4.10b an incorrect;

Scenario Measurement CVFH ESF CESF SHOT FPFH Warehouse synt. Robustness 0% 0% 0% 80% 76%

Comp. time (s) 0.65 0.73 0.70 8.87 4.86 Table synt. Robustness 0% 0% 0% 30% 30% Comp. time (s) 0.12 0.23 0.24 14.1 10.2

(49)

4.2 Results 41

Figure 4.8:A correct convergence of a side cluster of the Milk Carton Box in pink.

Figure 4.9: An upside-down convergence of the Milk Carton Box in pink classified as correct .

(50)

Table 4.7:Recognition of the Statue Of Liberty.

Scenario Views Downsampl. Keypoints Hough cluster Warehouse synt. 12 20 mm 30 mm 100 mm

Table synt. 12 5 mm 5 mm 50 mm

Table 4.8:Chosen parameters for the table 4.7.

Figure 4.11 presents the recognition with an increasing degree of standard deviation.

4.2.2 Registration

Table 4.9 presents the results of the registration using the Statue of Liberty in the synthetic depth images from the warehouse scenario. Translation accuracy is measured in mm and Rotation accuracy is measured as described in section 2.5. They are both averages from the correct classified models.

Measurement ICP ICP-LM ICP-plane GICP 3D-NDT

Robustness 95% 84% 100% 84% 84%

Translation acc. (mm) 8.46 5.69 20.87 8.30 13.14 Rotation acc. 0.0004 0.0005 0.0042 0.0023 0.0095 Comp. time (s) 0.02 0.06 0.02 0.06 0.09 Table 4.9:Registration of the Statue Of Liberty from the synthetic warehouse scenario using twelve views.

Figure 4.12 presents the robustness of the methods with added Normal Dis-tributed Noise with an increasing degree of standard deviation.

(51)

4.2 Results 43

(a) A correct convergence of the Statue of Liberty in purple.

(b) An incorrect convergence of the Statue of Liberty in purple.

Figure 4.10: Convergences of the Statue of Liberty in the warehouse sce-nario.

(52)

0 5 10 15 20 25

0 0.5 1 Fraction of convergences First Scenario CVFH CESF SHOT FPFH 0 5 10 15 20 25

0 0.5 1 Fraction of convergences Second Scenario CVFH CESF SHOT FPFH

(53)

4.2 Results 45

0 5 10 15 20 25

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of convergences ICP ICP-LM ICP-plane GICP NDT

Figure 4.12: Robustness of the registration on the Statue of Liberty with added noise.

(54)

(55)

5

Discussion

This chapter presents a discussion of the method and of the results of the evalua-tions.

5.1 Result

This section presents a discussion of the results of the evaluation.

5.1.1 The System

The system did in some scenarios perform fast as in a couple of frames-per-second but it did not process in realtime, if defined as 30 frames-per-frames-per-second.

Some methods in the system are programmed to be executed in parallel on several cores, but not all. The system should be able to process in realtime or close to if implemented on a GPU while using the fact that the point clouds are organized.

An interesting phenomenon in the evaluation of the synthetic scenes was that in most of the scenarios the system performed better with some added noise, up to a standard deviation of about 2 mm. One reason to this could be that the pose refinement registration methods used in this thesis are iterative and can easily end up in a local minima. With some noise there is a possibility that those local minima disappear and the registration can continue to a global minimum instead.

Throughout the evaluation the robustness of the ESF was bad. One reason for this could be that different viewpoints of the depth images result in large variations among visible surfaces of the object. This will affect the amount of shapes that ends up in the On, Off or Mixed histograms. An example of this is the LS5. When seen from a viewpoint situated slighty above the top, the object’s top and side will be visible. This results in a high number of shapes bins the

(56)

48 5 Discussion

Off histogram compared to when seen directly from the side, which would bin more histograms as On. Thus it is important to have similar viewpoints. This argument is strengthened by the fact that in [20], which first presents the ESF, the number of different model point clouds used for object recognition is 80 which is considerably higher than what have been used in this thesis.

The evaluation showed that the processing on the Kinect One depth sensor was more computationally heavy than on the Astra, though this can depend on that the Kinect One scenes have slightly more and bigger objects situated on the table.

Warehouse scenario

The local descriptors performed more robust than the global descriptors in the warehouse scenario. For the recognition of the LS5 they performed both quickly and robustly but for the Statue of Liberty the computation time was considerably higher. This was since with the LS5 the system still performed well when the voxel size of the keypoints sampling was 50 mm. The Statue of Liberty needed more keypoints to gain a high robustness and thus the more dense sampling of 30 mm. With that many keypoints, the computation of the descriptors and the descriptor matching was slower. But also the amount of hypotheses from the 3D Hough Voting that was passed on to the pose refinement was higher. Attempts to reduce the number of hypotheses were made by setting a higher number of minimum correspondence votes but this lowered the robustness. Sometimes less than ten votes were cast on the correct pose out of thousands of correspondences.

A reason to this low descriptiveness of the local descriptors in the Statue of Liberty can be on the nature of depth images: features occlude themselves. A feature seen from a certain viewpoint will occlude its background. An example could be a wrinkle in the dress, when seen from above the whole wrinkle may be seen from the viewpoint. But when the viewpoint is situated at the side the background will not be visible. This results in that local descriptors only can describe what is being seen from the actual viewpoint. Perhaps they are more descriptive if the model and scene depth images have a more similar viewpoint as in panorama stitching or SLAM systems, or when the data is in full 3D.

The global descriptors performed worse than the local on the LS5. A reason for this is that there was fairly high amount of occlusion in the images so that the global descriptors were not able to describe the surfaces. Though figure 4.7 shows that the global descriptors were more robust against noise than the local. This could be a result of that the general geometry of the LS5 is preserved while small local regions are more affected by the noise. For the Statue of liberty, the clustering of the CVFH and CESF divided the point clouds into many small clus-ters that were less descriptive.

Table scenario

In the table scenario, using the global descriptors CVFH and CESF, the system performed both robustly and quickly on the LS5, both on the synthetic and the real scenes. Though it is clear that the CRH does not perform well.