Feature Based Learning for Point Cloud Labeling and Grasp Point Detection

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018

Feature Based Learning for

Point Cloud Labeling and

Grasp Point Detection

(2)

Fredrik Olsson LiTH-ISY-EX--18/5165--SE Supervisor: Felix Järemo-Lawin

isy_{, Linköping University}

Ola Petersson

SICK IVP

Examiner: Per-Erik Forssén

isy_{, Linköpings universitet}

Division of Computer Vision Department of Electrical Engineering

(3)

Abstract

Robotic bin picking is the problem of emptying a bin of randomly distributed objects through a robotic interface. This thesis examines an SVM approach to ex-tract grasping points for a vacuum-type gripper. The SVM is trained on synthetic data and used to classify the points of a non-synthetic 3D-scanned point cloud as either graspable or non-graspable. The classified points are then clustered into graspable regions from which the grasping points are extracted.

The SVM models and the algorithm as a whole are trained and evaluated against cubic and cylindrical objects. Separate SVM models are trained for each type of object in addition to one model being trained on a dataset containing both types of objects. It is shown that the performance of the SVM in terms accuracy is dependent on the objects and their geometrical properties. Further, it is shown that the algorithm is reasonably robust in terms of successfully picking objects, regardless of the scale of the objects.

(4)

(5)

Acknowledgments

I would like to thank SICK IVP for the opportunity to work on this thesis. I thank my supervisor at SICK IVP, Ola Petersson as well as his team for all their technical help and support. I also wish to thank my supervisor at ISY, Felix Järemo-Lawin for the help with the report. Additionally I would like to thank my examiner Per-Erik Forssén for all the help and encouragement. Lastly I would like to thank Jens Edhammer for an exceptionally well implemented simulator as well as for all the help regarding it.

Linköping, August 2018 Fredrik Olsson

(6)

(7)

Notation

Abbreviations

Abbreviation Description

CAD Computer-aided design SVM Support Vector Machine

PFH Point Feature Histogram FPFH Fast Point Feature Histogram

RBF Radial Basis Function PCL Point Cloud Library

SMO Sequential Minimal Optimization

(10)

(11)

1

Introduction

Robotic bin picking is the art of picking, often randomly, distributed objects within a container (bin) through a robotic interface. Many algorithms work on point cloud data and different methods exist for localising objects. Development of methods for discriminating individual points as pickable or not pickable can lead to greater generalizability when segmenting the image into graspable re-gions.

1.1 Motivation

Robotic bin picking solutions have become a natural part of the manufacturing-and logistics industries with a wide range of solutions available for different ap-plications. A common approach is to use a Computer-aided design (CAD) model of the object that is to be picked, transform it into a point cloud and perform a point cloud registration (point matching) of the model in a point cloud depicting the bin with objects. Since this requires a model of the object that is to be picked up, such methods are subject to poor generalization. Therefore it is of interest to study methods for bin picking which generalize better in order to generate solu-tions which can handle a wider range of different tasks.

Computer vision solutions for robotic grasping is a widely studied subject. Us-ing Support Vector Machines (SVMs) to learn robotic graspUs-ing is not a new idea and has been studied by [2] and [14] for example. They however are studying the problem of n-finger robotic grasping, i.e. the problem of identifying grasping poses for a robotic hand with n fingers. This thesis will try to use an SVM ap-proach on the problem of using vacuum type grippers, such as the one shown in Figure 1.1, for picking objects. This in itself is an interesting investigation and by using results and ideas from object/shape recognition as in [19], the aim of this

(12)

Figure 1.1:Example of a vacuum-type gripper.

thesis is to achieve results which are industrially applicable.

1.2 Problem formulation

As mentioned in Section 1.1, the idea of using SVMs to learn robotic grasping has been studied before. Rephrasing the problem to classifying single points as graspable instead of identifying grasping positions opens up to some interesting investigations. Leaning on the promising results of geometric primitive classifica-tion through histogram-based feature discriminaclassifica-tion done by [19], along with the work done by [6] that shows the usefulness of SVMs for histogram discrimination, this thesis aims to answer the following questions:

1. Given simulated data of objects in a bin presented as a point cloud, is it possible to train an SVM to be able to successfully discriminate graspable points from non-graspable?

2. Is it possible to use an SVM trained on synthetic data for successful discrim-ination on non-synthetic 3D-scans?

(13)

1.3 Limitations 3

1.3 Limitations

The work presented in this thesis is subject to the following limitations: 1. The accuracy of the SVM is only evaluated on synthetic data. 2. The robotic interface for picking objects is simulated. 3. A single type of feature and a single SVM kernel is tested. 4. Only cylinder- and cube objects are studied.

Annotation of data is time consuming and therefore the accuracy of the SVM is only evaluated on synthetic data due to the availability of ground truth. This limits the evaluation of the SVM since the purpose is to use the SVM for classi-fication on real scanned data, and one can rarely expect synthetic data to model real data perfectly. The performance of the SVM will however be implicitly mea-sured when the algorithm is evaluated as a whole as described in Section 3.2. This in turn introduces another limitation since the robotic interface had to be simulated. Again this limits the evaluation somewhat since the evaluation does not completely take place in the real world domain.

Due to limitations in the amount of available time, the work was limited to test-ing only a stest-ingle type of feature and SVM kernel. The use of different features and kernels would most likely produce different results which in turn could lead to an interesting investigation. Further, only cylinder- and cube objects are studied. This means that the algorithm is only tested on symmetrical objects.

1.4 Related Work

This thesis investigates the use of an SVM for segmentation of scenes, with a bin containing objects, into regions which are “graspable” or “not graspable” in a sense of suitability for placing a vacuum-type picking tool. The SVM will dis-criminate a histogram feature, Fast Point Feature Histogram (FPFH) [21], through training on simulated scenes.

In 2008 Rusu et al [19] introduced the Point Feature Histogram (PFH), the prede-cessor to FPFH, in a work similar to that of this thesis. They evaluate different approaches (SVM, KNN and K-means clustering) for classifying geometric prim-itives in indoors scenes based on the PFH histograms, and found that SVMs are particularly suitable for this type of classification. This was also shown earlier in 1999 in the work by Chapelle et al [6] where they successfully used SVM to classify images based on color histograms. Common in these works is the use of a Radial Basis Function (RBF) as the kernel for the SVM, though results in [19] differ marginally depending on the kind of RBF used.

In 2010 Bohg and Kragic studied the problem of n-finger robotic grasping, us-ing a SVM approach usus-ing Shape Context [1], a feature that could be seen as the

(14)

ancestor for the PFHs in that it summarizes global geometry in a local descriptor. They compared the performance of SVM to that of logistic regression and found that the ability of the SVM to non-linearly discriminate feature vectors outper-forms the linear discrimination of logistic regression.

The direct use of height maps, instead of point clouds, was studied in 2014 by Domae et al [8]. In their work they studied both the use of a 2-finger gripper as well as a vaccum-gripper. They use 2D image processing by representing the grasping tool as a binary mask containing both a grasping area and a collision area. By checking the intersection of the height map with the masks, patches which intersect well with the grasping area and not at all with the collision area are deemed to have a high graspability. From this, a graspability map is gener-ated with peaks being chosen as grasp points.

Due to the advancements of Neural Networks, in recent years they made entry into the subject of point cloud segmentation. In 2017 Qi et al [16] introduced PointNet which can be used both for 3D object classification, 3D object part seg-mentation as well as segseg-mentation in scenes. Of specific interest is the net’s abil-ity to combine the global feature used for classification into the segmentation network together with previously (for classification) used point features yielding new point features which combine both local and global information.

1.5 Thesis Outline

Chapter 2 introduces and explains the theory and concepts used within the thesis such as SVM and FPFH features. Chapter 3 describes the methods used followed by Chapter 4 in which the results are presented. Chapter 5 concludes the the-sis with a discussion regarding results and methodology as well as some final conclusions and a section regarding future work.

(15)

2

Theory

This chapter presents the theory used within the thesis. Each section starts with a brief introduction before delving deeper into the theory with more detailed descriptions in order to supply the reader with an understanding of the work presented later in the following chapters.

2.1 Support Vector Machines (SVM)

Support Vector Machines (SVM) have developed in both complexity, and hence applicability, through the years. They were introduced in the early 1960’s by Vapnik as a linear separator for datasets with two labels. In 1992, Vapnik et al. [3] published work bringing the SVM closer to where it is today by introducing the kernel function as a method for separation of non-linear datasets. Continuing the work, published in 1995, Vapnik and Cortes [7] brought the SVM close to where it is today by integrating the soft-margin into the SVM, making it applicable on non-separable datasets.

2.1.1 Introduction

Linear classification

The basic idea behind the theory of SVMs is based on the case where we have a labelled dataset (a training set), e.g. with labels -1 and 1, from which we want to find a discriminator (a hyperplane) which maximizes the margin between the data of different labels. An example is shown in Figure 2.1 where the data is labeled as red and blue, the discriminator shown in black and the maximized margin shown in purple.

(16)

x y

Figure 2.1:Example of a SVM with a maximized margin

After the SVM has been trained (the margin on the training data has been maxi-mized) it can used to classify new data as shown in Figure 2.2 where the new data is shown in a lighter colour of their respective label. Of specific interest is the blue point at (2, −1) since this is a point which is wrongly classified i.e the correct label would be blue but given the training data the discriminator wrongly classifies it as red. This shows the common machine learning problem of selecting a training dataset which models the data to which the classifier is to be applied in a correct way.

x y

Figure 2.2: The SVM from Figure 2.1 used to classify new data (bright colour)

Non-linear classification

Of course not all datasets are as “kind” and suitable for classification as those in Figure 2.1 and Figure 2.2. In fact most datasets which are of interest are not even linearly separable. As mentioned in the beginning of Section 2.1, the SVM was first introduced for linearly separable data but was in 1992 extended by Vapnik et al. [3] through the introduction of the kernel function. The idea is to take the dataset which is not linearly separable and map it (through the use of a kernel

(17)

2.1 Support Vector Machines (SVM) 7

x y

Figure 2.3:Non-linearly separable data

x z

Figure 2.4:The dataset from Figure 2.3 mapped with Equation (2.1) function) into a higher dimensional space where it is linearly separable. As an example, look at the dataset in Figure 2.3 which is clearly not linearly separable. By the use of a simple kernel function as defined by (2.1),

z(x, y) = x2+ y2 (2.1)

we introduce the new z-dimension and by looking at Figure 2.4, where x is plot-ted against z, the dataset can now easily be deemed linearly separable.

2.1.2 Linear case

The starting point is a set of labelled training data represented as n pairs {xi, yi}

where xi ∈ Rd is the data- or feature vector (of dimension d) and yi ∈ {−1, 1}

is its corresponding label. The goal is to find a separating hyperplane w which separates the data xi according to their respective labels yi with the maximum

margin. Suppose that there exsists a hyperplane (w, b) which separates the data x_i, then we know from linear algebra that points which lie on the plane will satisfy

(18)

where w is the normal of the hyperplane and b is some constant. However the idea is to maximize the margin between this hyperplane and the closest positive and negative samples respectively. To enforce points to lie outside of the margin, Burges [4] introduces the constraints

xi· w + b ≥ +1, if yi = +1 (2.3)

x_i· w + b ≤ −1, if yi = −1 (2.4)

which means that the closest points with the respective labels will each lie on the respective hyperplanes gained from enforcing equalities in (2.3) and (2.4), see Figure 2.5. These constrains can be rewritten as

yi(xi· w + b) ≥ 1∀i (2.5)

summarizing both inequalities into one constraint. Further, the distance between the separating hyperplane, Equation (2.2) and those at the margin (Equations (2.3) and (2.4) with enforced equalities) is 1/||w|| and thus the width of the whole margin (the quantity which we wish to maximize) is 2/||w||. Equally we could (for mathematical convenience) choose to minimize 1₂||_w||2_{. This means that the} problem we now wish to solve is

min w,b 1 2||w|| 2 s.t. yi(xi· w + b) ≥ 1∀i (2.6)

The first step of solving this problem according to Burges [4] is to introduce posi-tive Lagrange multipliers αi to each of the constraints in (2.6). This is because it

makes the problem easier to handle since we will replace the constraints in (2.6) by constraints on the Lagrange multipliers themselves. This reformulation will also have the convenient result that the training data will appear in form of dot products in the training and test algorithms. Through the introduction of the La-grangian multipliers we get the LaLa-grangian L which is the new function we wish to minimize. This means that the problem now can be formulated as

min w,b,αi L₌ 1 2||w|| 2₋ k X i=1 αi(yi(xi· w + b) − 1) s.t. ∂L ∂αi = 0, αi ≥0 (2.7)

(19)

2.1 Support Vector Machines (SVM) 9 w w· x +b =0 w· x +b =−1 w· x +b =1 x₁ − b ||w_|| x₂ _||w2 ||

Figure 2.5: Illustration of the basic SVM problem formulation. Adapted from [22].

which is a convex quadratic programming problem. This opens for the possibility to solve a dual problem

max L_D ₌1 2||w|| 2₋ k X i=1 αi(yi(xi· w + b) − 1) s.t. ∂LD ∂w , ∂LD ∂b = 0, αi ≥0 (2.8)

which is called the Wolfe dual [23]. The partial derivatives conditions of (2.8) gives the conditions that

w=X i αiyixi (2.9) and X i αiyi = 0 (2.10)

(20)

which are equality constraints in this dual formulation and thus we can substitute them into (2.8) yielding the final dual formulation of the problem

max αi L_D ₌X i αi−1 2 X i,j αiαjyiyjxi· xj s.t. X i αiyi = 0, αi ≥0 (2.11)

which can be solved through a quadratic programming algorithm.

2.1.3 Non-linear case

In 1992 Vapnik et al. [3] extended the SVM by introducing kernel functions to be able to handle non-linearly separable data. As described by Burges [4], it is noticeable that the training data for our problem only appears in dot products in (2.11). The idea is to map the training data through a transform Φ,

Φ: Rd 7→ H _(2.12)

where Rd is the Euclidean feature space of the training and H a higher (possibly infinite) dimensional Euclidean space. This would yield that the training data would now only depend on the dot products Φ(xi) · Φ(xj) in H. By defining a

function as

K(xi, xj) = Φ(xi) · Φ(xj) (2.13)

where K is called the kernel function, the training data would only be dependent on K and there would be no explicit need to know the definition of Φ. This means that the optimization problem for the non-linear case is defined as

max αi L_D ₌X i αi − 1 2 X i,j αiαjyiyjK(xi, xj) s.t. X i αiyi = 0, αi ≥0 (2.14)

i.e. the same problem as in the linear case is solved but on data transformed into a higher dimensional space where the data is linearly separable.

A commonly used kernel is the Gaussian Radial Basis Function (RBF) defined as

K(xi, xj) = exp(−γ||xi−xj||2) (2.15)

(21)

2.1 Support Vector Machines (SVM) 11

2.1.4 Non separable case and soft margins

In 1995 the SVM was further extended when Vapnik and Cortes [7] introduced the soft margin classifier. This means that some training data samples are allowed to be on the “wrong” side of the derived margins, which in turn allows deriving SVMs from non-separable data. It is done through relaxation of the constraints (2.3) and (2.4) by introducing slack variables ξi, which yields the new constraints

x_i· w + b ≥ +1 − ξi, if yi = +1

xi· w + b ≤ −1 + ξi, if yi = −1

ξi ≥0 ∀i

(2.16)

which in turn means that the slack variables ξi needs to be included in the

objec-tive function. This is done by revising the objecobjec-tive function in (2.6) to 1

2||w||

2_{+ C}X

i

ξi (2.17)

where C is a hyperparameter chosen by the user at design. A larger C means assigning a higher cost to errors, i.e. a higher cost for passing the margin. One of the main advantages of this formulation is that neither ξi or their respective

La-grange multipliers will appear in the LaLa-grange dual formulation on the problem. This means that when the problem, after the relaxation of the constraints, is put on the form of (2.14) it is formulated as

max αi L_D ₌X i αi− 1 2 X i,j αiαjyiyjK(xi, xj) s.t. X i αiyi = 0, 0 ≤ αi ≤C (2.18)

(22)

2.2 Fast Point Feature Histogram (FPFH)

In 2008 Rusu et al. [19], [20] introduced the Point Feature Histogram (PFH) as a pose-invariant local feature describing the underlying surface model properties at a query point pq. The aim was to be able to use a multiclass SVM (Section 2.1)

to classify points in a point cloud as belonging to different geometric primitives such as edges, cylinders and planes. In 2009 Rusu et al. [21] extended their work through the modified Fast Point Feature Histogram (FPFH), a trimmed down ver-sion of the PFH, which is much faster to compute while still retaining much of the discriminative power of the PFH.

2.2.1 Point Feature Histogram

The computation of PFHs relies on 3D points pi in a point cloud P with

coordi-nates {xi, yi, zi}and their respective estimated surface normals ni. The

computa-tions are outlined in Algorithm 1. Essentially the idea behind Algorithm 1 is to Algorithm 1:Point Feature Histogram

Input:A dataset of points P forEvery point pqin P do

Select all neighbours within a given radius r to construct a

k-neighbourhood Kq

forEvery pair of points (pi, pj), where i , j, in Kq. do

Define a Darboux frame according to (2.19) and compute the angles

α, φ and θ in (2.20).

Increment the bin in the histogram corresponding to the specific set of values of all three angles.

end

Normalize the histogram so that each bin represents the percentage of points in Kqthat belongs to that bin.

end

calculate the Darboux {u, v, w} frame as

u = ni

v = (pj−pi) × u

w = u × v

(2.19)

and the angles

α = v · nj

φ = (u · (pj−pi))/||pj−pi||

θ = arctan(w · nj, u · nj)

(2.20)

and then form a histogram for the query point pq. This is done by binning

(23)

2.2 Fast Point Feature Histogram (FPFH) 13 p6 p9 pk1 pk5 pk4 pk3 pk2 p12 p11 p10 p8 p7 pq

Figure 2.6: Illustration of the PFH influence diagram of pqand the points

pkiin Kq. Adapted from [21].

the k-neighbourhood of pq, Kq. This will constitute the Point Feature Histogram.

An illustration of the influence diagram of pqand Kq, i.e. how the points pqand

pki∈ Kqrelate to each other, is shown in Figure 2.6.

2.2.2 Fast Point Feature Histogram

As mentioned in the beginning of this section, the FPFH was introduced in or-der to reduce the computational complexity of the PFH, namely from O(n · k2) to O_{(n · k) where n is the total number of points in P and k is the number of} neigh-bours in Kq. The idea is to first calculate the relations of (2.19) and (2.20) between

the query point pqand all its neighbours in Kqas shown by the influence diagram

in Figure 2.7. This constitutes a Simplified Point Feature Histogram (SPFH). The FPFH is then computed by adding the weighted sum of the SPFHs of the k neigh-bours in Kqto the SPFH of the query point. This means that

FP FH(pq) = SP FH(pq) +1 k k X i=1 1 ωk · SP FH(pk) (2.21)

where ωk is the distance between pqand pk. An influence diagram is shown in

Figure 2.8 and the computations are outlined in Algorithm 2. The idea behind the algorithm is to form a SPFH for every point p in the cloud P and that these histograms need to be computed only once (thus reducing computational com-plexity) since they are not dependent on the query point pq.

(24)

p6 p9 pk1 pk5 pk4 pk3 pk2 p12 p11 p10 p8 p7 pq

Figure 2.7: Illustration of the SPFH influence diagram of pqand the points

pkiin Kq. p6 pk1 p9 pk5 pk4 pk3 pk2 p12 p10 p8 p7 pq p11 2 2 2 2 2 2 2 2 2 2

Figure 2.8:Illustration of the FPFH influence diagram of pqand the k points

in Kq. Relations which occur twice are marked with grey. Adapted from

[21].

2.3 Clustering

In 2006 Rabbani et al. [17] introduced the Region Growing algorithm for point cloud segmentation. It utilizes the idea of comparing normals and their residuals towards a “smoothness constraint” for determining whether neighbouring points belong to the same smooth surface. It is essentially based on the comparison of the angles between the point normals. During work with Point Cloud Library (PCL) [11] the algorithm was extended [12] to utilize point curvatures as an alter-native to residuals for testing the smoothness constraint.

The algorithm starts by sorting the points by their curvature value. This is be-cause it will start each region growth at the available point (i.e. not belonging to a cluster) with minimum curvature since this point is located in a flat area

(25)

2.3 Clustering 15

Algorithm 2:Fast Point Feature Histogram Input:A dataset of points P

forEvery point pqin P do

Select all neighbours within a given radius r to construct a

k-neighbourhood Kq.

forEvery point piin Kqdo

Select all neighbours within the given radius r to construct a

k-neighbourhood Ki.

Define a histogram SPFH(pi).

forEvery point pj∈ Kiand every Ki do

Define a Darboux frame according to (2.19) and compute the angles α, φ and θ in (2.20) for the pair (pi, pj).

Increment each bin corresponding to the value of each of the angles in SPFH(pi).

end end

Construct FPFH(pq) according to (2.21).

end

and growth from these areas tend to reduce the total number of segments. The selected point is then added to a set called seeds. Then for every seed the algo-rithm finds the seed’s neighbours. Every neighbour’s normal is then compared against the normal of the seed and if the angle between them is less then a spec-ified threshold, the neighbour is added to the current region/cluster. After the angle comparison, the neighbour’s curvature is compared to a specified thresh-old and if it is lower, the neighbour is added to the current set of seeds. When the curvature has been tested, the current seed is then removed from the set of seeds and the algorithm starts over performing the same neighbour tests for all the seeds, updating the set with each iteration. When the set of seeds becomes empty, the current region can not grow any further. The whole process is then repeated again by starting to grow a new region from the available point with the smallest curvature. It continues to do so until there are no points which do not belong to a region, meaning that even single outlier points will constitute a region.

An overview of the Region Growing Algorithm as implemented by PCL [12] is given in Algorithm 3 where P is the input point cloud, N and c are the sets of their respective normals and curvatures, cthand θthbeing the thresholds for

cur-vature and angles and Ω being a neighbourfinding function (such as k-nearest neighbours). Output is the set R which contains the regions/clusters generated from the algorithm. Further, A is the list of available points, Rcthe region which

is currently being grown, S the set of seeds, Bc the set containing neighbours of

(26)

Algorithm 3:Region Growing Algorithm Input: P, N, c, cth, θth, Ω

Output: R Region list R ← Ø

Sort points in P according to curvature and put in the list of available points A

while Ais not empty do Current region Rc ←Ø

Current seeds S ← Ø

Point with min curvature in A → pmin

Add pminto S and Rc

Remove pminfrom A

fori = 0 to size(S) do

Find nearest neighbours of current seed Bc← Ω(S{i})

forj = 0 to size(Bc) do

Current neighbour pj ←Bc{j}

ifpj ∈Aand cos−1(|N{S{i}} · N{Pj}|) < θththen

Add pj to Rc Remove pjfrom A if c(pj) < cththen Add pj to S end end end end Add Rcto R end

(27)

3

Method

This chapter describes the methods used within this work and their implemen-tation, including the generation of synthetic point clouds with graspability mea-sures as well as the SVM. The chapter also includes a description of how the used methods are evaluated.

3.1 Description

As declared in Section 1.2 the aim of this work is to investigate whether it is possible to successfully train an SVM to discriminate graspable points from non-graspable. Further, it should be investigated if it is possible to cluster these points and generate grasping points and poses for a vacuum-type gripper. To accom-plish this, data was generated for training an SVM. The trained SVM model is then used for classifying graspable points. The classified points are then clus-tered and a grasping point is extracted from each cluster. An overview of the flow of this process is shown in Figure 3.1.

This approach was inspired by the work of Rusu et al. [19], see Section 1.4, in which an SVM was used to classify geometric primitives. Their work showed that it was possible to successfully discriminate between edges and planes for example. This is of great interest in this work since placing a vacuum-type grip-per on an edge would result in failure of gripping the object. The segmenta-tion/clustering into graspable regions is inspired by the works of Domae et al. [8] and Rao et al. [18] which performs segmentation on depthmaps through de-tection of edges. The clustering in this work is instead performed on a point cloud using a surface and curvature based segmentation. In turn, the availabil-ity of precise computation of normals in a point allows for a naive approach for extracting the grasping point and pose of the gripper.

(28)

Generate

training data Train SVM

Cluster graspable points Extract grasping points Use SVM for classiﬁcation

Figure 3.1:Overview of the process to generate grasping points.

Training the SVM corresponds to solving the quadratic programming problem of (2.18), Section 2.1.4, which is done through the use of the Sequential Minimal Op-timization (SMO) [15] algorithm. In (2.18), the kernel function K chosen for this work is the Gaussian Radial Basis Function of (2.15) in Section 2.1.3. FPFH fea-tures (Section 2.2) are selected as input data vectors due to the promising work of Rusu et al. [19], as well as the SVMs proven ability to successfully discrim-inate histograms as described in Section 1.4. In order to generate the training data for the SVM, a synthetic point cloud with graspability labels is generated from a simulator based on the work by Edhammer [9]. From this point cloud, the FPFH features are calculated and thus generating pairs of FPFH feature vectors and graspability labels corresponding to xi and yi respectively in (2.18).

The trained SVM model is then used for classifying individual points as either graspable or non-graspable. When the points in a cloud have been classified, all points labeled as non-graspable are removed, leaving a point cloud consisting of only the graspable points. These points are then subject to the clustering al-gorithm described in Section 2.3. Clusters that have fewer points than 50 are discarded since these are deemed to be to small to be of significance. This thresh-old for discarding clusters is of course dependent on the density of the processed point cloud, as well as the application for which the algorithm is used. For each cluster, the centroid is then calculated by taking the mean of the respective coor-dinates (x,y,z) in the cluster. The point that is closest (by Euclidean distance) to the centroid is then chosen as the grasping point.

The pose of the gripper (the transform for orienting it in relation to the grasping point) is then determined through aligning the z-axis (z) of the world coordinates to the normal (n) of the grasping point. This is equivalent to rotating z towards n, which is done by determining a rotation matrix. The matrix is computed using Rodrigues’ Rotation Formula as described by Nordberg [13]

R(v, α) = I + (1 − cos α)[v]2×+ sin α[v]× (3.1)

where R is a 3 × 3 rotation matrix, I the identity matrix, α the angle between z and n, v is the axis of rotation defined by v = z × n and [ · ]×being the cross product

matrix. Since cos α = z · n, sin α = ||z × n|| and v = z × n, (3.1) turns into

R(z, n) = I + (1 − z · n)[z × n]2×+ ||z × n||[z × n]× (3.2)

i.e. R depends only on z and n. Though (3.2) describes how the acquisition of R is actually implemented, (3.1) may yield a clearer idea of what is actually

(29)

3.2 Evaluation 19

happening. As can be seen in Figure 3.2, what is done is that z is rotated around the axis v by the angle α so that it is aligned with n.

x

y z

n _v

α

Figure 3.2:Illustration of the alignment of z towards n

3.2 Evaluation

The SVM is evaluated by accuracy defined as the number of correctly classified samples (both graspable and non-graspable) divided by the total number of sam-ples given in percentage. This is done solely on synthetic data due to the availabil-ity of ground truth for point clouds generated by the simulator. The evaluation is done on datasets consisting of different types of models/objects and varying scales and quantities of objects. The influence of noise in the data is examined through the addition of white noise with varying variances for different tests di-rectly on the x,y and z coordinates.

The algorithm as a whole is evaluated on real scanned data. A scene with a bin with objects is set up, the scene is scanned by a real camera, the algorithm outputs a grasping point and a pose (as shown in Figure 3.3) and the object is removed by hand in order to simulate robotic bin picking. The algorithm ranks the possi-ble poses according to the size of their respective clusters (in terms of number of points). The pose with the largest cluster is selected based on the notion that a large cluster corresponds to a surface with many graspable points, thus suitable for grasping. Once an object has been removed, a new scan is performed and the algorithm starts over.

(30)

The performance of the algorithm is measured through success rate, which is de-fined as the number of successful attempts to pick objects from the bin divided by the total number of attempts, similar to the evaluation in [8]. An attempt is deemed to have failed if the gripper would collide with either the bin or an object, i.e. the simulated gripper (blue object in Figure 3.3) overlaps with either the bin or an object. It is also deemed to have failed if an object is significantly blocked, meaning that roughly one fourth of the surface upon which the gripper is placed is blocked by another object. Further, an attempt placing the gripper too close to the edge of an object (within a distance corresponding to 10% of the objects length in that direction) is also marked as a failed attempt. All these cases are evaluated through visual inspection in the bin picking interface shown in Figure 3.3. If an attempt was marked as failed, the next best candidate (the next largest cluster) is chosen for the next attempt. The algorithm terminates when it finds no eligible grasping point and pose. If the algorithm terminates before the bin is empty, each remaining object is marked as a failed attempt.

Figure 3.3:Example output of a picking pose from the algorithm.

3.3 Implementation

The following section presents the implementation of the methods described in Section 3.1.

3.3.1 Data generation

The generation of training data for the SVM is done in two steps. First, synthetic point cloud data is generated with each individual point assigned a binary gras-pability value, where 1 corresponds to the point being graspable and 0 otherwise. Second, the FPFH features (Section 2.2) of the point cloud are generated and

(31)

3.3 Implementation 21

Figure 3.4:A cylinder model with a graspability texture mapped onto it.

each histogram is then paired with the corresponding point label. These pairs then constitute the training data for the SVM.

Generating synthetic point cloud data

As mentioned in Section 3.1, the synthetic point cloud data is generated by a sim-ulator based on the work by Edhammer [9]. It simulates randomly dropping rigid objects (loaded from a model) into a box and acquiring an image from a 3D-camera, including artifacts such as occlusion and with the possibility of adding noise according to a simple model. The graspability information is added through the mapping of a texture to the model as is depicted in Figure 3.4. Here green depicts graspable and red not graspable. The red channel is used solely for visualization purposes.

Generating SVM data

In order to generate the data that the SVM needs for both training and classifi-cation, i.e. the pairs consisting of FPFH features and a corresponding label, a separate interface is implemented. Essentially this consists of a parser reading the files containing the point cloud (x,y and z coordinates) and the graspability labels respectively, along with a computation of the FPFH features. A second parser formats the data on a form that the SVM interface can interpret. This means that the same interface is used for turning both training data and testdata into the SVM format in a very similar manner, the separating factor being the presence of graspability labels. This in turn means that the interface can also be used for viewing the result of a classification using the original point cloud and the output of the SVM as its input. An overview of the this system is shown in Figure 3.6 where green boxes and arrows indicate files and file parsing while blue represents the internal structure of the interface.

(32)

Figure 3.5:A rendering of the simulated scene of randomly dropped models with graspability textures. Green depicts graspable and red not graspable.

Graspability labels Point cloud data Create Point cloud Compute FPFH Format SVM data Estimate Normals SVM data

Figure 3.6: Overview for the system turning point clouds into SVM data. Green boxes and arrows indicates files and file parsing while blue represents the internal structure of the interface, i.e. the blue boxes together constitute a single program.

For the internal representation of the point clouds as well as the normal estima-tion and computaestima-tion of the FPFH features, the open-source Point Cloud Library (PCL) [11] is used. An example of a point cloud representation of the data gen-erated by the simulator and then used by the interface is given in Figure 3.7. Some minor artifacts can be noticed close to the edges of the cylinder where some points on the floor have wrongly been marked as green, a problem inherited from the simulator.

3.3.2 SVM implementation

The SVM classifier is implemented using the open-source library LIBSVM [5]. The pre-built binaries svm-scale, svm-train and svm-predict are used in

(33)

Figure 3.7: Visualization of a point cloud representation of the data gener-ated by the simulator

order to obtain a model as well as classification results. A flowchart of how these are used is shown in Figure 3.8. Here the green boxes are files used in the training phase of the SVM, the blue boxes represent the pre-built binaries and the yellow boxes are files used when the SVM interface is used for predicting graspability labels on unclassified data.

Training the model

The training data is acquired as described in Section 3.3.2. This data is then scaled so that each data point belongs to the interval [0,1] using svm-scale, which outputs a scaled version of the training data as well as the parameters (the range file) used for scaling. These parameters need to be saved since the model will be trained on a dataset of this scale, and thus any unlabeled data that should be subject to classfication, will then need to be scaled using the same parameters (as depicted by the yellow arrow in Figure 3.8). The scaled data is used along with hyperparameters (Section 3.3.2) by svm-train to generate a model utilising a Gaussian RBF kernel (Section 2.1.3), which can then used for classification. As mentioned earlier, training the model corresponds to solving the qudratic programming problem of (2.18), see Section 2.1.4, where LIBSVM employs the Sequential Minimal Optimization (SMO) [15] algorithm.

Classifying unlabeled data

At first the unlabeled data (assuming to be the output of the system in Section 3.3.1) needs to be scaled using svm-scale to the same space as the model is trained for, i.e. using the range file generated when scaling the training data,

(34)

SVM

training data svm-scale svm-train SVM Model Scaled SVM training data Range

Unlabeled SVM data svm-scale Scaled SVM data svm-predict Output labels

Figure 3.8: Overview of the flow and usage of the SVM interface. Blue cor-responds to prebuilt binaries from LIBSVM and arrows corcor-responds to file parsing performed by those binaries. Green corresponds to files used and generated during the training step, and yellow corresponds to files used and generated when the interface is used for classification.

see Figure 3.8. The scaled data is then used together with the model file by svm-predictwhich classifies the data and outputs a list of graspability labels. Creating the training dataset

The training dataset for the SVM consists of concatenated subsets from several scenes. At first, point clouds and grasping labels of several scenes are synthet-ically generated as per described in Section 3.3.1. They are then used as input to the PCL interface described in Section 3.3.1 which computes FPFH features and generates data files on a format which LIBSVM can use. These data files are then used as input to the LIBSVM Python script subset.py, which takes a data file and generates a new file containing a subset of the input data with the same distribution of labels as the original data file has. This means that the data file from every separate scene is each sampled into a subset, thus reducing the size of the dataset. These subsets are then concatenated into a single data file, which is then used for training the model as described above. This sampling and con-catenation of subsets will reduce the size of the training set, while still retaining various examples from several different scenes. An overview of the training set creation process is shown in Figure 3.9.

Hyperparameter optimization

The hyperparameters in the model that needs optimization (i.e. the ones that are user specified) are γ in (2.15) and C in (2.18). In addition the radius r (Algorithm 2, Section 2.2.2) used for computing the FPFH features also affects the accuracy of the classification and can be seen as a hyperparameter. Optimization in this

(35)

subset.py

SVM data

Subsets

concat

SVM training data

From PCL Interface

Figure 3.9:Overview of the creation of a training dataset. Blue corresponds to scripts, where subset.py originates from LIBSVM. Green corresponds to files used and created during the process and arrows correspond to file parsing.

context means that the values chosen for these parameters are the ones which gen-erate the highest accuracy (as per defined in Section 3.2) on the training datasets. This is done through a grid-search as in [10], using an n-fold cross-validation (with n = 5) in order to determine the accuracy for each combination of parame-ters.

3.3.3 Grasp point extraction

Once a point cloud has been classified, all non-graspable points are removed. The remaining graspable points are then clustered through the region-growing algorithm described in Section 2.3, using the PCL implementation [12]. For each cluster, the grasping point and gripper pose are then calculated as described in Section 3.1. The grasping points are then ranked according to the size of the clusters, and the grasping point belonging to the largest cluster is then selected as the chosen candidate.

(36)

(37)

4

Results

The following chapter presents the results of the work described in Chapter 3. The methods of evaluation are described in detail in Section 3.2. The chapter is divided into two main sections with 4.1 presenting the classification performance of the SVM and 4.2 presenting the bin-picking performance of the system as a whole.

4.1 SVM performance

This section presents the performance of the SVM evaluated by accuracy as de-fined in Section 3.2.

4.1.1 Datasets

The datasets used in this work are generated as per described in Section 3.3.1. They are listed in Table 4.1 where σ x, y and z is the standard deviation (in mm) of the white noise applied directly onto the x, y and z coordinates respectively. Each dataset is named after the object model used in the scenes where the dimen-sions are given in mm and the level of noise applied to the respective coordinates, the decimal separation being after the first figure. Cylinder dimensions are given as Diameter × Height while cube dimensions are the length of its sides. Note-worthy is that the simulator generates data with a sampling distance between points of approximately 3 mm. Datasets were generated in different scales, both larger and smaller than the ones used for training models, in order to examine scale invariance in the feature space. Cylinders and cubes were chosen as objects since they differ in their geometrical properties, with the cube having many sharp edges and corners while the cylinder has a circular curvature. Some datasets were generated in two versions where one was used for training the model (name

(38)

ing with T) and one used for validation (name ending with V). Both training- and validation datasets consist of the same number of objects, scenes and noise. Each dataset consist of 30 scenes. The dataset used for training was sampled (with 1500 samples per scene) and concatenated as described in Section 3.3.2.

Dataset Objects σ x,y σ z

cyl70x70_XY0_Z0T/V 243 0.0 0.0 cyl70x70_XY025_Z05T/V 243 0.25 0.5 cyl70x70_XY05_Z1T/V 243 0.5 1.0 cyl50x50_XY0_Z0V 288 0.0 0.0 cyl50x50_XY025_Z05V 288 0.25 0.5 cyl50x50_XY05_Z1V 288 0.5 1.0 cyl100x100_XY0_Z0V 135 0.0 0.0 cyl100x100_XY025_Z05V 135 0.25 0.5 cyl100x100_XY05_Z1V 135 0.5 1.0 cyl140x140_XY0_Z0V 63 0.0 0.0 cyl140x140_XY025_Z05V 63 0.25 0.5 cyl140x140_XY05_Z1V 63 0.5 1.0 cube90_XY0_Z0T/V 117 0.0 0.0 cube90_XY025_Z05T/V 117 0.25 0.5 cube90_XY05_Z1T/V 117 0.5 1.0 cube60_XY0_Z0V 288 0.0 0.0 cube60_XY025_Z05V 288 0.25 0.5 cube60_XY05_Z1V 288 0.5 1.0 cube130_XY0_Z0V 63 0.0 0.0 cube130_XY025_Z05V 63 0.25 0.5 cube130_XY05_Z1V 63 0.5 1.0

Table 4.1: Datasets used for training and validation. They are named after the object model used in the dataset and its dimensions (length and height for cylinders, side length for cubes), the amount of noise and whether its use is for training (T) or validation (V). The σ columns shows the standard deviation (in mm) of the white noise applied directly onto the x, y and z coordinates respectively.

Object models

The same cylinder object model, shown in Figure 4.1, is used for all datasets named after it. The measurements are given as cylinder height and diameter in mm. The difference in measurements between the objects in different datasets is a result of scaling. The width of the non-graspable areas on the side of cylinder are 15% of the total cylinder height respectively, regardless of scale.

(39)

4.1 SVM performance 29

Figure 4.1:The cylinder model.

As with the cylinder datasets, the cube datasets all originate from the same object model, shown in Figure 4.2, again with different scalings. The measurement is given as the length of the sides of the cube in mm. The width of the non-graspable areas at the edges of the cubes is 10% of the cube side length.

Figure 4.2:The cube model.

4.1.2 Model hyperparameters

Table 4.2 presents the values of the hyperparameters r, C and γ (found through hyperparameter optimization, Section 3.3.2) that the respective model was trained with. It is important to note that the r presented in this chapter is the radius used

for determining the normals. The radius used for calculating the FPFH features is 2.5 times the value presented here. This is due to the recommendations in the PCL

implementation of FPFH. Table 4.2 also presents the amount of noise σ , used in the generation of the training dataset for each model. Both the radius r and the standard deviations of the noise σ is given in mm. As can be seen from Table 4.2,

(40)

Model σ x,y σ z C γ r cyl70x70_XY0_Z0T 0.0 0.0 5.7 5.7 10.0 cyl70x70_XY025_Z05T 0.25 0.5 8.0 8.0 11.5 cyl70x70_XY05_Z1T 0.5 1.0 8.0 8.0 13.5 cube90_XY0_Z0T 0.0 0.0 512.0 8.0 15.0 cube90_XY025_Z05T 0.25 0.5 8.0 8.0 8.5 cube90_XY05_Z1T 0.5 1.0 5.7 8.0 7.0 cyl70x70+cube90_XY025_Z05T 0.25 0.5 22.6 8.0 15.0 Table 4.2: The amount of noise and values of hyperparameters used for training the respective model

the optimal radius increases with the amount of noise in the case with cylinders, while with cubes the relationship is reverse.

4.1.3 Model performance

The models are named after the dataset upon which they were trained. The two last columns of Tables 4.3–4.7 shows which dataset the model was tested against and the average accuracy (sum of the accuracies of all scenes divided by number of scenes) for that dataset.

Accuracy and influence of noise

In Tables 4.3 and 4.4 the models and their performance on the test sets are shown. The models are trained on a specific level of noise and then tested against various levels of noise.

Model Test set Acc

cyl70x70_XY0_Z0T cyl70x70_XY0_Z0V 91.29% cyl70x70_XY0_Z0T cyl70x70_XY025_Z05V 84.96% cyl70x70_XY0_Z0T cyl70x70_XY05_Z1V 66.19% cyl70x70_XY025_Z05T cyl70x70_XY0_Z0V 87.79% cyl70x70_XY025_Z05T cyl70x70_XY025_Z05V 90.50% cyl70x70_XY025_Z05T cyl70x70_XY05_Z1V 83.85% cyl70x70_XY05_Z1T cyl70x70_XY0_Z0V 85.33% cyl70x70_XY05_Z1T cyl70x70_XY025_Z05V 88.73% cyl70x70_XY05_Z1T cyl70x70_XY05_Z1V 88.92%

Table 4.3: Performance of cylinder models trained on various amounts of noise on test sets.

When looking at Table 4.3 it can be seen that in the case of cylinders, models trained on a higher level of noise performs reasonably well also on scenes with a lower amount added noise. The reverse is not true which becomes especially clear when looking at model cyl70x70_XY0_Z0T, where the addition of noise in the datasets heavily reduces the accuracy of the classification.

(41)

Model Test set Acc

cube90_XY0_Z0T cube90_XY0_Z0V 74.09% cube90_XY0_Z0T cube90_XY025_Z05V 73.00% cube90_XY0_Z0T cube90_XY05_Z1V 48.25% cube90_XY025_Z05T cube90_XY0_Z0V 69.69% cube90_XY025_Z05T cube90_XY025_Z05V 79.25% cube90_XY025_Z05T cube90_XY05_Z1V 51.36% cube90_XY05_Z1T cube90_XY0_Z0V 38.10% cube90_XY05_Z1T cube90_XY025_Z05V 67.62% cube90_XY05_Z1T cube90_XY05_Z1V 80.63%

Table 4.4:Performance of cube models trained on various amounts of noise on test sets.

In the case of cubes, Table 4.4 shows a different behaviour of cube models than that of cylinder models. Cube models trained on a specific amount of noise do not generally perform reasonably well on datasets with lower levels of noise. In the case of cylinders, the drop in accuracy was in the few percentages but for cubes the reduction in accuracy is around 10 percentages. Further, all cube models perform worse than their cylinder counterpart. A part of the explanation lies with the models’ inability to detect edges if there are not points on both sides of the edge, see Figure 4.3. This problem arises because the FPFH feature utilizes point normals, meaning that if there are no points “on the other side of edge” (compare the teal marked edge with the yellow one in Figure 4.3) the normals will point in the same direction as those of a flat surface. This means that it will be classified as a flat surface that ends instead of an actual edge.

Figure 4.3:Example of when the model fails to detect an edge due to lack of points on both sides of the edge. The cyan marked edge is wrongly classified as opposed to the yellow marked one, where the presence of just a few points proves to be sufficient.

Accuracy on objects of different scale

In Tables 4.5 and 4.6 the results of testing a model on the same type of object but with a different scale are shown. This is to examine how well the models

(42)

generalize to objects with the same basic shape but on a different scale. The results of testing the models on objects of the same scale is included in the tables in order to ease the comparison.

Model Test set Acc

cyl70x70_XY0_Z0T cyl70x70_XY0_Z0V 91.29% cyl70x70_XY0_Z0T cyl50x50_XY0_Z0V 83.15% cyl70x70_XY0_Z0T cyl100x100_XY0_Z0V 85.83% cyl70x70_XY0_Z0T cyl140x140_XY0_Z0V 86.76% cyl70x70_XY025_Z05T cyl70x70_XY025_Z05V 90.50% cyl70x70_XY025_Z05T cyl50x50_XY025_Z05V 81.04% cyl70x70_XY025_Z05T cyl100x100_XY025_Z05V 84.69% cyl70x70_XY025_Z05T cyl140x140_XY025_Z05V 86.69% cyl70x70_XY05_Z1T cyl70x70_XY05_Z1V 88.92% cyl70x70_XY05_Z1T cyl50x50_XY05_Z1V 77.48% cyl70x70_XY05_Z1T cyl100x100_XY05_Z1V 85.16% cyl70x70_XY05_Z1T cyl140x140_XY05_Z1V 83.87% Table 4.5:Performance of cylinder models on datasets containing differently scaled objects.

In general, the performance of the classification on objects with a different scale than that which was used to train the model is worse than the performance on ob-jects with the same scale, which could be expected. It can also easily be discerned by looking at Table 4.5 and Table 4.6 that the models generally perform better on larger objects than smaller, regardless of the noise level. In general terms, in the case of cylinders, scaling of the objects to twice the size will reduce the accuracy in the range of 4-5 percentages. In the case of cubes, an increase in size of about 44% results in a drop of accuracy by approximately 2 percentages, except in the case of cube90_XY0_Z0 which actually performed slightly better on the datasets with larger objects with an increase in accuracy of 1.5 percentages.

Model Test set Acc

cube90_XY0_Z0T cube90_XY0_Z0V 74.09% cube90_XY0_Z0T cube60_XY0_Z0V 67.93% cube90_XY0_Z0T cube130_XY0_Z0V 75.57% cube90_XY025_Z05T cube90_XY025_Z05V 79.25% cube90_XY025_Z05T cube60_XY025_Z05V 73.17% cube90_XY025_Z05T cube130_XY025_Z05V 77.35% cube90_XY05_Z1T cube90_XY05_Z1V 80.63% cube90_XY05_Z1T cube60_XY05_Z1V 74.55% cube90_XY05_Z1T cube130_XY05_Z1V 78.32%

Table 4.6:Performance of cylinder models on datasets containing differently scaled objects.

(43)

Accuracy of a model trained on a mixed dataset

The model cyl70x70+cube90_XY025_Z05T was trained on a mixed dataset con-taining the first 15 scenes of cyl70x70_XY025_Z05T and cube90_XY025_Z05T respectively (totalling to 30 scenes). In Table 4.7 the performance of the mixed model on cylinder and cube datasets is shown, along with the performance of models trained solely on either cylinders or cubes for easy comparison.

Model Test set Acc

cyl70x70_XY025_Z05T cyl70x70_XY025_Z05V 90.50%

cube90_XY025_Z05T cube90_XY025_Z05V 79.25%

cyl70x70+cube90_XY025_Z05T cyl70x70_XY025_Z05V 84.36% cyl70x70+cube90_XY025_Z05T cube90_XY025_Z05V 74.37% Table 4.7:Performance of cylinder models on datasets containing differently scaled objects.

When looking at Table 4.7 it becomes apparent that the mixed model performs worse than its specialized counterparts which could be expected. The drop in accuracy is in the range of 5-6 percentages for both cylinders and cubes. However, judging from the accuracy, it seems that the mixed model can reasonably well discriminate between the graspable flat surface of a cube and the non-graspable flat surface of a cylinder.

(44)

4.2 Bin picking performance

This section presents the performance of the algorithm as a whole, evaluated by success rate as defined in Section 3.2. An example of the type of scenes that are used in the evaluation is shown in Figure 4.4. A complete account of the scenes used in the evaluation is given in Appendix A.

Figure 4.4: Example scene configuration of the datasets used in the algo-rithm evaluation.

4.2.1 Picking cylinders

The model selected for tests on real scanned data was cyl70x70_XY025_Z05T. A visual inspection of the results of the classification showed that models trained on this level of noise generalized best to real scanned data. The results of the performed tests are shown in Table 4.8.

Table 4.8 shows that the algorithm performs well on real scanned datasets. It can also be seen that it generalizes well to objects of different scale. However it does fail when the cylinders are aligned with their flat surfaces towards each other as is shown in Figure 4.5. This is due to that algorithm fails to discriminate between the two cylinders and treats them as a single object, placing the gripper at the seam between the objects as shown in Figure 4.5.

(45)

4.2 Bin picking performance 35

Dataset #Objects Success Notes

cyl70x70rand1 14 100.0%

-cyl70x70rand2 14 100.0%

-cyl70x70rand3 14 100.0%

-cyl70x70pyramids 14 100.0%

-cyl70x70corners 4 100.0%

-cyl70x70aligned 14 14.3% Terminated early because grasp-ing point at the edge between cylinders

cyl103x78rand1 14 85.7% Terminated early (2 objects left) because grasping point at the edge between cylinders

cyl103x78rand2 14 100.0%

-cyl46x55rand1 24 100.0%

-cyl46x55rand2 24 100.0%

-cylMixedRand 40 95.2% One pick pose would cause a colli-sion between the gripper and the bin. One picked object was signif-icantly blocked by other objects Table 4.8: Performance of model cyl70x70_XY025_Z05T on real scanned datasets.

Figure 4.5:A failed grasping attempt due to the model’s inability to discrim-inate the two cylinders.

The SVM classification on cylinders

In Figure 4.6, an example of using the model cyl70x70_XY025_Z05T on the dataset cyl70x70rand1 is shown. As can be seen, many points which are close to the edge of a cylinder are marked as non-graspable though far from every

(46)

point have this correct classification. These imperfections can be seen as a result of the shift from the synthetic to the real world domain. Further, it can be seen that non-graspable points seem to more often be subject to misclassification than graspable points.

Figure 4.6:Classification results of model cyl70x70_XY025_Z05T on dataset cyl70x70rand1

In Figure 4.7 the clustering of the classification results presented in Figure 4.6 is shown. Each cluster is marked with an individual colour and clusters marked with red are clusters which have too few points and are thus discarded.

Figure 4.7: Clustering of the results of model cyl70x70_XY025_Z05T on dataset cyl70x70rand1. Each cluster has a seperate colour. Red clusters are clusters that consist of too few points and are discarded.

(47)

4.2.2 Picking cubes

The model selected for tests on real scanned data was cube90_XY025_Z05T. The results of the performed tests are shown in Table 4.9.

Dataset #Objects Success Notes

cube90rand1 6 66.7% Terminated early because grasping point at the edge between cubes

cube90rand2 6 100.0%

-cube90corners 4 100.0%

-cube90Aligned 6 57.1% Terminated early because grasping point at the edge between cubes cuboid90x45x45 15 100.0%

-cuboidMixed 25 100.0% Includes cube90, cuboid90x45x45 and cuboid102x90x43

Table 4.9: Performance of model cube90_XY025_Z05T on real scanned datasets.

Table 4.9 shows similar results to that of the cylinder model in Section 4.2.1. It generally performs well on real scanned datasets regardless of the scale of ob-jects. The model shares the same difficulties with discriminating objects which are aligned towards each other as is shown in Figure 4.8. Here it is shown that the cubes do not need to be perfectly aligned edge-to-edge in order for the algorithm to fail. It is sufficient that a large enough part of their edges are close to each other so that some artifacts in the data acquisition makes the algorithm unable to differentiate between the two objects.

Figure 4.8:A failed grasping attempt due to the model’s inability to discrim-inate the two cubes.

(48)

The SVM classification on cubes

In Figure 4.9, an example of using the model cube90_XY025_Z05T on the dataset cube90rand1 is shown. As can be seen, in the case of cubes, an SVM trained on synthetic data seems to not generalize as well to real scans as it does in the case of cylinders. Interestingly, the misclassification relationship seems to be the reverse to that in the case of cylinders. For cubes, it seems that it is more common for a graspable point to be classified wrongly rather than the non-graspable points. However, as shown in Table 4.9, these results seems to be sufficient in order to generate viable grasping points in most cases since the graspable points do not lie close to the edges in most cases.

Figure 4.9: Classification results of model cube90_XY025_Z05T on dataset cube90rand1

4.2.3 Picking mixed objects

As explained in Section 4.1.3 the model cyl70x70+cube90_XY025_Z05T was trained on a mixed dataset containing 15 scenes with cubes and 15 scenes with cylinders. In Table 4.10 the performance of this model on real scanned datasets is shown.

(49)

Dataset #Cyl. #Cubes Success Notes

mixedRand 6 4 100.0%

-mixedCylOnTop 6 4 100.0%

-mixedCylStanding 3 3 50.0% The three cylinders were picked but marked as fail-ures since the model is trained to not classify the flat surfaces of a cylinder as graspable.

Table 4.10: Performance of model cyl70x70+cube90_XY025_Z05T on real scanned datasets. Columns #Cyl. and #Cubes declares the number of cylin-ders and cubes present in the respective datasets.

Table 4.10 shows that the model seems to perform reasonably well on datasets containing different types of objects. However, it fails to discriminate between the non-graspable flat surface of a cylinder and the graspable surface of a cube, as is shown by the 50% success rate for the dataset mixedCylStanding. In this case, grasping the cylinders on their flat surface is marked as a failure, since the model is trained to label the flat surfaces of a cylinder as non-graspable. A part of the explanation for this failure to discriminate can be found in Figure 4.10. Due to the position of the camera, there are no points on the side of two of the cylinders. This leads to a situation similar to that shown in Figure 4.3 in Section 4.1.3, where the edge is not detected but rather a “flat surface that ends”. Again, it is shown that the presence of just a few points is sufficient in order for the curved edge to be detected and thus also the flat surface of the cylinder which lies close to it.

(50)

(51)

5

Conclusion and Future Work

This chapter presents some final conclusions regarding the work presented in this thesis as well as thoughts on possible future work.

5.1 Conclusions

The results presented in Chapter 4 shows that the chosen approach to the task of bin picking is a viable one, due to the robustness in terms of the success rates pre-sented in Section 4.2. It could be argued that a CAD-based approach, as described in Section 1.1, would suffer from the same discriminative problem of separating cylinders and cubes which are aligned side-by-side. Such an approach would however most probably not be as sensitive to alignment and would most likely produce successful results in the case shown in Figure 4.8. On the other hand, a CAD-based approach would not generalize to objects of different dimension as well as the algorithm presented in this work. It would require more prior knowl-edge of the dimensions of the objects in the scene.

Regarding the questions asked in Section 1.2, the results of Section 4.1 shows that it is indeed possible to train an SVM to successfully discriminate graspable points from non-graspable. It should however be mentioned that the quality of the classification is highly dependent on the type of object used. As was shown, in the noise-free case, the cylinder model performed an average accuracy of 91.29% while the cube model performed an average accuracy of 74.09%. It was shown in Figure 4.3, and explained in Section 4.1, that the cube model has problems with detecting edges if there are not points on both sides of the edge, due to the in-herent dependency on point normals of the FPFH feature. Thus there is a strong dependency on the objects being classified and what discriminative geometrical properties they possess, as well as which regions of the objects that are declared

(52)

as graspable or non-graspable. All in all this means that it is indeed possible to successfully discriminate graspable points from non-graspable in some cases, but to claim that it would work generally for all objects and applications would be a too daring statement to make.

The results of Section 4.2 implicitly shows that it is possible to use an SVM trained on synthetic data for successful discrimination on non-synthetic 3D-scans. It should be clarified that the results in Section 4.2 does not tell that much about the quality of the SVM-classification in terms of accuracy as in the case with syn-thetic data. Rather it shows that the quality of the classification is sufficient for successful extraction of grasping points according to the approach described in Section 3.1. This in turn answers the last question of Section 1.2, i.e. that it is indeed possible to successfully extract grasping points from non-synthetic 3D-scans.

5.2 Future work

There exist many possible extensions and further investigations of the algorithm presented in this work. One interesting investigation would be to extend the fea-ture vector with e.g. the amount of neighbours for the query point. This could possibly help in solving the problem with wrongly classified points close to edges as described in 4.1. This is due to that points closer to these “flat edges” would have fewer neighbours than those that lie central on the surface on the side of the cube, thus introducing a way to discriminate between these two cases. The approach would probably require some kind of weighting of this extra feature di-mension in order for it to be of significance, since the FPFH is a 33-didi-mensional feature vector.

Lastly, improvements on the algorithm as a whole could be made by modifying the way in which the grasping points are ranked. As described in Section 3.3.3, the grasping points are ranked according to the size of their respective cluster. If the position of the grasping point were weighed in together with the cluster size, grasping points that are closer to the top of the bin would be favoured even if they are slightly smaller. This could help improve robustness since it would help dealing with the problem of selecting grasping points on large objects which are occluded by smaller objects, as with one case in Section 4.2.1. Further, an im-provement could be made in determining the pose of the gripper. Instead of simply selecting the normal of the closest point to the centre of gravity, it would be interesting to perform some kind of search in order to have a grasping point and a pose of the gripper, that is positioned and directed more towards the open-ing of the bin. This could help eliminate the cases when a graspopen-ing attempt fails because the gripper would collide with the bin or other objects. The reasoning here is based on the notion that a pose which is more aligned with the approach of the gripper (from the top of the bin) would reduce the risk of a collision. The results would be an algorithm with increased robustness.

(53)

(54)

(55)

A

Scene configuration of evaluation

datasets

(56)

(a)cube90rand1 (b)cube90rand2 (c)cube90corners

(d)cube90Aligned (e)cuboid90x45x45 (f)cuboidMixed

(57)

47

(a)cyl70x70rand1 (b)cyl70x70rand2 (c)cyl70x70rand3

(d)cyl70x70pyramids (e)cyl70x70corners (f)cyl70x70aligned

(g)cyl103x78rand1 (h)cyl103x78rand2 (i)cyl48x55rand1

(j)cyl48x55rand2 (k)cylMixedRand

(58)

(a)mixedRand (b)mixedCylOnTop (c)mixedCylStanding

Feature Based Learning for Point Cloud Labeling and Grasp Point Detection

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018