Evaluation of probabilistic representations for modeling and understanding shape based on synthetic and real sensory data

(1)

,

STOCKHOLM SWEDEN 2017

Evaluation of probabilistic

representations for modeling and

understanding shape based on

synthetic and real sensory data

GABRIELA ZARZAR GANDLER

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

representations for modeling and

understanding shape based on

synthetic and real sensory data

GABRIELA ZARZAR GANDLER

Master in Machine Learning Date: September 20, 2017

Principal: ABB Corporate Research Supervisor: Yasemin Bekiroglu (ABB)

Co-supervisors: Mårten Björkman (KTH) and Carl Henrik Ek (University of Bristol, UK)

Examiner: Danica Kragic

Swedish title: Utvärdering av probabilistiska representationer för modellering och förståelse av form baserat på syntetisk och verklig sensordata

(3)

Abstract

(4)

Sammanfattning

(5)

Contents iii Glossary v 1 Introduction 1 1.1 News Value . . . 2 1.2 Research Question . . . 3 1.3 Scope . . . 3

1.4 Societal and ethical aspects . . . 3

1.5 Outline . . . 4 2 Background 5 2.1 Gaussian Process . . . 5 2.1.1 Preliminaries . . . 5 2.1.1.1 The Prior . . . 6 2.1.1.2 The Posterior . . . 7 2.1.1.3 The Evidence . . . 8 2.1.1.4 Kernel Functions . . . 8

2.1.2 Gaussian Process Implicit Surface . . . 10

2.1.3 Sparse Gaussian Process . . . 10

2.1.4 Variational Sparse Gaussian Process . . . 12

2.2 Shape Learning . . . 16

2.2.1 Preliminaries . . . 16

2.2.2 Probabilistic Surface Learning . . . 16

2.2.3 Shape Descriptors . . . 18 2.2.3.1 Principal Curvatures . . . 19 2.2.4 Object Clustering . . . 20 2.2.4.1 Spectral Clustering . . . 21 3 Method 22 3.1 Data Sets . . . 22

3.1.1 Princeton ModelNet Synthetic Data Set . . . 22

3.1.2 Real Sensory Data Set . . . 24

3.2 System Outline . . . 26

3.2.1 Surface Reconstruction . . . 27

3.2.2 Object Clustering . . . 30

3.3 Experiments . . . 31

(6)

3.3.1 Phase 1: Exploration . . . 31

3.3.2 Phase 2: Exploitation . . . 32

3.4 Evaluation . . . 32

3.4.1 Triangular Mesh with Associated Uncertainty . . . 32

3.4.2 Distances to Ground Truth . . . 32

3.4.3 Probabilistic Occupancy Map . . . 33

3.4.4 Clustering Performance . . . 34

3.4.5 Spectral Embedding . . . 34

3.5 Software and Libraries . . . 34

4 Results and Discussion 35 4.1 Phase 1: Exploration . . . 35

4.1.1 Kernel Functions . . . 35

4.1.2 Sparse and Full Gaussian Processes . . . 39

4.1.3 Number of Inducing Variables . . . 42

4.2 Phase 2: Exploitation . . . 43

4.2.1 Surface Reconstruction . . . 43

4.2.1.1 Princeton ModelNet Synthetic Data Set . . . 43

4.2.1.2 Real Sensory Data Set . . . 45

4.2.2 Object Clustering . . . 48

5 Conclusions and Future Work 50

Bibliography 52

(7)

List of abbreviations

Acronym Definition

GP Gaussian Process GRF Gaussian Random Field

DTC Deterministic Training Conditional SVI Stochastic Variational Inference SE Squared Exponential

RBF Radial Basis Function

MA Matérn

MA52 Matérn with ν = 5/2

TP Thin-plate

KL Kullback-Leibler ICP Iterative Closest Point

SVD Singular Value Decomposition POM Probabilistic Occupancy Map

BFGS Broyden–Fletcher–Goldfarb–Shanno

L-BFGS-B Limited-memory BFGS that handles bound constraints

(8)

Mathematical notation

Symbol Definition

Gaussian Processes

n Number of observations

D Number of dimensions in the covariate

x The covariate (input), a column vector in RD, i.e. x = [x1, ..., xD]T

X A matrix in Rn,D_{with all n observed covariates (inputs), i.e. X = [x}

1, ..., xn]T

y The dependent variable (output), a scalar in R

y A column vector in Rnwith all n observed dependent variables (outputs), i.e. y = [y1, ..., yn]T

S A dataset, i.e. S = {(xi, yi)|i = 1, ..., n} = (X, y)

f A latent function value in R

f A column vector in Rn_{with latent function values, i.e. f = [f}

1, ..., fn]T

m(·) A column vector that represents a mean k(·, ·) A kernel/covariance function

K(·, ·) A kernel matrix

θ A vector with kernel hyperparameters

µ A column vector that represents the mean of a multivariate Gaussian distribution Σ A square matrix that represents the covariance of a multivariate Gaussian distribution σ2 A variance of a Gaussian noise

l A lengthscale

ν A hyperparameter of the Matérn class of kernel functions R The hyperparameter of the thin-plate kernel function Γ(·) The Gamma function

Hν(·) A modified Bessel function of the second kind of order ν

KL(BkA) The KL divergence from an arbitrary probability distribution A to an arbitrary probability distribution B

q(·) A probability distribution that approximates another probability distribution p(·)

∗

Superscript that refers to test variables

u _{Superscript that refers to inducing variables}

q _{Superscript that refers to something associated to an approximate distribution}

Principal curvatures

u, v Parameters of a surface r

r A regular surface function r(u, v) : R2 _{→ R}3

S A surface

p A point represented by a vector in R3 n A unit-length normal vector in R3

C A curve that belongs to a surface α(s) A function that parameterizes a curve s The arc length of a curve

c The normal curvature of a curve at a certain point

(9)

Symbol Definition

Distance between features

Gi,j A matrix containing the squared Euclidean distance between the principal curvatures in object i and in object j

Gi,j_kernel A matrix defined as Gi,j_kernel = exp −G₂i,j

gi,j A scalar obtained by calculating the mean of each column in Gi,j_kernel followed by the mean of the means

di,j Distance measure di,j = gi,i+ gj,j− 2gi,j _{between objects i and j}

Similarity and spectral clustering

wi,j A similarity measure between objects i and j

W A square and symmetric affinity matrix containing similarity measures wi,j between pairs of objects

Z A diagonal matrix whose diagonal values correspond to the sum of the respective row values in W

L A normalized Laplacian matrix, where L = Z−12W Z− 1 2 t Number of clusters used as input to a clustering algorithm

V A matrix formed by stacking the t eigenvectors corresponding to the t largest eigenvalues of L in columns

General

k·k The Euclidean distance of a vector

(10)

(11)

Introduction

Deriving 3D object shape information from point clouds is a key component in a vast num-ber of industrial and scientific applications, ranging from archaeology [1] to robotics [2]. A 3D point cloud - i.e. a collection of 3D point coordinates - consists of means to represent scenes and objects in the real world. Sensors such as Kinect cameras, laser sensors and hap-tic devices commonly perceive object shape by capturing a point cloud of object’s surface. This raw data is thus a simplified representation of the real world object and typically does not directly encode the full shape.

With this in mind, this work tackles the extraction of object shape information from point clouds. We choose to understand shape particularly by learning the object complete surface. Object’s surface reconstruction has shown to be a comprehensive way to reach shape infor-mation, because it enables i.a. surface feature extraction [3], surface points smoothing and re-sampling [4], occlusion checks [5] as well as object recognition or clustering [3].

The scientific interest towards extracting object shape information within robotics has increased during the last years. The advancements in robotic perception learning are pro-gressively enabling robots to better execute tasks in different environments, as highlighted by [2]. These authors recognize the importance of building object models to enhance robotic perception, as well as the crucial connection between the model construction and the active interactions between the robot and the environment. The work [6] draws attention to the use of shape information as input to robot grasp and motion planning tasks, while [7] employs this information to estimate object pose. Furthermore as the diversity, ease of use, and popu-larity of 3D acquisition methods continues to increase, so does the urge for the development of new surface reconstruction techniques [8].

The problem at hand involves various challenges. In robotic perception tasks, there exists many sources of uncertainty, such as noisy sensor data. Through noisy sensors the robot can only partially observe the current state of the world [2]. Additionally [8] underlines the im-portance of adapting the surface reconstruction method to the actual properties of the point cloud at hand. The authors list some relevant kinds of artifacts that are commonly present in point clouds, such as uneven sampling density, outliers, misalignment, missing/incomplete data and, as mentioned, sensor noise. Furthermore particular objects have complex shapes - differently from primitive shapes, such as a box and a cylinder - and are therefore more challenging to model.

In order to reconstruct object’s surfaces, we propose 3D probabilistic models of surfaces, based on the raw data. Probabilistic representations play an important role in this work, because they provide an adequate framework to - among other reasons - describe not only

(12)

clean (i.e. without noise), but also uncertain (i.e. with noise) and incomplete data. In this context Gaussian Process (GP) regression [9] has been in the recent years a promising mod-eling framework to reconstruct object’s surfaces [10, 11, 3, 5, 12], commonly because it is a data-driven probabilistic modeling framework and hence very flexible. In particular we em-ploy Gaussian Process Implicit Surface (GPIS), a mechanism to estimate the surface of an object with uncertainty, by extracting the level-set of an underlying GP in 3D. By employing GPIS, we aim to find adequate probabilistic representations of object shapes, which are sat-isfactorily capable of capturing the true object shape characteristics in a principled manner.

Despite coping better with noisy observations in comparison to deterministic models -3D probabilistic models usually become slower to compute as the size of the solution space grows [2]. This commonly creates a trade-off between modeling uncertainty and computa-tional speed. Instead we propose an alternative solution to tackle this problem. Rather than modeling object’s surface via conventional GP models, we utilize sparse GPs with variational formulation [13]. This alternative GP formulation has a much lower computational demand and allows surface modeling even when raw data is relatively large. Further information about these different GP variants is given in the next chapter.

Besides applying sparse GP models to estimate object’s surfaces, we analyze how these models can be used to distinguish different object classes. Various approaches have been put forward to perform object clustering from 3D raw data, such as point cloud registration [4], surface feature extraction followed by calculation of similarity between features [3] and even more recently deep neural networks [14]. Similarly to [3], we choose to extract trans-lation and rotation invariant surface features and subsequently cluster objects based on the similarity between their corresponding feature matrices.

The research work was conducted at ABB Corporate Research, at the Automation Solu-tions department and particularly in the Mechanical Systems & Mechatronics group. ABB is a leading supplier of industrial robots and has a strong research focus on robotics and machine learning. Therefore the research topic is tightly connected to ABB’s business value.

1.1 News Value

This work is located at the intersection of three main research fields, namely machine learn-ing, computer vision and robotics. As a research contribution, this work aims to offer valu-able insights regarding how to encode shape information of objects in a principled and computationally viable way. Potential benefits from the outcomes of this work include fur-ther knowledge about 3D object modeling and particularly sparse shape representation from point clouds with and without noise. The outcomes may also provide a ground for future new insights regarding e.g. robot grasp and motion planning, since shape information is usually necessary for such planning tasks [6, 15].

The outcomes of this thesis extend the analysis previously published in [3] as described below. The overall news value offered by the proposed method can be summarized in the following topics:

(13)

perform all computations in a PC computer1_;

• For object’s surface estimation with GPIS, we adopt sparse GP with variational formu-lation [13], which has not been studied before, to the best of our knowledge;

• We undertake optimization of GP parameters to find most adequate model representa-tion, given the available raw data for each object. Contrarily GP model parameters are chosen manually in [3];

• Finally besides handling the real sensory data in [3], synthetic noiseless data is also used in the experiments. This data is extracted from the Princeton ModelNet data set [14]. The use of synthetic data perfectly smooth and generally complete surfaces -serves as benchmarks to evaluate our approach.

1.2 Research Question

The main research question is:

• How does modeling based on sparse Gaussian Processes with variational formulation perform as 3D probabilistic object’s surface models, i.e. as object representations that capture shape information, given synthetic and real sensory observations?

This leads to the following auxiliary questions:

• How do different kernel choices affect the probabilistic object’s surface models? • How does the proposed shape representation model both noisy and noiseless data? • How does the proposed shape representation perform when data is incomplete? • Can the proposed shape representation be used for object clustering?

1.3 Scope

This thesis focuses on extracting object shape information from raw data. This is performed through 3D object modeling, and especially through kernel methods in machine learning, in particular GPs. As mentioned, both synthetic and real sensory data are considered in the experiments. Through these experiments and further analysis we aim to address the problem of creating probabilistic representations of objects in a principled and computationally viable way, and further applying them for object clustering. Ultimately we intend to highlight the potential benefits that principled methods may offer to model object’s surfaces, from both noisy and noiseless data.

1.4 Societal and ethical aspects

The research contributions may potentially create a societal impact e.g. by enhancing per-ception skills of robots that can operate in cooperation with people for various tasks such as assembly in industrial environments or aiding people in domestic or office environments.

(14)

Figure 1.1: Illustration of a scenario where a human teaches a task to a YuMi robot which requires understanding object properties, such as pose (position and orientation).

These can be quite complex and dynamically changing tasks, requiring advanced learning and perception skills as addressed in this thesis. Robots are frequently faced with unknown objects (e.g. even in industrial settings CAD models of objects are not always available), which makes 3D object modeling an important task e.g. to enhance perception [2], support further grasp and motion planning tasks [6, 15] and aid object pose estimation [7]. Collabo-rative robots, such as ABB YuMi robot seen in Figure 1.1, perform tasks collaboCollabo-ratively with humans. Humans have knowledge about the given tasks, but often do not know about the specifics of the robot’s embodiment, and how the robot senses and perceives its surrounding, e.g. how a robot understands objects geometry - an important parameter for planning actions - from its observations. It is therefore important to equip robots with advanced perception skills to perceive objects based on their observations, making them more autonomous. In future scenarios, users will not only use robots as tools but they will be assisted by agents with advanced cognitive and learning abilities, often relying on advanced robotic perception, such as perception of objects, which is addressed in the thesis.

Robotic applications often raise ethical questions of socio-economic and legal natures. Researchers from diverse areas have been working towards tackling ethical questions about creating robotic technology and implementing it in societies, in a way that ensures the safety of humans [16]. Additionally computer vision is a field connected to various ethical aspects that have yet to be solved completely, in the first place privacy and data storage. While we recognize the importance of the mentioned ethical implications and the consequences of robotic technology connected to the studied topic, we believe they significantly rely on the particular application domain.

1.5 Outline

(15)

Background

Gaussian Processes have been used in recent literature of robotic perception [3, 4, 5, 11, 12, 17, 18, 19], especially due to their capability to learn spatial representations from noisy data in a non-parametric Bayesian formulation [9]. In this chapter we first review fundamental theory that is key to our work. Then the general field connected to learning shape representations of objects is analyzed, by reviewing previous works in the area.

2.1 Gaussian Process

The problem tackled by this work is, in essence, a regression problem. Consider the function f (x), where x = [x1, ..., xD]T ∈ RD represents a covariate (input) and f defines a mapping.

This mapping is in fact unknown - it is a latent function - and all one can observe is a dataset S = {(xi, yi)|i = 1, ..., n} = (X, y) filled with input locations xi ∈ RD and corresponding

noisy observations yi ∈ R. X is a matrix in Rn,Dwith all n observed covariates (inputs), i.e.

X = [x1, ..., xn]T, while y is a vector in Rnwith all n noisy observations, i.e. y = [y1, ..., yn]T.

The relation between function values and corresponding noisy observations is considered to be:

yi = f (xi) + i, (2.1)

where i ∼ N (0, σn2)are noise terms that are independent and identically distributed. They

are assumed to follow a zero-mean Gaussian distribution with variance σ2n. By identifying

the patterns in the set S, one wishes to find a reasonable way to make predictions, i.e. to estimate the function value f (x∗)in a new input (test point) location x∗outside the set S.

Section 2.1.1 shortly describes GPs, their main assumptions and why they can be great mathematical tools to tackle regression problems. Then Section 2.1.2 introduces GPIS, a 3-dimensional surface that is estimated via GP inference on a 3-3-dimensional input domain. Section 2.1.3 illustrates a class of approximate solutions, i.e. sparse GP variants to overcome computational limitations. Finally Section 2.1.4 explains a more elaborate alternative for sparse GPs with a variational formulation.

2.1.1 Preliminaries

This section is subdivided in separate explanations for each of the following elements present in Bayesian statistics: the prior, the posterior and the evidence. At the end some choices of

(16)

kernel functions are further analyzed.

2.1.1.1 The Prior

A Gaussian Process can be considered a generalization of a Gaussian probability distribution. While the latter describes a random variable which is either a scalar or a vector -the former is a stochastic process, i.e. a collection of random variables. More specifically, a GP is (infinitely) many random variables, any finite number of which is jointly Gaussian distributed [9].

For the regression problem at hand, the collection of random variables is the collection of variables fi, f (xi)defined in any finite collection of input locations xi ∈ RD, for i in a finite

set. The variables fi can be seen as instantiations of function values in the corresponding

input locations xi.

A mean function and a covariance function are all the elements needed to fully specify a GP. As defined in [9], the mean function m(x) and the covariance function k(x, x0)of f (x) can be defined as

m(x) = E[f (x)],

k(x, x0) = cov(f (x), f (x0))

= E[(f (x) − m(x))(f (x0) − m(x0))],

(2.2)

where x, x0 ∈ RD_{. k(x, x}0₎_{is a kernel function k : (R}D_{, R}D_{) → R that takes as arguments}

two input locations x and x0 and outputs a real value. This real value defines the covariance between the function values f (x) and f (x0)evaluated on the corresponding input locations. Note this interesting fact: the covariance between outputs is written as a (kernel) function in the input domain. Finally f (x) can be generally described as

f (x) ∼ GP(m(x), k(x, x0)). (2.3)

Equation 2.3 reveals how f (x) can be represented as a GP. It also shows how GPs are suitable for regression: by its very nature, a GP is able to govern the properties of a function with continuous domain [9]. This is done in a "soft" way, by a distribution over functions implied by the choice of m(x) and k(x, x0). This choice - and especially the one for k(x, x0) - fixes the properties of the functions considered for inference, such as smoothness and sta-tionarity. The choice of the kernel function is thus a crucial decision when using GPs. It can be selected to have a parametric form, which depends on some parameters θ, the so-called hyperparameters. Some alternatives of kernel functions are presented later.

(17)

2.1.1.2 The Posterior

There is valuable information in S - the training set - about the unknown function f . The n la-tent function values corresponding to the training set are in f = [f (x1), ..., f (xn)]T. Similarly

the n∗latent function values to be predicted from test points are in f∗ = [f (x∗₁), ..., f (x∗_n∗)]T. According to the prior, the joint distribution of f and f∗is

f f∗ ∼ N m(X) m(X∗) , K(X, X) K(X, X ∗₎ K(X∗, X) K(X∗, X∗) , (2.4) where m(X) = [m(x1), ..., m(xn)]T ∈ Rnand m(X∗) = [m(x∗1), ..., m(x∗n∗)]T ∈ Rn ∗ are the concatenations of the prior means evaluated on the training and test input locations, respec-tively. Additionally K(X, X) ∈ Rn,n stands for the covariance matrix between the training outputs, while K(X∗, X∗) ∈ Rn∗,n∗ for the covariance matrix between the test outputs. Fi-nally K(X, X∗) ∈ Rn,n∗ is the cross-covariance matrix between each pair of training and test points, and K(X∗, X) = K(X, X∗)T ∈ Rn∗_,n

. Each element of these K(·, ·) matrices is calculated via the previously defined kernel function.

Instead of f , the available training output variables are y, which are corrupted by noise. Therefore it is useful to calculate the joint Gaussian prior distribution between y and f∗, which is y f∗ ∼ N m(X) m(X∗) ,K(X, X) + σ 2 nI K(X, X∗) K(X∗, X) K(X∗, X∗) , (2.5)

where σ2nis the variance of the Gaussian noise.

This joint Gaussian prior distribution can be now conditioned on the observations, in order to derive the joint posterior distribution of f∗, which can be written as

p(f∗|y, X, X∗) ∼ N (µ∗, Σ∗),where

µ∗ , E[f∗|y, X, X∗] = m(X∗) + K(X∗, X)[K(X, X) + σ_n2I]−1(y − m(X)), Σ∗ = K(X∗, X∗) − K(X∗, X)[K(X, X) + σ_n2I]−1K(X, X∗).

(2.6)

Equation 2.6 is the core predictive equation in GP regression. The mean of the posterior distribution µ∗can be seen as the prediction of f∗ itself, i.e. the estimated function values in every input test location. The variances - the diagonal elements of Σ∗- in turn indicate how certain the prediction is in every input test location.

(18)

2.1.1.3 The Evidence

A valuable element in Bayesian statistics worth commenting at this point is the marginal likelihood p(y|X). It represents the probability of the observed data under the model. The marginal likelihood can be calculated as

p(y|X) = Z

p(y|f , X)p(f |X)df . (2.7)

In Equation 2.7, the first element in the integral p(y|f , X) is called the likelihood, i.e. the probability of the observed data, given the underlying mapping f . Induced by Equation 2.1, the likelihood distribution can be expressed as

p(y|f , X) ∼ N (f , σ2_nI). (2.8)

Additionally the second element in the integral in Equation 2.7 p(f |X) is the prior distri-bution over function values. Thus the marginal likelihood in Equation 2.7 combines infor-mation about the possible mappings (from the prior distribution) with inforinfor-mation about the noisy data (from the likelihood) by integrating over all possible latent mappings f . This re-sults in a marginalization over the latent mappings f . This marginalization thus expresses a modeling setting where a whole family of functions is simultaneously considered [21]. Tak-ing the prior from Equation 2.3 and assumTak-ing m(x) = 0, the distribution of the marginal likelihood can then be expressed as

p(y|X) ∼ N (0, K(X, X) + σ_n2I). (2.9) The marginal likelihood often takes the role of the objective function in the modeling sce-nario, because the goal is usually to increase the data fit. Then optimization of the objective function is done with respect to the kernel’s hyperparameters θ and the noise variance σn2,

by following the maximum likelihood procedure [21].

2.1.1.4 Kernel Functions

As mentioned previously, the selection of kernel functions is the key element when solving regression problems through GPs, as they encode assumptions about the functions consid-ered for inference. The kernel hyperparameters represent means to fit the model to the data, by e.g. maximizing the marginal likelihood with respect to them. Note that from now on the terms kernel function and covariance function will be used interchangeably.

Most commonly used kernel functions are stationary. Stationary kernel functions are functions of x − x0, which makes them invariant to translations in the input space. A more constrained definition is given by isotropy: in case a kernel function is a function of kx − x0k it is said to be isotropic. The last definition means that the kernel is invariant to all rigid motions [9]. Below three functions used in this work are introduced.

Squared-exponential covariance function

(19)

also because it models well the smoothness characteristics of various random processes [12]. The SE covariance function can be expressed as

kSE(x, x0|θSE = {σ2, l }) = σ2exp − 1 2l2kx − x 0_k2 , (2.10)

where the operator k·k stands for the Euclidean distance (or the l2_{-norm) of a vector. The}

lengthscale l can be interpreted as an indicator of how close two input locations have to be in order for their function values to influence each other significantly. In other words, this hyperparameter determines how quickly the function should vary (the shorter l is, the more sudden changes there are) [9].

Matérn class covariance functions

The Matérn class (MA) of covariance functions is also isotropic and given by [9]:

kM A(x, x0|θM A= {σ2, l , ν}) = σ22 1−ν Γ(ν) √ 2νkx − x 0_k l !ν Hν √ 2νkx − x 0_k l ! , (2.11) where Γ(·) and Hν(·) stand for the Gamma function and a modified Bessel function of the

second kind of order ν [22], respectively. Its hyperparameters are the variance σ2, the length-scale l and an additional parameter ν.

As ν → ∞, the Matérn covariance converges to the squared exponential covariance. Therefore this class can be considered to be a generalization of the squared exponential co-variance function, where the parameter ν controls how many times the process is differen-tiable, i.e. ν controls the smoothness of the function [9]. Additionally [22] reminds that the Matérn class includes not only the squared-exponential, but a large class of kernel functions, and proves very useful for applications due to this flexibility.

The mathematical expression becomes especially simple when ν is a half-integer. In this case the covariance function is equivalent to the product of an exponential and a polynomial functions. An interesting case for machine learning is ν = 5/2 [9], which is used in this thesis. Thus the Matérn covariance function is given by:

kM A52(x, x0|θM A52= {σ2, l }) = σ2 1 +√5kx − x 0_k l + 5 3 kx − x0k2 l2 ! exp " −√5kx − x 0_k l # . (2.12)

Thin-plate covariance function

More recently the thin-plate (TP) covariance function has become a popular choice for GPIS estimation [10, 3, 12]. It is derived from the thin plate spline in [10] and presents in-teresting properties for shape analysis and reconstruction. It is again an isotropic function and it has R as its only hyperparameter. It is set to R = maxkx − x0k, for x, x0 ∈ RD_{. This}

means that R is the maximum Euclidean distance between input points. The TP covariance function can be expressed as

(20)

Differently from the SE function, this covariance function does not provide design pa-rameters to adapt itself to desired surface properties. In [3], however, it is argued that the TP covariance function may provide better reconstructions than the SE function e.g. for rectan-gular objects, where the flatness of surfaces should be preserved.

The presented theoretical explanation in Section 2.1.1 skips mathematical proofs and is focused on results rather than on derivations that lead to the results. For a more mathe-matically detailed explanation, the interested reader is invited to explore the works [9, 20]. Particularly in [9] there exists a broad content about GPs.

2.1.2 Gaussian Process Implicit Surface

In this work GP regression is employed to estimate the surface of objects given noisy obser-vations in the R3domain. GP regression is applied - as explained in Section 2.1.1 - to estimate a continuous function whose level-set induces the object’s surface. In this context, the work [12] provides a physical interpretation: the level-sets implied by functions in R3 _represent

surfaces of objects with non-zero volume. When the functions are estimated by GPs, these surfaces are then called Gaussian Process Implicit Surfaces (GPISs). In other words, they are representations of uncertain shape estimates by using GP. The work [11] has shown an application of GPIS on robotics domain, by applying GPIS for surface estimation and grasp planning. As stated by the authors, this allows to fuse uncertain information from different sensors in a probabilistic shape estimate.

In order to induce the object’s surface, the model should provide means to distinguish data points that are part of the object from data points that are not. This can be done by speci-fying a mapping f : R3→ R that is - without loss of generality - zero-valued for points on the surface, while positive and negative for points outside and inside the surface, respectively. This can be described as

f (x)       

> 0 if x outside the surface, = 0 if x on the surface, < 0 if x in the surface.

(2.14)

The core idea of GPIS is to let f (x) be distributed according to a GP, so that the estimated surface contour is random and characterized by the underlying process [12].

2.1.3 Sparse Gaussian Process

Probably the most significant limitation of GPs is how its computational demand scales with n(the data size): the associated complexity is O(n3), due to the inversion of a n × n matrix [9]. In terms of computational time and storage, this is prohibitive for large datasets. There-fore sparse methods have been developed in the literature [23, 9, 13, 24, 21]. Most of these methods are associated with a computational cost of O(n(nu)2), for nu n.

Such methods adopt nupairs of auxiliary input-output pairs of variables, which are called inducing variables, denoted as xu

i ∈ RD and fiu ∈ R, for i = 1, ..., nu and nu n. Sparse

algorithms choose the inducing variables in various different ways. The simplest way to choose inducing inputs is to let them be a subset of the training set. Some algorithms, how-ever, allow them to lie anywhere in the input domain [23].

(21)

nu × nu _{matrix instead. Similar to the notation introduced in the beginning of this}

chap-ter, Xu ∈ Rnu_,D

and fu ∈ Rnu

represent a matrix with all inducing inputs and a vector with all inducing outputs, respectively. In this context, the primary goal of the sparse method is to define a good low-rank approximation to the full covariance, as well as to select a good set of inducing inputs [21].

In [21] there exists an insightful unified view of different sparse methods proposed in the literature, which is originally derived from [23]. Firstly consider that the inducing outputs fuare under the prior distribution similarly to f , which makes

p(fu|Xu) ∼ N (0, K(Xu, Xu)). (2.15) Additionally the joint prior distribution of f and fu _is

f fu ∼ N 0 0 , K(X, X) K(X, X u₎ K(Xu, X) K(Xu, Xu) . (2.16)

From this it is possible to conclude that the conditional GP prior of f is

p(f |fu, X, Xu) ∼ N (µ, Σ),where

µ , E[f |fu, X, Xu] = K(X, Xu)K(Xu, Xu)−1fu, Σ = K(X, X) − K(X, Xu)K(Xu, Xu)−1K(Xu, X).

(2.17)

The original probability space can be recovered by means of marginalizing out the induc-ing outputs fu. For simplicity, from now on p(fu|Xu₎_{and p(f |f}u_{, X, X}u₎_{are referred as p(f}u₎

and p(f |fu_{, X), respectively. In the exact case, the marginal likelihood can be expressed as}

p(y|X) = Z

p(y|f )p(f |fu, X)p(fu)dfudf , (2.18) which leads to the exact result found in Equation 2.9.

The work [23] introduces the fundamental approximation which induces sparsity, namely the approximation q(f |fu_{, X)}_{to the true conditional p(f |f}u_{, X). Different sparse methods take}

different assumptions regarding the form of q(f |fu, X), while they maintain the exact p(y|f ) and p(fu). Thus various sparse approaches presented in the literature can be unified under this common view, which takes the approximate distribution q(f |fu_{, X)}_{as the core}

distinc-tion among these approaches [21]. Addidistinc-tionally it is reminded by [23] that, even though the inducing outputs are always marginalized out in the predictive distribution, the choice of inducing inputs does impact the final solution.

Consider the approximation

q(f |fu, X) ∼ N (µ, Σq),where Σq6= Σ. (2.19) In Equation 2.19, µ and Σ are defined in Equation 2.17, while Σq _{represents an}

(22)

In Equation 2.20 different sparse methods can be derived by means of choosing a matrix Σqthat depends on the inversion of a nu× nu_{matrix. One example of such a method is the}

one used in [25]. In [25] the so-called deterministic training conditional (DTC) approximation assumes that Σq _{= 0, which implies that the approximation to the conditional GP prior is}

determinant, i.e. qDT C(f |fu, X) ∼ N (µ, 0). In this particular case, the approximate marginal likelihood is

qDT C(y|X) ∼ N (0, K(X, Xu)K(Xu, Xu)−1K(Xu, X) + σ_n2I). (2.21) It can be concluded from Equation 2.21 that sparse approaches that fit into the framework presented in this section explicitly replace the covariance matrix using a low-rank approxi-mation via assumptions in the approximate distribution q(f |fu_{, X). As noticed in the DTC}

case, this is equivalent to modifying the GP prior.

Interested readers should refer to the works [25, 23, 9] for further reading about sparse solutions for GPs.

2.1.4 Variational Sparse Gaussian Process

This section is intended to describe a formulation for sparse GPs which directly approximates the posterior distribution described in Equation 2.6. The explanation given in this section is mainly derived from the work [13].

Consider a set of function values f∗ which are intended to be predicted. The posterior can be described by the following predictive Gaussian

p(f∗|y, X, X∗) = Z

p(f∗|f , X∗)p(f |y, X)df . (2.22) The first step towards an approximation of the posterior is to augment the model by introducing the inducing pairs of variables fu_{and X}u_{, which allow to update the predictive}

Gaussian to

p(f∗|y, X, X∗) = Z

p(f∗|f , fu, X∗)p(f |fu, y, X)p(fu|y)df dfu. (2.23) Suppose that fu is a sufficient statistic for f , which means that f∗ and f are independent given fu_{, i.e. p(f}∗_{|f , f}u_{) = p(f}∗_|fu_{). It also follows that p(f |f}u_{, y, X) = p(f |f}u_{, X). Since in}

practice it is improbable to find inducing outputs that are sufficient statistics, consider an approximation to the exact posterior, referred as q(f∗|X∗_{). Additionally consider q(f}u_{) ∼}

N (µq(fu)_{, Σ}q(fu)₎_{to be a free variational Gaussian distribution, which is in general different}

than the true posterior counterpart, i.e. q(fu_{) 6= p(f}u_{|y). Thus q(f}∗_|X∗₎_{can be given as}

q(f∗|X∗) = Z p(f∗|fu_{, X}∗_{)p(f |f}u_{, X)q(f}u_{)df df}u = Z p(f∗|fu, X∗)q(fu)dfu = Z q(f∗, fu|X∗)dfu. (2.24)

(23)

q(f∗|X∗) ∼ N (µq(f∗|X∗), Σq(f∗|X∗)),where µq(f∗|X∗) = K(X∗, Xu)K(Xu, Xu)−1µq(fu),

Σq(f∗|X∗) = K(X∗, X∗) − K(X∗, Xu)K(Xu, Xu)−1K(Xu, X∗)+ K(X∗, Xu)K(Xu, Xu)−1Σq(fu)K(Xu, Xu)−1K(Xu, X∗).

(2.25)

Equation 2.25 defines the general form of the sparse posterior GP, with complexity O(n(nu₎2_).

It is however still not clear how the variational distribution q(fu)should be set - i.e. how the variational parameters µq(fu) and Σq(fu) should be determined - and how the inducing in-puts Xu _{should be selected.}

The work [13] explains a principled procedure to determine the variational distribution q(fu) and the inducing inputs Xu: to form the variational distribution q(f |X) and the ex-act posterior p(f |y, X) on the training function values f , and subsequently to minimize a distance between these two distributions. Then [13] further highlights that this procedure is equivalent to minimize a distance between the joint exact posterior p(f , fu|y, X) and the corresponding joint variational approximation q(f , fu_|X).

Notice from Equation 2.24 that

q(f , fu|X) = p(f |fu, X)q(fu). (2.26) The inducing inputs Xu_{can also be considered free variational parameters in this}

formu-lation. Notice that the joint exact posterior p(f , fu|y, X) is associated to the following joint model

p(y, f , fu|X) = p(y|f )p(f |fu, X)p(fu). (2.27) By marginalizing out fu, one gets the equivalent joint model

p(y, f |X) = p(f |y, X)p(y|X). (2.28)

Then notice that the conditional prior p(f |fu, X), as in Equation 2.17, and the marginal prior p(fu), as in Equation 2.15, are dependent on the choice of the inducing inputs Xu. The posterior p(f |y, X) and the marginal likelihood p(y|X), however, do not depend on the choice of the inducing inputs Xu. This leads to the conclusion that the joint posterior model has a set of free variational parameters Xu.

The next natural step is then to find an adequate distance function that measures the closeness of two probability distributions. The Kullback-Leibler (KL) divergence is used for this purpose. This is the objective function to be minimized. The KL divergence from an ar-bitrary probability distribution A(·) with density a(·) to an arar-bitrary probability distribution B(·)with density b(·) can be expressed as KL(BkA) and its formulation follows

KL(BkA) = EB " log b(·) a(·) # = Z +∞ −∞ b(·)log b(·) a(·)d·, (2.29)

where EBrepresents the expectation with respect to the distribution B.

(24)

the amount of information lost when the prior is used to approximate the posterior. The KL divergence in the context of variational inference is however the reverse KL divergence, which is KL(AkB) = EA " log a(·) b(·) # = Z +∞ −∞ a(·)loga(·) b(·)d·. (2.30)

In the context of variational inference A represents the variational distribution q(f , fu|X) and B, the exact posterior p(f , fu_{|X, y). Even though the reverse KL divergence is less}

in-tuitive, it enables to take expectations with respect to the variational distribution, instead of the true posterior.

The objective function to be minimized can then be stated as the KL divergence from the exact joint posterior over the latent function values to the variational approximation of it:

min KL(q(f , fu|X)kp(f , fu|X, y)). (2.31) Additionally the minimization in Equation 2.31 is equivalent to the maximization of a variational lower bound of the exact log marginal likelihood p(y|X). This is the core dif-ference from what is presented in Section 2.1.3. A lower bound L can be derived, which corresponds to the approximate log marginal likelihood log q(y|X) proposed by the varia-tional inference. The derivation can be done in the following way: from the expression of the exact marginal likelihood in Equation 2.18, apply the logarithm, multiply with p(f |f_{p(f |f}uu,X)q(f_,X)q(fuu)₎ inside the integral and finally apply Jensen’s inequality. This yields the lower bound L, i.e.

L = Z p(f |fu, X)q(fu)logp(y|f )p(f |f u_{, X)p(f}u₎ p(f |fu_{, X)q(f}u₎ df u_df = Z p(f |fu, X)q(fu)logp(y|f )p(f u₎ q(fu₎ df u_{df .} (2.32)

Thus it can be stated that

log p(y|X) ≥ L. (2.33)

Then the variational parameters - i.e. the inducing inputs Xu_{and the parameters µ}q(fu)

and Σq(fu₎

of the variational distribution q(fu₎_{- can be determined by means of optimizing}

L with respect to them.

At this point it is possible to notice some key differences between the original sparse formulation (exposed in Section 2.1.3) and the variational sparse formulation [21]:

• The variational formulation makes an explicit approximation with respect to the exact GP posterior, which represents a rigorous way to approximate a model, by defining a clear distance minimization procedure. Differently the original approach makes an approximation by changing the GP prior, usually followed by some optimization with respect to the marginal likelihood of the modified model. However, as stated by [13], the latter is not an optimal way to approximate the exact GP, because it doesn’t mini-mize any distance between the exact GP and its approximate counterpart;

(25)

• Finally the variational formulation uses in general a free q(fu_{), while the originally}

proposed sparse formulation does not adopt such an approximation.

In [24] the lower bound in Equation 2.32 served as objective function to find values to the variational parameters - namely the inducing inputs and the parameters of the Gaussian distribution of q(fu)-, as well as to find values to the kernel hyperparameters and the noise variance from the observations. The authors describe a Stochastic Variational Inference (SVI) procedure through which the parameters of the distribution q(fu₎_{are optimized in the}

natu-ral gradient space. The computational complexity associated to this variational formulation is O((nu)3). Thus an advantage of the SVI procedure is that it does not scale with the number of data points.

However the current work does not focus on the solution proposed by [24], which re-quires optimizing a large number of parameters - especially due to the Gaussian distribution q(fu)- and involves a somewhat complex optimization procedure, characterized i.a. by the adjustments of the stochastic procedure’s step length. The current work instead focuses on the solution proposed by [13].

In [13] the bound L in Equation 2.32 is maximized by analytically solving for the optimal choice of the variational distribution q(fu). The author thus eliminates q(fu) by differenti-ating Equation 2.32 with respect to q(fu) and setting its parameters to the optimal values, which are q(fu) ∼ N (µq(fu), Σq(fu)),where µq(fu)= 1 σ2 n K(Xu, Xu) K(Xu, Xu) + 1 σ2 n K(Xu, X)K(X, Xu) −1 K(Xu, X)y, Σq(fu)= K(Xu, Xu) K(Xu, Xu) + 1 σ2 n K(Xu, X)K(X, Xu) −1 K(Xu, Xu). (2.34)

Therefore the variational lower bound can be updated by setting these optimal parame-ters. This guides to the solution proposed by [13]:

Lf inal=log qf inal(y|X) =log qDT C(y|X) − 1 2σ2

n

tr (Σ), (2.35)

where Σ is explained in Equation 2.17. Lf inal _{is a variational lower bound to the exact log}

marginal likelihood, for any selection of inducing inputs Xu. Furthermore Lf inal is tighter in comparison to the lower bound proposed in [24], which means that the solution proposed by [13] results in an approximation that is closer to the exact log marginal likelihood.

The curious conclusion from Equation 2.35 is that the variational lower bound Lf inal differentiates with the sparse DTC approximation by a trace term only. This trace term plays an important role and is tightly related to the advantages of treating sparse GPs through a variational formulation. It acts as a regularizer and consequently avoids overfitting, because • it corresponds to the squared error of predicting f from fu _{and therefore facilitates to}

find an appropriate set of inducing inputs Xu_{[13]; and}

• it regularizes the kernel hyperparameters and the variance σ2

n, associated to the model

(26)

By optimizing the variational lower bound in Equation 2.35 with respect to the inducing inputs, one can find the optimal inducing inputs Xu. Note that the selection of Xu deter-mines the flexibility of the variational distribution q(f , fu|X) [13]. Added to the fact that the optimal parameters of the variational distribution q(fu₎_{are given in Equation 2.34, this}

concludes the explanation about the variational formulation for sparse GPs. The approxima-tion q(f∗|X∗₎_{in Equation 2.25 can then be used with the fitted parameters as a proxy for the}

posterior, e.g. to form predictions about unseen data.

For a deeper understanding about variational sparse GPs, as well as for more detailed mathematical explanation, the works [13, 24, 21] are highly recommended.

2.2 Shape Learning

In this thesis GP represents the fundamental model to tackle the problem of probabilistic ob-ject shape estimation from raw data with artifacts. In this section we address shape learning approaches from a broader perspective instead. Previously published works addressing 3D object models within robotic perception are commented. Furthermore the role of shape fea-tures as descriptors of object shape is addressed, as well as object clustering based on such features.

2.2.1 Preliminaries

Robots commonly need information (e.g. shape, position, orientation, ...) about objects in the surroundings in order to accomplish simple tasks. Point cloud data obtained from various sensors - such as Kinect camera, haptic and laser sensors - has been a valuable input to generate information about objects. The collected data provides a potentially rich ground for further insights regarding e.g. object shape. Object shape information in turn represents a common input for robots to take decisions regarding object exploration, object classification or recognition, as well as object grasping and collision-free motion.

There are several challenges related to representing object shape. Visual data may present noise and incompleteness, e.g. objects might be seen from only one view direction or might be occluded. Different ways of determining object shape have been proposed. The authors of [6] base their method on the assumption that most objects applied in service robotic se-tups have symmetries. By starting off with partial visual data, they estimate hidden parts of the object by calculating the best symmetry plane and reflecting the given points accord-ingly. However this symmetry assumption may not always hold. Additionally in some applications it may be more accurate to extract shape information by modeling the complete object’s surface from observations. In this context [26] performs complete object’s surface construction. The authors let a robot grasp an object and move it in front of a depth camera, to explore the object from various view directions. These are moves governed by an algo-rithm that calculates the best next view, which takes in consideration both the view quality (prioritizing surface regions that maximize the information gain) and the actuation cost.

2.2.2 Probabilistic Surface Learning

(27)

to help identify points of most uncertainty and hence prioritize areas for tactile exploration. This work builds a first surface hypothesis via visual data and later improves the surface esti-mation by gradually adding tactile inforesti-mation to the system. This means that tactile data is used to refine object models initially learned from visual points provided by a stereo system. Similarly to [26], the work [17] performs action selection for tactile exploration by consid-ering the trade-off between the uncertainty in the surface estimation and the corresponding travel cost. GPIS modeling is applied in this case to estimate the shape of the bottom part of an object. Additionally [27] also performs object modeling with guided haptic exploration, both via GPIS and Local Implicit Shapes.

Besides GPIS, Gaussian Random Fields (GRF) may be used to represent surfaces. GRF consist of a GP regression applied on a two-dimensional domain. [5] presents a way to build a 3D compact representation of a robot’s work space area, which especially provides a shape representation of incomplete areas and occluded objects/regions. They rely on both GPIS and GRF. The authors firstly train GRF and GPIS models from point cloud data provided by a stereo camera. They further exploit GRF and Delaunay triangulation to detect highly uncertain areas and target points to be explored by touch. These points in turn are added to the original visual data to continuously improve the GPIS estimations. This approach results in a reduced amount of interactions and reduced computational time. Additionally [18] does not build implicit surface models, but explicit representations using only GRF instead. They constrain the reconstructed object area to a predefined rectangular zone and guide the exploration via a next-best-touch strategy that prioritizes areas of maximum uncertainty, as dictated by the posterior variance of the GRF model. While being a tool to construct pieces of object’s surfaces or terrains, GRF don’t enable the complete construction of closed objects, which GPIS does.

There exists many advantages in using GPs to model object shapes. GPs allow to combine knowledge about object format from a priori belief to knowledge provided by the collected data [9]. A GP regression framework - thanks to its principled probabilistic nature - is also able to combine uncertainties from different levels - e.g. from a priori beliefs, from input data points [28], as well as from function values. This is a beneficial property, since it gathers multiple sources of knowledge in a soft (probabilistic) way and it allows the practitioner to know the input locations with most predictive uncertainty.

The estimation of implicit surfaces via GPs is significantly influenced by the choice of the covariance (or kernel). In this context [29] shows a novel way to represent 3D surfaces of objects, by presenting a new approach to regularize multi-scale compactly supported basis functions, which corresponds to a GP with modified covariance function. The thin-plate co-variance function has been a commonly used coco-variance to encode shape information [3, 10]. However other approaches have been suggested, such as the introduction of shape primi-tives to define geometric object priors. This idea suggested in [12] leads to positive recon-struction results, even on surface regions that are not observed. Particularly a non-stationary kernel for reconstruction of 3D surfaces is proposed in [30]. Non-stationary kernels may provide benefits by letting the model adapt to functions whose smoothness varies with the input data, while they commonly have more elaborate mathematical formulations. Finally [31] proposes radial basis functions, which may provide interesting tools for 3D surface es-timation. Differently from the aforementioned approaches, this thesis presents a study on how different stationary kernel functions affect object’s surface representation.

(28)

kernel functions to model multi-scale surface geometries. Their model involves data fusion from multiple sensors, as well as adaptive sampling. The core idea behind their method is that, in a GP modeling scenario, such complex surfaces require covariance functions that are able to represent different topological features at different scales. Therefore they propose the use of combinations of kernel functions, namely the squared exponential function, the Matérn function, as well as a periodic function. This approach allows them to represent complex surfaces considerably well.

Kernels are usually parametrized by the so-called hyperparameters [9]. These hyperpa-rameters are typically elements of high importance in the predictive framework, since they - together with the kernel function choice itself - determine the prior knowledge encoded in the kernel. Finding out suitable hyperparameters is not a trivial task. Different methods have been proposed, with the simplest one being maximization of the marginal likelihood [9]. Alternative methods have been proposed as well, such as variational approximations of GPs. As mentioned in Section 2.1.4, the work [13] presents such an approach, in which instead of maximizing the exact marginal likelihood, it relies on maximizing a variational lower bound to the exact marginal likelihood. This thesis studies optimization of GP param-eters via variational approximation.

2.2.3 Shape Descriptors

Feature extraction is an important step for machine learning applications. By extracting fea-tures from 3D data it is possible to describe objects regarding their shape characteristics. A shape descriptor attempts to quantify shape in ways that agree with human intuition or task-specific requirements [32].

Simple geometric measures can be used to describe shapes, such as center of gravity, eccentricity, circularity ratio and etc. These simple features can however only discriminate shapes with large differences and are therefore not suitable to be standalone shape descrip-tors. In this context, shape descriptors should meet some requirements, as mentioned by [32]:

• The descriptor should be as complete as possible to represent the object; • The descriptor should be described and stored compactly;

• The computation of distance between descriptors should be simple.

3D shape descriptors may be extracted from a mesh, such as a triangular mesh, i.e. a representation of object’s surface. A triangular mesh consists of a particular type of polygon mesh, which comprises of a set of triangles connected by their common edges or corners. Even though many representations have been proposed for 3D models, polygon and trian-gular meshes are commonly used in most domains. In addition to being extremely flexible and expressive, these meshes have been directly supported by accelerated graphics hard-ware for several years, which contributed to their diffusion and establishment [33]. In this thesis triangular meshes are used as surface representations.

(29)

Figure 2.1: A surface S with a curve C, a point p and a unit-length normal vector n (adapted from [37]).

features are addressed in [35]. The authors highlight that principal curvatures are intrin-sic surface properties, i.e. they are not affected by the choice of the coordinate system, the position of the viewer relative to the surface or the chosen parameterization of the surface. Principal curvatures are further discussed below. For alternative 3D shape descriptors, the work [36] is recommended.

2.2.3.1 Principal Curvatures

Curvatures are local measurements of the bending of regular surfaces [35]. They can be uti-lized as intrinsic shape properties to describe surfaces. Following the explanation in [37], firstly some important definitions on differential geometry are given, namely normal curva-tures, principal directions and finally principal curvatures.

Consider a regular surface function r(u, v), i.e. an object’s surface defined in the 3D space and parameterized by u and v. Consider S to be defined by r(u, v) and p to be a point lying on S. The orientation of S at p is specified by the unit-length normal vector n, where [37]

n(u, v) =

∂r(u,v) ∂u ×

∂r(u,v) ∂v

k∂r(u,v)_∂u ×∂r(u,v)_∂v k. (2.36) Further consider a curve C ⊂ S, which is parameterized by α(s) = r(u(s), v(s)), where s represents the arc length of C, and α(0) = p. Figure 2.1 shows C, as well as S, p and n. The normal curvature c of C at p is then defined as

c =< d

2_α(0)

ds2 , n >, (2.37)

where < ·, · > stands for the internal product between two vectors.

(30)

minimum values. These two directions are called principal directions. The corresponding maximum and minimum normal curvatures are in turn called principal curvatures [35].

There exists different methods for locally estimating differential properties of a triangular mesh. Within the framework of these criteria [35] proposes a scheme for computing the principal curvatures at the vertices of a triangular mesh, to create local descriptors of surface shape. Later on [37] presents an algorithm based on the method from [35], but improving the way normal vectors are estimated from a triangular mesh. The authors remind that approximations to the surface normal and the surface curvatures are often necessary, since the surface is commonly defined by a triangular mesh, i.e. a set of discrete points.

Given a surface represented by a triangular mesh, the calculation of principal curvatures outputs a feature matrix containing two measurements per mesh vertex. In this thesis princi-pal curvatures calculation follows essentially the method from [35], but adapts the estimation of normal vectors as proposed in [37]. The interested reader can refer to [35, 37] for the actual algorithmic details concerning the computations of principal curvatures.

2.2.4 Object Clustering

The task of finding representative clusters has been the focus of extensive research in ma-chine learning and pattern recognition [39]. In virtually every scientific field dealing with empirical data, practitioners attempt to get a first impression on the data by trying to identify groups of “similar behavior” in the data [40]. There are in fact numerous ways to perform classification or clustering from 3D point clouds. For instance in [14] neural networks are proposed to i.a. tackle object recognition using a convolutional deep belief network. As ad-dressed in Section 2.2.3, 3D shape descriptors capture objects characteristics that support to distinguish different object classes. In [41, 42], similarity between different objects is quan-tified by comparing so-called object shape distributions. Their approach outputs a shape distribution from a shape function that measures global geometric properties of the object.

Moreover object identification can provide further insights about the environment sur-rounding a robot. In [4] objects are identified through bimanual tactile exploration. This method applies GPs to perform data smoothing: originally noisy and non-uniform point clouds are uniformly sampled, by means of prioritizing points that diminish the uncertainty the most. The Iterative Closest Point (ICP) algorithm [43] is used in this case for compari-son of point clouds and subsequent object identification. The work [44], in its turn, focuses on acquiring meaningful haptic knowledge about an object during brief and functional ac-tion. They propose a hybrid approach composed by a random forest classifier, for object identification, and a parametric object property estimator. This hybrid approach implies col-laboration between both components, which increases overall identification accuracy and leads to object pose estimation as well.

Differently from the aforementioned approaches and similarly to [3], this thesis addresses object clustering by first formulating a distance measure between features extracted from triangular meshes for different objects. The work [45] offers a framework to measure such a distance. The authors test whether two distributions are different on the basis of samples drawn from each of them. They propose a test statistic consisting of the difference between the mean function values on the two samples. Then in case this statictic is large, the samples are likely from different distributions, and vice-versa.

(31)

for instance a clustering method that attempts to distinguish objects based on a similarity measure. With this in mind, two objects can be said to be similar if the distance between them is small. Spectral clustering is applied in this thesis and is further discussed below.

2.2.4.1 Spectral Clustering

Spectral clustering is a particular clustering algorithm that relies on the top eigenvectors of a matrix derived from the distance between objects [39]. The work [40] highlights that spectral clustering often outperforms traditional approaches, such as k-means, besides being very simple to implement and being solved efficiently by standard linear algebra methods.

Consider a square and symmetric affinity matrix W , containing the similarity measures between every pair of objects. Additionally consider t to represent the total number of clus-ters. The steps in the spectral clustering algorithm proposed by [39] are described below:

1. Consider Z to be a diagonal matrix whose diagonal values correspond to the sum of the respective row values in W . Then calculate the normalized Laplacian matrix 1 L = Z−12W Z−

1 2;

2. Find the t eigenvectors corresponding to the t largest eigenvalues of L and form the matrix V by stacking these eigenvectors in columns;

3. Normalize each of V ’s rows to have unit length;

4. Consider each row of the normalized V as a point in Rt, i.e. a lower-dimensional rep-resentation for each object. Then cluster all points into t clusters via k-means algorithm (it could in fact be any other algorithm that aims at minimizing some distortion); 5. Finally assign each object to the cluster assigned to the corresponding lower-dimensional

representation in V .

Note that this algorithm essentially changes the representation of the original data points to points in Rt_{. Due to the properties of the graph Laplacian, this change of representation}

enhances the clustering properties in the data, so that objects can be discriminated in the new representation [40]. In practice spectral clustering can be very useful when the individual clusters do not form convex regions or are not clearly separated [39]. For further information on spectral clustering algorithms, the interested reader is invited to refer to [39, 40, 46].

(32)

Method

This chapter describes the undertaken investigation of probabilistic representations for mod-eling and understanding object shape. Firstly we present two different data sets, followed by the description of the system outline, i.e. the approach we use to tackle shape information extraction from raw data. Further the experiments are described and the different evaluation tools are introduced.

3.1 Data Sets

Our study employs data from different sources, as a means to collect various insights about 3D shape modeling. With this in mind, two essential data types are evaluated:

• A synthetic public data set, namely Princeton ModelNet data set [14], with object point clouds from different categories; and

• A real sensory data set, introduced in [3], with object point clouds collected from two sensors and belonging to three different categories.

3.1.1 Princeton ModelNet Synthetic Data Set

The Princeton ModelNet data set [14] contains a clean collection of 3D CAD models for objects in various categories, which are publicly available. The corresponding labels were obtained by Princeton researchers via Amazon Mechanical Turk service and are provided freely.

In this thesis objects from four categories are used in the experiments: apples, bottles, pots and ducks. The raw data for each object is a point cloud, i.e. a collection of positions in the 3D space. Table 3.11 displays, for each object, the corresponding name, CAD model and number of data points. The selection of objects is based on the completeness of the point clouds, i.e. objects with complete clouds and few "holes" are chosen. The object names are used throughout the rest of this thesis as a means to refer to each object. As one can see in Table 3.1, the number of points in the point clouds varies considerably: the most scarce cloud has 832 points, while the densest has 6879 points. The figures of the CAD models in Table 3.1 are obtained from [47].

1_{In the Princeton ModelNet data set, object pot1 belongs to category coffee pot, while objects pot2 and pot3} belong to category tea pot.

(33)

Name Object # of points Name Object # of points apple1 1084 pot1 2659 apple2 832 pot2 6879 apple3 3575 pot3 1843 bottle1 2287 duck1 2112 bottle2 6576 duck2 2329 bottle3 5228 duck3 4846

(34)

Figure 3.1: The experimental robot platform: Kuka arm with a Schunk hand is located on the left, while Kinect camera is on the right (source: [3]).

3.1.2 Real Sensory Data Set

The real sensory data utilized in the experiments is the same as in [3]. The data was collected using a setup composed of a fixed Kinect stereo vision camera and an industrial Kuka arm (6 degrees of freedom) with a Schunk hand (7 degrees of freedom), which was equipped with three fingers. The fingers had pressure sensitive tactile pads and the object to be explored was placed on a table, as in Figure 3.1. Figure 3.2 shows example tactile readings from the Schunk hand. During exploration the fingers were closed until contacts were sensed. More detailed information regarding the experimental robot platform can be found in [3].

The data itself consists of a collection of 3D point coordinates, i.e. a point cloud. Data collection was performed through a perceptually-guided procedure, which is explained be-low. For each object, firstly a point cloud was recorded from the visual sensor. Then the robotic hand was guided to touch the objects at different locations to gather tactile observa-tions. The goal was to progressively refine the surface reconstructions by guiding the hand towards surface points for which the predictive variance was large, in order to explore the most uncertain surface regions and decrease uncertainty. This procedure considered an ac-tion space defined by 6 different heights and 9 different approaching angles. Through this procedure at most 54 tactile readings2 _{were recorded, complementing the original visual}

data. Data from both sensors were defined in the same frame, thanks to the calibration per-formed on the arm-hand configuration with respect to the camera system, with a precision of few millimeters [3].

Table 3.2 displays the objects, which belong to 3 different basic shape categories, namely boxes, cylinders and spray bottles. For each object, the name, the object picture, the number of sensory data points (from visual and tactile sensors) and the number of ground-truth data points (from scans with high precision) are given. The object names are used throughout the rest of this thesis as a means to refer to the objects. As one can see in Table 3.2, the number of points in the sensory point clouds varies considerably: taking into consideration all visual and tactile readings, the most scarce cloud has 3594 points, while the densest has 14572 points. Additionally the ground-truth point clouds are denser than the sensory point clouds.

(35)

# of sensory points # of ground-truth Name Object Visual Tactile points

box1 6979 1714 122413 box2 5450 445 106497 box3 12620 1952 209962 cyl1 5029 1580 103662 cyl2 4528 1460 101707 cyl3 2765 829 60923 cyl4 5071 1975 114010 spray1 4252 1214 106690 spray2 4084 1508 104193 spray3 2937 1166 71734

(36)

Figure 3.2: The Schunk hand, with the thumb opposing the other two fingers. There are two tactile pads in each finger with 14 × 6 and 13 × 6 cells. The red regions correspond to tactile readings from the sensor cells where hypothetical contacts are sensed.

Therefore, differently from the Princeton ModelNet data set, this set contains considerably noisy data.

3.2 System Outline

The system outline describes the approach we use to tackle extraction of shape information from raw data. Figure 3.3 shows this approach in detail. It can be seen that the two main cornerstones in the system outline are: surface reconstruction - through GPIS based on sparse GP with variational formulation - and object clustering.

The first cornerstone in the system outline - surface reconstruction - contains a number of steps. Firstly the raw data (possibly incomplete and noisy) is pre-processed. This may involve data centralization, normalization and addition of topological information. A kernel function needs to be selected, in order to compose the prior distribution, i.e. a GP repre-senting our prior beliefs or preferences regarding the modeling problem. Further the GP parameters are optimized (learned) to take adequate values given the available data. This procedure involves running the L-BFGS-B [48, 49] algorithm, which is further discussed in this section. The next step consists of deriving the sparse GP posterior distribution, following the calculations in Equation 2.25. Finally the surface is reconstructed by defining a triangular mesh corresponding to the level-set of the sparse GP posterior mean. The Marching Cubes algorithm [50] is employed to derive the triangular mesh.