perception pipeline for object manipulation

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Integration of a visual perception pipeline for object manipulation

XIYU SHI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

2 |

(3)

Integration of a visual

perception pipeline for object manipulation

XIYU SHI

Master’s Programme, Embedded Systems, 120 credits Date: November 21, 2020

Supervisor: Ioanna Mitsioni Examiner: Danica Kragic Jensfelt

School of Electrical Engineering and Computer Science Swedish title: Integration av en visuell perceptionssystem för objektmanipulering

(4)

Integration of a visual perception pipeline for object manipulation / Integration av en visuell perceptionssystem för objektmanipulering

c

2020 Xiyu Shi

(5)

Abstract | i

Abstract

The integration of robotic modules is common both in industry and in academia, especially when it comes to robotic grasping and object tracking.

However, there are usually two challenges in the integration process. Firstly, the respective fields are extensive, making it challenging to select a method in each field for integration according to specific needs. Secondly, because the integrated system is rarely discussed in academia, there is no set of metrics to evaluate it. Aiming at the first challenge, this thesis reviews and categorizes popular methods in the fields of robotic grasping and object tracking, summarizing their advantages and disadvantages. This categorization provides the basis for selecting methods according to the specific needs of application scenarios. For evaluation, two well-established methods for grasp pose detection and object tracking are integrated for a common application scenario. Furthermore, the technical, as well as the task-related challenges of the integration process are discussed. Finally, in response to the second challenge, a set of metrics is proposed to evaluate the integrated system.

Keywords

System integration, Robotic grasping, Object tracking, System evaluation, Robot Operating System

(6)

ii | Abstract

(7)

Sammanfattning | iii

Sammanfattning

Integration av robotmoduler är vanligt förekommande både inom industrin och i den akademiska världen, särskilt när det gäller robotgrepp och objektspårning.

Det finns dock vanligtvis två utmaningar i integrationsprocessen. För det första är både respektive fält omfattande, vilket gör det svårt att välja en metod inom varje fält för integration enligt relevanta behov. För det andra, eftersom det integrerade systemet sällan diskuteras i den akademiska världen, finns det inga etablerade mätvärden för att utvärdera det. För att fokusera på den första utmaningen, granskar och kategoriserar denna avhandling populära metoder inom robotgreppning och spårning av objekt, samt sammanfattar deras fördelar och nackdelar. Denna kategorisering utgör grunden för att välja metoder enligt de specifika behoven i olika applikationsscenarion. Som utvärdering, integreras samt jämförs två väletablerade metoder för posdetektion och objektspårning för ett vanligt applikationsscenario. Vidare diskuteras de tekniska och uppgiftsrelaterade utmaningarna i integrationsprocessen. Slutligen, som svar på den andra utmaningen, föreslås en uppsättning mätvärden för att utvärdera det integrerade systemet.

Nyckelord

Systemintegration, gripande av robotar, objektspårning, systemutvärdering, robotoperativsystem

(8)

iv | Acknowledgments

Acknowledgments

Firstly, I appreciate all the support provided by the Division of Robotics, Perception, and Learning(RPL) under the School of Electrical Engineering and Computer Science at KTH. Process of doing the thesis is challenging. I would like to thank Danica Kragic for examining my thesis, and my supervisor Ioanna Mitsioni for providing me with constant guidance, from the thesis topic selection, project implementation to the revision of the thesis.

Besides, I am grateful to my parents and all my friends for their mental support. Finally, special thanks would give to my friend Lynn, who have accompanied me in form of online video during the special period of COVID- 19.

Stockholm, November 2020 Xiyu Shi

(9)

CONTENTS | v

List of Figures

2.1 Robotic grasping [1] . . . 6

2.2 Object tracking . . . 8

3.1 Overall diagram of the system . . . 15

3.2 Differences between Kinect v1 and v2 [2] . . . 17

3.3 Hardware composition of Kinect v2 [3] . . . 18

3.4 GeForce GTX 980M GPU information . . . 18

3.5 System composition diagram . . . 19

3.6 Run time result in paper [4] . . . 20

3.7 Reference frame relationship: horizontal position (left); working position (right) . . . 21

3.8 Euler angles . . . 22

3.9 tf tree of the system . . . 24

3.10 Role of node pcl_pub . . . 25

3.11 Integrated system: grasp pose generated (above) and reference frames in tracking process (below) . . . 25

4.1 Flow chart of the experiment . . . 29

4.2 Scenarios: 1-4 objects. . . 30

4.3 Scenarios: Object 1-4 . . . 30

4.4 Failed grasp pose (left). Successful grasp pose (right) . . . 31

5.1 GPU utilization and memory usage in a one object scenario . . 33

5.2 GPU utilization and memory usage in a two objects scenario . 34 5.3 GPU utilization and memory usage in a three objects scenario 34 5.4 GPU utilization and memory usage and in a four objects scenario 34 5.5 GPU utilization and memory usage for Object One . . . 35

5.6 GPU utilization and memory usage for Object Two . . . 36

5.7 GPU utilization and memory usage for Object Three . . . 36

5.8 GPU utilization and memory usage for Object Four . . . 37

(12)

viii | LIST OF FIGURES

5.9 Time recording for Scenario Set One . . . 38 5.10 Object Two and four. . . 39

(13)

LIST OF TABLES | ix

List of Tables

2.1 Advantages and disadvantages of different methods . . . 13 3.1 Introduction of Grasp Pose Detection (GPD) and Simtrack . . 16 5.1 GPU utilization and memory usage for Scenario Set One . . . 35 5.2 Run-time and success rate of Scenario Set One . . . 38 5.3 Run-time and success rate of Scenario Set Two . . . 39 D.1 The data memory.used will be divided by total installed GPU

memory (8126M iB) to get the percentage of memory usage . 54

(14)

x | List of acronyms and abbreviations

List of acronyms and abbreviations

2D Two Dimensional 3D Three Dimensional

DPM Deformable Part Model

GPD Grasp Pose Detection

ROS Robot Operating System

SIFT Scale Invariant Feature Transform

(15)

Introduction | 1

Chapter 1 Introduction

1.1 Background

Since the birth of industrial robots in the late 1950s, robot technology has achieved rapid development, and its application has been integrated into many aspects of our lives. In order to expand the application field of robots and improve their flexibility, autonomy and intelligence, more and more robots need visual information.

With the widespread application of vision in robots, the problems of visual tracking and vision-based grasping have become the focus. According to a survey [5], the most commonly used robot in industrial production is a grasping robot, and visual information plays a vital role before and after the grasp itself. Object tracking in the context of robotics is similar to the action of tracking objects through human eyes, tracking provides the visual perception for a robot, the visual information usually include the position and the orientation of an object.

Robotic grasping and object tracking are two essential research fields in robotics whose academic achievements have been widely used in industrial applications. In order to perform a grasping task, the robot needs to detect the position of the object and then execute the appropriate motion. According to G. Du et al [1], this process usually requires 4 steps: object localization, object pose estimation, grasp detection and motion planning. Object tracking in the context of robotics is similar to the action of tracking objects through human eyes, so it also referred to as visual tracking. With tracking, it is possible to continuously infer the object’s pose after a grasp in order to further manipulate it. Both fields of grasping and tracking are increasingly expanding, with new methods being proposed everyday. Some of the methods provide open-source

(16)

2 | Introduction

code, which makes it possible for engineers to reproduce the methods quickly.

In research, robotic grasping and object tracking are usually treated separately.

In practice however, the two actions are executed sequentially on the same platform and thus system integration is needed. Thus, system integration will be introduced.

System integration for robotic applications usually requires the use of Robot Operating System (ROS). ROS is an open-source system developed by the Stanford Artificial Intelligence Laboratory, specifically for adapting to the multi-node and multi-task scenarios of robotic applications. ROS adopts a distributed framework. Through the modularity design, the processes of the robot can run separately, which is convenient for isolating potential problems in their respective components, without affecting the whole system.

At the same time, it provides generic software tool-kits and libraries, which significantly improves code re-usability in robotics.

However, even though there are plenty of methods in the two research fields andROS provides a convenient integration platform, there are usually two challenges in the integration process of a robotic system. Firstly, the respective fields are extensive, so it is a challenge to discern which methods are appropriate according to specific application scenarios. Secondly, because the integrated system is rarely discussed in the academic field, there is no set of metrics to evaluate and analyze it.

1.2 Research question

How to select appropriate methods for building a visual perception pipeline for manipulation tasks that combines robotic grasping with visual tracking and is able to run real-time? What challenges are present when integrating perception modules for a robotic system and what metrics can we use to analyze the resulting system?

1.3 Purpose

The purpose of this work is to answer the research question and attempt to provide solutions to the two challenges mentioned in Section1.1. In response to the first challenge, the thesis reviews and categorizes state of the art in robotic grasping and object tracking methods and summarizes the advantages and disadvantages of each category. This provides the basis for selecting methods according to the specific needs of different application scenarios.

(17)

Introduction | 3

Additionally, it integrates specific methods for a common scenario in the laboratory and discusses the technical, as well as task-related challenges in the integration process. Finally, in response to the second problem, a set of metrics is proposed to comprehensively evaluate the integrated system, which provides reliable information for the integrated pipeline.

1.4 Scope and limitations

This thesis classifies many methods according to their applicability in different scenarios.To illustrate the integration process, a prominent method from each field is selected. The goal is to elaborate on the problems commonly met during integration along with potential solutions, discuss the aspects that need to be considered when selecting metrics. It should be noted however, that in a complete robot system there might be need for integration of further modules, depending on task requirements.

1.5 Structure of the thesis

Chapter 2 presents and categorizes the state of the art in robotic grasp and object tracking methods. Chapter3presents the methodology used to solve the problem. Chapter4presents the experimental design and process. Chapter5 analyzes the experimental results and finally, Chapter6summarizes the whole thesis.

(18)

4 | Introduction

(19)

Background | 5

Chapter 2 Background

With the continuous development of information technology and the remarkable improvement of hardware computing capabilities, the application scenarios and service models of robots are continuously expanding. In turn, this led the intelligent robot industry to transition into an intense period of technological demand. At present, most commercial robots can only achieve simple functions, such as sweeping robots and toy robots. High-end service robots equipped with robotic arms are unable to reach the market on a large scale due to immature technologies and massive consumption of computing resources, which has caused a demand gap in the market. Research on more complex robots, capable of grasping objects with the visual information provided in the real environment has become an important direction for the future development of robots.

According to the field of neurology, the human brain mainly relies on visual and tactile information to understand the physical world. The eyes and hands are the main organs that help humans complete most tasks. For robots,

"eyes" and "arms" are also very important. Their visual information comes from the camera, and the robotic arm and fingers are used for various tasks. As an example, a typical task which often encountered in industrial applications is to locate an object visually, grasp it and move it to a different location

For robots, visual information plays a decisive factor in the perception of the external environment, in object recognition as well as robotic grasping.

Moreover, compared to other perceptual information, vision has the characteristics of a large amount of information, low acquisition cost and wide applicability.

Generating a grasping pose for unknown objects is equivalent to the analytical ability of the human "brain". The robot can obtain information about objects and the surrounding environment based on visual information and evaluate

(20)

6 | Background

its grasping pose. The ability to recognize and track unknown objects is equivalent to the resolving power of the human eyes. In a robotic context, vision is responsible for the detection, recognition and tracking of objects in the robot’s field of view. By providing sufficient visual information, these modules enable the robot to interact with its environment, e.g. by picking and placing objects.

The above reasons make robotic grasping and object tracking two important research directions. Robot grasping enables the robot’s "arm" to grasp objects accurately, while object tracking enables the robot’s "eyes" to obtain information regarding the objects at any time.

In the following sections, Section 2.1and Section2.2 classify prominent robotic grasping methods, object detection and tracking methods in different categories. Section2.3summarizes the methods and papers mentioned in this chapter.

2.1 Robotic grasping

Robotic grasping mainly involves three aspects: object localization, object pose estimation, grasp detection and motion planning.(Figure2.1) Detecting objects and generating grasping poses is the first step and an essential part of successful grasping, which is also known as grasp synthesis. This section mainly summarizes the research on grasp synthesis methods.

Figure 2.1: Robotic grasping [1]

Before data-driven methods based on machine learning were widely employed, most robot grasping methods used analytical approaches [6]. Analytical approaches require accurate prior knowledge of the object properties and the mechanics of robotic arms. They are based on object geometry and physical models, using kinematics and dynamics to detect a potential grasp.

Since there are a large number of conditions to be met, the computational complexity is usually prohibiting and cannot be applied in scenes without prior knowledge of the environment and the objects. In response to the problems of analytical methods, data-driven methods were introduced. The application of

(21)

Background | 7

data-driven methods avoids complex calculations with regards to the objects’

physical properties and significantly reduces the modelling requirements.

Among the data-driven methods, some methods [7,8] need to establish the object model database in advance, so that the database containing grasp poses for different models can be also generated offline. This database is employed to facilitate searching during the online stage. In the online stage, the scene is segmented, the objects are identified, the corresponding objects in the model database are found, and then the object pose estimation is carried out. According to the estimated pose, the proposed grasping position and orientation in the grasping database are then identified. Finally, a grasp is generated based on reachability filtering. However, the application scenarios of this method are limited. In addition to requiring prior knowledge of the items, the method relying on searching in the database is also limited in cluttered scenarios, when the occluded object cannot correspond to the database entries as they are often only partially visible.

In the data-driven category that does not need to establish a model database in advance, some methods [9, 10] first segment the scene’s point cloud, and then plan the grasp according to the geometric characteristics of the segmented point cloud. Other methods allow the robot to grasp partially occluded, known objects [11], known objects with unknown poses [12], and completely unknown objects [13,4].

In this thesis, the method described in [4] will be used. Ten Pas et al. [4] calculated curvatures, surface normals based on geometric information, generated a large number of 6-DOF candidate poses and then utilized deep learning to classify the candidate poses. Finally, according to the degree that the robot can easily reach, the best grasping pose is selected. This method achieves 93% success rate for novel objects in clutter.

2.2 Object detection and tracking

Object detection is the task of identifying objects from backgrounds with different levels of complexity and separating the background to complete subsequent tasks such as tracking and recognition. Therefore, object detection is an essential task for the high-level understanding of the environment and its performance will directly affect the performance of subsequent high-level tasks such as object tracking, motion recognition, and behaviour understanding.

The most important part of an object detection task, is the segmentation of the image to isolate the foreground target from the background. Therefore, object detection methods can be divided into detection methods that are based

(22)

8 | Background

on background modelling and the ones that are based on foreground modelling.

Among them, the methods based on background modelling estimate the background, establish the correlation between the background model and time, compare the current frame with the background model, and indirectly separate the object. [14, 15] In contrast, modelling based on the foreground aims to establish the object model directly according to the grayscale, colour, and textural features of the foreground objects and usually includes online and offline stages. In the offline stage, the object samples are used to train the classifier and consequently, the classifier is used to classify and detect the object during the online stage [16,17].

Object tracking is a continuous estimation process of the object state (Figure 2.2), which adds a temporal dimension to object detection. Object detection is a component of object tracking, and it is used to initialize the object’s state, and in some methods, the tracking method itself is based on the object detection. Therefore it is preferable to categorize the object detection methods according to the characteristics that are simultaneously useful for the categorization of tracking methods.

Figure 2.2: Object tracking

In the following sections, the methods will be classified into 2Dand 3D scenarios, and popular methods will be listed as examples.

2.2.1 2D scenario

In object detection and tracking, the box that closely surrounds the detected target is called the bounding box expression of the target. In 2D object detection, the bounding box is usually a rectangle. The main task of2Dobject detection is to detect the predefined objects of interest in a given2D image, classify the objects of interest, and locate them through a bounding box.

(23)

Background | 9

Object detection methods can be divided into two categories according to whether the image features are manually designed or obtained automatically.

One is feature representation based on human design, and the other is based on deep learning.

Feature representation based on human design

Artificially designed features can be roughly divided into four categories according to the differences in visual characteristics and feature calculations:

gradient features, pattern features, shape features, and colour features.

Gradient features describe the target by calculating the distribution of gradient intensity and direction in space. This type of feature has excellent scale and rotation invariant characteristics. The most widely used is Scale Invariant Feature Transform (SIFT)[16], and some improved algorithms have also been proposed [18, 19]. Pattern features pertain to feature descriptions obtained by analyzing the relative difference of the local regions of the image and are usually employed to express textural information [20].

Shape features are derived from model-based object detection [21] and are generally used to describe the contour of objects [22]. Like the gradient features, shape features also have excellent scale, rotation and translation invariant properties. However, many different types of objects may have similar shapes, so the detection methods based on shape features are limited in application.

Lastly, the colour features are utilized by calculating the probability distribution of local image attributes (such as grayscale and colour). In recent years, it has been widely used in object detection [23], object tracking [24] and other tasks, achieving good results.

Learning-based feature representation

In 2006, Hinton et al. [25] put forward the concept of deep learning, which made the long silent neural network return to the public view. In the following years, deep learning methods have made significant progress. In 2012, the deep convolution neural network Alexnet [26] won the championship in ImageNet Large Scale Visual Recognition Challenge (ILSRVC). Its accuracy was twice that of the second Deformable Part Model (DPM) model, which broke the historical record of ILSRVC. In 2013, Sermanet et al. proposed an AlexNet-based object detection algorithm OverFeat [27], which won the champion in the object detection task of ILSRVC2013. Since then, more and more deep learning methods have been used in computer vision.

(24)

10 | Background

With the detection accuracy improvement of deep learning algorithms, such as fast RCNN [28], SSD [29] and Yolo [30], 2D object detection has been put into practical use. However, for applications in robotics, autonomous driving or augmented reality, the results of a 2D object detection can not meet the requirements. These methods only output the object’s category information and its 2D bounding box, which represents the approximate position of the target in the image. However, in a real scene, the object to be detected has3Dinformation, and several applications need to utilize both the position and rotation information to assist further decision-making.

2.2.2 3D scenario

In3Dobject detection and tracking, the bounding box is a cube rather than a rectangle. Knowing the 3D bounding box vertex coordinates and the CAD model of the rigid object, the six degrees of freedom of the target can be obtained, three of which are3Ddirection information, and the other three are the position information. Thus,3Dobject detection is also called the 6D pose estimation problem. According to the different sensors used, the existing3D object detection algorithms can be roughly divided into the ones employing visual features, point clouds and multi-modal fusion.

The visual features methods have the advantages of low cost and rich texture features. They can be subsequently divided into monocular vision and binocular (depth) vision according to the type of camera. The weakness of the former is that the depth information cannot be obtained directly, which may lead to large positioning error of the target in the 3Dspace. The latter methods, such as Simtrack [31], not only provides rich texture information but also has more accurate depth information. At present, they has a higher detection accuracy than the monocular based ones. However, binocular/depth vision is sensitive to light conditions and other factors, which quickly lead to the deviation of depth calculation.

Another is the point cloud method. The point cloud is a group of data points which can record location, colour or other information. Compared with the visual features, point cloud data have accurate depth information and prominent 3D spatial characteristics. As such, they are widely used in 3D object detection. At present, the 3D object detection algorithms based on point cloud generally follow two methods: 3Dpoint cloud projection and 3D voxel features. In the literature, Complex-YOLO [32] and BirdNet [33]

use point cloud projection to convert 3D point clouds into 2D images and standard 2Dobject detection networks (Faster RCNN, YOLO, etc.). Finally,

(25)

Background | 11

they use the position dimension in the point cloud to restore the geometric pose in3Dspace. 3DFCN [34], VoxelNet [35] use3Dvoxel methods to encode the 3D features of point clouds and use3D convolution to extract the geometric pose. However, the point cloud information lacks textural features, so it is difficult to achieve good performance in object detection and classification.

In the extreme case where the point cloud is relatively sparse, it cannot even provide useful spatial features.

Therefore, the 3D object detection method based on the multi-modal fusion of point clouds and visual information has been studied and used.

Papers such as MV3D [36], AVOD [37] adopted this method. In this method, the texture and other features of the image are used to detect the object, and the depth information of the point cloud is combined to recover the3Dgeometric position and orientation of the object.

In addition to being classified by different sensors or according to different scene scales,3Dobject detection tasks can be divided into indoor and outdoor scenes. Due to the large differences between indoor and outdoor scenes, such as object size, type, and environmental complexity, there are many differences in the research methods of the two. Among them, the outdoor scene mainly involves the detection and positioning of vehicles, pedestrians, bicycles and other targets in large-scale. However, in this work, we are concerned with indoor scenes. The main challenges associated with the object’s 6D pose estimation is the large variety of object types, as well as how their properties vary greatly depending on the view perspective.

2.3 Summary

Table2.1summarizes the advantages and disadvantages of the above methods.

Concretely, among the robot grasping methods, analytical methods can accurately calculate the grasp pose. However, because of the complexity and dependence on prior knowledge, they have been gradually replaced by data- driven methods. The data-driven methods can be classified according to whether or not an object or grasp pose database needs to be established in advance. When selecting a method for system integration, it is necessary to consider whether it is worth to build a database in different application scenarios. It is useful to select database methods for scenarios that no new objects appear to achieve higher accuracy.

As for the object detection and tracking, methods based on 2D images have been widely used in industry, such as face recognition, pedestrian detection, intelligent video surveillance, etc. However, in robotic grasping,

(26)

12 | Background

the 3D position and 3D orientation information are significant as they are needed for further manipulation of the object. 3D object detection and tracking can be divided into three categories according to different sensors.

Visual feature methods and the point cloud methods can provide rich texture features and accurate depth and spatial characteristics, respectively. As for the multi-modal methods, the accuracy and increased demand of computational resources, caused by the complexity of the implementation, needs to be taken into account.

Therefore, for scenarios where no novel objects appear and there is little to no occlusion between the objects, the method of establishing a database in advance can be appropriate. In contrast, in the case of a heavily cluttered scenario with unknown objects, it is advisable to select methods that do not require establishing a database in advance. The detection and tracking method can be selected according to the type of sensors and the computing resources availability. When the computing resources are limited and the target is relatively large, the deviation in depth will not have significant impact so the visual feature-based method can be selected. On the other hand, when the target object does not have rich textures, the point cloud method will give better results. A final point to consider is the availability of the code, which promotes reproducability.

The above sections classify the robotic grasping and object tracking methods based on scenarios. Based on this classification,GPD[4] from the data-driven methods category that do not require a database in advance and Simtrack [31] from the visual feature-based method category, are selected for integration. This combination can generate grasp poses in clutter and track objects by their visual features, which allows the robot to perform grasping and tracking tasks in more complex environment while consuming fewer computational resources. In the next chapter(Chapter 3), the technical and task-related challenges in the integration process will be discussed and the system will be evaluated.

(27)

Background | 13

Table 2.1: Advantages and disadvantages of different methods

Advantages Disadvantages

Robotic Grasping

Analytical methods ([6])

1. Accurate results

1. Cannot run real-time;

2. Prior knowledge of environment needed

Data- driven methods

Establishing a database needed ([7], [8])

1. High accuracy in uncluttered scenario 2. Less execution time

1. Prior knowledge of environment and object needed

2. Low accuracy in cluttered scenario

No database needed in advance ([9], [10], [11], [12], [13], [4])

1. Less prior knowledge of environment and the object needed

2. Higher accuracy and shorter execution time 3. Some methods can work in clutter

1. Different algorithms have room for

improvement in speed, computational

complexity and accuracy

Object detection and tracking

2D ([16], [18], [19], [20], [21], [22], [23], [24]), [25], [26], [27], [28], [29], [30])

2Dtracking use widely in scenes that do not account for the3Dinformation of targets, such as vehicles and pedestrians, while robots grasping usually require detailed pose information.

3D

Visual features ([31])

1. Low cost

2. Rich texture features

1. Sensitive to light conditions and other environment factors, which will lead to inaccurate depth information Point

cloud ([32], [33], [34], [35])

1. Accurate depth information and spatial characteristics

1. Lack of texture features,thus, difficult to achieve object detection and classification Multi-

modal ([36], [37])

1. Accurate depth information and spatial characteristics

2. Rich texture features

1. Complex implementation

(28)

14 | Background

(29)

Methodology and implementation | 15

Chapter 3 Methodology and implementation

The overall diagram of the system is shown in the Figure3.1. Each part of the diagram will be introduced in detail in the following sections.

Figure 3.1: Overall diagram of the system

When attempting to integrate two distinct systems, the main challenges stem from two main categories. The first is technical-related challenges. The

(30)

16 | Methodology and implementation

repositories that hold the code often require different hardware and software library support for system integration. Therefore, solving the compatibility issues between libraries and adjusting the package according to specific hardware is necessary. The second is task-related challenges. In this project, when the camera used for visual input is fixed, the grasp pose of one object needs to be generated in the clutter and then the object will be tracked when it is moved manually. The visualization tool, Rviz, is responsible for showing this pipeline. To achieve the above process, there are more steps to be implemented. Table 3.1 briefly introduces GPD [4] and Simtrack [31].

Before launching the two packages in sequence, some parameters need to be set to accommodate the specific experimental conditions. Then, because both modules are working in their own reference system, the frame relationship between two modules and the point cloud received by the camera need to be identified. Therefore, the following necessary steps for integration will be described in detail in the following sections:

• Setting appropriate parameters;

• Creating a reference frame relationship forGPDand Simtrack;

• Adjusting the orientation of the point cloud and publishing it toGPD.

Table 3.1: Introduction ofGPDand Simtrack

Module Brief description Link

GPD Package to detect grasp poses in dense clutter, with point clouds as input and pose of viable grasps as output; the input point cloud needs to be adjusted to display in rviz.

GPD (Github)

Simtrack Package to detect and track the pose of objects in real-time

Simtrack (Github)

In this chapter, technical-related challenges will be described in Section 3.1, where the hardware and operating environment needed in this project will be introduced. The task-related challenges will be introduced in detail in Section 3.2. Finally, Section 3.3 introduces how to model the objects for the system.

(31)

3.1 Hardware and operating environment

3.1.1 Kinect v2

Kinect is a motion-sensing peripheral launched by Microsoft, originally for the video game industry, but which gradually ended up being utilized both in industrial and academic fields. Kinect has produced two generations of cameras since its release. Compared with the Kinect v1 (released in 2012), Kinect v2 (released in 2014) has greatly improved in many aspects such as the resolution of the depth camera, RGB camera and range of detection.(Figure 3.2)

Figure 3.2: Differences between Kinect v1 and v2 [2]

The depth camera used in this project is Kinect v2. As shown in figure 3.3, its hardware mainly includes three parts: RGB camera, Depth Sensor (IR Camera) and Infrared Radiation Emitter. Kinect V2 uses Time of Flight (TOF) technology to obtain depth data: the infrared transmitter sends pulsing light, and after the light is projected on the target object, the reflected light is received by the Depth Sensor, and finally, the distance of the target object is calculated through the time difference. Therefore, the infrared camera can collect not only infrared images (grayscale representation), but also the depth information.

(32)

Figure 3.3: Hardware composition of Kinect v2 [3]

A driver is the software that makes hardware device and operating system communicate with each other, therefore, Kinect v2 also needs a driver. This project uses the driver libfreenect2 [38] to provide RGB image, depth image and point cloud data to work with, and uses iai_kinect2 [39] as the bridge between libfreenect2 andROS.

3.1.2 GeForce GTX 980M

Figure 3.4: GeForce GTX 980M GPU information

Two GeForce GTX 980M Graphic Processing Units are used in this project.

980M applies the second-generation Maxwell architecture, which has higher performance and power efficiency than the previous-generation of NVIDIA GPU architecture.

As shown in figure 3.4, 980M has 1536 CUDA Cores and 8192MB dedicated memory. One of this project’s objectives is measuring the GPU usage, the recording method will be introduced in detail in Chapter4Experiments.

(33)

3.1.3 Operating environment

Further system requirements and details are listed in AppendixA, and solutions to compatibility issues are also listed in the AppendixBand AppendixC.

3.2 Implementation Process

3.2.1 Parameter setting

Figure 3.5: System composition diagram GPD

There are three types of parameters that need to be determined in order to use GPD:

1. Workspace

The workspace is the area where the grasp pose is generated. The system composition diagram is shown in Figure 3.5. In order to successfully produce the candidate grasps, it is necessary to establish the spatial relationship between the camera and the workspace, as well as determine the workspace’s size. These parameters allows GPD to generate the grasp pose only in the desired area. In this project, the workspace is located on the ground; the length, width and height are 0.6, 0.5 and 0.25 meters respectively.

(34)

2. Representations for the neural network

GPD has two different input representations for the neural network, a 3-channel and a 15-channel one. For all of our experiments, we chose the 15-channel representation as it provides higher accuracy.

3. Number of grasp candidates

Figure3.6is a comparison of the processing time of different representations in the different number of point clouds. As introduced in Section3.1, in this project we use a Kinect camera which has a larger number of point cloud points than that shown in Figure 3.6 and the processing time is scaled accordingly. In order to reduce the running time, the experiments reduce the number of candidates to approximately 200. The reason for this is that in the preliminary experiments, it was found that there is no significant difference in accuracy between 1000 and 200 candidate. The number of grasp candidates cannot be edited directly, but the parameter which controls the number of samples to be generated can be reduced.

For this project, it is set from the default value of 500 to 50.

Figure 3.6: Run time result in paper [4]

Simtrack

In Simtrack, the detector and tracker, which are responsible for displaying and tracking the detected objects, can run either on one GPU or separately on two GPUs. When running on one GPU, the detector will stop working when all objects are detected. When the object is removed from the scene and replaced, or when the tracking is suddenly interrupted, the response speed

(35)

will be affected since the position of the object needs to be detected again.

Therefore, for the experiments of this project, the tracker and detector will work on separate GPUs.

3.2.2 Reference frame relationship

Since both modules will operate in the same space where the generated poses need to be known and unified, an essential step in the system integration process is establishing the frame relationships. For this project, GPD has no default reference frame and Simtrack uses the same reference frame (kinect2_rgb_optical_frame which will be referred to as Kinect frame in the following) as the Kinect driver. Thus, this step will create a frame for GPD, calculate the relationship betweenGPDand Simtrack, and then publish this relationship.

The positional and rotational relationship is depicted in Figure3.7. The world frameis placed vertically below the Kinect and at the same height as the bottom of the workspace. The default orientation of the Kinect frame is set by the Kinect driver and the camera will work at the the maximum rotation angle (about 30 degree) in this project.

Figure 3.7: Reference frame relationship: horizontal position (left); working position (right)

The integrated environment is utilizing ROS as the common operating system. ROS provides a software package to describe the transformation relationship between frames and establishes the full pose relationship for existing or newly created frames. Two common ways of expressing orientations

(36)

are Euler angles, which are denoted by three numbers, and axis-angle representations, which are usually denoted by four numbers.

Euler angles are used to express the orientation of a rigid body and direction transformation in the3Dcoordinate system. There are usually two most commonly used conventions to present the results: Proper Euler and Tait Bryan. The selection orders of Proper Euler angles are (x, y, x), (x, z, x), (y, x, y), (y, z, y), (z, x, z), (z, y, z). That is, in (a, b, a) order: after rotating by a certain angle around the a-axis, rotate around the newly generated b- axis by an angle, and finally rotate around the new a-axis after these two rotations. Another common formulation is the Tait Bryan angles which are usually represented by the following rotation orders: (x, y, z), (x, z, y), (y, x, z), (y, z, x), (z, x, y), (z, y, z). This method traverses the three axes of the Cartesian coordinate system. For example, the Roll-Pitch-Yaw angle is the case of (x, y, z). Given a representation, it is straightforward to derive the rotation matrix. For example, in the case of Euler angles (x, y, z), the rotation matrix can be derived as following:

Figure 3.8: Euler angles

(37)

M = Rot(x, α) · Rot(y, β) · Rot(z, γ)

=





1 0 0

0 cos α sin α 0 − sin α cos α









cos β 0 sin β

0 1 0

− sin β 0 cos β









cos γ − sin γ 0 sin γ cos γ 0

0 0 1



 (3.1) M =





cβcγ −cβsγ sβ

c_αs_γ+ c_γs_αs_β c_αc_γ− s_αs_βs_γ −c_βs_α s_αs_γ− c_αc_γs_β c_γs_α+ c_αs_βs_γ c_αc_β



 (3.2)

There are many choices for rotation sequences that have other expressions [40].

Amongst them, a commonly used orientation formulation in robotics, is the Angle-Axis Representation. The difference with Euler angles is that this method no longer requires multiple rotations to find the target direction, but finds a rotation axis. The target direction can be obtained by only rotating around an axis once. This leads to the quaternion representation q = (x, y, z, w), where w is the angle of rotation, and the vector [x, y, z] is the axis of rotation. The rotation matrix of the quaternion is:

M =





1 − 2y²− 2z² 2(xy − zw) 2(xz + yw) 2(xy + zw) 1 − 2x²− 2z² 2(yz − xw) 2(xz − yw) 2(yz + xw) 1 − 2x²− 2y²



 (3.3)

Thus, in order to convert Euler angle (x, y, z) into a quaternion:

q =







− sin^α₂ sin^β₂ sin^γ₂ + cos^α₂ cos^β₂ cos^γ₂ sin^α₂ cos^β₂ cos^γ₂ + cos^α₂ sin^β₂ sin^γ₂

− sin^α₂ cos^β₂ sin^γ₂ + cos^α₂ sin^β₂ cos^γ₂ sin^α₂ sin^β₂ cos^γ₂ + cos^α₂ cos^β₂ sin^γ₂







(3.4)

In this project, the Kinect frame and the world frame can be described as: rotate world frame around the x-axis −120 degrees to get the orientation of Kinect frame. Therefore, the Euler angles are expressed as (yaw, pitch, roll) = (0, 0, −2.112), and the corresponding quaternion is (x, y, z, w) = (−0.870, 0, 0, 0.492). The frame is then translated on the z-axis by 0.75 meters to get the pose of Kinect frame.

Figure3.9 depicts the different frames while the integrated system is running and the relationship of world frame and Kinect frame has been established successfully.

(38)

Figure 3.9: tf tree of the system

3.2.3 Point cloud publishing

In the previous section we created a reference frame, the world frame forGPD, so now we can publish the point cloud received by the camera to the world frameofGPD. Since the default point cloud orientation is the same as the camera’s, before publishing the point cloud toGPD, the point cloud needs to be transformed.

In order to transform the point cloud to the appropriate orientation, we follow the steps denoted in Algorithm1.

Algorithm 1: Point cloud publisher

Input: Original point cloud: cloud_msg_in

Output: Point cloud after transformation: cloud_msg_out

1 subscribe to the topic publishing the cloud cloud_msg_in;

2 convertROSmessage cloud_msg_in to format cloud_in;

3 read relationship X between world frame and Kinect frame;

4 apply X to cloud_in to get cloud_out;

5 convert format cloud_out toROSmessage cloud_msg_out;

6 publish cloud_msg_out to the topic ofGPD;

The function of the node created in Algorithm1 can be summarized as:

read the original point cloud from the camera, adjust the orientation, and

(39)

publish it toGPD. Figure3.10also describes this process.

After the necessary operations to make the two components of the pipeline compatible, the task-related modifications are completed. The integrated system of GPDand Simtrack can now operate as expected and use the same reference frames, an example of this can be seen in Figure3.11

Figure 3.10: Role of node pcl_pub

Figure 3.11: Integrated system: grasp pose generated (above) and reference frames in tracking process (below)

(40)

3.3 Object modelling

After the system integration is completed, the next step is to model the objects involved in the experiment. Autodesk 123D Catch is the modeling software recommended by the tutorial of Simtrack [41], however it is no longer available. This section will introduce the modeling method using the software Autodesk ReCap Photo.

Autodesk ReCap Photo is an alternative to Autodesk 123D Catch. To build a model of a real object, pictures of the object from different angles need to be collected. This allows the software to build the model through the information of the object’s photos. After the model is generated, in order to get a real-scale model in the simulation world, it necessary to manually edit the position of the reference frame and the scale of the model by using the software Blender.

In all of the following experiments, objects from the Yale-CMU-Berkeley (YCB) were utilized. YCB is a collection of objects that aims to provide a benchmark for robotic grasping and manipulation. The set includes common, everyday life objects of different shapes, sizes, textures, weights and rigidities.

The objects used in the experiments are shown in Figure4.3, where Objects 1-3 are YCB objects, and Object 4 is an object selected randomly.

The models of the YCB objects are accessible online [42]. Thus, for the YCB objects, only the reference frame needs to be adjusted.

(41)

Experiments | 27

Chapter 4 Experiments

This chapter will discuss the design of metrics and introduce the experimental process in detail.

4.1 Metrics

The evaluation metrics vary depending on the application field. For instance, visual tracking usually considers center location error, the accuracy of the tracker, failure scores and so on [43], while the run-time efficiency, the success rate of grasping in clutter, etc. are the more appropriate metrics in robotic manipulation. Thus, while designing the metrics of the integrated system, two aspects need to be considered. Firstly, we need to find common metrics for the methods used. Secondly, we need to consider the characteristics of the integrated system itself and how to measure it.

For the integrated system of this thesis, the common metric for GPD and Simtrack is the success rate. In GPD work, success rate denotes the grasp successes as a fraction of the total number of grasp attempts, and Simtrack measures the proportion of a synthetic sequences that can be tracked successfully. However, in practical application, it is of little significance to test the success rate by using synthetic sequences. This is due to the fact that in the possible applications of the whole robot system, the real-time pose information of the object may be fed back to other work modules which are executing tasks in parallel. Hence, if the tracker cannot work continuously and only tracks parts of the sequences, the stability of the system will be negatively affected.

For this reason, in this experiment, we introduce the metric, robustness.

As mentioned in Section1.4, a complete robot system usually has more

(42)

28 | Experiments

tasks to execute, which requires more computing resources. Therefore, monitoring the use of computing resources will help to monitor whether the system is too resource-intensive. In algorithms that involve image processing, such as the tracking part in this system, a lot of processing units in the GPU are consumed. Thus, this thesis will take real-time GPU utilization and memory usage as a metric. Besides helping to monitor whether the system is consuming large amounts of resources, GPU utilization and memory usage are also used to provide information for possible optimization in algorithms.

The CUDA toolkit will be used to facilitate the measurement.

Finally, since we are interested in practical applications, whether the system can run in real-time is a key parameter that needs to be measured. For this purpose, we add run-time results to show how long the system takes to accomplish tasks.

4.2 Experimental design

The experiment is designed to introduce how the above metrics will be used to analyze and evaluate the integrated system.

The experiments will explore the relationship between the system performance and the number and types of the objects. Thus, we consider two sets of experimental scenarios. The first set is based on the number of objects included, which can range from 1 to 4 objects (Figure 4.2). The second set considers each type of object individually (Figure 4.3). Since in real- world scenarios the placement of the objects is usually disorderly, for each scenario the experiment was repeated 30 times, and the objects were placed randomly each time to avoid outliers and biases. The results for each scene were averaged.

For each scene, the experimental flow chart is shown below:

(43)

Experiments | 29

Figure 4.1: Flow chart of the experiment

(44)

30 | Experiments

Figure 4.2: Scenarios: 1-4 objects

Figure 4.3: Scenarios: Object 1-4

Three metrics are introduced in Section 4.1: 1. robustness, 2. GPU utilization and memory usage, 3. run-time results. In the following we will

(45)

Experiments | 31

introduce how these three metrics were measured during the experiment.

• Robustness

The robustness denotes the success rate of the experiment. The experiment is considered successful only after both steps are successful. Furthermore, Failure type A represents failure to generate a stable grasp pose (Figure 4.4) and Failure type B represents the interruption of tracking. When both succeed, the overall experiment is considered successful.

Figure 4.4: Failed grasp pose (left). Successful grasp pose (right)

• GPU utilization and memory usage

Memory usage is the percent of GPU memory used. GPU utilization is the percent of the time during which one or more kernels were executing GPU processes over the sampling period, which denotes how often we monitored the GPU. NVIDIA provides an efficient tool to manage and monitor GPU related information. Appendix D provides the detailed method to record this metric.

• Run-time results

Run-time results denote the run-time of the system to accomplish a task. In this experiment, since the two modules are started almost at the same time (ignoring the running speed of the launch file), the grasp pose generation and the pose detection process coincide. Moreover, as it takes longer to generate the grasp pose, the run-time here is recorded as the time from the beginning until the grasp pose is generated.

(46)

32 | Experiments

(47)

Results and Analysis | 33

Chapter 5 Results and Analysis

This Chapter will present the results as depicted by the chosen metrics and analyze them. Section 5.1 and Section 5.2 show and analyze the GPU utilization and memory usage as well as robustness and run-time results of the two experimental scenarios respectively.

5.1 GPU utilization and memory usage

5.1.1 Number of objects

For the first set of experiments, the scenarios are divided according to the number of objects. Figures5.1,5.2,5.3,5.4illustrate the GPU utilization and memory usage in the four scenarios where the number of objects is 1, 2, 3, and 4 respectively.

Figure 5.1: GPU utilization and memory usage in a one object scenario

(48)

34 | Results and Analysis

Figure 5.2: GPU utilization and memory usage in a two objects scenario

Figure 5.3: GPU utilization and memory usage in a three objects scenario

Figure 5.4: GPU utilization and memory usage and in a four objects scenario The sampling period is 500 ms, and the average experiment time of the four scenarios is about 15 minutes each. Within every 15 minutes, we repeat the following steps for the object: grasp pose generation, object tracking, object replacement and finally, remaining object rearrangement. The corresponding percentages of the above four figures are shown in Table5.1. From the table,

(49)

we observe that the mean value and standard deviation of GPU parameters are similar in each scenario, which indicates that the number of objects has little effect on the GPU utilization and memory usage The average value of the memory usage of each device is around 9% and 25% respectively, and the GPU utilization is far from 100%. This means more than 70% of the memory and 50% of the GPU time can be used for other tasks in the robot system (for example, motion planning) and more GPU acceleration tasks.

Table 5.1: GPU utilization and memory usage for Scenario Set One

Device 0 Device 1

Memory usage

GPU utilization

Memory usage

GPU utilization Mean

1 object 10.29% (±3.13%) 22.97% (±4.52%) 26.42% (±6.60%) 51.56% (±9.36%) 2 objects 9.24% (±2.60%) 19.68% (±4.56%) 25.65% (±7.21%) 51.53% (±11.96%) 3 objects 9.13% (±2.61%) 19.82% (±4.40%) 26.29% (±7.23%) 52.15% (±11.04%) 4 objects 8.97% (±2.75%) 19.57% (±4.68%) 26.26% (±7.44%) 51.61% (±11.43%)

5.1.2 Types of objects

For the second set of experiments, the scenarios are divided by the types of objects. Figures 5.5, 5.6, 5.7, 5.8 contain the GPU utilization and memory usage of scenarios for different objects.

Figure 5.5: GPU utilization and memory usage for Object One

(50)

Figure 5.6: GPU utilization and memory usage for Object Two

Figure 5.7: GPU utilization and memory usage for Object Three

(51)

Figure 5.8: GPU utilization and memory usage for Object Four There are two essential stages of the experiment in the above figures: Phase One (P1) and Phase Two (P2). P1 is the stage from system booting to the completion of booting, which includes the start of all the system nodes (the first grasp pose generated at this step) and the start of the visualization software Rviz. The phase P2 corresponds to when the object is moving and the system starts tracking.

It can be observed that the GPU utilization and memory usage of two GPUs at the beginning of P1 increase significantly at the same time; this is because Device 0 is responsible for booting the tracker in addition to driving the screen, and Device 1 is responsible for starting the detector. At the end of P1, the GPU utilization of Device 0 suddenly rises; this is due to Rviz starting the visualization, meaning that Device 0 needs to render the tracker image. There is no noticeable feature change at the start and end of the P2 stage, which proves that the GPU works continuously in this stage and the movement of the object does not increase the GPU workload.

5.2 Robustness and run-time results

This section will evaluate the system’s Robustness and give the run-time result in two sets of scenarios, and propose a success rate improvement method for small objects at the end of Subsection5.2.2.

(52)

5.2.1 Number of objects

For Scenario Set One, Figure5.9depicts the required time for the generation of 30 grasp poses and how it evolves according to the number of objects. Table 5.2summarizes the number of occurrences for each failure type, the success rate as well as the time mean and standard deviation in each scenario.

Figure 5.9: Time recording for Scenario Set One

Table 5.2: Run-time and success rate of Scenario Set One Failure

type A

Failure type B

Success rate

Time mean

Time standard deviation

1 object 9 6 50.00% 2.0176 0.5549

2 objects 7 9 46.67% 1.6706 0.5208

3 objects 12 8 33.33% 1.5066 0.3967

4 objects 12 9 30.00% 1.7507 0.6794

Combining Figure5.9and Table5.2, it can be seen that there is no apparent correlation between the run-time and the number of objects. The average generation time is 1.5 − 2.0s. The small standard deviation proves that the system can stably generate grasp pose for any number of objects at this time. In contrast, the success rate decreases with the increase of objects. By observing the two types of failures, it can be seen that when the number of objects increases, Failures of type A increase slightly. Therefore, it can be inferred that the increase in scene complexity has an impact on the success rate of

(53)

the grasp pose generation. For example, when the grasping pose is generated between two objects, the experiment is regarded as a failure, and the fewer the objects, the lower the probability of this happening. In addition, due to the limitation of a single camera, the generated point cloud is limited, and there is a difference in the shape of the object in the actual scene. As a consequence, the more objects, the greater this deviation will be. This is also one of the reasons why the success rate decreases with the complexity of the scene.

5.2.2 Types of objects

For Scenario Set Two, Table5.3summarizes the number of each failure type, success rate, the time mean and standard deviation in each scenario.

Table 5.3: Run-time and success rate of Scenario Set Two Failure

type A

Failure type B

Success rate

Time mean

Time standard deviation

Object One 8 6 53.33% 1.9348 0.4985

Object Two 7 11 40.00% 1.5672 0.4325

Object Three 8 14 26.67% 1.6345 0.5384

Object Four 11 15 13.33% 1.5203 0.7384

The table5.3shows that the success rate from Object One to Object Four has decreased. Comparing with the table5.3, it can be found that the number of failure type B has increased significantly, which is because Simtrack relies onSIFTkey points to identify the features of objects. As introduced before, Object Two and Object Four are textureless objects (Figure 5.10), while the volume of Object Three and Object Four is smaller than the first two. When the Kinect is far away from the object, the resolution is too low to recognize the texture. This is also one of the reasons for the decrease in success rate.

Figure 5.10: Object Two and four

(54)

5.3 Summary

From the above analysis, the number of objects will not affect the consumption of GPU computing resources, and in the process of the system running, more than 70% of the memory and 50% of the GPU time can be used for other tasks in the full robot system and GPU acceleration tasks. While in the set that scenarios are divided according to the types of objects, the difference in type have little influence on the GPU computing resource consumption process. Section5.1analyzes the reasons of resource consumption in different stages, and proves that during the object movement, the computing resource consumption is stable, and does not increase the GPU workload.

As for the robustness and run-time results, there is no correlation between the system run-time and the number or type of objects. The average run-time is 1.5 − 2.0s, which proves that the system can meet the real-time requirements.

The small standard deviation proves that the system can stably generate grasp poses for any number of objects during the run-time. Combined the two sets of scenarios, it can be found that the factors affecting the robustness of the system are scene complexity, the richness of object’s surface texture, and the resolution of the camera.

(55)

Conclusions and Future work | 41

Chapter 6 Conclusions and Future work

6.1 Conclusions

This thesis addresses the research question by categorising the robotic grasping and object tracking methods based on different scenarios. Additionally, it successfully demonstrates the required integration process for a system that is able to run real-time.

By reviewing the robotic grasping and object tracking methods, the advantages and disadvantages of each category are listed. For robotic grasping, there are analytical methods and data-driven methods. Analytical methods use kinematics and dynamics to detect a potential grasp. As a contrast of analytical methods, the data-driven methods avoids complex calculations of the objects’ physical properties and significantly reduces the object modelling requirements. Object tracking could be divided into 2D tracking and 3D tracking. 2D tracking is used widely in scenes that do not account for the 3D information of targets, such as vehicles and pedestrians tracking. While robots grasping usually require detailed pose information.

Based on the categories, GPDand Simtrack are selected for integration.

This integrated system generates grasp poses in clutter and track objects by the object’s visual features, which allows the robot to perform grasping and tracking in a more complex environment while consuming fewer resources.

From the results of the experiment, we can tell the system is able to run real- time, and the proposed metrics provide valuable information to evaluate and analyze the integrated system.

perception pipeline for object manipulation

Integration of a visual perception pipeline for object manipulation

XIYU SHI

Integration of a visual

perception pipeline for object manipulation

XIYU SHI

Abstract

Keywords

Sammanfattning

Nyckelord

Acknowledgments

Contents

List of Figures

List of Tables

List of acronyms and abbreviations

Chapter 1 Introduction

1.1 Background

1.2 Research question

1.3 Purpose

1.4 Scope and limitations

1.5 Structure of the thesis

Chapter 2 Background

2.1 Robotic grasping

2.2 Object detection and tracking

2.2.1 2D scenario

2.2.2 3D scenario

2.3 Summary

Chapter 3

Methodology and implementation

3.1 Hardware and operating environment

3.1.1 Kinect v2

3.1.2 GeForce GTX 980M

3.1.3 Operating environment

3.2 Implementation Process

3.2.1 Parameter setting

Simtrack

3.2.2 Reference frame relationship

3.2.3 Point cloud publishing

3.3 Object modelling

Chapter 4

Experiments

4.1 Metrics

4.2 Experimental design

Chapter 5

Results and Analysis

5.1 GPU utilization and memory usage

5.1.1 Number of objects

5.1.2 Types of objects

5.2 Robustness and run-time results

5.2.1 Number of objects

5.2.2 Types of objects

5.3 Summary

Chapter 6

Conclusions and Future work

6.1 Conclusions