Human-robot Collaboration Focusing on Image Processing

(1)

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Human-robot Collaboration

Focusing on Image

Processing

XUETAO ZHANG

(2)

(3)

Abstract

The future manufacturing industry calls for a new generation of production so-lutions with high efficiency and flexibility. Therefore, research towards Human-Robot Collaboration (HRC) has recently aroused much attention in the field of manufacturing as it provides the possibility for harmonizing the accuracy of a robot with the flexibility of a human. The objective of the thesis is to propose a novel architecture for a Human-Robot Collaboration input system. The sys-tem receives video frames and depth information from an Intel® RealSense™ Depth Camera D435 and recognize the closest point to the target object in the 3D point cloud environment. Moreover, different algorithms and technologies, which are image filters for video preprocessing and nearest neighbor search algo-rithms for target detection, are applied to the system and evaluated to optimize the performance. In this thesis, the system workflow is elaborated and an anal-ysis towards the performance of the system is presented based on quantitative measurements including refresh rate and error calculation.

(4)

Sammanfattning

Den framtida tillverkningsindustrin kräver en ny generation produktionslösningar med hög effektivitet och flexibilitet. Därför har forskning om samarbete mel-lan människa och robot nyligen väckt stor uppmärksamhet p˚a tillverkning-somr˚adet eftersom den ger möjlighet att harmonisera en robots noggrannhet med en människas flexibilitet. Syftet med avhandlingen är att föresl˚a en ny arkitektur för ett inmatningssystem för människa-robot-samarbete. Systemet tar emot videoramar och djupinformation fr˚an en Intel® RealSense™ Depth Camera D435 och känner igen den närmaste punkten till m˚alobjektet i den tred-imensionella bilden. Dessutom tillämpas olika algoritmer och tekniker, i form av bildfilter, för förbehandling av videon och närmaste granne-sökalgoritmer för m˚aldetektering, p˚a systemet och utvärderas för att optimera prestanda. I denna avhandling utarbetas systemstrukturen och en analys av systemets pre-standa presenteras p˚a grundval av kvantitativa mätningar, inklusive uppdater-ingshastighet och felberäkning.

(5)

Acknowledgment

I would like to express my gratitude and appreciation to Professor Xi Wang, my research project supervisor for his patient guidance, enthusiastic encouragement and constructive advice. I would like to express my very great appreciation to Professor Mats Bengtsson for his advice and assistance in keeping my progress on schedule.

I would also like to acknowledge the support provided by Mo Chen for the guidance on the robot UR5.

I also wish to thank Ying Yang for the help and suggestions on manipulating the robot UR5 and her assistance with the images used in this report.

(6)

List of Figures

1.1 D435 camera modules [5] . . . 3

1.2 A brief structure of the HRC system . . . 4

2.1 Projection transformation . . . 8

2.2 (a) shows the ideal noise-free measurement, while (b) presents an actual measurement situation with the depth noise, affecting the result of depth information randomly fluctuating within 2mm [18]. 10 2.3 The depth information smoothed by different filters [18] . . . 11

2.4 Voronoi diagram . . . 12

2.5 A k-d tree decomposition for seven points in a 128×128 region [4] 14 2.6 The octree data structure and its serialization [23] . . . 15

3.1 Module structure of the HRC input system . . . 18

3.2 Flow chart of the HRC input system architecture . . . 19

3.3 Flow chart of the calculation module . . . 21

3.4 Flow chart of the visualisation module . . . 22

4.1 Visualization of the test environment without and with the human 24 4.2 The instability value and mean squared error of the persistence filter with different persistence values . . . 27

4.3 The instability value and mean squared error of the smoothing filter with different values of α or δ . . . 28

4.4 (a) The background environment with no filtering. (b) The ex-perimental HRC environment without filtering. (c) The back-ground environment with filtering. (d) The experimental HRC environment with filtering. . . 29 4.5 The refresh rate of the NNS algorithms with different data volumes 32

(8)

List of Tables

2.1 The hole-filling strategy of the persistence filter . . . 10

4.1 The quantitative measurement of the persistence filter . . . 26

4.2 The quantitative measurement of the smoothing filter . . . 27

4.3 The quantitative measurement of the NNS algorithms . . . 31

(9)

Chapter 1

Introduction

Globally industrial robots are in a period of rapid growth. Driven by the sup-port of government policies, the growing manpower cost and market demand, robot replacement for manpower is an irresistible trend. From automobile pro-duction, metalwork to consumer electronics manufacturing and food packaging, the applicable field of industrial robots is expanding at high speed.

As mechanized production is applied in more and more industries, it opens up a number of new opportunities as well as challenges. Traditional industrial robots are expensive, difficult to operate, and need to be isolated from peo-ple when they work. However, compared with the traditional industrial robots isolated from people through security fence, collaborative robots have good ver-satility and accessibility, and can work together with human partners to finish one or more tasks efficiently. Their ease of programming enables the same robot to be quickly applied to different positions and perform various tasks.

To study how collaborative robots work with human is in the field of Human-Robot Collaboration (HRC). The method discussed in this paper is to collect information from the camera and recognize the target for collaborative robot control. However, with the development of video monitoring equipment and network bandwidth, people are not satisfied with the traditional recognition based on plane images. The real-time transmission and processing of big data has aroused much concern. At the same time, the processing of 3D point cloud data also shows different advantages in the HRC field.

Thus, this paper is focusing on the study of the input end of HRC, including video preprocessing, target recognition and visualization. Recent research of the input end of HRC is reviewed and a novel HRC input architecture is proposed. After that, the work is implemented and the recognition result is validated through synchronous video streams. The performance of the video preprocessing methods and target recognition algorithms are quantitatively evaluated.

1.1 Human-robot collaboration overview

Human-Robot Collaboration (HRC) defines a special type of coordinated activ-ity, which provides the possibility for harmonizing the accuracy of a robot and flexibility of a human [31]. It refers to a work scenario where humans and au-tomated machines share a workspace and work at the same time. In the classic

(10)

HRC mode, the collaborative robot acts as the assistant to help the human to do hard work, such as carrying, loading or other kinds of repetitive work, while the human acts as the operator to control and monitor the production process. They both contribute their own specialties.

An effective HRC can be realized when the robot has the ability to under-stand and resolve the communication mechanisms that humans use to interact with humans [10]. Besides, the robot also must be able to express its own needs and objectives. In this case, the common work goal between the human and the robot can be established and the robot can regulate its behavior and work towards the goal [16].

Driven by Industry 4.0, which is a new stage of industrial development for automation and data exchange in manufacturing technologies, including cyber-physical systems, the Internet of things, cloud and cognitive computing and smart factory [1], HRC enables highly flexible work processes, maximum pro-ductivity, and economic efficiency. But at the same time, the relationship be-tween the human and the robot in the HRC system is also undergoing subtle changes. How to coordinate the division of labor between the human and the robot has become another focus in the field of HRC.

The first feasible relationship between the human and the robot is Human Emulation (HE) [30]. HE requires the robot behave like humans or have the abilities that similar to humans. This method focuses on studying the interac-tion mode between humans, and then applies the mature modes to the HRC system. The robot in HE should be able to establish its own goals and plans for its work, and it should also speculate on the goals and plans of its work partners and help them to complete their work.

Another approach to coordinate the human and the robot is called Human Complementary (HC) [30]. Other than HE, the goal of HC is to create a more intelligent and smart robot to complement and collaborate with humans in an efficient and safe way, leading an asymmetric relationship between the robot and the human. Therefore, research towards HC is focusing on assigning different types of work to the robot and the human, so as to give full play to their advantages as well as avoiding their drawbacks.

The first collaborative robot in the world was launched by Universal Robots (UR). The founder of the company had the idea of a universal robot when he was in college, and began to sell the first UR5 collaborative robot in 2008. After ten years of development, UR has expanded its business to the world, and established Universal Robots Academy to help users quickly learn robot programming. At present, the company has launched a series of collaborative robot products, such as UR3, UR5 and UR10, with payloads ranging from 3kg to 10kg [6].

The driving force to promote the strong increase of collaborative robot mar-ket is not only the demand of the traditional manufacturing industry, such as im-proving productivity, reducing labor cost, but also the new requirement to deal with the small batch production of multiple varieties and transform from batch production to mass customization. The three elements of production, which are information, logistics and manipulation, have all been digitized. Robots and humans share the work space and time to achieve efficient optimization of flexible automation with HRC as the operation link.

(11)

Figure 1.1: D435 camera modules [5]

1.2 Depth camera overview

The fundamental of human eyes is the main basis to realize stereo vision, which relies on disparity to estimate depth. Cameras without depth detection func-tion can use stereo vision to calculate depth, while those with depth detecfunc-tion, such as Kinect, often use the parallax principle to find depth. They usually first project patterns, then compare the two patterns from left and right cameras to compute every depth of the scene. In addition, the range finder [2], which is now widely used for stereo vision, is the first approach to get depth informa-tion. However, because its giant cost, it is only used for military and industrial purposes.

In order to introduce the structure and functions of the depth camera in more detail, the Intel® RealSense™ depth camera D435 (hereinafter called the ”D435”) is introduced as an example of depth cameras. It is a stereo tracking camera which can provide depth information and based on calculating depth from stereo vision. As is shown in Figure 1.1 [5], D435 has two depth sensors that separated by a predefined baseline. The depth information of every pixel in the scene is calculated by the cameras to understand the scene. Also the camera can be integrated into the HRC system with Intel® RealSense™ SDK 2.0 and cross-platform support, and has a visual field range up to 10m.

The D435 derives the depth information from overcoming the difficulty on corresponding between the synchronously captured left and right video frames, working out the deviation for each pixel (i.e. same objects from the scene that correspond to different pixels in left and right images), and generating the depth map from disparity and triangulation [17].

1.3 HRC input system overview

As is shown in Figure 1.2, the HRC input system is a part of the whole HRC environment, which is responsible for capturing and recognizing the information from the working environment. The initial robot information is set in the input system, so the system can process the robot information with the information captured from the working environment to compute the information around the robot. A novel point cloud segmentation method is invented, and the perspective projection, temporal and spatial filtering and nearest neighbor search algorithms are applied during this process. After the information is computed in the input

(12)

Figure 1.2: A brief structure of the HRC system

system, it will be transmitted via transmission system to the robot control end, which then commands the robot to make corresponding actions. Then the robot control system obtains the latest robot information and transmits it back to the input system. The input system updates the robot information, captures the information from the working environment again and so on.

In addition, the input system introduces a visualization module that the classic HRC mode does not possess to increase the safety factor the whole HRC system. The arrow from the input system to the working environment (the human) indicates this visualization part. The traditional HRC system only gives safety control commands to the robot when it detects security issues in the work environment, for example, the system will control the robot to avoid the human when the distance between the human and the robot is below the safety distance. While the human can hardly know he broke the safety rule of the robot if he knows few about the robot and the system, because the system cannot help him realize it. However, when the visualization module is added to the system, it can visualise the information of the working environment to the human and let the human know the current state of the working environment. So in a sense, it can improve the security of the whole system.

1.4 Methodology overview

The system proposed in the paper is to meet the requirements of pursuing low latency while maintaining certain accuracy in the HRC system. The two im-age filters encapsulated in the SDK and the two classic nearest neighbor search algorithms are introduced in the system because of that. Several quantitative methods are used to evaluate the system. As there is no standard methodology in the field of HRC, both classic and novel evaluation criteria are introduced. The refresh rate of the system is introduced to evaluate both the filters and algorithms since it can provide intuitive expression in terms of the synchro-nization ability of the system. Mean squared error is used as it is a classic standard that can reflect the fluctuation and precision of the recognition results in a fairly good way. Finally, some novel criteria are invented to assess the system. The instability value is introduced to evaluate the noise-reducing and hole-filling abilities of the filters by counting up the occurrence frequency of

(13)

the mis-recognized points, while the average distance error is the average of ten times of the deviation between the true position of the target object and the theoretical value the system calculates. The instability value can not only well reflect the frequency of mis-recognition but also easy to compute. The average distance error is similar to the mean squared error, but it can describe the error with the true value and is more appropriate to evaluate the NNS algorithms which are more relevant to accuracy other than precision.

1.5 Research ethics

All the experiments conducted in this paper are under a safe environment. It is guaranteed that the robot used in the experiments, UR5 is fixed and in shutdown state. The usage of the experiment site and all experiment tools has been approved by the person in charge.

In addition, the study towards the HRC system is to find a better relation-ship between the human and the robot in work situation. The purpose behind it is to relieve humans from physically strenuous work. It is true that the study on robots may cause people’s worry about the future. However, from the per-spective of the author, this is not the intention of the paper.

1.6 Thesis objective

As mentioned above, HRC has become one of the worldwide fastest boosting market and a key technology for the next generation of the manufacturing in-dustry. Trying to improve the performance of collaborative robots and make it more coordinating with human is the main purpose of the thesis. Moreover, the thesis also tries to improve the security of the HRC system by introducing a visualization module to convey the information of working environment to the human in HRC, since the classic HRC system only sends security commands to the robot while the human in HRC can hardly know if he broke the safety rules.

The main specific goals are listed below:

• Make a new attempt on the HRC input system based on 3D point cloud recognition with D435 as the input tool for capturing information. • Smooth and reduce the noises caused during the information capturing by

introducing image filters.

• Improve the efficiency of the 3D point cloud recognition by applying near-est neighbor search algorithms.

• Improve the safety of HRC by introducing a visualization part for system performance reflection.

1.7 Thesis structure

In the rest of the thesis, Chapter 2 will present the basic concepts and recent research about the video preprocessing and target recognition of the HRC input system. Then the whole system architecture will be discussed in Chapter 3. In

(14)

Chapter 4, The analysis and discussion towards the performance of the system will be shown based on quantifiable measurements, and the system implementa-tion will also be elaborated. In the end, Chapter 5 summarizes the thesis work and contributions, bringing ideas for the possible future work.

(15)

Chapter 2

Literature Review

In this chapter, Perspective projection, the approach to transform plain images and depth information to a 3D point cloud, is shown in the first section. Then several preprocessing methods and NNS algorithms used in the system will be elaborated. The second part introduces two filters, a temporal filter for persis-tence control and a spatial filter for noise smoothing. Then two prevailing NNS algorithms in point cloud recognition, k-d tree and octree, will be demonstrated. At last, the basic concepts of the visual interface in the HRC system will be mentioned.

2.1 Perspective projection

Perspective projection is a figure obtained by projecting an object onto a single projection plane from a projection center. It distinguishes distance in the way that objects which are far away from the projection center appear smaller than close objects. 2D images that cameras take provide a perspective projection of the view field of the cameras. Photographic lenses and the human eye use the same projection and this is why perspective projection looks most realistic [19]. After understand how perspective projection works, inverse projection transformation can be easily applied to the depth information obtained by the depth camera to identify and restore the true position for every points from plain images.

The projection line of perspective projection starts from the viewpoint and it is not parallel. Any perspective projection that is not parallel to the projection plane will converge into a point, which is called vanishing point. Perspective projection is divided into one-point, two-point and three-point perspective ac-cording to the number of the vanishing point that the projection appear to converge [13]. A simple projection transformation is described in Figure 2.1, where

• ax,y,z: the position of the target point A that the camera projects.

• cx,y,z: the position of the viewpoint C.

(16)

Figure 2.1: Projection transformation

• ex,y,z: the position of the intersection E of the line passing through C and

perpendicular to the display surface. • bx,y: the 2D projection of the point A.

When cx,y,z = (0, 0, 0) and θx,y,z = (0, 0, 0), the 3D vector (2, 1, 0) is

pro-jected to the 2D vector (2, 1).

Otherwise, vector dx,y,z is defined, which is the position of the point A in

the camera coordinate system, to calculate bx,y. In order to compute the dx,y,z,

the ax,y,z must be camera transformed [28]:

  dx dy dz  =   1 0 0 0 cos θx sin θx 0 − sin θx cos θx     cos θy 0 − sin θy 0 1 0 sin θy 0 cos θy     cos θz sin θz 0 − sin θz cos θz 0 0 0 1       ax ay az  −   cx cy cz     (2.1)

It also can be represented without using matrices:

dx= cos θy(sin θzy + cos θzx) − sin θyz (2.2a)

dy= sin θx(cos θyz + sin θy(sin θzy + cos θzx)) + cos θx(cos θzy − sin θzx)

(2.2b) dz= cos θx(cos θyz + sin θy(sin θzy + cos θzx)) − sin θx(cos θzy − sin θzx)

(2.2c)

Then the 2D projection B can be obtained from the transformed point D by the following formula [29]:

bx= ez dz dx+ ex (2.3a) by= ez dz dy+ ey (2.3b)

(17)

or expressed in the form of the homogeneous coordinate matrix:   fx fy fw  =   1 0 ex ez 0 1 ey ez 0 0 _e1 z     dx dy dz   (2.4) where bx= fx fw (2.5a) by= fy fw (2.5b)

The distance between the viewpoint and the display surface, ez, is decided

by the field of view, and α = 2 arctan_e1

z is the view angle.

2.2 Video preprocessing

Video preprocessing is the approach applied before the information of the video being extracted and processed to reduce the error caused during the content creating stage. There are lots of video preprocessing methods, such as color cor-rection and noise reduction, to process and optimize videos to be more efficient for subsequent use. The two approaches used in this system are a persistence filter and an edge-preserving filter. They filter the video frames from time and space domain respectively, and both contribute to reducing the error rate of recognition.

The reason that these two filters are chosen in the system is that they are encapsulated in the Intel® RealSense™ SDK 2.0 based on the hardware interface at the bottom layer. Their filtering speed is much higher than the filters built on the application layer, and they also cause the least burden on the system. Hence, for the consideration of latency, only these two kinds of filters are selected in the paper.

2.2.1 Persistence control

A temporal filter is applied in the system for persistence control and hole filling [18]. For every single pixel in the image, if the filter observes a hole, the pixel that its depth equal to 0, it will check the last several frames at the same spot and determine if the observed hole is a ”true hole”. If it is, then the hole will be retained. Otherwise, the hole will be filled by the non-zero depth value of the same pixel in the nearest previous frame.

In order to recognize ”true hole”, a variable called persistence value is intro-duced to the filtering. It can be set from 0, means the persistence filter is not activated and no hole-filling processed, to 8 which indicates the persistence will be always imposed regardless of the stored history. From 0 to 8, the standard of conducting the hole-filling keeps loosing. The specific strategy corresponds for every number is listed in Table 2.1.

(18)

Persistence Value Strategy

0 no filling

1 filling if the pixel was valid in 8 of the last 8 frames 2 filling if the pixel was valid in 2 of the last 3 frames 3 filling if the pixel was valid in 2 of the last 4 frames 4 filling if the pixel was valid in 2 of the last 8 frames 5 filling if the pixel was valid in one of the last 2 frames 6 filling if the pixel was valid in one of the last 5 frames 7 filling if the pixel was valid in one of the last 8 frames

8 always filling

Table 2.1: The hole-filling strategy of the persistence filter

(a) (b)

Figure 2.2: (a) shows the ideal noise-free measurement, while (b) presents an actual measurement situation with the depth noise, affecting the result of depth information randomly fluctuating within 2mm [18].

2.2.2 Noise smoothing

The spatial filter used in the system can reduce noises while preserve edges at the same time [18]. Figure 2.2 shows an example, a 10x10mm box is placed against the wall that is 500mm away from the depth camera. The camera can track and record the distance from it along axis X. Figure 2.2(a) presents the ideal result of the obtained depth information, while Figure 2.2(b) is the practical measurement with noises [18].

Figure 2.3 presents the depth information of the camera scene applied by the different smoothing filters below. The filter used in Figure 2.3(a) is a median filter with left rank equals 5. In Figure 2.3(b), a simple moving average filter is applied with window size 13. Figure 2.3(c) shows the result of a bidirectional exponential moving average with alpha=0.1, while Figure 2.3(d) is filtered by an exponential moving average with same alpha value and delta value 3 [18].

Applying filters to smooth the data in Figure 2.2(b) will reduce the noise but may introduce some unwanted artifacts, such as curved or elongated edges, or deviation, influencing the final result even greater than the original noise. A median filter and a simple moving average filter are applied in Figure 2.3(a) and 2.3(b). The filtered results don’t perform well as the edges are still rugged. Figure 2.3(c) is a typical example of overfiltering with bidirectional exponential moving average. The edge is well smoothed but the whole contour is badly

(19)

(a) (b)

(c) (d)

Figure 2.3: The depth information smoothed by different filters [18]

deformed, which leads the measurement even more distorted than the unfiltered result. In Figure 2.3(d), exponential moving average is applied, reaching a relatively decent performance.

The noise-smoothing filter applied in Figure 2.3(d) is a kind of domain-transform filter. The depth map is raster scanned in axis X and Y twice. Also the one-dimensional exponential moving average (EMA) is computed while scan-ning, with an alpha value determining the smoothing degree and a delta value defining the edge-preserving threshold. Below shows the recursive equation:

Sn =    D1, n = 1 αDn+ (1 − α)Sn−1, n > 1 and ∆ = |Sn− Sn−1| < δthresh Dn, n > 1 and ∆ = |Sn− Sn−1| > δthresh (2.6)

Where Dn is the instantaneous depth value of the current pixel, Sn is the

value of EMA of the pixel and α determines the influence that the past pixels has on the current pixel.

α = 1 means no filtering while α = 0 leads pixels always remain from the last one if the threshold delta is not exceeded. As long as the depth value between the neighboring pixels exceeds the threshold delta, the alpha value will set to 1 which means no filter applied. It will help the filter to recognize and preserve the edge values as edges are being observed. However, it may lead different results as the filter is applied from the left to the right of edges or vise versa. This is why the bi-directional scan both in axis X and Y is chosen to sufficiently eliminate the artifact caused by this problem during filtering.

(20)

Figure 2.4: Voronoi diagram

2.3 Nearest neighbor search

Nearest neighbor search (NNS) is a kind of proximity search, which is to find the optimal point in a given point set that is nearest to the target point. For example, given a point set S and a target point p in scale space M, find the point that has the closest distance to p in S. In most cases, M is a multidimen-sional Euclidean space, and the distance is determined by Euclidean distance or Manhattan distance.

There are several solutions to the NNS problem. The merits of these algo-rithms depend on the time complexity of their solution and the spatial complex-ity of the data structure used to search. The simplest NNS algorithm involves brute-force calculation of all the distances between all possible pairs of points with the target point in the data set, which is called linear search. For N sam-ples in the D dimension, the complexity of this approach is O(dN). When the data sample is small or the dimensional space is high, linear search is efficient enough to become very competitive and even outperform space partitioning ap-proaches [32]. However, as the number of samples N increases, this naive search approach quickly becomes incapacitated.

The branch and bound methodology has been applied to this problem since the 1970s. For Euclidean space, this approach is also called spatial index. There are several space-partitioning methods that are already well developed and com-monly used for this problem. One of the most common and simple one must be the K-D tree, which will search the space continuously while divides the area by the intermediate node into two adjacent parts, each part contains half of the nodes. When querying, evaluate the target point at each split point from the root node to the leaf node. For constant dimension, the average complexity of the query time is O(log N) [25].

(21)

di-agram, a plane partition that every region is close to one point in a given point set, offers an implied geometric expression for NNS and also plays a funda-mental and crucial role in lots of proximity search algorithms. The box-search and mapping table-based search are based on the Voronoi diagram, which have decent search performance in the case of vector quantization [27].

NNS has a group of prevalent solutions that can be used in many different fields for evaluation and optimization. As for point cloud segmentation and recognition, because the point cloud data is formed by numbers of points which are in three dimensions, k-d tree and octree are the two of the most efficient methods. Besides, the system requires high real-time performance, while the current research on NNS algorithms mostly focuses on improving the algorithm complexity at the expense of the system latency in order to achieve better de-tection results. However, the HRC input system does not require high accuracy of detection. Therefore, k-d tree and octree, which do not need to occupy much of the system resource but can show decent detection performance at the same time, are selected in the system.

2.3.1 K-d tree

A k-d tree (short for k-dimensional tree) is a binary tree constructed by multi-dimensional Euclidean space segmentation, in which each leaf node is a k-dimensional point, while each non-leaf node represents a hyperplane, which is perpendicular to the coordinate axis of the current splitting dimension. In this dimension, the whole space is divided into two parts, one is in the left subtree, the other is in the right subtree. If the current splitting dimension is D, the coordinate values of all points on the left subtree should be less than the coordinate value of the hyperplane in D dimension, and for points on the right subtree, their coordinate values should be greater than or equal to the hyperplane. The k-d tree can quickly exclude or query data with little relevance by building indices. Its properties are very similar to binary search trees [7].

In a balanced k-d tree, the distance from all leaf nodes to the root node is approximately equal. However, a balanced k-d tree is not optimal for scenarios such as nearest neighbor search or spatial search. The construction of a k-d tree is as follows: Dimensions of data points are sequentially taken as the splitting dimension. Then the median of data points in this dimension is regarded as the hyperplane. The data points on the left side of the hyperplane are formed into the left subtree, while the right side points compose the right subtree. Re-cursively process its subtrees until it cannot be further divided. The remaining data points are saved in the leaf nodes.

In order to optimize the process of splitting dimension selection, first com-pare the distribution of data points in all dimensions before the construction starts. Then sort the dimensions by variance and pick the dimension with largest variance as the starting dimension. This method can achieve a good segmentation and balance performance [26].

There are also two alternative ways to select median points. The first is to sort and store all the data points in all dimensions before the algorithm starts. Then in the subsequent median selection, there is no need to sort the data everytime the subtrees are classified, which can improve the processing speed of the algorithm [9]. The other one is to select and sort a fixed number of points randomly from all data points as the sample points. Each time when classify

(22)

Figure 2.5: A k-d tree decomposition for seven points in a 128×128 region [4]

points, regard the median from these sample points as the splitting hyperplane. This method has been proved to achieve nicely balanced trees.

The nearest neighbor search using k-d tree is mainly divided into two steps: • Query the data point Q from the root node according to the comparison between Q and each splitting nodes until the leaf node is reached. If the value of Q in the corresponding dimension k is less than the splitting node, the left subtree will be accessed, otherwise the right subtree will be accessed. When it reaches to the leaf node, calculate the distance between Q and the leaf node and record the node as the current ”nearest neighbor” P and the minimum distance D.

• Backtracking is introduced to find the node closest to Q, to be more spe-cific, which is to determine whether there is a node closer to Q in the branches that have not been visited. If the distance between Q and an unreached branch under the queried parent nodes is less than D, it is con-sidered that the node closer to P in the branch exists. Enter that branch and perform the same search as the previous step. If a closer data point is found, it will be updated to the current ”nearest neighbor” P and the minimum distance D. Whereas if the distance between Q and the branch is greater than D, then there is no point closer to Q in the branch. The determination of backtracking is from the bottom to the top of the tree until there is no branch closer than P when backtracking to the root node.

From the perspective of geometric space, the whole backtracking process is to determine whether the hypersphere with Q as the center and D as the radius intersects with the hyperrectangle represented by the branches of the tree.

In the implementation, the distance between Q and tree branches can be obtained in two ways. The first is to record the boundary parameters of all data points contained in each subtree on the corresponding dimension during the tree construction. The second is to record the splitting dimension k and the splitting value M. Therefore the distance between Q and the splitting node is |Q(k) − M | [11].

(23)

Figure 2.6: The octree data structure and its serialization [23]

2.3.2 Octree

An octree is a tree data model first proposed by Dr. Hunter in 1978 [20]. Each node of the octree represents a octant element, and each node has eight child nodes. The octants represented by these eight child nodes together are equal to the volume of the parent node. The center point of each octant is usually regarded as the datum point for branching. If a parent node in an octree is not empty, there will only be eight child nodes for that parent node, that is, there will not be any numbers of child nodes other than 0 or 8.

The steps to structure an octree [21] is listed below:

• Set the maximum depth for recursion.

• Identify the dividing space and build the first octant with this space. • Sequentially put all data points into the octant.

• If the maximum depth is not reached, divide the octant into eight equal parts with the center point as the datum point for branching. Then assign all the data points in the original octant to the eight child octants. • If the number of data points allocated to a child octant is not zero while

the other seven child octants are all zero, and the same as its parent octant, the child octant will not be further subdivided.

• Repeat the subdividing process until the maximum recursion depth is reached.

The octree has excellent performance on NNS implementations in terms of running time and efficiency. Due to its regular partitioning of the searching area and the high branching factor, the coordinate query does not have to recurse a lot and can balance the time penalty with the tree construction, showing decent performance when dealing with certain types of data [14].

An efficient way to traverse the octree [12] is shown as follows:

• Determine the query point q with its coordinate i, the maximal allowed distance d and the current closest point N.

(24)

• For each child C, determine if C is inside of the bounding ball with d as the radius. If it is, determine again if C is a leaf node. Update the current closest point N as C if C is a leaf node. Otherwise, repeat the determination process with C as the parent node until every leaf node is reached.

• N is the closest point to q.

The accessing order of child nodes is determined by which octant the query point is closest to. First the closest octant is accessed, and the three neigh-boring octants that are directly next to the octant are visited. After that the three neighbors of those are traversed, and finally, the rest most distant node is processed. According to [14], no floating-point operations is needed during the index computation.

(25)

Chapter 3

Proposed Approach and

Methodology

In this chapter, a system planning of the HRC input system will first be summa-rized including a brief introduction of how the system is divided into different parts, in accordance to separability of work and parallelism of threads and the tasks each part of the system completes. Then the detailed architecture of the whole HRC input system will be elaborated and disassembled into four modules, the hardware module, the processing module, the calculation module and the visualization module. A simple but efficient point cloud segmentation method is invented for foreground extraction in the calculation module. The operation procedure of each module and the approach to integrate the modules together will be elaborated.

3.1 System architecture

As is mentioned in Chapter 1, the HRC input system is a part of the whole HRC system. It captures the depth information and video frames from the HRC environment, and processes them with the robot information, which is first set during initialization and later updated from the robot control end, to obtain the information of the closest point to the robot. Then this recognition result is sent to robot control system for further use, and also visualized to help the human in the HRC environment.

The HRC input system is composed by one hardware module and three software modules. The hardware module is the module dominated by the Intel® RealSense™ Depth Camera D435 which is fitted with an RGB camera, two stereo imagers and an infrared projector. The RGB camera can record the color video like normal video cameras, while the two imagers are two camera sensors which are identical and are configured with identical settings. They have a built-in depth module which can sense the depth of the shooting scene. The infrared projector improves the ability of the two stereo imagers to estimate depth by projecting a static infrared pattern on the scene to increase texture on low texture scenes [3].

As is shown in Figure 3.1, the three software modules are the processing module, the calculation module and the visualization module. The processing

(26)

Figure 3.1: Module structure of the HRC input system

module is responsible for preprocessing the video frames from the D435, while the calculation module is in charge of computing the nearest point to the ob-ject, and the main task of the visualization module is to visualize the processed data. To be more specific, the color and the depth video of the experiment scene are first captured by the two cameras, and divided into color and depth frames to be transmitted to the first software module, the processing module. In the processing module, the depth frames are first transformed into point clouds by the inverse perspective projection. A point cloud is a data set of points in a coordinate system that can be used to recognize objects and reflect scene in-formation. Then the point clouds are smoothed by the temporal and spatial filters. The background point cloud is first processed to set the background of the experiment scene. This is done by capturing the depth information with no measured objects appeared in the scene in the first one hundred frames after the camera starts shooting. Then the first one hundred frames will be averaged according the depth of every pixel to diminish the noises and holes. Then the processed background point cloud and other depth frames are delivered into the calculation module for further analysis. At the same time, an alignment method is introduced to merge and align the color and the depth frames into one frame. Subsequently, the aligned frames are sent to the visualization mod-ule to generate visible images. After the depth frames reach the calculation module, the foreground point cloud is extracted by the foreground extraction algorithm, then processed by the nearest neighbor search algorithms to compute the position information of the nearest point. Then the module also sends the result to the last software module, visualization module. After all the data is acquired, the visualization module starts to render and visualize the different parts of the data sequentially till the visualization is completed.

As the flow charts shown in Figure 3.2, the system we build consists of an independent initialization process and three threads that can realize video preprocessing, nearest point computation and data visualization at the same time.

(27)

(28)

3.1.1 Processing Module

The system starts when the background frames are captured by the D435 and transmitted to the processing module. Then the background point cloud keeps being extracted and processed to apply the temporal filter till the background is completely set. The initialization stage is finished. While the background data is sent to the calculation module, the processing module keeps acquiring the new video frames that have the target object in the experiment scene.

The video processing thread in the processing module fetches the frames and first spatially aligns all the streams in the frames to the depth viewport to get the aligned frames, then smooths the depth frames by applying temporal filtering. The temporal filter is one kind of domain-transform filters that can reduce the depth noise by calculating multiple frames and the one-dimensional exponential moving average. Then the filtered depth frames are delivered into the calculation module.

3.1.2 Calculation Module

Figure 3.3 presents how the nearest point is calculated in the calculation module controlled by the closest-point thread. First, the thread fetches the depth frames and extracts their point cloud. Then it differences the extracted point cloud with the background point cloud to acquire the foreground point cloud. The foreground point cloud is used to build a octree or k-d tree. After the tree is fully generated, the position information of the robot arm is introduced into the tree. By the query algorithm of the tree, the nearest point to the robot arm is found. The position information of the nearest point is sent to the main thread in the visualization module for video rendering.

Foreground extraction

As is mentioned above, a difference is done to extract the foreground point cloud after the point clouds are received by the calculation module. Foreground ex-traction is always an important issue in the field of 3D point cloud segmentation. An efficient method of connected component extraction is proposed to extract vehicles and pedestrians in the neighborhood of the moving sensor by B¨orcs, Bal´azs Nagy, and Csaba Benedek [8]. Golovinskiy and Thomas Funkhouser also mentions a simple way to extract the foreground, which is to assume that all points within a certain distance to the position of the input target are fore-ground [15]. Junejo and Naveed Ahmed proposes a method based on RGB and depth data to realize foreground extraction by extracting corner features and train a non-linear SVM on these features [22]. In addition, another point cloud segmentation approach is presented for interacting objects in a stream of point clouds by exploiting spatio-temporal coherence by Xiao Lin, Josep R. Casas, and Montse Pardas [24].

Because the system proposed in this paper is real-time based and can’t afford complex forground extraction algorithms that are very time-consuming even though the detection performance is excellent, a new point cloud segmentation approach is invented to extract forground, which is relatively simple and light-weight, but also ensures adequate performance.

The first step of the method is to determine the background of the scene. This is done by capturing the depth information with no measured objects

(29)

Figure 3.3: Flow chart of the calculation module

appeared in the scene in the first one hundred frames after the camera starts shooting. Then the first one hundred frames will be averaged according the depth of every pixel to diminish the noises and holes. After that the background is set and stored for the next step. In the next stage, the point clouds with the target objects included are obtained and used to compute the disparity with the background point cloud. The disparity is the foreground point cloud used for NNS.

3.1.3 Visualisation Module

As is illustrated in Figure 3.4, three main visualization tasks are undertaken by the main thread which are the aligned frames, object and robot and their position information. Several OpenGL libraries are used for rendering.

The visual interface of the system provides intuitive feedback of the recog-nition results, and helps the human knows what the current situation is during the process of HRC.

(30)

(31)

Chapter 4

Implementation and Result

Analysis

The chapter comprises two main parts, the specific implementation of the HRC input system and the performance evaluation and analysis of the two prepro-cessing filters, the persistence filter and the smoothing filter, and the two NNS algorithms, the k-d tree and the octree applied in the system. Then a compari-son is drawn between the two NNS algorithms.

4.1 Implementation

During the implementation of the HRC input system, the D435 is utilised as a camera sensor to obtain depth and color frames for the following filtering and NNS algorithms. The source code package used in this paper is Intel® RealSense™ SDK 2.0, which includes integrated interfaces for the D435 and OpenGL interfaces for visualisation. The D435 is fixed in the position that can capture all target objects including the robot UR5 and the human body in a proper distance with the background. Then the D435 is connected with a laptop where the processing, calculation and visualization modules are executing. The results of the recognition are printed on the system console and also visualized into video streams that play synchronously in a window. The original processing, calculation, and visualization modules are developed based on the architecture presented in Chapter 3.

The recognition results of the system test environment are shown in Fig-ure 4.1. The whole frame is composed by the depth frame and the color frame aligned together. Because only one color sensor is built in the D435, while the depth sensors are two, the area that the depth sensors capture is larger and embrace the colored area. Also in the colored area, the frames look perfectly aligned when no filtering applied in the video preprocessing stage. This is be-cause the pixels on the color frame cannot get aligned with the depth frame where the depth is zero and are lost. Even though it looks perfectly aligned, lots of information is missing. The orange circle that both figures have is the indicator that points out the position of the arm end of the UR5 which is fixed in this experiment for convenience. In Figure 4.1(a), the environment background is set so the foreground can be extracted with Figure 4.1(b). While in Figure

(32)

(a)

(b)

(33)

4.1(b), the human appears as the target object in the experiment scene and the closest point is calculated and presented by the other orange circle. The blue dashed line links the two points and also displays the distance between them. At the same time, the detailed information about the two points is computed and printed out to the system console for further analysis and evaluation.

4.2 Performance of video preprocessing

As is mentioned in Chapter 2, a temporal filter for persistence control and a spatial filter for noise smoothing are the preprocessing methods in the system before the foreground is extracted. The filter evaluation is implemented under the situation where every parts of the input system are in normal operation with the enumeration search as the NNS algorithm. In this set of experiments, the target recognition object is a static black box of 14cm × 9cm × 5cm size. The box is put on a table in front of the D435 with the position [0.5m, 0m, 1m] in the camera coordinate as the position of front top left vertex of the box. Assign [−0.5m, 0.5m, 0.5m] as the robot position to the system. Also make sure the front top left vertex of the box is the closest point to the robot position and record the box position. At the beginning of the experiments, the box is removed from the experiment environment. After the background is set, the box is put on the specified position and the data will start being collected.

This section assesses the filtering performance of the two filters by calculating the refresh rate of the system and mean squared error of the position of the closest point. The refresh rate is estimated by the average frequency of the recognition results printed on the system console. Count the total time of 10 results to calculate the frequency of each time, and average the frequency of five times to get the average frequency. To compute the mean squared error, the first step is to average 10 results of the position of the closest point to obtain the average position. Then calculate the distances between all the positions of the 10 results and the average position to get the mean squared error. When 5 mean squared errors are computed, average them to get the final value. Also, a novel variable called instability value is introduced to evaluate the noise-reducing and hole-filling abilities of the filters by counting up the occurrence frequency of the mis-recognized points in one minute, when the two filters are applied. The true position of the closest point is pre-measured, so the recognized closest points that are more than one meter away from the true position will be regarded as the mis-recognized points. Also the mis-recognized points are not engaged in the computation of the mean squared error.

4.2.1 Performance of the persistence filter

Table 4.1 shows the filtering performance of the persistence filter applied in the system with different parameters. The first column from left is the different persistence values as input variable of the filter, as is interpreted in Chapter 2. Because the system is for real-time use, the second column measures a crucial capability of synchronous systems, the refresh rate, which is introduced to filter measurement to determine if the applied filter will affect the reaction time of the system. The third column takes the instability value as another standard to evaluate the filter. Finally in the last column, the mean squared error is

(34)

Persistence Value Refresh Rate (Hz) Instability Value MSE (m2₎ 0 12.93 24 0.0057 1 11.81 27 0.0068 2 12.44 21 0.0039 3 12.32 13 0.0026 4 12.56 11 0.0024 5 11.60 9 0.0030 6 12.11 3 0.0017 7 11.78 0 0.0005 8 12.35 1 0.0011

Table 4.1: The quantitative measurement of the persistence filter

calculated for each different persistence values.

As the table indicates, the refresh rate is slightly influenced by the persistence value input chosen for the filter. The persistence value 1, 5 and 7, which refer to the persistence strategies “filling if the pixel was valid in 8, 2 and 1 out of the last 8 frames”, correspond to the refresh rates that are all below 12Hz. This is because 8 last frames are required for check, compared with less frames when the other values are applied.

In addition, Figure 4.2 shows how the instability value and mean squared error change with the persistence value. As the persistence value increases, the instability value and mean squared error are both reducing in general. The value 0 means never filtering and 1 means only filtering when the pixel in last 8 frames all have depth, while the value 7 refers as long as one of the last 8 frames have depth value on that pixel, the current zero value will be replaced, and 8 denotes always apply the hole-filling no matter what happened in last frames. This indicates that the requirement of implementing the hole-filling strategy is also getting lower and lower, leading the filtered frames to become increasingly persistent. Due to a more and more “solidified” picture as the value increases, the instability value and mean squared error are also declined.

When a proper persistence value is needed for the system use, those three aspects above should all be considered carefully. The value 2, 3 and 6 are suitable since they don’t have low refresh rate and also check enough frames for filtering. Then when their instability value and mean squared value are included for comparison, the value 2 has much greater these two parameters than 3 and 6. From 2 to 3, there is also a relatively big plunge in both the instability value and mean squared error, while from 3 to 6, only the instability value has appreciable reduction. Another reason why 3 is preferred than 6 is that when the degree of persistence is too much, the frames are over-filtered and the whole video looks much laggy and “sluggish”. Even if the instability value and mean squared error are good, many points that should not have had values are filled with wrong values, causing errors and misrecognitions when the NNS algorithms are applied for target detection.

4.2.2 Performance of the smoothing filter

In Table 4.2, the performance of the smoothing filter elaborated in Chapter 2 is evaluated. The same parameters, refresh rate, instability value and mean

(35)

Figure 4.2: The instability value and mean squared error of the persistence filter with different persistence values

α δ Refresh Rate (Hz) Instability Value MSE (m2₎

0 20 11.90 0 0.0003 0.1 12.41 0 0.0002 0.2 11.82 2 0.0008 0.3 12.31 3 0.0010 0.4 12.88 5 0.0015 0.5 12.20 6 0.0025 0.6 12.35 8 0.0022 0.7 12.11 11 0.0032 0.8 12.45 9 0.0018 0.9 12.34 13 0.0020 1 12.29 17 0.0019 0.4 1 12.30 12 0.0022 10 12.44 9 0.0020 20 12.88 5 0.0015 30 11.71 6 0.0021 40 12.75 5 0.0017 50 12.46 3 0.0015 60 12.66 7 0.0009 70 11.98 5 0.0019 80 12.50 4 0.0025 90 11.82 6 0.0031 100 11.53 5 0.0014 - - 12.32 13 0.0026

(36)

(a)

(b)

Figure 4.3: The instability value and mean squared error of the smoothing filter with different values of α or δ

(37)

(a) (b)

(c) (d)

Figure 4.4: (a) The background environment with no filtering. (b) The exper-imental HRC environment without filtering. (c) The background environment with filtering. (d) The experimental HRC environment with filtering.

squared error are also introduced to assess this filter as those for the persistence filter above. However, the input variables are different in this filter, which are the α and δ. The α determines the influence that the past pixels has on the current pixel, while the δ value is the threshold that decides if the current pixel should inherit a part from the past pixels. The α is a value between 0 and 1, where 1 means no filtering and 0 leads pixels always remain from the last one if the threshold δ is not exceeded. Otherwise, the current pixel will retain its value.

First of all, the refresh rate fluctuates slightly around 12Hz whatever the values of α and δ change. Besides when the filter is not activated, the refresh rate of the system is 12.32Hz, which is also similar to the results with filtering. So it can be concluded that the different values of the input parameters don’t exert varying degrees of influence on the refresh rate of the system. Then the instability value and mean squared error are taken into account together in Figure 4.3(a). When the δ remains 20, the instability value and mean squared error increases as the α grows from 0 to 1. But a watershed can be observed between 0.4 and 0.5 since the mean squared error surges from 0.0015m2 _to

0.0025m2_{. As the result of this, 0.4 is chosen for the finally α value of the}

system.

Additionally, the filter performance is also measured when the α is set to 0.4 and the δ rises from 1 to 100. As is shown in Figure 4.3(b), the instability value keeps decreasing when the δ becomes 20 from 1. After that it stops dropping and remains around 5. While the value of mean squared error fluctuates between 0.0010m2to 0.0020m2from first to last. This is because as the δ value becomes much higher, there are less and less edges that can be diagnosed by the filter and almost every pixels on the frame are smoothed. At the same time, those

(38)

edges that cannot be detected are erased by the filter, causing the detailed information of the image lost, which will affect the performance of the following NNS algorithms. In this case, 20 is selected as the eventual value of the δ.

As the Figure 4.4 shows, the distinction between the frames with and with-out applying the above two filters is directly revealed. Figure 4.4(a) and Figure 4.4(b) are the visualization results of the test environment without the persis-tence and the smoothing filters applied, while Figure 4.4(c) and Figure 4.4(d) are filtered during the video preprocessing. When proper input parameters of the filters are chosen, the filtering process will fill the holes and smooth the fault points that caused by noises as possible and avoid the true points. It’s apparent to find that the two filtered figures have more points on the edges of holes that are not aligned with the color images. This is because in the processing module, the depth frames are first aligned with the color frames, then processed by the filters. Some new edge points are generated during the filtering in the places where no points were existing before, so they can’t match the color map. If the order is reversed, the depth frames are filtered first then aligned, the newly generated points will be colored by the color frames. However, those points will not be distinguishable in the visualization result.

4.3 Performance of nearest neighbor search

al-gorithms

In this section, the two NNS algorithms that are used in this system, the k-d tree and the octree, will be analyzed and evaluated via quantitative measurement, comparing to the simple enumeration search. Also, a novel variable, average distance error, will be introduced to help assess the performance of these algo-rithms. The average distance error is the average of ten times of the deviation between the true position of the closest point in the target object to the robot position and the theoretical value the system calculates. The true value is mea-sured in reality based on the camera coordinate.

Table 4.3 shows the measurement results of the NNS algorithms. The k-d tree and the octree with different height from 1 to 7 are compared with the brute-force search. In order to get a stable and accurate result while increase the total point number of the target object, the first measurement is under the same situation as the filter evaluation but adds a static standing human body as the other part of the target object, and makes sure the human body is not closer than the black box to the robot position. The data is filtered by the persistence filter and the smoothing filter with the optimal parameters before the algorithms apply. In order to control other variables that may affect the results, every algorithm is executed in the same environment and the total point number of the target object is around 104_{. The second column of the table presents the}

corresponding refresh rate of each approaches and the average distance error each algorithm causes displays in the last column.

The enumeration search has 10.88Hz refresh rate and 0.11m average distance error. Since all data points are accessed during the search, the error is the least but the refresh speed is relatively low among these algorithms. As for k-d tree, because of the constrcution of the k-d tree, as the number of data points increases exponentially, the time used for tree construction also surges, Even

(39)

Search Algorithm Refresh Rate (Hz) Average Distance Error (m) Enumeration 10.88 0.11 K-d Tree 0.08 -Octree h=1 11.49 0.54 h=2 11.15 0.32 h=3 11.26 0.19 h=4 11.20 0.12 h=5 11.09 0.14 h=6 9.61 0.09 h=7 3.84 0.12

Table 4.3: The quantitative measurement of the NNS algorithms

Search Algorithm Number of Points 102 103 104 105 106 Enumeration 11.94 11.56 10.88 9.27 6.71 Octree (h=4) 11.74 11.43 11.2 10.76 9.33 K-d Tree 11.07 2.46 0.08 -

-Table 4.4: The refresh rate comparison of NNS algorithms

though the actual searching speed is faster than the sequential traversal. For example, when the total point number of the target object is around 104_{, the}

height of the tree reaches to 13 or 14. Building such a giant tree is very time-consuming, causing the refresh rate plunge to 0.08Hz. Because of the very low refresh rate, the system is extremely laggy and cannot used for real-time use. The corresponding average distance error also becomes useless.

Since the time of tree construction becomes increasingly important when the height of the tree getting greater, the octrees with different heights are measured. Not like the k-d tree, which the height will increase as data goes bigger, the height of octree is fixed as it scopes the data area and separates the area into parts that is multiple of 8. In the table, the refresh rate remains and the average distance error reduces sharply when the height increases from 1 to 4. After that the refresh rate starts to decline and average distance error basically stays around the error of the enumeration search. In this case, it’s can be concluded that when the tree height is 4, the octree has the best performance. After the optimal height of the octree in the system is found, another im-portant thing is to measure the performance of the algorithms when different amounts of data are engaged. As is shown in Table 4.4 and Figure 4.5, the first column is the three search algorithms and the first row is different magnitudes of the data points. The measured property is the refresh rate of each algorithms. The first enumeration search has a gradual downward trend of the refresh rate when the number of points raises up. Especially when the number reaches 106, the refresh rate experiences a relatively huge decline from 9.27Hz to 6.71Hz. While for the octree with height at 4, the descend becomes more gentle and only cut down to 9.33Hz when 106 _{points are processed. When it comes to the k-d}

tree, the reduction is greatly sharper than the other two algorithms even if only less than 104_{points are engaged. When the number increases to 10}5_{, the system}

(40)

Figure 4.5: The refresh rate of the NNS algorithms with different data volumes

average performance when the total number of points is around 100.

Therefore, it’s concluded that when the data volume is low, those three al-gorithms all perform well. In this case, the enumeration search is recommended since it’s simplest and easy to deploy. When the data is relatively large, the k-d tree is not suitable for real-time use because of its long construction time, while the octree reveals better performance than enumeration as the data keeps getting massive.

(41)

Chapter 5

Summary and Future Work

The future manufacturing industry demands a new generation of production solutions which is smart and efficient while keeping productive at the same time, leading the protagonist of manufacturing process to human from robot. At the same time of ensuring work efficiency, guaranteeing human’s safety has become an important part of HRC development. With the updating of information capture equipment such as depth camera, more advanced real-time recognition and control technology can be applied to HRC system to improve its safety and accuracy. Meanwhile, the architecture of HRC system can be further optimized to meet the requirements of higher real-time and ease of use.

Therefore, the key point of the paper is to make a new attempt on the HRC system based on 3D point cloud recognition with D435 as the input tool for capturing information. The D435 depth camera that launched in recent years is utilised as the data acquisition unit of the system with more accurate and efficient depth information recognition. Moreover, the paper opens up a new mode of HRC for the point cloud recognition based on the D435 depth camera and offers a system prototype can be further improved and applied into the manufacturing industry. The design of the system highlights the consideration of security, and promotes the development of the HRC system to be more human-friendly.

The specific scientific contribution includes:

• Two video preprocessing methods, the persistence filter and the smoothing filter, are introduced to the system and evaluated with different input parameters to make video streams more stable and easy to recognize. The most reasonable filter configuration for this system is measured and discussed for a better detection performance.

• A foreground extraction method for 3D point cloud data of real-time sys-tem is developed and implemented in order to reduce the total amount of data to be detected in the following process of the nearest point recog-nition. In the process of selecting and developing algorithms, due to the system latency, more attention is paid to the algorithm efficiency. • Two object searching algorithms, the k-d tree and the octree, are applied

to recognize the target and compared to optimize the system performance. The evaluation result shows that the k-d tree and the octree have similar

(42)

performance to the sequential search with low data volume, while the oc-tree shows decent performance in the aspect of efficiency when big amount of data engaged. This makes it possible to select corresponding recognition algorithms for different data to improve system efficiency.

• A visualization part is built based on the Intel® RealSense™ SDK 2.0 and OpenGL to reflect and monitor the real-time execution results.

As a new attempt in the field of HRC, the system has the characteristics of both efficiency and accuracy. The use of D435 depth camera and multi-threaded modular architecture bring efficiency improvement. The appropriate foreground recognition and the NNS algorithms ensure the stability of the system in per-formance at the same time. It is believed that as a camera sensor mode of HRC based on point cloud recognition, it will provide more possibilities for the future development of HRC system.

There are also several possible future research prospects that can be con-cerned for improvement and innovation:

• The system is only the input part of a whole HRC system. The output can be designed via ROS and integrated with the input part to become a com-plete HRC system. The real-time communication between two subsystems can also be an interesting research direction.

• The video preprocessing filters used in the system are based on the inter-faces that the D435 development kit offers. More advanced hole-filling and noise-smoothing strategies can be applied after more interfaces of D435 are developed and open for developers.

• The foreground extraction method implemented in the system is a light-weight algorithm especially for real-time use. Any fast and efficient fore-ground extraction algorithms based on point cloud can be possible substi-tutes for better performance.

• The object detection algorithms applied in the system are NNS algorithm. There is still a massive development space toward algorithms such as K-Nearest Neighbor or Artificial Neural Network that can be engaged in HRC systems.

• A visualization module of the system is realized only when the UR5 robot is fixed. Further tracking and recognition algorithms can be utilised to trace and predict trajectories.

(43)

References

[1] Industry 4.0: the fourth industrial revolution – guide to Industrie 4.0. ht tps://www.i-scoop.eu/industry-4-0/. Accessed: 2020-04-22.

[2] Range finder. https://www.britannica.com/technology/range-finde r. Accessed: 2020-04-22.

[3] Intel® RealSense™ Depth Camera D400-Series (Intel® RealSense™ Depth Camera D415, Intel® RealSense™ Depth Camera D435) Datasheet. https://www.mouser.com/pdfdocs/Intel_D400_Series_Dat asheet.pdf, 2017. Accessed: 2020-03-20.

[4] KD Trees. https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html /KDtree.html, 2019. Accessed: 2020-03-09.

[5] Depth Camera D435. https://www.intelrealsense.com/depth-camer a-d435/, 2020. Accessed: 2020-02-22.

[6] Our History: “Make robot technology accessible to all.”. http://www.un iversal-robots.com/about-universal-robots/, 2020. Accessed: 2020-01-27.

[7] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.

[8] Attila B¨orcs, Bal´azs Nagy, and Csaba Benedek. Instant object detec-tion in lidar point clouds. IEEE Geoscience and Remote Sensing Letters, 14(7):992–996, 2017.

[9] Russell A Brown. Building a balanced kd tree in o (kn log n) time. arXiv preprint arXiv:1410.5420, 2014.

[10] Balasubramaniyan Chandrasekaran and James M Conrad. Human-robot collaboration: A survey. In SoutheastCon 2015, pages 1–8. IEEE, 2015.

[11] Trevor Darrell, Piotr Indyk, and Gregory Shakhnarovich. Nearest-neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, 2005.

[12] Jan Elseberg, St´ephane Magnenat, Roland Siegwart, and Andreas N¨uchter. Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. Journal of Software Engineering for Robotics, 3(1):2–12, 2012.

Human-robot Collaboration Focusing on Image Processing

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Human-robot Collaboration

Focusing on Image

Processing

XUETAO ZHANG

Abstract

Sammanfattning

Acknowledgment

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Human-robot collaboration overview

1.2

Depth camera overview

1.3

HRC input system overview

1.4

Methodology overview

1.5

Research ethics

1.6

Thesis objective

1.7

Thesis structure

Chapter 2

Literature Review

2.1

Perspective projection

2.2

Video preprocessing

2.2.1

Persistence control

2.2.2

Noise smoothing

2.3

Nearest neighbor search

2.3.1

K-d tree

2.3.2

Octree

Chapter 3

Proposed Approach and

Methodology

3.1

System architecture

3.1.1

Processing Module

3.1.2

Calculation Module

3.1.3

Visualisation Module

Chapter 4

Implementation and Result

Analysis

4.1

Implementation

4.2

Performance of video preprocessing

4.2.1

Performance of the persistence filter

4.2.2

Performance of the smoothing filter

4.3

Performance of nearest neighbor search

al-gorithms

Chapter 5

Summary and Future Work

References