The Optimal Hardware Architecture for High Precision 3D Localization on the Edge.: A Study of Robot Guidance for Automated Bolt Tightening.

(1)

IN

DEGREE PROJECT MECHANICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

The Optimal Hardware

Architecture for High Precision 3D Localization on the Edge

A Study of Robot Guidance for Automated Bolt Tightening JACOB EDSTRÖM

PONTUS MJÖBERG

(2)

This page is intentionally left blank

(3)

Master of Science Thesis TRITA-ITM-EX 2019:427

The Optimal Hardware Architecture for High Precision 3D Localization on the Edge

Jacob Edström Pontus Mjöberg

Approved Examiner

Martin Törngren

Supervisor

Daniel Frede

Commissioner Contact person

Abstract

The industry is moving towards a higher degree of automation and connectivity, where previously manual operations are being adapted for interconnected industrial robots. This thesis focuses specifically on the automation of tightening applications with pre-tightened bolts and collaborative robots. The use of 3D computer vision is investigated for direct localization of bolts, to allow for flexible assembly solutions. A localization algorithm based on 3D data is developed with the intention to create a lightweight software to be run on edge devices. A restrictive use of deep learning classification is therefore included, to enable product flexibility while minimizing the computational load.

The cloud-to-edge and cluster-to-edge trade-offs for the chosen application are investigated to identify smart offloading possibilities to cloud or cluster resources. To reduce operational delay, image partitioning to sub-image processing is also evaluated, to more quickly start the operation with a first coordinate and to enable processing in parallel with robot movement.

Four different hardware architectures are tested, consisting of two different Single Board Computers (SBC), a cluster of SBCs and a high-end computer as an emulated local cloud solution.

All systems but the cluster is seen to perform without operational delay for the application. The optimal hardware architecture is therefore found to be a consumer grade SBC, being optimized on energy efficiency, cost and size. If only the variance in communication time can be minimized, the cluster shows potential to reduce the total calculation time without causing an operational delay.

Smart offloading to deep learning optimized cloud resources or a cluster of interconnected robot stations is found to enable increasing complexity and robustness of the algorithm. The SBC is also found to be able to switch between an edge and a cluster setup, to either optimize on the time to start the operation or the total calculation time. This offers a high flexibility in industrial settings, where product changes can be handled without the need for a change in visual processing hardware, further enabling its integration in factory devices.

(4)

(5)

Examensarbete TRITA-ITM-EX 2019:427

Den Optimala Hårdvaruarkitekturen för 3D-lokalisering med Hög Precision på Nätverksgränsen

Jacob Edström Pontus Mjöberg

Godkänt Examinator

Martin Törngren

Handledare

Daniel Frede

Uppdragsgivare Kontaktperson

Sammanfattning

Industrin rör sig mot en högre grad av automatisering och uppkoppling, där tidigare manuella operationer anpassas för sammankopplade industriella robotar. Denna masteruppsats fokuserar specifikt på automatiseringen av åtdragningsapplikationer med förmonterade bultar och kollaborativa robotar. Användningen av 3D-datorseende undersöks för direkt lokalisering av bultar, för att möjliggöra flexibla monteringslösningar. En lokaliseringsalgoritm baserad på 3D- data utvecklas med intentionen att skapa en lätt mjukvara för att köras på Edge-enheter. En restriktiv användning av djupinlärningsklassificering är därmed inkluderad, för att möjliggöra produktflexibilitet tillsammans med en minimering av den behövda beräkningskraften.

Avvägningarna mellan edge- och moln- eller klusterberäkning för den valda applikationen undersöks för att identifiera smarta avlastningsmöjligheter till moln- eller klusterresurser. För att minska operationell fördröjning utvärderas även bildpartitionering, för att snabbare kunna starta operationen med en första koordinat och möjliggöra beräkningar parallellt med robotrörelser.

Fyra olika hårdvaruarkitekturer testas, bestående av två olika enkortsdatorer, ett kluster av enkortsdatorer och en marknadsledande dator som en efterliknad lokal molnlösning. Alla system utom klustret visar sig prestera utan operationell fördröjning för applikationen. Den optimala hårdvaruarkitekturen visar sig därmed vara en konsumentklassad enkortsdator, optimerad på energieffektivitet, kostnad och storlek. Om endast variansen i kommunikationstid kan minskas visar klustret potential för att kunna reducera den totala beräkningstiden utan att skapa operationell fördröjning.

Smart avlastning till djupinlärningsoptimerade molnresurser eller kluster av sammankopplade robotstationer visar sig möjliggöra ökad komplexitet och tillförlitlighet av algoritmen.

Enkortsdatorn visar sig även kunna växla mellan en edge- och en klusterkonfiguration, för att antingen optimera för tiden att starta operationen eller för den totala beräkningstiden. Detta medför en hög flexibilitet i industriella sammanhang, där produktändringar kan hanteras utan behovet av hårdvaruförändringar för visuella beräkningar, vilket ytterligare möjliggör dess integrering i fabriksenheter.

(6)

(7)

Acknowledgements

We would first like to thank our Industrial Supervisor Stefan Olofsson, our Aca- demic Supervisor Daniel Frede and our Examiner Martin Törngren.

We would also like to extend a thanks to Dr. Stefan Quinders and Dr. Fe- lix Bertelsmeier for their support and guidance regarding industrial robotics and vision systems, to Mikael Wendel for hardware and lab assistance, to Jör- gen Maas for discussions about cloud solutions, to Erik Persson for discussions about bolt tightening tolerances and especially to the local team for their continuous support.

Jacob Edström Pontus Mjöberg Stockholm, June, 2019

(8)

(9)

List of Figures

1.1 Edge to Cloud Configurations . . . . 4

1.2 Evaluation Layers . . . . 8

2.1 Shot Noise Signal Ratio [1] . . . . 10

2.2 Laser Properties [2] . . . . 10

2.3 Laser Classes According To EN/IEC 60825-1 . . . . 11

2.4 Isometric view, orthogonal projection . . . . 12

2.5 Top view, right position perspective projection . . . . 12

2.6 Real-life perspective projection . . . . 13

2.7 Stereo Vision Disparity [3] . . . . 13

2.8 Verged vs. Parallel Stereo . . . . 14

2.9 Time-of-Flight Working Principle [4] . . . . 15

2.10 Mono Structured Light [5] . . . . 16

2.11 Laser Profiler . . . . 17

2.12 Basic FPGA-architecture [6] . . . . 18

2.13 Difference between CPU and GPU [7] . . . . 19

2.14 Master-Slave versus Peer-to-Peer [8] . . . . 20

2.15 Canny Edge Detection [9] . . . . 22

2.16 Morphological transformations [15] . . . . 23

2.17 Shape detecting with ApproxPolyDP [10] . . . . 24

2.18 Convolutional Neural Network Visualization [11] . . . . 25

2.19 The YOLO-algorithm [12] . . . . 26

3.1 Universal Robot UR5e [13] . . . . 31

3.2 Phoxi 3D Scanner S [14] . . . . 33

3.3 SBC hardware setup . . . . 35

3.4 Cluster hardware setup . . . . 35

3.5 Test Rig . . . . 36

3.6 Tightening Setup . . . . 37

3.7 Sensor and tool mounting . . . . 38

3.8 Complete Station Setup . . . . 38

3.9 Localization algorithm . . . . 40

3.10 Example of partitioning . . . . 42

3.11 Overlap to ensure object detection . . . . 42

3.12 Overlap for different divisions with a total of 24 sub-images . . . 43

3.13 Overlap for different optimal divisions . . . . 44

3.14 Normal Distribution . . . . 46

3.15 Comparison between mean, median & mode . . . . 47

3.16 Localization Algorithm Steps . . . . 48

(12)

3.17 Algorithm structure for the cluster configuration . . . . 49

4.1 Center Point Precision Method . . . . 54

4.2 Center Point Errors . . . . 55

4.3 Timing Distribution . . . . 58

4.4 UP Board Sub-image Division Calculation Times with CL 99.85% 59 4.5 UP Board SBC - Function Timings . . . . 60

4.6 Raw Cluster Timing Data . . . . 61

4.7 Raw Timing Distribution . . . . 61

4.8 Refined Timing Distribution . . . . 62

4.9 Cluster Sub-image Division Calculation Times with CL 99.85% . 62 4.10 Cluster - Function Timings . . . . 63

4.11 Cloud Sub-image Division Calculation Times with CL 99.85% . . 64

4.12 Cloud - Function Timings . . . . 64

4.13 Raspberry Pi Sub-image Division Calculation Times with CL 99.85% . . . . 65

4.14 Raspberry Pi SBC - Function Timings . . . . 66

4.15 Process Timings . . . . 67

4.16 Robot Movement to Standby Position . . . . 68

4.17 Robot Movement . . . . 68

4.18 Robot Operation . . . . 69

4.19 CPU usage for the UP Board SBC . . . . 70

4.20 CPU usage for the Cluster . . . . 70

4.21 CPU usage for the Cloud . . . . 71

4.22 CPU usage for the Raspberry Pi SBC . . . . 71

4.23 RAM usage . . . . 72

5.1 Image Partitioning . . . . 75

5.2 Queued up stations for complete cloud calculations . . . . 77

5.3 Timings for the chosen divisions of each platform . . . . 80

B.1 Neural Network Summary . . . . 93

(13)

List of Tables

3.1 3D Vision Technology Comparison . . . . 32

3.2 Phoxi 3D Scanner S Specifications [14] . . . . 33

3.3 SBC comparison . . . . 34

3.4 CUDA vs. CPU on Cloud . . . . 50

4.1 Normal Approximation Performance . . . . 55

4.2 Binary Classification Notation . . . . 56

4.3 Deep Learning Performance . . . . 57

4.4 Best divisions for the UP Board SBC . . . . 59

4.5 Best divisions for the cluster . . . . 62

4.6 Best divisions for the Cloud . . . . 64

4.7 Best divisions for the Raspberry Pi SBC . . . . 65

5.1 Best Sub-image Divisions Across Devices . . . . 74

A.1 Camera Settings . . . . 91

C.1 Image Partitioning Table - Cluster . . . . 95

C.2 Image Partitioning Table - Single Devices . . . . 96

(14)

(15)

Nomenclature

CAD Computer Aided Design CL Certainty Level

CN N Convolutional Neural Network CP U Central Processing Unit F OV Field Of View

F P GA Field-Programmable Gate Array GP U Graphics Processing Unit OS Operating System P CB Printed Circuit Board

P CI Peripheral Component Interconnect ROI Region of Interest (Vision Systems) RAM Random Access Memory

SBC Single Board Computer

Y OLO You Only Look Once (Deep Learning Algorithm)

(16)

(17)

Chapter 1

Introduction

In this chapter, the background of the thesis is introduced followed by the research aim and research questions. The project’s delimitations and limitations are stated, followed by the system requirements. The research methodology is described as well as the report disposition.

1.1 Background

Automated Assembly

The manufacturing industry is going through the fourth industrial revolution, or Industry 4.0 as it is also referred to. Automated assembly is becoming increasingly common as skilled labour becomes harder to find, at the same time as industrial robot solutions becomes more sophisticated. These new smart fac- tories enable cost reductions as well as increased process performance. Flexible automation, where assembly operations are performed while objects are moving or have moved to an approximate area, enables new industry processes and production flexibility [16].

While automation is growing and promising, complete automation is oftentimes impossible or unnecessarily expensive. While robot automation moves towards general and flexible solutions, it may be hard to justify very advanced robotic systems for complex applications like wiring or cabling, which can be handled well by human operators. The optimal solution is often semi-automation, where certain tasks with the highest return on investment are automated, while human labor is kept for the other tasks. The combination of both these activities in close proximity has been termed Collaborative Automation, where the market for Collaborative Industrial Robots, or "cobots", is expected to grow at a CAGR of 50 % between 2017 and 2025 [17]. These cobots have safety features for collision detection and quick emergency stop, but to enable this they are limited in payload and speed.

The tightening of pre-mounted bolts is an assembly operation where the pre- ceding steps of part joining and bolt insertion may be complex tasks that are expensive to automate. As bolts may be placed in different angles due to the product design, and as some bolts may be tilting due to a play between the

(18)

bolt and the threaded hole, direct 3D localization of the bolts is advantageous.

With direct bolt localization, there is also no need for information about the part geometry, enabling quick product alterations without tedious adaptions of assembly processes.

3D Computer Vision

To locate objects for automated assembly, sophisticated sensor technology is used, where vision systems enable wide and precise analysis of the surround- ings. Traditional 2D vision is still very popular, as it is relatively inexpensive and can be made into compact systems. Object detection can however be prob- lematic, for instance detecting a black bolt on a black background. To overcome issues with contrast and lighting, 3D vision is starting to emerge on the market.

A range of different technologies for depth perception are available, where the result is a point cloud or depth map of the 3D geometry of the scanned surface.

This also allows for precise 3D inspections, as for inline control glue dispensing [18] and quality control of printed circuit boards [19], as well as more large-scale object detection, as for gesture control [20] and pedestrian detection for autonomous driving [21]. These two different divisions of applications have vastly different precision requirements, making it hard to define a "good precision"

for general 3D vision systems. Centimeter precision for autonomous driving is remarkable, while it is disastrous for high precision inspections, which may require micrometer precision. Another important aspect is the measurement range, the limits of the vision system’s depth data. The measurement range for autonomous driving needs to be about 1-100 m to detect objects both near and far [22], while some high precision systems operate with a measurement range of only 10 mm [23].

The same assembly application of pre-tightened bolts place itself somewhere in the middle of these extremes, which has not been the target of extensive research. Systems from both divisions may be applicable, or even combined to leverage respective strengths. Different technologies may be required for different applications, where it is important to analyze the needed range of tasks that the system is intended to be used for.

Edge, Cluster and Cloud Computing

With a clear drive for automated assembly and accompanying 3D vision systems, the processing of the image data into robot coordinates becomes an interesting point of optimization. Three different configurations will here be explained, namely edge, cloud and cluster computation.

The term Edge Computing refers to systems where processing is done on embedded devices in the "edge" of the network, close to the actual system. Edge Computing normally uses single board embedded systems, which are small, cheap and energy efficient. Vision systems with embedded processing are generally referred to as Smart Cameras, but due to the increased complexity of 3D vision this is more common for 2D systems. 3D vision systems instead usually require a separate processing station, often called an accelerator, for general computer vision tasks with a high frame rate requirement. However, for the

(19)

application of robotic guidance and tightening, there is no need for a high frame rate since it only requires one picture to be taken, and it may to be possible to optimize the system into an embedded system. This could enable a "Smart 3D Camera" to relieve the use of an accelerator for this application, or allow the process to be done on adjacent devices. One such possibility is to utilize the computational resources of the tightening tool, or enable new tools with computer vision processing support. This could reduce the number of devices and maximize their utility, as the tightening tool otherwise is idle during image acquisition and robot movement.

Cloud Computing instead performs the calculations on a more powerful server.

In line with the general need for an accelerator, high precision computer vision with 3D data may be too computationally intense for edge devices. For such cases, both external and local cloud solutions could be used. The basis for both is to serve multiple users with high computational power, which comes at the cost of communication delay. It may also be accompanied by query delays due to the cloud being preoccupied with other calculations at the time. Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure are all examples of external cloud services with increasing popularity, but they are rarely used for real-time operations due to communication delays. As a result, current services and pricing are not optimized for quick real-time queries, where minimum time limits may well exceed the usage. While this may well change in the future, it is still an entry barrier to currently existing external cloud systems.

A local cloud solution is instead an on-premise solution in the factory, which could tend to multiple stations. This configuration would result in significantly higher initial costs compared to external cloud systems, but could potentially reduce the connection times. A downside of a local cloud solution would be that the system is either utilized fully, resulting in similar problems that external cloud solutions have with query delays and limited availability of resources, or that it is not utilized enough, resulting in idle time where the system still needs to be maintained and consumes power.

The cluster configuration combines multiple edge devices, which increases the computational power without requiring significant investments and maintenance of cloud servers. At the cost of communication delays, this configuration may perform parallel computations on idle devices in the network to leverage already existing processing power. Extending further on the idea of embedded computer vision support in the tools, multiple tools from different synchronized stations could potentially be connected as a cluster and used together.

The options are illustrated in fig. 1.1, where the optimal choice depends on the application. If an edge device is sufficient, it is assessed by factors such as cost, size, power consumption and connection time. If more computational power is needed, a cluster of edge devices may alleviate the need for a dedicated accelerator. For even more computationally heavy applications, a local or external cloud solution can be used. In other words, there generally exists application specific cloud-to-edge and cluster-to-edge trade-offs regarding where to place the computations. The configurations can also be combined in the case where a specific function is too computationally heavy for an edge device, where it can be offloaded to a cloud device or cluster while keeping the core function-

(20)

ality at the edge. The analysis must therefore be made on a sub-functional level to identify bottlenecks and smart offloading opportunities.

Figure 1.1: Edge to Cloud Configurations

1.2 Research Aim

The project aimed to realize an edge computing 3D vision sensor system for localizing multiple pre-tightened bolts with high precision 6DOF coordinates for the position and 3D rotation of the center of the bolt head. The localization enabled the positioning of a robot-mounted nutrunner, to tighten the localized bolt in an industrial environment. The system was optimized to reduce the operational delay of the robot, meaning the time where the robot has to wait for further instructions, from the moment the image acquisition is completed to when the final bolt is successfully tightened.

The project evaluated the cloud-to-edge and cluster-to-edge trade-offs with re- spect to this operational delay as well as investigated which hardware architecture that was most suitable for the application. It compared the performance of three different types of hardware architectures for distributed processing of the above-mentioned task, as well as the impact of preprocessing steps such as sub-image division.

1.3 Research Questions

“What is the optimal hardware architecture for an edge computing system for extraction of several precise 6DOF coordinates from 3D data, to reduce delays for a serial robot operation?”

Additionally,

• “What is the Cloud-to-Edge trade-off for such a system?”

• “What is the Cluster-to-Edge trade-off for such a system?”

• “How are these trade-offs affected by preprocessing steps such as sub-image window size?”

(21)

1.4 Delimitations

The system used a collaborative industrial robot from Universal Robots, namely a UR5e with a payload of 5kg. The chosen vision system was mounted on the robot tool flange, next to an electric fixtured nutrunner. This limited the choice of vision system, which in turn delimited the study to such systems. Only one vision system was tested due to limited resources.

Three different types of hardware architectures were evaluated in four different configurations, namely a cluster of single board computers (SBC), two SBCs consisting of both the model used for the cluster and a more powerful model and lastly a high-end computer as an emulated cloud solution. The localization algorithm was based on open source code for ease of use and the ability to explore multiple image processing methods. Computational parallelization was used in the cases where existing support was available, but was not further developed as part of the project.

1.5 Limitations

As the project was delimited to general purpose and standardized hardware, with open source code, there were limitations to the computational efficiency of the systems. With custom-built hardware and new computer vision algorithms, there could be ways to achieve a higher level of parallelism of the computations, which could affect the conclusions of this thesis. This would however require significant development, outside of the scope of the project, and also lock the result to specific hardware.

1.6 System Requirements

Precision

The intended application, tightening of pre-tighetened bolts with a robot-mounted tool, required high precision. The system was intended to be able to handle a variety of bolt sizes, initially restricted to hexagonal M10 bolts, but it had to be able to extend the functionality to hexagonal M6, M8 and M12 bolts. Multiple bolts had to be detected simultaneously, where the bolts could be placed in different orientations to simulate a more complex product. This orientation was restricted to a maximum of 30^◦ in relation to the image plane of the sensor.

The width of the bolt heads ranged between 10-19mm for the bolts intended for the system, where the hexagonal shape needed to be identified. It was hard to define a clear requirement for X/Y resolution to detect this shape, as this is algorithm-dependent. At different orientations, the shape would be skewed, resulting in a range of different shapes to be allowed. At the maximum tilt, the minimum distance of 10 mm would be reduced to 7 mm. Sub-millimeter resolution would be necessary, where 0.5 mm was decided to be sufficient.

In addition to the shape detection, the 3D rotation was also required. The selected bolts were found to have a ±5.7^◦ play when fitted in the selected socket,

(22)

which meant that the orientation from the vision system could be a maximum of ±5.7^◦ off while still aligning properly to the bolt. A later problem is the tension from pushing in that wrongful direction during the tightening. Steel threading is not significantly affected by this, but plastics and soft materials may be damaged. After discussions with internal experts at the thesis company, the required angle accuracy of the vision system was set to ±2^◦. A one degree rotational resolution was therefore required to properly guarantee this maximum deviation. The rotation was to be calculated by measuring the tilt of the bolt head, and by simple trigonometry, the required depth accuracy to detect a one degree angle of the 10-19 mm wide bolt heads range from 0.17-0.33 mm.

The selected socket was of pathfinder type, with chamfered edges on all bolt sides so that when pressure was applied, the bolt would rotate to slide in. The chamfer extended 2 mm to all sides, which meant that the positional accuracy had a margin of ±2 mm where it still would align properly.

As such, the required resolution for the vision system was decided as XY = 0.5 mm in X/Y and Z 0.17 mm in Z (depth). The required accuracy of the algorithm was decided as P = ±2 mm in position and R= ±2^◦in rotation.

Real-Time Performance

The real-time performance of the system was important, as the intended factory applications normally had a strict requirements of low takt times. The project was not intended for a specific customer and application, but in order to develop a competitive system, it had to aim at similar requirements. Without similar existing systems, there was no predefined takt time to meet, but a slower process had to be motivated in terms of increased system performance.

The complete process time consisted of the image acquisition, the coordinate calculations, the communication between devices, the robot movement and the tightening process, where the optimization of the last three was outside of the scope of the project. The first two sub-processes set requirements for different parts of the system, where the image acquisition concerned the choice of camera and the coordinate calculation concerned the choice of computational hardware.

The process was completely paused during the image acquisition, where no robot operation except for the image acquisition purpose could be performed. The coordinate calculation could however be parallelized with robot movement, which reduced the penalty for this delay.

The targets were derived through discussion of the potential productivity increase from automation and the general aim to not be slower than current manual stations. That multiple object were to be located simultaneously increased the allowed vision delay, and so did the importance that the system was accurate enough to not damage the item to be operated on. A rough target at 1s was set for the image acquisition delay, a time used for similar vision system initiatives within the thesis company, with emphasis to create high quality data from the beginning. The coordinate calculation target was split into one target for the first coordinate acquisition and a second target for the complete

(23)

processing, where both were stated as to not cause operational delays for the robot in a representative application. The first coordinate acquisition time was determined by the movement from the capturing position to a standby position close to the intended operation, which roughly was estimated to 1s.

As such, the image acquisition delay and the first coordinate delay were both set to roughly 1s.

1.7 Research Methodology

The project was initiated with a study of 3D computer vision systems and the ordering of such a sensor. The vision systems were evaluated based on their suitability for the application, both in terms of measurement performance and hardware integration. The chosen vision system was mounted on a collaborative robot and a test rig with pre-tightened bolts was built as a true reference, to which the algorithm’s precision could be compared.

Four computational systems were assembled and evaluated for this project.

System One was a more high performing embedded microprocessor, directly performing the image processing on chip, to transfer coordinates to the robot.

System Two was a cluster with a central embedded microprocessor coordinat- ing the image split, distributing the image processing of sub-images on several low-performing microprocessors, receiving the results and transmitting the coordinates to the robot. System Three used the same high-performance embedded microprocessor as in System One, but transferred the image processing to a cloud server, returning coordinates to be sent to the robot. System Four used the same embedded microprocessor as System Two but performed the image processing directly on chip, to then transfer the coordinates to the robot.

The system performance was evaluated on a three level hierarchical framework, as shown in fig. 1.2. The middle layer could be seen as an entry point, focused on the algorithm performance. It concerned process timings without application or hardware relation, for high-level decisions such as sub-image division. The process timings were split into a time for calculation of the first coordinate, as this could start the robot process, and a total calculation time for when the device would be done and available for new tasks. It also evaluated the algorithm parts individually to identify bottlenecks and potential improvements on the various platforms. The top layer instead focused on the application requirements, strongly related to the Real-Time Performance section under System Requirements. It investigated how the application was affected, what delays were introduced and their impact due to their timing. Given the selected system, it analyzed the timings of the robot system to be put in relation to the computational timings. The bottom layer in turn focused on the hardware platform, with an "under-the-hood" approach to seek the reason for the specific performance. It investigated both CPU multi-core utilization and RAM usage.

(24)

Figure 1.2: Evaluation Layers

1.8 Report Disposition

The structure of the report is as follows:

Chapter 2 Theoretical Framework presents the theoretical foundation of the thesis, with literature studies of the hardware and software used.

Chapter 3 Implementation describes both the hardware and software im- plementations of the examined systems.

Chapter 4 Results presents the system verification and the results of the tests.

Chapter 5 Discussion and Conclusion discusses the results based on the research questions, concludes the result and discusses areas of potential future work.

(25)

Chapter 2

Theoretical Framework

In this chapter, the theoretical foundation of the project is presented. It is divided into three sections, based on 3D computer vision, computational hardware and localization software respectively.

2.1 3D Computer Vision

3D vision is emerging to facilitate more complex tasks where a 3D geometry is necessary, for instance for more advanced quality inspections and flexible object manipulation. The data naturally represents 3D coordinates, which can come in many variants. More complex variants capture the 3D scene from different angles to create a 3D environment. The simplest variant is a projected depth map, where the 3D shape is projected on the plane of vision system lens. It has the same format as a single-channel grayscale image, with the depth data as pixel (z) values, which allows for traditional 2D image processing. The single- angle image acquisition also has the benefit of not requiring movement, which makes the process potentially faster if this method proves sufficient.

There are numerous ways to achieve the projected depth map, where the most common strategies will be investigated in the section 3D Vision Technologies.

Before this, the following section will go through some fundamental concepts for 3D computer vision.

2.1.1 Computer Vision Concepts

Ambient light

Ambient light refers to all light sources that are not controlled by the vision system. This includes sunlight and factory lamps, the latter which normally is controlled with a flickering alternating current source. In contrast to human vision, computer vision systems have a hard time to auto-adjust for varying light intensities, which can cause the algorithms to behave differently in different lighting conditions. This is most crucial for systems without active illumination, and a general drawback of 2D vision systems. One way to counter this is through flashes, much stronger than the possible variation in lighting. This however creates very strong reflections, which can reduce the ability to detect structure.

(26)

Shot Noise

Shot noise is defined as the standard deviation of the ambient light[1], as seen in fig. 2.1. When the shot noise is stronger than an active illumination signal, it is hard to distinguish the signal and the image becomes noisy. Ways to counter this is to either reduce the ambient light by for instance shielding off the workstation, switching to DC lighting, or more commonly increasing the active signal. Increasing this signal however makes the light source potentially dangerous to humans working in collaboration with the robot.

Figure 2.1: Shot Noise Signal Ratio [1]

Laser and Laser Classes

When in need of strong light sources, laser is often used. In contrast to ordinary light, laser is directed, monochrome and coherent, as illustrated in fig. 2.2. The monochromatic aspects allows for filtering of all other wavelengths, reducing the impact of most ambient light. Coherence and directivity makes the signal strong and less variant, both of which reduce the noise of the image.

While this strong light source appears to be optimal for a vision system, it also may pose a health risk. In collaborative environments, where an operator may be close, laser light could inflict eye damage. With strong lasers, safety distances can be long, which can cause trouble when firing the laser in a direction that may coincide with the eye level of factory personnel. Laser sources are therefore

Figure 2.2: Laser Properties [2]

(27)

Figure 2.3: Laser Classes According To EN/IEC 60825-1

safety classed with laser classes according to the standard EN/IEC 60825-1, as shown in fig. 2.3. A clear distinction can be made between laser class 2 (LC2), which is safe for accidental exposure in contrast to laser class 3 (LC3).

It may now instead seem optimal to have the lowest possible laser class, and while that is true from a safety standpoint, it puts limitation on the application.

The reason behind stronger laser light is not only domination over ambient light, but also increased Region of Interest (ROI), increased speed and material flexibility. The larger area that needs to be illuminated, the stronger the laser has to be to output the required number of photons per area unit, per time unit.

For moving lasers, the same applies with regard for illumination per physical point, which means that a stronger laser allows for a faster movement. Lastly, different materials absorb light differently, where a very absorbent material like rubber requires a strong laser for a sufficiently reflected signal. It is therefore a trade-off between safety and performance, which has to be balanced for each application.

Perspective Projection

Perspective projection is a concept of perceived distortions with real-life optics.

Given a pinhole approximation of a camera in a single point, a flat surface will be be perceived to be tilting further away from the camera center axis. One way to explain this is to first demonstrate its counter-part: orthogonal projection.

This is often used in computer graphics such as CAD software, as seen in fig.

2.4. Here, the object orientation remains the same independent of the distance from the center axis.

(28)

Figure 2.4: Isometric view, orthogonal projection

With a real-life camera, this is however not the case. Fig. 2.5 displays the same object as in fig. 2.4, but from a top view with perspective projection from a point along a center axis placed on the center-right edge of the object. The flat cubes in the top row appears to be tilted the further left they are placed. While the human vision is capable of countering this and understanding that all these shapes lie on the same plane, the left-most shapes appear to be tilting almost 45 degrees to a computer vision system. Given depth data, this is not an issue when calculating planes, but 2D shape detection will be made more complex due to distorted shapes and reduced shape sizes.

Figure 2.5: Top view, right position perspective projection

Fig. 2.6 shows a real-life example of this perspective projection with the selected vision system. It displays four cuboid objects distributed along the complete FOV of the vision system, where the camera is positioned above the second right-most object. The top surface of this object appears to be flat along the vertical axis and somewhat tilted along the horizontal axis due to this perspective projection. With the shading of the background, the human eye can perceive the left-most object along the same flat surface, but the shape itself appears to be tilting almost 33^◦ to a 2D computer vision algorithm. The height of the object’s top surface is reduced by 10 % due to its placement further away from the camera, but its width is reduced by 25 %. When compensating for the reduced height, the relative width is reduced with 16 %, which makes it appear tilted by 33^◦ along the vertical axis.

(29)

Figure 2.6: Real-life perspective projection

The conclusion of this is that vision system performance and robustness is affected by the object placement in the field of view. Highly tilted objects will become increasingly difficult to detect the further away from the camera center they are placed. It is thus not possible to state a fix product angle requirement for a vision system without stating where it will be positioned in relation to the camera.

2.1.2 3D Vision Technologies

Passive Stereo Vision

Passive stereo vision is the simplest strategy for 3D vision, using two traditional 2D cameras with a known interocular distance and vergence angle, similar to the human vision. While the hardware architecture is not advanced, the software solution may be quite sophisticated to compensate for this. The stereo structure also lays as a foundation for more advanced techniques, later described in this section.

The passivity comes from the lack of active illumination, further discussed under the section Structured Light, where the system relies on ambient light or homogeneous illumination from camera flashes. As seen in fig. 2.7, the two cameras will capture different images due to their placement, meaning that physical points will appear in different pixels in the two images. By cross-referencing the pictures, a disparity map can be created, describing how much a point is shifted in pixels between the images. The closer the object is, the larger the shift will be, enabling a conversion from disparity to depth.

Figure 2.7: Stereo Vision Disparity [3]

(30)

This technique requires the point to be distinguishable from both cameras, which is not the case for flat, even surfaces. In the absence of structure, the depth map will be noisy. This may be smoothed with advanced image processing, but it will inherently lack a high homogeneous precision and detail.

As points have to be detected in both pictures, stereo systems only work where the two images overlap, where camera vergence can control the measuring range.

This vergence provides a larger overlap of the two images, allowing the system to function on closer distances, as shown in fig. 2.8. As a reference, human eyes also verge when looking at an object that comes closer. Inversely, a more parallel positioning allows measurements further away. A wider interocular distance combined with a larger vergence angle can increase the depth of overlap between the systems, and thus the measuring range. Increasing the beam angle of a projecting pattern and the camera’s angle of view respectively have the same effect.

Figure 2.8: Verged vs. Parallel Stereo

Another version of stereo vision is structure-from-motion, which uses a single camera in multiple positions to calculate the disparity map. This only requires one 2D camera, which makes it the cheapest possible system with regards to hardware. It requires movement of the camera for multiple positions and that the object is still, as the generated structure otherwise breaks. It is also sensitive to shot noise, as lighting conditions may vary during the sampling of data.

The same flaws of stereo vision can be found here, but magnified by the impact of shading and variable exposure from different angles.

As a conclusion, passive stereo vision is a cheap configuration where it is hard to achieve a high uniform accuracy. It is an indirect measurement of distance, where pixel disparity based on intensity is approximated to depth. Therefore, it is most suitable for object detection (in terms of presence, as opposed to object localization) or gesture control.

Time-of-Flight

The time-of-flight (ToF) principle is best known from sonar systems, where a pulse is sent out and the time delay is measured until its reflection is sensed.

ToF cameras however can be divided into two categories, pulsed ToF and modulated ToF, where only the former is based on delay of the reflected signal. The modulated ToF instead calculates the depth from the phase difference of the reflection of a modulated signal [24]. The principle is the same, that by knowing

(31)

the phase difference and the speed of light, the depth can be calculated. Fig.

2.9 illustrates the concept.

Figure 2.9: Time-of-Flight Working Principle [4]

The systems are fairly inexpensive, small and harmless (LC1), which has led to the adoption in for instance smartphones to enable face-detection [25]. The resolution however is limited, as the depth resolution is dependent on the distance.

Most commercial systems never promise more than a 1 cm depth resolution at the time of writing, while sensors from the manufacturer pmd has a depth resolution of 2 mm at its closest measuring distance of 100 mm [26].

The measurement range can be made quite large, as the modulated signal can be made strong while being harmless, and it can be measured over large distances without significant impact of ambient light [27]. It is therefore in a unique position for general purpose 3D vision without high precision demands.

In conclusion, while time-of-flight vision is a promising technology for mid and large range 3D applications, it has not sufficient precision for the intended application.

Structured Light

As discussed in the Passive Stereo Vision section, passive vision systems have a hard time to calculate depth for areas without structure. One example of this is a flat and even surface, where it is hard to measure the pixel shift for the disparity map. To overcome this issue, a known pattern can be projected onto the surface to create an artificial structure. This pattern will be deformed due to the height variations, which makes it possible to calculate the depth [28].

Structured light systems can be built as both mono and stereo systems, with one projecting unit and either one or two capturing units, where fig. 2.10 shows a mono setup. By projecting multiple patterns in succession and combining the results, a very high precision can be achieved, where commercial systems have an X/Y/Z resolution of about 0.2-0.5 mm [29]. This complex processing has although made the cameras quite expensive, where commercial units range from 8k-12k euro. Both versions normally employ a verged positioning of either the projector and the camera or the two stereo cameras. As explained in the Passive Stereo Vision section, this controls the measurement range, enabling close capture and a large ROI. Although, as the pattern has a finite resolution, the accuracy of the system declines with distance. In addition, it is hard to properly project a pattern over a long distance while overpowering ambient

(32)

Figure 2.10: Mono Structured Light [5]

light. Even over smaller distances, the systems are quite sensitive to ambient light, where the active signal has to overpower the shot noise. It may also be sensitive to reflections, as this may distort the active signal. In conclusion, it is a high precision system which is quite sensitive to disturbances.

Laser Profiler

A laser profiler uses the traditional time-of-flight concept, as used by the pulsed time-of-flight cameras, with triangulation of a laser pulse. By measuring the reflection time and knowing the speed of light, the depth can be calculated very precisely. This is the only commercially available option for the earlier discussed high precision metrology applications, where 1 µm precision can be achieved [30]. Instead of measuring a point source, it projects a beam of light onto a surface. By moving either the sensor or the object, the laser will scan over the surface, stitching together profiles into a 3D image. This requires knowl- edge of the relative movement, which has to be precisely controlled, to offset the profiles properly.

The sensor can normally be controlled to either capture a profile on demand or to capture with a fix scan rate. In the case of a fix sensor position and an object moving on a conveyor belt, an encoder can be used to signal the sensor at known distance intervals. In the case of controlled and precise movement, a fix scan rate may instead be a far simpler choice. To enable an inter-profile accuracy of 0.17 mm when moving at a normal robot speed of 250 mm/s, a scan rate of close to 1.5 kHz is needed, which is hard to achieve without a dedicated encoder. Due to the narrow projection in fix angles, these systems have a quite limited measuring range. It is fixed by its geometry to capture the reflection, illustrated in fig. 2.11, and the selected power of the laser to support the expected travel distances. While these factors can be varied between models, each system has a quite narrow measuring range.

(33)

Figure 2.11: Laser Profiler

The price range of entry level laser profilers with a scan rate of about 300- 500 Hz start from about 5k euro, where "high speed" versions with a scan rate of about 2-5 kHz cost about 7k euro.

With its narrow bandwidth, it is easy to filter out ambient light. Reflections may be more troublesome however, as each point only is scanned once. As discussed in the previous section on laser classes, laser safety has to be considered.

In conclusion, it is a high precision system with a narrow measurement range.

Parallel Structured Light

Parallel structured light can be thought of as a mix of structured light and laser profiling. It projects structured patterns with laser light, with a resulting X/Y/Z resolution of 0.17 mm [31]. It leverages the ambient light robustness of laser light while sampling each point multiple times with various patterns, intensities and exposure times, which may reduce the impact of reflections. As with laser scanners, laser safety has to be considered. This technology often also deploys a mono-structure with a single laser projection unit and a single receiving camera, which creates a shading effect from its single viewing angle.

In conclusion, it is a system with high resolution together with both ambient light and reflection robustness.

2.2 Computational Hardware

In order to perform all the required tasks of this application, a computational unit is required. There are multiple architectures available for different kinds of computations, which will be presented in the following section.

2.2.1 FPGA

Field-Programmable Gate Arrays (FPGA) are integrated circuits which can be programmed to change its physical structure. They contain an array of programmable logic blocks, as seen in fig. 2.12 which can be configured to perform any task. They are programmed by flashing the memory with a specific

(34)

programmed task, which will then be applied to the logic blocks once the FPGA is powered. This allows the FPGA to perform certain tasks with very low latency and with high parallelism, but also limits it to that certain task until it is turned off and its memory is re-flashed. This makes it ideal for highly parallelizable tasks such as data mining, while being unable to perform general unspecified computational tasks. It also requires a large amount of work to implement the specified task, since they are implemented in a completely different programming framework compared to those available for a CPU or GPU.

Figure 2.12: Basic FPGA-architecture [6]

2.2.2 GPU

A GPU is a processing unit used for vector-based calculations, which uses hun- dreds or thousands of cores in contrast to the few cores of the CPU. This allows a GPU to perform several calculations simultaneously similarly to FPGAs, while still being flexible in terms of what kind of calculations are made. GPUs are commonly used for tasks such as 3D-rendering and deep learning because of their computationally intense and parallelizable nature, where a large number of independent tasks can be spread out on the high amount of processing cores.

The cores of a GPU are not very powerful individually as a result, making the GPU unfit for more serialized and heavy operations usually associated with most operating system tasks. Thus, a GPU always needs a complementary CPU to function. GPUs can be divided into two different groups, namely Dedicated GPUs and Integrated GPUs.

Dedicated GPU

For most heavy GPU usage, a dedicated GPU is required. It is a separate unit with its own circuit board, equipped with dedicated cooling and directly powered

(35)

by the Power Supply Unit (PSU). It is interfaced with the CPU using PCI through the motherboard or Thunderbolt if using an external GPU. Because a dedicated GPU is a separate unit, it can have a much more extensive design with more cooling capabilities, resulting in vastly higher performance.

Integrated GPU

For smaller designs, it is common to integrate the GPU into the CPU. The shared cooling and limited size severely limits the processing power of the integrated GPU, resulting in most integrated GPUs mostly being used for basic user interface and 2D-rendering.

2.2.3 CPU

The CPU is often referred to as the brain of a computer, as it performs a majority of tasks required to run an operating system. In contrast to the GPUs core amount, a CPU is equipped with only a small number of processing cores, as shown in fig. 2.13, which possess a higher processing power individually.

Figure 2.13: Difference between CPU and GPU [7]

In recent times there have been major advancements in the development of power efficient microprocessors, which was catapulted by the emerging smartphone market. While there existed powerful CPUs in PCs, they consumed a lot of power and in turn generated a lot of heat. The microprocessors on the other hand are constructed to be able to run more simple tasks with extra focus on the power, size and heat limitations. As the market has expanded, so have the microprocessor capabilities, with the latest in the smartphone industry having performance rivaling many laptops. This development has opened the market for less costly microprocessors for other use cases, for example in the Raspberry Pi, which allows for light to medium intensity operations to be integrated into new environments. This has played a major role in the recent Edge Computing boom, where computations need to be performed closer to the operation. As the hardware becomes both cheaper and more powerful, the possibilities become exponentially larger.

(36)

2.2.4 Cluster Computing

Cluster Computing is a way for many computational units to perform tasks together. This is done by either splitting up the computation task into sub- tasks that can be processed individually, or going even lower and linking the processing units together so that they basically function like a single multi-core processing unit. The application of this kind of system might be anything from a giant cloud data center with incredible processing power to a small local cluster of smaller computers that do continuous analysis.

The communication between the different nodes is often the limiting factor since they have to interact often and quickly. In most cases this is done via Ethernet communication through a network, since it is a fast and easy way to coordinate communication. There is also a distinction between the individual nodes having individual memory space or the memory only being a part of a central node.

Every node having individual memory allows for more independent operations while a central memory allows for faster and less communication.

There are two general architectures when it comes to clusters; Master-Slave and Peer-To-Peer [32], shown in fig. 2.14. Master-Slave consists of a master node that delegates all tasks to the slave nodes, which perform the actual computing. This is the easiest one to implement due to the divided nature of the nodes’ responsibilities. However, it is more susceptible to errors or failings since the master node is such a central part of the architecture and as a result can make the whole system fail. Peer-to Peer fixes this by handling the coordi- nation collectively with all the nodes. While coming at a cost of individual computational capacity, the impact of a node failing is not as dire since every other node still functions. The overall impact of a failed node will then just be a proportional loss of overall computational power with little risk of complete failure.

Figure 2.14: Master-Slave versus Peer-to-Peer [8]