• No results found

GPU-Assisted Collision Avoidance for Trajectory Optimization

N/A
N/A
Protected

Academic year: 2021

Share "GPU-Assisted Collision Avoidance for Trajectory Optimization"

Copied!
111
0
0

Loading.... (view fulltext now)

Full text

(1)

GPU-Assisted Collision Avoidance for Trajectory Optimization

Parallelization of Lookup Table Computations for Robotic Motion Planners Based on

Optimal Control ABHIRAJ BISHNOI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)
(3)

Optimization

Parallelization of Lookup Table

Computations for Robotic Motion Planners Based on Optimal Control

ABHIRAJ BISHNOI

Master of Science, Computer Science and Engineering Date: January 13, 2021

Industry Supervisor: Roberto Lampariello Supervisor: Wei Der Chien

Examiner: Stefano Markidis

School of Electrical Engineering and Computer Science Host organization: German Aerospace Center

Swedish title: GPU-assisterad kollisionsundvikande för banaoptimering

Swedish subtitle: Parallellisering av uppslagstabellberäkningar för robotrörelsesplanerare baserat på optimal kontroll

(4)

© 2021 Abhiraj Bishnoi

(5)

Abstract

One of the biggest challenges associated with optimization based methods for robotic motion planning is their extreme sensitivity to a good initial guess, especially in the presence of local minima in the cost function landscape.

Additional challenges may also arise due to operational constraints, robot controllers sometimes have very little time to plan a trajectory to perform a desired function. To work around these limitations, a common solution is to split the motion planner into an offline phase and an online phase. The offline phase entails computing reference trajectories for varying parameterizations of the task space in the form of a lookup table. During the online phase, a stripped down version of the optimizer is supplied with a suitable initial guess from the lookup table using the current state estimate of the robot and its surrounding bodies. This method helps in alleviating problems related to both local minima and operational time constraints, by seeding the optimizer with a suitable initial guess that allows it to converge to the global minimum much faster.

The problem however, shifts to the computational complexity of computing a lookup table of reference trajectories for a fine enough discreti- zation of the input state space. For many robotic scenarios of interest, it is often impractical and sometimes computationally infeasible to compute a look up table using a serial, single core implementation of the offline phase of a motion planner. The main contribution of this work is to develop and evaluate a method for reducing the time spent on computing a lookup table of reference trajectories during the offline phase of motion planners based on optimal control. We implement a method to offload the computation of collision avoidance constraints during trajectory optimization on a Graphics Processing Unit (GPU), while simultaneously benefiting from a task based approach to distribute lookup table computations for independent subsets of the input state space across multiple processes on a cluster of machines. We demonstrate the efficacy of the proposed method in a practical setting by implementing and evaluating it within a representative motion planner based on optimal control.

We observe that the implemented method is 115x faster than the original serial version of the planner, using 86 processes on 5 machines with standard server grade hardware and 5 Graphics Processing Units in total. Additionally, we observe that the implemented method results in solutions identical to the original serial version in 96.6% of cases, lending credibility for its use in robotic motion planning.

(6)

Keywords

Motion Planning, Robotics, Trajectory Optimization, GPGPU, Parallel Programming

(7)

Sammanfattning

En av de största utmaningarna med optimeringsbaserade metoder för rörelse- planering inom robotik är deras extrema känslighet för en bra initial gissning, särskilt i närvaro av lokala minima i kostnadsfunktionslandskapet. Ytterligare utmaningar kan också uppstå på grund av operativa begränsningar. Robot- kontroller har ibland väldigt lite tid att planera en väg för att utföra en önskad funktion. För att kringgå dessa begränsningar är en vanlig lösning att dela upp rörelseplaneraren i en offline-fas och en online-fas. Offlinefasen inkluderar beräkning av referensvägar för olika punkter i ingångstillståndsutrymmet i form av en uppslagstabell. Under online-fasen levereras en avskalad version av optimeraren med en lämplig initial gissning från uppslagstabellen med den aktuella uppskattningen av roboten och dess omgivande kroppar. Denna metod hjälper till att lindra problem relaterade till både lokala minima och driftstidsbegränsningar genom att sådd optimeraren med en lämplig initial gissning som gör att den kan konvergera till det globala minimumet mycket snabbare.

Problemet flyttas emellertid nu till beräkningskomplexiteten för att beräkna en uppslagstabell över referensvägar för ett tillräckligt fint utrymme för ingång- stillståndsutrymmet. För många robotscenarier av intresse är det ofta opraktiskt och ibland beräkningsmässigt omöjligt att beräkna en uppslagstabell med hjälp av en seriell, enda kärnimplementering av offline-fasen i en rörelseplanner.

Huvudbidraget till detta arbete är att utveckla och utvärdera en metod för att minska tiden som används för att beräkna en uppslagstabell över referensvägar under offline-fasen för rörelsesplanerare baserat på optimal kontroll. Vi imple- menterar en metod för att utföra en kollision undvika en grafikbehandlingsenhet (GPU), medan du använder en uppgiftsbaserad metod för att distribuera uppslagningsberäkningar för oberoende delmängder av inmatningsutrymme över flera processer i ett kluster av maskiner. Vi demonstrerar effektiviteten av den föreslagna metoden i en praktisk miljö genom att implementera och utvärdera den inom en representativ rörelseplanner baserat på optimal kontroll. Vi noterar att den implementerade metoden är 115 gånger snabbare än den ursprungliga serieversionen av schemaläggaren, med 86 processer på 5 maskiner med standardhårdvara och totalt 5 GPU: er. Dessutom observerar vi att den implementerade metoden resulterar i lösningar som är identiska med den ursprungliga serieversionen i mer än 96,6 % av fallen, vilket ger trovärdighet för dess användning i robotrörelse planering.

(8)

Nyckelord

Rörelseplanering, robotik, optimal kontroll, GPGPU, parallell programmering

(9)

Acknowledgments

First and foremost, I would like to thank my advisor Dr. Roberto Lampariello for inspiring me and for being a constant source of support, guidance and encouragement throughout the duration my master’s thesis. This work would not have been possible without his involvement and I am extremely grateful to him for being such a wonderful advisor.

I would like to thank the German Aerospace Center for graciously hosting me for the past 9 months and for giving me the opportunity to conduct research on an exciting topic in my field of interest. I would also like to thank them for providing access to resources and infrastructure required to conduct this thesis, and for giving me the exposure to the culture of working at a world class research facility. My sincere thanks also goes towards Mr. Michael Steinmetz at the German Aerospace Center for his help and for the useful discussions at the beginning of my thesis.

I would like to express gratitude towards my examiner Prof. Stefano Markidis and my academic supervisor PhD Candidate Wei Der Chien for taking me under their wing and giving me the knowledge, training and skills necessary to conduct this thesis. I would also like thank them for their guidance, help and support throughout my thesis.

Last but not least, I would like thank my parents for their constant love, encouragement and support throughout my lifetime. Nothing in my life would be possible without them and I am indebted to them beyond what words can express.

Stockholm, January 2021 Abhiraj Bishnoi

(10)
(11)

Contents

1 Introduction 1

1.1 Motivation. . . 1

1.2 Problem . . . 4

1.3 Purpose . . . 5

1.4 Goals . . . 5

1.5 Delimitations . . . 6

1.6 Ethics and Sustainability Aspects . . . 6

1.7 Structure of the thesis . . . 7

2 Background and Related Work 9 2.1 Trajectory Optimization for Motion Planning in Robotics . . . 9

2.2 Collision Detection and Avoidance . . . 12

2.2.1 Application Domain Representation . . . 13

2.2.2 Types of Queries . . . 17

2.3 Parallel Computing . . . 24

2.3.1 Single Program Multiple Data (SPMD) . . . 25

2.3.2 Amdahl’s law . . . 26

2.3.3 Gustafson’s law . . . 27

2.3.4 GPGPU . . . 27

3 Design and Implementation 31 3.1 Application Scenario . . . 31

3.1.1 Mathematical formulation of the optimization problem 34 3.2 Profiling the serial version of the planner . . . 40

3.3 Parallelization using GPUs . . . 42

3.3.1 Representation of objects . . . 42

3.3.2 Initial Parallelization Strategy using GPUs . . . 45

3.3.3 Improved Parallelization Strategy using GPUs . . . 50

(12)

3.4 Parallelization of the lookup table on multiple cores using an

SPMD parallel programming approach . . . 53

3.4.1 CUDA Multi-Process Service (CUDA MPS) . . . 56

4 Experimental Setup 57 4.1 Technology . . . 57

4.1.1 Profiling . . . 58

4.1.2 Programming languages, libraries and frameworks . . 58

4.1.3 Frameworks for parallel computation . . . 58

4.2 Testbed . . . 58

5 Experimental Results 61 5.1 Performance . . . 61

5.1.1 GPU v/s CPU . . . 61

5.1.2 Comparisons between sub-parts of the implemented method . . . 63

5.2 Accuracy . . . 70

5.3 Resource Utilization . . . 74

5.3.1 CPU Utilization. . . 74

5.3.2 GPU Utilization . . . 74

6 Discussion and Conclusion 76 6.1 Performance Discussion . . . 76

6.1.1 Accuracy Discussion . . . 80

6.1.2 Resource Utilization Discussion . . . 82

6.2 Conclusion . . . 83

6.3 Future Work . . . 84

References 85

(13)

List of Figures

1.1 Examples of application scenarios where the computation of lookup tables for robotic motion planning is useful . . . 3 2.1 Figure outlining the difference between a closed loop solution

and an open loop solution. [1] . . . 10 2.2 Examples of bounding volumes: sphere, axis-aligned bounding

box (AABB), oriented bounding box (OBB), eight-direction discrete orientation polytope (8-DOP), and convex hull [2]. . . 15 2.3 Example representation of a spacecraft with a robotic manipulator

mounted on it. The zoomed in figures show the use of capsules to represent robotic links and an Oriented-Bounding Box to represent the spacecraft itself. It can be seen that the bounding volumes are close in fidelity to the actual objects they encompass. 16 2.4 Pictorial representation of two spheres intersecting in 2-D.

The spheres are defined by a center and a radius. . . 19 2.5 Pictorial representation of a sphere and capsule intersecting in

2-D. . . 20 2.6 Pictorial representation of two capsules intersecting in 2-D. . . 22 2.7 Pictorial representation of the separating axis test in 2-D

(Adapted from [3]). . . 23 2.8 Figure outlining the difference between a CPU and a GPU . . 29 2.9 Illustration of the CUDA Programming Model (reproduced

from [4]). . . 29 3.3 Angular velocity of the tumbling target satellite . . . 38 3.4 Profiler output demonstrating the computation of collision

avoidance constraints as a major performance bottleneck in the algorithm for computing a lookup table of reference trajectories 41

(14)

3.6 Profiler output for the computation of collision avoidance constraints on the GPU with a stream-based, concurrent kernel method for 10 independent queries of the lookup table . . . . 49 3.7 Profiler output for breakup of computational effort for computing

collision avoidance constraints on the GPU without using multiple streams and kernels. The times are measured for a total of 10 independent queries of the lookup table. . . 52 4.1 Specifications of machines used for experiments. The column

S:C:T refers to the configuration of the CPU cores and stands for Socket(s):Core(s):Thread(s) per core. . . 59 5.1 Comparison of execution times between the GPU and the

single core CPU version, for varying problem sizes (number of queries). The height of each bin represents the overall computation time. The bars on top represent the standard deviation for the overall computation time. . . 62 5.2 The stacked bar graph represents the fractions of overall

computation time spent on collision avoidance, as well as the time spent on all other aspects of the problem, using the implemented method and the reference baseline for varying number of queries. . . 63 5.3 Speedup obtained for a varying number of processes and

combinations of the implemented optimizations on different nodes. . . 65 5.4 Comparison of execution times between nodes for the same

problem size (100 queries) and an optimal number of processes per node . . . 67 5.5 Comparison of execution times obtained using different configurations

of the implemented method. The baseline is considered to be the fastest single core CPU version of algorithm 1.. . . 70 5.6 Cost function histograms of the trajectories obtained using

the original serial implementation of the planner and the implemented method using GPUs . . . 73 5.7 Simultaneous logical core utilization during the computation

of a lookup table reported using the Intel(R) VTune Profiler.

The histograms show a percentage of the wall time a specific number of CPUs were running simultaneously. . . 75

(15)

6.1 Overlapped cost function histograms using the CPU and the GPU . . . 81

(16)
(17)

List of Tables

2.1 Feature matrix for implemented primitive intersection tests and methods for calculating penetration depth for pairs of bounding volumes on the GPU. . . 18 3.1 Average execution time using a single core implementation of

algorithm to compute a lookup table of reference trajectories (wall time in seconds). . . 41 3.2 Average execution time using a single core CPU with the

initial version of the algorithm for computing collision avoidance constraints on the GPU . . . 49 3.3 Average execution time using a single core CPU with the

improved version of the algorithm for computing collision avoidance constraints on the GPU . . . 52 4.1 Specific versions of tools, libraries and frameworks of relevance

used for experiments . . . 57 5.1 Configuration of machines used for conducting experiments

in this section . . . 64 5.2 Configuration of machines used for conducting experiments

in this section . . . 66 5.3 Speeds of machines relative to the slowest machine (rmc-gpu07) 68 5.4 Configuration of machines used for conducting experiments

in this section . . . 68 5.5 Configurations of machines used for computing a lookup table

of 1000 reference trajectories . . . 69 5.6 Order of magnitude of the mean and maximum relative errors

between the penetration depth constraints and cost function values computed using the GPU based method, and independently using ODE . . . 72

(18)

5.7 GPU utilization as reported by the nvidia-smi utility . . . 75

(19)

Chapter 1 Introduction

In this work, we explore the use ofGeneral Purpose Computation on Graphics Processing Units (GPGPU), coupled with a Single Program Multiple Data (SPMD) parallel programming approach to gain substantial computational efficiency in the computation of lookup tables of reference trajectories, during the offline phase of robotic motion planners based on optimal control. This work was carried within the Institute of Robotics and Mechatronics at the German Aerospace Center(DLR) in Oberpfaffenhofen, Germany. The Institute of Robotics and Mechatronicsis one of the leading robotics research institutes in the world and develops a wide array of robots to enable humans to interact more safely and efficiently with their surrounding environments. As a member of the German Aerospace Center, they also develop highly autonomous robots towards space missions for exploring remote planets, moons and small bodies of the solar system.

1.1 Motivation

Robots are set to play an increasingly important role in society in the near future. The rapidly increasing pace of robotics research coupled with the changing demographics of society have made robots more commonplace than ever before. Motion planning is a fundamental and well studied problem in robotics. The task of moving a robot from a given initial state to a desired final state, while guaranteeing safety from collisions and ideally satisfying some measure of performance is pivotal to the function of any robot. Traditionally, trajectory optimization as a mathematical technique was almost exclusively used for military applications and within the aerospace industry to compute

https://www.dlr.de/rm/en/desktopdefault.aspx/tabid-8017/

(20)

trajectories of missiles, aircraft and spacecraft. In recent years, there has been a rising interest in its use outside the aerospace community as a tool for motion planning in robotics [5] [6] [7] [8]. It has proven to be useful for generating safe and efficient trajectories that can be executed by a controller to cause a robot to move from a given initial state to a desired goal state, while minimizing a defined measure of cost and simultaneously ensuring a given set of constraints are satisfied throughout its motion. For robotic application scenarios involving complicated dynamics, motion constraints arising from the highly non-linear dynamics of the system, as well as constraints such as collision avoidance make the resulting control problem highly non-linear and non-convex [8] [9]. In such cases, trajectory optimization is often the most viable approach to solve the motion planning problem due to its proven ability to handle non-linear dynamical constraints.

The biggest challenges associated with optimization based methods for motion planning are their extreme sensitivity to a good initial guess and the presence of local minima in the cost function space [8] [1] [10]. Additional challenges may also arise due to operational constraints, robot controllers sometimes have finite time to plan a trajectory to perform a desired function.

Prior research demonstrates that such problems can generally be solved within the allocated operational time of 100 seconds [6]. However, if a cost function is to be optimized, the run-times become longer and may take much longer than the operational limits allow. In the special case of robotic application scenarios in space, this problem is exacerbated by the lack of powerful computational resources on board the robot and and the additional lack of a reliable network connection at all times between the station on ground and the robot or spacecraft in space. Some practical examples of such application scenarios are shown in figure1.1.

To work around these limitations, a common solution is to split the motion planner into an offline phase and an online phase. Most of the computational heavy lifting is performed beforehand during the offline phase and the results of the offline phase are subsequently used while computing a trajectory in real time during the online phase. This is very similar in spirit to the dynamic programming approach in computer science, wherein results stemming from a previous calculation are stored and subsequently re-used to save time and computational resources in the future. In the context of trajectory optimization for robotics, a concrete implementation of the offline phase can translate to computing reference trajectories for varying parameterizations of the task

(21)

(a) Grasping and stabilization of spinning debris by means of a

free-flying robot

(b) Grasping and stabilization of a non-cooperative tumbling target satellite by means of a free-flying

robot

Figure 1.1 – Examples of application scenarios where the computation of lookup tables for robotic motion planning is useful

space in the form of a lookup table [11] [5] [6] [12] [13] [10]. The lookup table can be thought of as a data base for mapping between a D-dimensional task space and an N-dimensional output space representing the family of time- parameterized optimal robot trajectories.

During the online phase, the state of the robot and that of other bodies in its vicinity are estimated using sensors on board the robot and this state estimate is used to extract a trajectory from the lookup table which was computed during the offline phase. In the simple case, the extracted trajectory corresponds to a state which is closest to the estimated state as defined by a suitable measure of distance in the D-dimensional task space. Another viable approach could be the use of a learning approach to learn a function mapping the task space to the solution space of feasible trajectories. The extracted trajectory is then supplied as an initial guess to a stripped down online version of the optimizer in a process known as ’warm-starting’. The advantage of such an approach is that the most computationally intensive aspects of the motion planning problem need to be performed only once with more time and resources at our disposal, and the results can be used any number of times thereafter online which much lesser computational effort. Crucially, this method also helps in alleviating problems related to local minima by providing a suitable initial guess for the optimizer to to use in real-time, as evidenced in [11] [5] [6] [10].

(22)

1.2 Problem

The main problem addressed in this thesis is the computational complexity of computing a lookup table of reference trajectories during the offline phase of motion planners based on optimal control. For many robotic scenarios of interest, it is impractical and often computationally infeasible to compute a look up table using a serial, single core implementation of the offline phase of a motion planner. This is especially true for application scenarios with many independent degrees of freedom wherein the computational complexity is exacerbated due to the infamous curse of dimensionality as the number of degrees of freedom increase [14] [15].

Data from profiling experiments conducted at theDLRshow that it takes over 32 hours to compute a lookup table of 1000 reference trajectories for a representative application scenario using a serial, single core implementation of a motion planner based on optimal control. The application scenario used for this analysis is that of computing a lookup table for the grasping and stabilization of a non-cooperative, tumbling target satellite by means of a free- flying robot. The input space is considered to be six dimensional and the output space is an O(10) dimensional output space of robot trajectories. Information from further experiments reveal the most computationally intensive tasks within the motion planner to be the computation of collision avoidance constraints (40-55% of total computation time), and the convex optimization solver (15-20% of total computation time). These findings are corroborated by literature on the subject [16] [17] and it is well known in the robotics research community that collision avoidance is one of the most time consuming aspects of motion planning, accounting for up to 90% of the total computation time in some cases.

A serial implementation limits scalability and serves as a major performance bottleneck while computing a lookup table of trajectories for different paramete rizations of the task space. The problem we wish to solve is - Can General Purpose Computation on Graphical Processing Units and techniques from parallel programming be leveraged to gain a substantial speedup in the computation of lookup tables of reference trajectories during the offline phase of robotic motion planners based on optimal control theory?

(23)

1.3 Purpose

The purpose of this work is to investigate the potential of GPUs to gain substantial computational efficiency in the computation of collision avoidance constraints during trajectory optimization, while simultaneously leveraging a traditionalSPMDparallel programming approach to parallelize the computation of a lookup table of reference trajectories across multiple cores on a cluster of machines. If successful, the results of this thesis will be used by researchers at theDLRto save a significant amount of time and computational resources during the offline phase of robotic motion planners based on optimal control.

The proposed solution can enable a wide range of applications, a few of which include - trajectories for the capture of a non-cooperative tumbling target satellite using a robotic manipulator mounted on a chaser satellite, the capture of tumbling objects on board the International Space Station (ISS) using a robotic manipulator on a freely floating base, and whole body trajectories of humanoid robots on earth. The results of the thesis might also be useful to researchers and engineers working on similar applications in similar or different domains, as a guideline for creating an effective parallelization strategy.

1.4 Goals

The main contribution of this work is to develop and evaluate a method for reducing the time spent on computing a lookup table of reference trajectories during the offline phase of motion planners based on optimal control. The advantage of the method is validated in a practical setting through an implemen- tation and evaluation within a representative motion planner for a practical application scenario. Hence, the project is considered successful if:

1. We can provide a method to accelerate the computation of collision avoidance constraints during trajectory optimization for robot motion planning, using the CUDA programming model to program an NVIDIA GPU.

2. We can demonstrate that the methods developed on theGPUto accelerate the computation of collision avoidance constraints do not lead to a dramatic loss in accuracy, which would render the method useless in a practical setting.

(24)

3. We can provide a method that builds on top of the aforementionedGPU based method to further accelerate the computation of lookup tables of reference trajectories by splitting the computation over multiple cores across multiple nodes in a cluster, each with its ownGPU.

4. We can implement the proposed method within a representative motion planner for a practical application scenario, and validate the advantage of the proposed method over the existing serial, single core based method.

1.5 Delimitations

This thesis is limited in scope to using GPGPU and parallel programming to improve the computational efficiency of the offline phase of an existing robotic motion planer based on optimal control. The use of the generated lookup table to train a machine learning model to output a good initial guess for the online phase in real-time falls outside the scope of this thesis. The thesis is also limited in the following aspect - we use theCompute Unified Device Architecture (CUDA) parallel computing platform and programming model to build a system to compute collision avoidance constraints usingGPUs. We acknowledge the presence of alternatives such asOpen Computing Language (OpenCL) but NVIDIA recommends the use of CUDA over for their own accelerators. This makes it the framework of preference for use in this thesis as most systems with theDLRhaveNVIDIAgrapics cards.

1.6 Ethics and Sustainability Aspects

This work addresses a subject that could be potentially classified as dual-use.

Due to confidentiality concerns, parts of this thesis, especially parts related to the application scenario addressed have deliberately been written in an academic manner to avoid giving information about how the method could be applicable for dual-use scenarios to avoid potential conflicts.

This work addresses the performance and scalability aspects for motion planners trajectory optimization based approach for orbital robotic applications in space, with a focus on collision avoidance. One of the applications targeted in this thesis is that of on-orbit servicing, which is expected to increase the service life of satellites and lead to a reduction in space debris. This

(25)

contributes to fulfilling the 12 United Nations Sustainable Development Goals 2020, namely, that of ensuring sustainable consumption and production patterns.

This work also contributes towards sustainability by significantly reducing the time required for the computation of lookup tables of reference trajectories for robotic motion planners based on optimal control. It can be argued that as a result of this, we save energy by making more efficient use of the resources we have at our disposal. This argument is also supported through the use ofGPUs in this thesis and the observation thatGPUstend to be more energy efficient than multi-coreCPUs[18].

1.7 Structure of the thesis

This thesis is organized as follows: In Chapter 2 we first provide the necessary background to understand the problem and contributions of this thesis. Chapter 3 is dedicated to the detailed description on the design and implementation of the GPU based system for computing collision avoidance constraints, as well as the design of a system to the computation of a distribute a lookup table of trajectories across multiple cores on multiple nodes of a cluster. In Chapter 4, we briefly describe the bench-marking environment used for our experiments.

In Chapter 5, results of an experimental evaluation of our implementation are presented. Finally, in Chapter 6, we conclude the thesis by summarizing the results, elaborating on limitations of our work and showing future research directions.

(26)
(27)

Chapter 2

Background and Related Work

This chapter provides basic background for topics related to this thesis. We also describe related work in the field and motivate our contribution to the existing body of research.

2.1 Trajectory Optimization for Motion Planning in Robotics

Trajectory Optimization is a mathematical technique for generating optimal trajectories by selecting the best inputs to a control system as functions of time [1]. A solution to a trajectory optimization problem is said to be feasible if it satisfies all of the problem requirements, known as constraints. The set of controls that produce feasible trajectories are known as admissible controls.

Trajectory optimization is concerned with finding the input set of controls that produce an ’optimal’ trajectory, i.e, a trajectory that satisfies all the given constraints while also minimizing some measure of cost, or alternatively, maximizing some measure of performance. Trajectory optimization is usually used for computing an open-loop solution to an optimal control problem, and is often used for systems where computing the full closed-loop solution is impractical or impossible. Figure2.1demonstrates the difference between an open loop solution and a closed-loop solution. The fundamental difference is that while a closed loop solution gives the set of input controls that can move a robot from any given point in the state space to a given final state, whereas an open loop solution is only concerned with the set of control inputs that move a body from a given initial state to a desired goal state.

(28)

; Figure 2.1 – Figure outlining the difference between a closed loop solution

and an open loop solution. [1]

Mathematically defined , given an initial condition, x0, and an input trajectory u(t) defined over a finite interval, t in [t0, tf], we can compute the finite horizon cost of executing that trajectory using the a standard additive- cost optimal control objective,

Ju(·)(x0) = lf(x(tf)) + Z tf

t0

l(x(t), u(t))dt.

The trajectory optimization problem can be formulated as

min

u(·)

lf(x(tf)) + Z tf

t0

l(x(t), u(t))dt

˙x(t) = f (x(t), u(t)), ∀t ∈ [t0, tf] x(t0) = x0.

Where x represents the state vector, u represents the set of input controls, J represents the cost function, x0 represents the initial state, t0 represents the initial time and tf represents the final time. A more detailed reference of the topic for the interested reader is available in [1]. Some trajectory optimization problems may also include additional constraints, such as collision avoidance (e.g., where the constraint is that the signed distance between the robot’s geometry and the obstacles stays positive) or input limits which can be defined for all time or some subset of the trajectory, such as limits on the velocities of the robot, and limits on accelerations and torques on its joints.

(29)

In recent times, there has been a revival of interest in the application of trajectory optimization as a tool for motion planning in robotics. A large body of prior work exists related to trajectory optimization being applied to motion planning in robotics. As far back as in the year 1986, Khatib proposed the use of artificial potential fields to work around obstacles [19]. Lampariello [20]

applied a direct shooting based method to compute optimal trajectories for the capture and stabilization of a tumbling target satellite by means of a freely- floating manipulator. Lampariello et. al [6] extend this work motivating the use of a lookup table for computing trajectories for the task of grasping and stabilization of spinning debris using a freely-floating robot. The same group also use a similar method with a different application scenario, for computing optimal trajectories for ball-catching using a humanoid robot [5]. As such, this work serves as an extension to these works, by developing a method to accelerate the computation of the lookup table offline, using techniques from parallel programming andGPUs.

Zucker et. al [7] develop a method for trajectory optimization invariant to re-parameterization, by using functional gradients to improve the quality of the trajectory. For collision avoidance, they use a voxel-based signed distance field for computing distances. Our work differs from this work in the use of signed distance/penetration depth constraints for guaranteeing collision avoidance, and through the use of a lookup table to provide a good initial guess to the online version of the motion planner.

Schulman et. al [8] apply a variant of trajectory optimization known as sequential convex optimization to solve motion planning problems in robotics.

The method they develop penalizes collisions with a hinge loss instead of formulating as a constraint. The authors argue the computational complexity of the optimization problem can be reduced through a reduction in the number of via points, so long as guarantees of collision avoidance can be delivered.

They go on to make a case for their argument by developing a unique method to ensure that the trajectory is collision free with a fewer number of via points, using swept volumes, support mappings and the Gilbert-Johnson- Keerthi algorithm(GJK) [21] for computing signed distances between convex shapes.

A limitation we see with their approach is that although the collision avoidance problem can be satisfied using a fewer number of via points, this

(30)

by itself is not a sufficient enough reason to justify a coarser discretization of the trajectory for all application scenarios. In our experience, for situations with complex dynamical constraints such as the ones addressed in this work, it is generally not possible to get smooth solutions using coarser discretizations of trajectories. Our approach to collision avoidance differs from theirs, as we enforce collision avoidance as a constraint at each via point instead of formulating it as a penalty. Furthermore, we do not make use of theGJKfor collision detection, as the popular method of usingGJKin combination with theExpanding Polytope algorithm(EPA) for computing the penetration depth is known to be numerically unstable for 32 bit floating point representations [22]. For more information, please see section3.3.1.

More recently, Merkt et. al [23] propose the use of a harmonic fields to guarantee continuous collision avoidance in between the via points of a trajectory. They also make use of the Bullet physics engine [24], which in turn makes use of theGJKalgorithm to compute signed distances used to ensure collision avoidance. Bonalli et al. [25] provide an algorithmic framework to solve trajectory optimization problems with rigorous theoretical guarantees of convergence. As such, their method could prove to be useful alternative to the use of a lookup table and presents an alternate approach to the one we take in this work.

The main difference between the method used in this thesis and previous work is the use of a look-up table computed during the offline phase to generate a suitable initial guess for the online planner to use in real-time. We acknowledge that this process is computationally intensive for a fine enough discretization of the input space. This is the problem we wish to solve and the gap in the existing body of research we seek to fill with this work.

2.2 Collision Detection and Avoidance

One of the most important constraints that is often imposed on the trajectory is that of collision avoidance. Collision avoidance in robotic motion planning refers to the process of ensuring that a planned path is safe and collision free.

This can include being free of collisions between the robot and static obstacles in its environment, collisions between the robot and other robots or dynamic obstacles in motion, as well as self collisions between different parts of the robot itself. Collision detection and avoidance as sub-fields of robotics have been given a lot of importance in both academia and industry, as collision

(31)

avoidance has historically been a major bottleneck in robotic motion planners.

A number of prior related works exist of on the subject [26] [17] [27] [28].

There are several factors affecting the choices made in designing a collision detection system. Real Time Collision Detection [29], the de facto Bible for implementing collision detection systems breaks down these factors into the following categories - application domain representation, types of queries supported, simulation environment parameters, performance, robustness, and ease of implementation and use. For the purposes of brevity, we limit ourselves in this section to the ideas used in this thesis. The motivation for using these ideas (choice of representations, algorithms) is described in some more detail in the section3.3.1.

2.2.1 Application Domain Representation

This refers to the geometrical representations used for bodies in the scene of interest. It is concerned with how we represent bodies of interest in the scene mathematically and geometrically. In layman terms, it can be thought of as the process of choosing which shapes we want to use to represent bodies in our application. The chosen geometric representations have a direct consequence on the algorithms that must be used to work with these representations. Generally, with fewer restrictions, more general purpose collision detection algorithms have to be used, with increasing complexity and decreasing performance. Alternatively, with more specialized representations that place restrictions on the representation, the collision detection systems can be vastly more performant at the expense of generality. This makes sense intuitively as well - it is generally far more difficult, complicated and expensive to design a system that works for all possible scenarios as opposed to one that works for a specific subset of scenarios.

Depending on how complicated the chosen representation for geometries of objects in the scene of interest is, it can be computationally expensive to directly test two objects for collision against once another. This is especially true if objects are represented by many thousands of polygons. To optimize the performance of intersection tests in such cases, bounding volumes are usually tested for overlap before the more complicated geometrical intersection tests are performed. ABounding Volume(BV) is a single simplified volume encapsulating one or more objects of more complex nature. The rationale behind this is that simpler bounding volumes / representations such as spheres

(32)

and boxes have much cheaper intersection tests than the complex objects they bound. Empirically, most application scenarios are found to have a larger proportion of intersection tests between pairs of objects resulting in a negative outcome than a positive one.

This observation is exploited through the use of bounding volumes to represent objects. For the cases where the bounding volumes are found to intersect, we have to perform two intersection tests - one for the pairs of bounding volumes, and one for the case of the more complex representations encompassed by the bounding volume. However, since a greater proportion of bounding volumes are usually not close enough to lead to an intersection, this method leads to an overall performance gain during collision detection.

For some applications, the bounding volume itself might be used for representing objects in the scene and a bounding volume intersection test serves as as sufficient proof of collision. We argue that this is true for robotic scenarios of interest in general, so long as we do not deal with the infamous motion planning problem of narrow passages [30] [31]. In situations where there exist narrow passages, a major drawback of using the bounding volume to represent objects in the scene is that we might register collisions between bounding volumes used to represent the bodies even though there might not be an actual collision between the bodies themselves (depending on how compact the bounding volume is).

If objects in the scene are represented by a more complicated representation such as a mesh, polygon soup or even a more complicated bounding volume, collisions that might be found using larger bounding volumes are likely to be rightly not registered. This is due to the fact that more complex, detailed representations tend to be closer in fidelity to the actual objects. We have faster intersection tests and lesser memory requirements for simpler bounding volumes at the expense of accuracy and fidelity, but a better bound and higher fidelity with more complex representations at the expense of performance.

Figure 2.2 taken from [29] shows this trade-off beautifully. The advantage of using bounding volumes as the representation of objects however, generally far outweighs the disadvantages. This is especially true for robotic application scenarios of interest in this thesis, where we do not have to deal with narrow passages. The upshot of this decision is that we save a significant amount of time in collision detection and avoidance dealing with much simpler representations.

(33)

; Figure 2.2 – Examples of bounding volumes: sphere, axis-aligned bounding

box (AABB), oriented bounding box (OBB), eight-direction discrete orientation polytope (8-DOP), and convex hull [2].

In this work, we are interested in building a system for the computation of collision avoidance constraints specifically for motion planning in robotics.

We are not interested in building a general purpose system capable of handling collision detection and avoidance for any possible generalized scenario.

Additionally, we do not have to deal with problems of narrow passages in the application scenario(s) considered. These conditions motivate the use of spheres, capsules and Oriented Bounding Boxes (OBBs) as being sufficient and ideal for representing objects in robotic scenarios of interest. A capsule can be thought of as the geometric shape that is obtained upon sweeping a sphere along a line. A common occurrence of a capsule in daily life is in the shape of medical pills. AnOBBis a bounding parallelepiped whose faces and edges are not necessarily parallel to the basis vectors of the frame in which they’re defined. Figure 2.2 shows how anOriented Bounding Box(OBB) is different from its close cousin, an Axis-Aligned Bounding Boxes(AABBs).

We choose OBBsas the representation of choice over AABBsbecause they are closer in fidelity to the actual object (have a tighter bound). Figure 2.3 shows the use of capsules andOriented Bounding Boxes(OBBs) in robotics by using them to represent a satellite with a robotic manipulator mounted on top of it. It can be seen that the bounding volumes used represent these bodies are very close in fidelity to the actual bodies they represent, only being larger by a a few centimetres at most in each dimension. Further motivating reasons for using the chosen bounding volumes to represent objects in particular, and not others can be found in section3.3.1.

(34)

(a)

(b)

Figure 2.3 – Example representation of a spacecraft with a robotic manipulator mounted on it. The zoomed in figures show the use of capsules

to represent robotic links and an Oriented-Bounding Box to represent the spacecraft itself. It can be seen that the bounding volumes are close in

fidelity to the actual objects they encompass.

(35)

2.2.2 Types of Queries

The types of queries refers to the type of questions we wish to ask of the collision detection system. For instance, we may be interested in answering the question whether two (or more) objects intersect / collide at a given instant in time? If yes, we might want to find out where and how do they intersect?

Or what can we done to bring two colliding objects out of collision? Another potentially interesting question we may wish to ask is - given two objects, how do we determine their closest points? The closest points are defined as being two points P1 in object A, and P2 in object B such that the separation distance between these two points is minimum. The closest points are not necessarily unique, there may be an infinite number of closest points between two objects. An easy way of visualizing this scenario is to imagine two boxes with faces parallel to each other. In this situation, there are an infinite pairs of closest points on the planes of the boxes that are closest to and facing one another. For the purposes of motion planning in robotics, we are primarily interested in determining whether two objects intersect and if they do, how could we go about bringing them out of a collision. We are generally not so interested in determining where and how they intersect? This is known in literature as computing the ’contact manifold’ and is more relevant for games and rigid body simulations. A number of references exist on this subject and the interested reader can find more information in [29] [22].

Answering the first question, namely, whether two or more objects intersect at a given instant in time, is fundamental to any collision detection system. It is commonly referred to as a ’static intersection test in literature’ and deals with answering the Boolean question of whether two static objects are overlapping, given their given positions and orientations with respect to some universal co- ordinate frame at an instant in time. In general, answering this question is fairly straightforward for simple representations of objects.

Answering the second question, namely - if two (or more) objects intersect, what can we do to bring them out of collision is in general, much more difficult to handle. In the context of robotic motion planning, we deal with this question by computing the Penetration Depth (PD) between pairs of objects. The Penetration Depth between two objects is defined as the minimum translational distance to bring the objects out of collision, i.e, the length of the shortest movement vector that would separate the objects or bring them out of a collision when they find themselves in one. If the two objects

(36)

do not collide, the penetration depth is set to 0. The computation of the penetration depth between two objects is the single most important operation in computing collision avoidance constraints during trajectory optimization for motion planning. The computed penetration depth is used to enforce signed distance constraints (essentially the opposite of the penetration depth) at each via point of the discretized trajectory while solving a trajectory optimization problem with direct shooting based methods [6] [12] [8]. The optimizer computes gradients with respect to this constraint to force the trajectory to move in the direction of steepest descent in the cost function space, pushing a trajectory out of a collision when there is one.

Answering the aforementioned questions sometimes requires computing the closest points between two objects as an intermediate step. In the following subsections, we briefly describe the algorithms to check for an intersection between two objects, and for subsequently computing the penetration depth between them if they are found to be in collision. In the following sections, we briefly describe the algorithms for intersection tests and computing penetration depths for different pairs of object representations. For brevity, we limit ourselves to algorithms for different pairs of bounding volumes chosen to represent objects in this thesis (spheres, capsules andOBBs).

A feature matrix of implemented tests for intersection and subsequent computation of penetration depth in the case of intersection, between pairs of representations considered in this work is described in table2.1. We briefly describe the implemented tests and penetration depth computations for the 6 unique cases implemented in this thesis, paying special attention in terms of detail to the tests that are used for representing robotic objects of interest in the application scenario considered for bench-marking the proposed method (sphere-sphere and sphere-capsule, please see3.1for more details).

Bounding Volume Sphere Capsule OBB AABB

Sphere X

Capsule X

OBB X

AABB X X X X

Table 2.1 – Feature matrix for implemented primitive intersection tests and methods for calculating penetration depth for pairs of bounding volumes on

the GPU.

(37)

Sphere v/s Sphere

The intersection test and penetration depth computation between a pair of spheres is simple to understand and implement. Consider a sphere A defined by its center C1 and radius r1, and a sphere B defined by its center C2 and radius r2. A 2-D representation of this scenario is shown in figure 2.4. The spheres intersect if and only if the Euclidean distance between their centers is less than the sum of their radii. The magnitude of penetration depth, i.e, the magnitude of the vector that can be used to bring the two spheres out of a collision is equal to the difference between the Euclidean distance between the centers of the two spheres, and the sum of their radii.

Figure 2.4 – Pictorial representation of two spheres intersecting in 2-D. The spheres are defined by a center and a radius.

If we consider the set of all points enclosed by sphere A including its boundary by a set A, and the set of all points enclosed by sphere B including its boundary by another set B. The intersection test can be mathematically defined as A ∩ B 6= ∅ ⇐⇒ |C1− C2| ≤ r1+ r2.

The Euclidean distance between the two centers is defined as

|C1− C2| =p(C1.x − C2.x)2+ (C1.y − C2.y)2+ (C1.z − C2.z)2

(38)

where C1.x, C1.y and C1.z represent the co-ordinates of the center of sphere A along the x, y and z axes. Similarly C2.x, C2.y and C2.z represent the co-ordinates of the center of sphere B along the x, y and z axes.

P D = |C1− C2| − (r1 + r2) Sphere v/s Capsule

The intersection test and penetration depth computation for the sphere-capsule case slightly more involved than that of a sphere-sphere computation, but still relatively simple to implement. Consider a capsule C defined by radius rC and a medial axis line segment connecting the centers of the two hemispheres at its ends, the midpoint of which is designated by a point L. Consider also a sphere S defined by its center C and radius rS. A 2-D representation of this scenario is shown in figure2.5.

Figure 2.5 – Pictorial representation of a sphere and capsule intersecting in 2-D.

The intersection test for testing whether the sphere and capsule intersect involves computing whether the distance between the center of the sphere and the closest point on the medial line segment (closest is defined relative to the center of the sphere) is greater than the sum of the radii of the sphere and capsule. The book ’Real Time Collision Detection’ [3] serves as a good reference for the implementation of the distance between the closest point on

(39)

the medial line segment of the capsule (relative to the center of the sphere), and the center of the sphere. The code listing below demonstrates a concrete implementation of this distance computation in C++ for reference. For the sphere-capsule intersection test, Point a and Point b refer to the end points of the medial axis lie segment of the capsule, Point c refers to the center of the sphere.

Listing 2.1 – C++ code for computing the squared distance between a point and a line segment

/ / R e t u r n s t h e squared d i s t a n c e between p o i n t c and segment ab CUDA_HOSTDEV f l o a t SqD i s tPo int Se gme n t ( P o i n t &a , P o i n t &b , P o i n t &c ) {

Vector3 ab , ac , bc ;

f 3 s _ a s s i g n ( ab . components , f3v_sub ( b . c o o r d i n a t e s , a . c o o r d i n a t e s ) ) ; f 3 s _ a s s i g n ( ac . components , f3v_sub ( c . c o o r d i n a t e s , a . c o o r d i n a t e s ) ) ; f 3 s _ a s s i g n ( bc . components , f3v_sub ( c . c o o r d i n a t e s , b . c o o r d i n a t e s ) ) ;

f l o a t e = f 3 v _ d o t ( ac . components , ab . components ) ; / / Handle c a s e s where c p r o j e c t s o u t s i d e ab i f ( e <= 0 . 0 f )

r e t u r n f 3 v _ d o t ( ac . components , ac . components ) ;

f l o a t f = f 3 v _ d o t ( ab . components , ab . components ) ; i f ( e >= f )

r e t u r n f 3 v _ d o t ( bc . components , bc . components ) ; / / Handle c a s e s where c p r o j e c t s onto ab

r e t u r n f 3 v _ d o t ( ac . components , ac . components ) − ( e ∗ e / f ) ; }

Just like the previous case, if we consider the set of all points enclosed by sphere S including its boundary by a set S, and the set of all points enclosed by capsule C including its boundary by another set C. The intersection test can be mathematically defined as S ∩ C 6= ∅ ⇐⇒ |C − L| ≤ rC+ rS.

The Euclidean distance between the two centers is defined as

|L − C| =p(L.x − C.x)2+ (L.y − C.y)2+ (L.z − C.z)2

where C.x, C.y and C.z represent the co-ordinates of the center of sphere A along the x, y and z axes. Similarly L.x, L.y and L.z represent the co- ordinates of the midpoint of the medial axis segment of the capsule B along the x, y and z axes.

P D= |L − C| − (rC + rS) Capsule v/s Capsule

The intersection test and penetration depth computation for the capsule- capsule case is described here. Consider a capsule C1 defined by radius r1

(40)

and a medial axis line segment connecting the centers of the two hemispheres at its ends. Consider also a capsule C2 defined by its center C2 and radius r2. A 2-D representation of this scenario is shown in figure2.6.

Figure 2.6 – Pictorial representation of two capsules intersecting in 2-D.

Let the closest points on the medial axis segments of the capsules be L1

and L2. The intersection test for testing whether the capsule and capsule intersect involves computing whether the distance between the medial axis line segments of the two capsules is greater than the sum of the radii of the sphere and capsule. Once again, the book ’Real Time Collision Detection’

[3] serves as a good reference for the implementation of the distance between the closest points of the two medial axis line segments of the capsules. If we consider the set of all points enclosed by sphere C1 including its boundary by a set A, and the set of all points enclosed by capsule C2 including its boundary by another set B. The intersection test can be mathematically defined as A ∩ B 6= ∅ ⇐⇒ |L1− L2| ≤ r1+ r2.

The Euclidean distance between the two centers is defined as

|L − C| =p(L1.x − L2.x)2+ (L1.y − L2.y)2+ (L2.z − L2.z)2

where L1.x, L1.y and L1.z represent the co-ordinates of closest point of capsule C1(where closest is defined with respect to the other capsule C2) along

(41)

the x, y and z axes. Similarly L2.x, L2.y and L2.z represent the co-ordinates of the closest point of Capsule C2 along the x, y and z axes (where closest is defined relative to Capsule C1).

P D= |L1− L2| − (r1 + r2)

Oriented Bounding Box(OBB) v/sOriented Bounding Box(OBB) An intersection test for the OBB-OBBcase can be implemented in terms of what is known in literature as the separating axis test. This test is discussed in detail in [3] for the interested reader. For the purposes of this work, it is sufficient to note that two Oriented Bounding Boxes (OBBs) are separated (non-intersecting) if, with respect to some axis L, the sum of their projected radii is less than the distance between the projection of their center points (as illustrated in Figure2.7).

Figure 2.7 – Pictorial representation of the separating axis test in 2-D (Adapted from [3]).

If we consider the set of all points enclosed by box B1 including its boundary by a set A, and the set of all points enclosed by capsule B2including its boundary by another set B. Let T represent the distance between the projection of their center points, L represent a potential axis of separation, rArepresent the projection of the radius of box B1on axis L, and rBrepresent

(42)

the projection of the radius of box B2on axis L. Mathematically, the separating axis test reads the two boxes B1 and B2 are separated on axis L if |T · L| >

rA+ rB.

For OBBs it is possible to show that at most 15 of these separating axes must be tested to correctly determine if the twoOBBsintersect. These axes correspond to the three coordinate axes of box B1, the three coordinate axes of box B2, and the nine axes perpendicular to an axis from each. If the boxes fail to overlap on any of the 15 axes, they are determined to be non-intersecting. If no axis provides this early out, it follows that the boxes must be overlapping.

Computing the overlap of projections on the 15 axes of separation and taking the least of the computed values represents the penetration depth between the twoOBBs.

Oriented Bounding Box(OBB) v/s Sphere

The intersection test for a sphere and an Oriented Bounding Box (OBB) is fairly straight forward and follows as a natural extension to some of the tests described earlier. First, we find the closest point on theOBBrelative to the sphere and subsequently the distance between the closest point on the box and the center of the sphere. If this distance is less than the radius of the sphere, we have an intersection. The penetration depth is computed as the difference between the radius of the sphere and the distance between the closest point and the center of the sphere.

Oriented Bounding Box(OBB) v/s Capsule

The intersection test for a capsule and anOriented Bounding Box (OBB) is slightly more involved than the OBB-Sphere case. The first step involves computing the closest point between the medial axis line segment of the capsule and the center of the OBB, just like in the sphere-capsule case. Once we have this point on the medial axis line segment, the intersection test and penetration depth are identical to the OBB-sphere case.

2.3 Parallel Computing

Parallel computing refers to a computing paradigm wherein many computations are carried out simultaneously during the same instant in time. The key idea is that large problems can be divided into smaller ones, which can often be

(43)

solved independently of each other simultaneously. A computational task is typically broken down into several smaller and often similar sub-tasks that can be processed independently and whose results can be combined afterwards, upon completion. There are many levels at which parellelism can be achieved, a few of them include - bit-level, instruction level, data level and task level.

Bit-level refers to a kind of parallelism based on increasing word size. It refers to the the observation that increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. At the instruction level, parallelism entails the simultaneous execution of two or more instructions of the same or different streams at the same instant in time.

Data parallelism refers to the kind of parallelism often exploited in GPUs, wherein the same instruction can be performed on multiple independent data streams simultaneously in parallel. Finally, task-level parallelism refers to the parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks such that they are concurrently performed by processes or threads across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data. An example of task parallelism is pipelining, wherein a single set of data is moved through separate tasks and each task can be executed independently from the others.

Parallelism has long been employed in high-performance computing, but recently, ever since we hit the architectural power wall that prevents further frequency scaling at the beginning of the 21stcentury, parallel computing has gained broader interest and emerged as the defacto paradigm of choice for improving performance in modern processors, mainly in the form of multi- core processors.

2.3.1 Single Program Multiple Data (SPMD)

Single Program Multiple Data(SPMD) is the most common style of parallel programming employed to achieve parallelism, wherein tasks are split into smaller sub-tasks which are then simultaneously run on separate input in order to obtain results faster. SPMD is a subset of the category Multiple Input Multiple Data (MIMD) As opposed to , wherein multiple processors

(44)

operate on the same program at different points in lockstep, SPMDemploys multiple autonomous processors to simultaneously execute the same program at independent points. WithSPMD, tasks can be executed on general purpose CPUswhereas SIMD requires vector processors to manipulate data streams.

Distributed Memory

A distributed memory computer consists of a collection of independent computers, called nodes. Each node starts its own program and communicates with other nodes by sending and receiving messages, calling send/receive routines for that purpose. The messages can be sent by a number of communi- cation mechanisms, such as TCP/IP over Ethernet or other specialized high- speed interconnects. Serial sections of the program are implemented by identical computation on all nodes rather than computing the result on one node and sending it to the others. SPMD usually refers to message passing programming on distributed memory computer architectures, commonly impl- emented using a standard such asMessage Passing Interface(MPI) [32].

Shared Memory

A shared memory model of parallel computing refers to a computer with several CPUs that can access the same memory space. Communication between processes executing on separateCPUscan occur through the use of shared memory mechanisms. The Single Program Multiple Data (SPMD) paradigm on a shared memory machine is usually implemented by standard (heavyweight) processes. This is the approach we follow in this thesis as well.

2.3.2 Amdahl’s law

A theoretical upper bound on the speed-up of a single program as a result of parallelization is given by Amdahl’s law. Ideally, the speedup from parallelization should be linear, doubling the number of processing elements should halve the run-time, and doubling it a second time should result in a run-time that is a quarter of what i originally was. However, in practice, very few parallel algorithms achieve this ideal speedup. Most of them have a near- linear speedup for small numbers of processing elements which flattens out into a constant value for large numbers of processing elements. Amdahl [33]

came up with a formula to calculate the potential speedup of an algorithm on a parallel computing platform.

(45)

Amdahl’s law says that Slatency(s) = 1/((1 − p) + p/s)

Where; Slatency is the potential speedup in latency of the execution of the whole task, s is the speedup in latency of the execution of the parallelizable part of the task and p is the percentage of the execution time of the whole task concerning the parallelizable part of the task before parallelization.

Since Slatency < 1/(1 − p), it shows that even a tiny fraction of the program which cannot be parallelized will limit the overall speedup available from parallelization. A program solving a large mathematical or engineering problem will typically consist of several parallelizable parts and several non- parallelizable (serial) parts. If the non-parallelizable part of a program accounts for 10% of the runtime (p = 0.9), we can get no more than a 10 times speedup, regardless of how many processors are added. This puts an upper limit on the usefulness of adding more parallel execution units.

2.3.3 Gustafson’s law

Amdahl’s law only applies to cases where the problem size is fixed. In practice, as more computing resources become available, they tend to get used on larger problems (larger data-sets), and the time spent in the parallelizable part often grows much faster than the inherently serial work. In this case, Gustafson came up with a less pessimistic and more realistic assessment of parallel performance [34]. Gustafson’s law calculates the theoretical speedup to be

Slatency(s) = 1 − p + sp

Both Amdahl’s law and Gustafson’s law assume that the running time of the serial part of the program is independent of the number of processors.

Amdahl’s law assumes that the entire problem is of fixed size so that the total amount of work to be done in parallel is also independent of the number of processors, whereas Gustafson’s law assumes that the total amount of work to be done in parallel varies linearly with the number of processors.

2.3.4 GPGPU

Ordinarily, aGPUis a specialized processor used for rendering graphics on a screen in quick time. General Purpose Computation on Graphics Processing

(46)

Units (GPGPU) refers to the use of a Graphics Processing Unit (GPU) for general purpose computations outside the purview of its traditionally intended purpose of graphics processing. Algorithms well-suited to execute on a GPUimplementation usually exhibit two common properties: they tend to be data parallel and throughput intensive. Data parallel means that a processor can execute the operation on independent data elements simultaneously, ideally without having to synchronize in between execution. In simple terms, throughput intensive means that we care more about processing more data in a given quantum of time, as opposed to processing a small amount of data relatively quickly. A throughput intensive applications usually has a lot of data that needs be processed, making it ideal forGPUsto process many independent pieces of data in parallel. For applications that exhibit these two key properties, GPUsachieve extremely performance by incorporating thousands of relatively simple processing units to operate on many data elements simultaneously.

In terms of architecture,GPUstend to differ substantially from traditional CPUs. CPUs are usually built for latency driven applications and responsiveness.

This leads to a very large proportion of resources on the CPU die being dedicated to make single streams of instructions run as fast as possible, including multiple layers of cache memory to hide latency from main memory and complex control logic to support pipe-lining, branch prediction and out of order execution). GPUstake a fundamentally different approach, choosing to pack the available area with thousands of tiny compute units that are capable of executing a single instruction stream on many data elements simultaneously.

Figure2.8shows this difference in a simple pictorial fashion. While theCPU has a lot of space dedicated to complex control logic and more complicated Arithmetic and Logic Units (ALUs), the GPU uses the same space for a very large number of simpler ALUs and dedicates a much lesser amount of space for complex control logic. Memory latency is a GPU is hidden through accelerated context switching, and the cache occupies a relatively tinier fraction of space on the die.

CUDA

In this section, we briefly describe theCompute Unified Device Architecture (CUDA) and the CUDA programming model. A more complete reference for the interested reader is available in [35]. Before the introduction of CUDA by NVIDIA in 2006, only users with a very specialized knowledge about the internal hardware ofGPUscould use them for purposes other than

References

Related documents

nk is the delay between input and output and e(t) is the noise. na, nb and nk are chosen in the identification process. The identified model can for example be used for prediction of

Standard 2 - Specific, detailed learning outcomes for per- sonal and interpersonal skills, and product, process, and system building skills, as well as disciplinary

För det tredje har det påståtts, att den syftar till att göra kritik till »vetenskap», ett angrepp som förefaller helt motsägas av den fjärde invändningen,

Section 4.4 highlighted the need to integrate already existing safety functions like ABS and SSC along with the CADA functions. It is interesting to observe how these

Chapter 3 covers collision detection specifics, such as different ways to implement middle phase collision detection and an important theorem often used for collision detection

This includes an introduction to the power flow problem, how the Newton-Raphson method is applied to the power flow problem, and different approaches on how to solve the

[r]

Den naturliga beständigheten och fukttrögheten hos furukäma och gran kan vara ett bra komplement till andra åtgärder (konstruktion, fuktawisade behandling) när man söker ett