Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016 ,

Parallelizing Map Projection of

Raster Data on Multi-core CPU and GPU Parallel Programming

Frameworks

DANIEL CHAVEZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming

Frameworks

DANIEL CHAVEZ

Master’s Thesis at CSC Supervisor: Stefano Markidis

Examiner: Erwin Laure

(4)

(5)

Abstract

Map projections lie at the core of geographic information systems and numerous projections are used today. The re- projection between different map projections is recurring in a geographic information system and it can be paral- lelized with multi-core CPUs and GPUs. This thesis im- plements a parallel analytic reprojection algorithm of raster data in C/C++ with the parallel programming frameworks Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA and OpenCL.

The thesis compares the execution times from the dif-

ferent implementations on small, medium and large raster

data sets, where OpenMP had the best speedup of 6, 6.2

and 5.5, respectively. Meanwhile, the GPU implementa-

tions were 293 % faster than the fastest CPU implemen-

tations, where profiling shows that the CPU implementa-

tions spend most time on trigonometry functions. The re-

sults show that reprojection algorithm is well suited for the

GPU, while OpenMP and Intel TBB are the fastest of the

CPU frameworks.

(6)

Parallellisering av kartprojektion av rasterdata på flerkärniga CPU- och GPU-programmeringsramverk.

Kartprojektioner är en central del av geografiska informa- tionssystem och en otalig mängd av kartprojektioner an- vänds idag. Omprojiceringen mellan olika kartprojektioner sker regelbundet i ett geografiskt informationssystem och den kan parallelliseras med flerkärniga CPU:er och GPU:er.

Denna masteruppsats implementerar en parallel och analy- tisk omprojicering av rasterdata i C/C++ med ramverken Pthreads, C++11 STL threads, OpenMP, Intel TBB, CU- DA och OpenCL.

Uppsatsen jämför de olika implementationernas exe- kveringstider på tre rasterdata av varierande storlek, där OpenMP hade bäst speedup på 6, 6.2 och 5.5.

GPU-implementationerna var 293 % snabbare än de snab-

baste CPU-implementationerna, där profileringen visar att

de senare spenderade mest tid på trigonometriska funktio-

ner. Resultaten visar att GPU:n är bäst lämpad för ompro-

jicering av rasterdata, medan OpenMP är den snabbaste

inom CPU ramverken.

(7)

Introduction

Representing the surface of a 3-dimensional sphere on a 2-dimensional plane cannot be done without distortion. For instance, we cannot lay the peel of an orange flat on a table without tearing and stretching the peel. The same problem occurs when cartographers try to represent the surface of Earth on a paper map or on a com- puter screen. Nevertheless, cartographers have a created a number of mathematical formulas to deal with this problem, called map projections.

A basic feature of a geographical information system (GIS) is to provide a way for the user to visualize several maps at the same time, either as layers or beside each other. In order to avoid misalignments between maps, it is required that all maps are projected by the same map projection. This creates a new task for GIS to perform before visualization; all maps have to be transformed, or reprojected, to the same map projection before connecting them into a single map. In GIS as desktop applications, the reprojection should be performed as fast as possible.

Besides misalignments, it is crucial to consider the maps’ projections before doing any analysis or comparisons on the maps. Analysing maps projected by different map projections will lead to incorrect conclusions. The choice of which map projection to reproject to is usually defined by the user and depends on the region, country or continent being analysed.

In programming, maps are represented as a raster data structure and the rasters are reprojected coordinate by coordinate. The coordinates in a raster can be re- projected independently of each other, which makes the problem embarrassingly parallel.

This thesis will implement the reprojection of three rasters in C++ on a number

of parallel programming frameworks on both multi-core CPUs and GPUs. The

implementations will use a commercial GIS toolkit for reading and writing to raster

data files.

(10)

1.1 Motivation

Map projection and reprojection is relevant in a wide array of geographic applica- tions, everything from geolocation services to geographic information systems. With the recent development of multi-core processors and GPUs, the parallelization of map projection and reprojection algorithms becomes relevant.

The purpose of this thesis is to analyze the execution time of the different implementations of the reprojection algorithm. The goals of this thesis are:

• Implement the reprojection algorithm in the following frameworks:

– Pthreads, a low-level API for parallel programming

– C++11 STL threads, standard library support for parallel programming – OpenMP, a high-level API for parallel programming

– Intel TBB, a high-level and portable library for parallel programming – NVIDIA CUDA, an extension of the C language that targets the GPU – OpenCL, a specification for frameworks that target the GPU

• Compare tuning parameters specific to each framework

• Identify bottlenecks in the algorithm and in the implementations

• Compare the execution times of the implementations

The goals gives rise to the following question: Which CPU and GPU parallel pro- gramming frameworks achieve low execution times on the reprojection algorithm?

1.2 Limitations

The implementations in this thesis are portable and they do not fine-tune for any specific compiler or hardware. Only multi-core CPU and GPU parallelization on the frameworks mentioned in Section 1.1 will be considered. While parallel program- ming is also possible in a distributed-memory architecture, this thesis will only consider shared-memory architectures.

The geographical data used on the experiments are three sets of categorical raster data consisting of 8-bit grayscale pixels, and geographical data represented as vector data is not considered. The reprojection implemented in this thesis is a Lambert conformal conic projection.

1.3 Contribution

This thesis focuses on analytic reprojection which is relevant for geodesy and any

application that requires high precision of projected maps. While other alternative

(11)

1.4. RELATED WORK

approaches exist for reprojection, this thesis uses a precise mathematical approach in order to focus on the parallelization.

To the author’s knowledge, there are no publications regarding the implemen- tation of map projections and its parallelization. This thesis contributes by giving a comparison of different implementations, which can be used as a reference for other implementations. The conclusions drawn on this thesis show important con- sequences that hardware can have on any parallelized reprojection algorithm.

1.4 Related work

The subject of this thesis can be separated in to two different research areas. This thesis applies parallel computing, a growing research subject, on map projections;

a large part of geodesy and GIScience.

Performance comparisons of parallel programming frameworks are usually done with benchmarks consisting of popular real-world algorithms, while this thesis uses the reprojection algorithm as a benchmark.

1.4.1 Map projection

In [1] an alternative approach, used by ArcGIS, for reprojection is presented. The demand for fast reprojections requires faster methods and in this paper coordinates are reprojected by building a third order polynomial using Least Squares Fitting.

The results show accurate and faster projections than the affine transformation approach.

The books [2], [3] and [4] show all the different aspects of a geographical in- formation system. They all define map projection as a key part of geographical information systems and they show the need for the hundreds of map projections used today. These books, together with most of the work done in map projections, use the work done by John P. Snyder [5] as a foundation for the definitions of map projections.

1.4.2 Comparison of frameworks

In [6] an execution time comparison of OpenMP and OpenCL is done on three different algorithms. OpenCL has shorter execution time than OpenMP on two algorithms. The results also show that the speedup gained from OpenCL depends on the input size, i.e. OpenCL’s background work has a considerable overhead.

In [7] three algorithms’ execution time are compared on several programming frameworks: Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and FastFlow. The results show that Intel ArBB, SWARM and Pthreads had the best performance. The Intel TBB library had good performance on a specific number of cores and a specific input size.

In [8] the execution time of Intel TBB, Cilk Plus and OpenMP is compared

on three parallel algorithms. The authors also measure the relation between the

(12)

speedup achieved and the number of threads used by the libraries. The results show that Intel TBB and Cilk Plus were slightly faster than OpenMP. Loop vectorization supported by Cilk Plus and OpenMP had significant performance boosts.

In [9] CUDA and OpenCL are compared on identical kernels. Both the time needed for memory transfers from main memory to GPU memory and the kernel execution times were measured. CUDA performed better on both measurements.

The wide hardware support and portability of OpenCL may be the reason for its worse performance.

In [10] CUDA and OpenCL are compared on 16 benchmarks ranging from syn- thetic to real-world applications. The results show that CUDA performs 30 % better than OpenCL, the authors also show that this difference in performance is due to unfair comparisons. A fair comparison is defined and the portability of OpenCL does not fundamentally hurt its performance. The authors conclude that OpenCL can be a good alternative to CUDA.

1.5 Outline

In Chapter 2, map projections and reprojections are formally defined together with the two map projections used in this thesis. Also, how geographical data is repre- sented with rasters and the importance of reprojection in geographical information systems is explained. The chapter ends with an overview of shared-memory systems and the parallel programming frameworks used in this thesis.

Chapter 3 shows the methodology and what is being measured during the bench- marks. Also the implementations of the reprojection algorithm on each parallel programming framework are presented. Chapter 4 shows the testing environment used on the benchmarks.

In Chapter 5 the results of the benchmark are presented. These include scala- bility results, speedup and profiling information. Some tuning parameters specific to each parallel programming framework are also presented.

The methodology and the results of the thesis are discussed in Chapter 6, and

the conclusions from this thesis are presented in 7, together with some ideas that

can be applied to future research.

(13)

Chapter 2

Background

2.1 Reprojection

A point on Earth is defined as its geographical coordinates (φ, λ, h), where φ is latitude, λ is longitude and h is the vertical distance from the Earth’s surface. For mathematical purposes, reference ellipsoids are used to fit Earth and approximately represent the true shape of Earth. Hence, the geographical coordinates (φ, λ, h) are given together with their reference ellipsoid [4].

The geographical coordinates in this thesis are represented as (φ, λ), where the height is hidden for brevity, although the coordinates are still referred to as being 3-dimensional coordinates.

2.1.1 Map projection

A map projection is a mathematical formula which projects or translates the Earth’s surface to a flat surface. It projects any 3-dimensional coordinate (φ, λ) on the Earth’s surface to a two-dimensional coordinate (x, y) on a flat surface. The process can also be reversed, going from a flat surface to a spherical surface, known as the inverse projection. We define a map projection as

( x = f (φ, λ) y = g(φ, λ)

and

( φ = f

⁻¹

(x, y) λ = g

⁻¹

(x, y) as its forward and inverse projections, respectively [4].

Distortion from map projections

The definition of any map projection is not free from errors. The fundamental

problem is that a map projection has to approximate the true shape of Earth [2].

(14)

The approximation of the shape of Earth is done through a mathematical model of Earth, called a datum surface [4].

In theory, one could define an infinite amount of datum surfaces, resulting in an infinite amount of map projections. In practice, the challenge is to find which map projection gives a good approximation of the true shape of Earth for a particular region. Here, “good” depends on which cartographic quantity the map-user requires.

The three fundamental cartographic quantities on a projected map are area, shape and distance. Since no map projection can simultaneously preserve all quantities, the choice of map projection becomes important [4].

Plate Carrée projection

The Plate Carrée projection simply maps longitude to x and latitude to y:

( x = λ y = φ

The projection results in a distorted image of Earth, as seen in Figure 2.1. Although the top and bottom edges of the map appear smeared out, the projection is often used by satellite data of the entire Earth [3]. The only cartographic quantity that

Figure 2.1. Earth projected by the Plate Carrée projection.

the Plate Carrée projection preserves is the distance between every point along the Equator. The shapes and area properties are distorted and the projection is not ideal for analysis [3].

Lambert conformal conic map projection

In the Lambert conformal conic (LCC) projection, the map is assumed to be a cone wrapped around the datum surface. It uses an ellipsoidal model of Earth as a datum surface [4].

The LCC forward projection is defined [5] as ( x = ρ sin[n(λ − λ

0

)]

y = ρ

₀

− ρ cos[n(λ − λ

₀

)]

(15)

2.1. REPROJECTION

where

ρ = aF t

ⁿ

, ρ

0

= aF t

ⁿ₀

, n = ln(m

₁

) − ln(m

₂

)

ln(t

₁

) − ln(t

₂

) , m = cos φ 1 − e

²

sin

²

(φ)

¹²

,

t = tan(

^π₄

−

^φ₂

)

[1 − e sin(φ)/1 + e sin(φ)]

^e²

, F = m

1

nt

ⁿ₁

and t

₀

is obtained from the formula for t using φ

₀

. In the same way, m

₁

and m

₂

are obtained from the formula for m using φ

₀

. The remaining constants depend on the datum surface:

Origin latitude: φ

₀

= 52°, Origin longitude: λ

₀

= 10°,

Semi-major axis of reference ellipsoid: a = 6378137 m and eccentricity: e = 0.08181919 The LCC inverse projection, used by the implementations in this thesis, is defined as:







φ =

^π₂

− 2 arctan(t[

^1−esin(φ)_1+esin(φ)

]

^e²

) λ =

_n^θ

+ λ

₀

where

t = ρ aF

_n¹

, ρ = [x

²

+ (ρ

₀

− y)

²

]

¹²

and θ = arctan x ρ

₀

− y

Earth projected by LCC seen in Figure 2.2 below.

Figure 2.2. Earth projected by the Lambert conformal conic projection.

The cartographic quantity that LCC preserves is direction (conformality). Thus,

it is often used by airplane pilots and in sea navigation. Any straight line in an

LCC projected map corresponds to a great circle distance i.e. the shortest distance

between two points on the surface of a sphere [4].

(16)

2.1.2 Reprojection

Carefully choosing a specific map projection is the first step towards a correct map analysis. However, geographic data often comes from different sources and they must be projected by the same map projection. The process of converting geographic data through projections and reversing projections in order to achieve a common map projection is called reprojection [2].

More formally, we define reprojection as converting from one map projection to another. This is a two step process: Given a 2-dimensional coordinate (x

_p

, y

_p

) in some map projection p.

1. Inverse project it to a 3-dimensional coordinate, (φ, λ), with f

_p⁻¹

(x

_p

, y

_p

) 2. Project (φ, λ) to the wished map projection q with f

_q

(φ, λ)

From 2. we obtain (x

_q

, y

q

) and the coordinate has been reprojected from p to q.

2.1.3 Raster data

One way of storing geographical data (e.g. maps) is in a raster. A raster is a matrix of cells where each cell may contain colour values, temperature values, elevation values or other types of information [2]. In Figure 2.3 below, the raster to the right represents a park with a lake and a beach where the cells correspond to a colour, e.g. 0 is green.

Figure 2.3. A park with a lake and a beach represented as a map and as a raster.

The raster in Figure 2.3 does not show where this lake is located on Earth, neither does it say which map projection it used or what the values 0, 1, 2 mean.

This information, called georeferencing metadata, is given together with the raster, for example in forms of GeoTIFF .tif and World .tfw files. These files include the map projection, coordinates for the corner pixels, colour tables and more.

The georeferencing metadata gives geographical meaning to the coordinates in

a raster. For example the first element in the raster has an index of (0, 0) but after

adding offsets from the metadata (0 + eastOffset, 0 + northOffset), the element

represents a location on Earth.

(17)

2.1. REPROJECTION

2.1.4 Reprojection in GIS

Map projections and reprojections have considerable disadvantages and limitations.

Why are they used in GIS and are there any alternative methods? There are well- defined references, axis of rotation, and centres of mass which accurately creates latitudes and longitudes. Locations on Earth defined in longitudes and latitudes are suitable for analysis and calculations.

However, GIS inherit techniques for working with geographic data which existed long before digitalization. These techniques work with flattened surfaces of Earth, thus working with 2-dimensional surfaces is inevitable in GIS [3].

There are three main methods [11] used for flattening or projecting 3-dimensional surfaces:

• Analytical transformation projects through mathematical formulas and is the slowest (computationally intensive) but most precise out of the three.

• Affine transformation is commonly used in GIS. It uses translation, scaling and rotations in order to switch between coordinate systems. Since the op- erations are based on 2-dimensional surfaces, it is only used for reprojection and not for map projection.

• Polynomial transformation reprojects a few coordinates analytically and builds a third-order polynomial that represents the transformation of the an- alytically reprojected coordinates. The remaining coordinates are projected using the polynomial.

In Figure 2.4 below, a simplified data flow of a GIS is shown. Note that reprojec- tion is a key part of the process. The reason that GIS must address the reprojection of maps is the large amount of map projections used in practice. If only one stan- dardized map projection existed, there would be no need for the reprojection step.

In fact, the data flow would be less complicated if all geographic data were of the

same type, format, scale, resolution and map projection. This is one of the reasons

why effort is being made in recent years to universally standardize data types and

exchange formats [3].

(18)

Paper maps Satellite data Digital data

Editing and cleaning

Reprojection

Generalization

Geographic

Database Visualization

Figure 2.4. A simplified part of the data flow inside a geographic information system.

However, as we saw in Section 2.1.1 above, no single map projection can preserve all three cartographic quantities. Hence, the reprojection step will remain as a key part of the GIS data flow [3].

2.2 Shared-memory systems

The parallel programming frameworks used in this thesis are designed for shared- memory multiprocessors (SMP). An SMP, see Figure 2.5, offers the programmer a single physical address space shared by all processors and the communication be- tween processors is done implicitly through the shared memory [12]. Programming on SMP is done via threads or higher level abstractions like tasks, built on top of threads [13].

CPU

MEMORY

Figure 2.5. The structure of a shared memory multiprocessor.

(19)

2.2. SHARED-MEMORY SYSTEMS

2.2.1 Threads

Threads are light-weight processes. Multiple threads run inside a process and each thread has its own stack. At any point in time, a thread is running on a single CPU core [14]. For example in Figure 2.5, four threads can run simultaneously.

If programs create more threads than the cores available, the operating system’s scheduler will give time slices to all processes and distribute them among cores, a method known as context switching. When too many threads are waiting for their time slice, the operating systems spends many resources on context switching instead of actually executing the process, this problem is known as oversubscribing the processor [15].

2.2.2 Tasks

Tasks are more light-weight than threads, they do not need a time slice and they do not explicitly participate in the operating system’s context switching. Multi- ple tasks can be executed by one thread, resulting in less scheduling and avoids oversubscription. The mapping of tasks to threads during runtime is flexible and a task scheduler can balance the workload of the threads, a method known as load- balancing [16].

2.2.3 Speedup

The speedup S gained from parallelizing a sequential program is defined as S = T

_sequential

T

_parallel

where T

_parallel

and T

_sequential

is the time it takes to finish the program with and without parallelization [14].

Amdahl’s law states that the speedup S is bounded by the time it takes to execute the sequential part of the program, because all parts of the program cannot be parallelized. The speedup S in Amdahl’s law is defined as

S = 1

T_parallel

n

+ T

_sequential

where n is the amount of cores available. Given an infinite amount of cores, i.e.

n → ∞, the speedup will be bounded by its sequential part [14].

2.2.4 Hardware limitations

Programs consist of algorithms and calculations which can be bounded by hardware, and their overall performance is improved with better hardware. A calculation can be compute-bound, memory-bound and I/O-bound.

Compute-bound (also known as CPU-bound) calculations are bounded by the

speed of the processor. In these calculations, the processor is usually working at

(20)

its full capacity. Memory-bound calculations are bounded by the size of the mem- ory and the bandwith of memory transfers. Similarly, I/O-bound calculations are bounded by the speed of I/O transfers, for example reading and writing to disk [17].

2.2.5 SIMD

Flynn’s taxonomy is used to compare different architectures. The classification of computer architectures is done in terms of number of instructions and number of data streams. Parallel systems are variations of SIMD (single instruction multiple data) systems. A SIMD system can apply an instruction on multiple data streams simultaneously. The type of parallelism which applies the same instruction on dif- ferent parts of data, for example simultaneous operations on different parts of an array, is called data-parallelism. SIMD systems are ideal for data-parallelism [14].

2.2.6 CPU Architecture

The different CPUs in Figure 2.5 above work independently, at any point in time they are executing different instructions on different data streams. This kind of architecture is called MIMD (multiple instruction multiple data) and it is the kind of architecture we see today in dual- and quad-core desktop machines [18].

CPUs are equipped with different levels of high-speed cache memories to mini- mize long trips to the main memory, see Figure 2.6 below.

CPU

MEMORY L1-L2

L3 CACHE L1-L2

L1-L2

Figure 2.6. A more detailed structure of a shared memory multiprocessor.

Cache memories are efficient in a SISD (single instruction single data) system.

On MIMD systems however, the parallel nature of thread execution bring along a

number of problems [18]. The data inside caches are in effect copies of the data

inside main memory and since each processor has its own local cache, the same

copies of data are duplicated in each cache. The problem of cache coherence occurs

(21)

2.2. SHARED-MEMORY SYSTEMS

when one of the processors updates its local cache and neighbouring processors’

local cache becomes outdated. Clever protocols deal with this problem but caches are still subject to false sharing [19].

False sharing occurs when two or more processors are updating different parts of memory that lie close to each other (in the same cache line). The program and the processors have an illusion of data being shared without any conflicts, when in fact the cache coherency protocol is constantly invalidating caches (also known as cache ping-pong). False sharing is common when threads work on relatively small arrays with small types of elements, like bytes. Padding between data and assuring that threads are working on parts that lie far enough in memory are ways to avoid false sharing [19].

2.2.7 GPU Architecture

GPUs today work in a variation of a SIMD system, called SIMT, single instruction multiple threads. The "SI" part of SIMT is not a single instruction but instead it is a kernel defined by the programmer. A kernel can be seen as a number of instruc- tions or a function. Threads will execute the kernel in a sequential-lockstep SIMD fashion. The amount of threads available for kernel execution is in the magnitudes of thousands and tens of thousands. [18]. Note the simplified design of the GPU architecture in Figure 2.7 below.

SM

GPU

MEMORY MEMORY

GPU

Figure 2.7. A simplified view of the GPU architecture.

Here, data operations are done inside the GPU’s memory and not explicitly in main memory. The data flow in the GPU is

1. Host-to-device, data is transferred from main memory to the GPU 2. Kernel execution, data is modified inside the GPU

3. Device-to-host, data is transferred from the GPU to main memory

(22)

where steps 1. and 2. are known as the "GPU overhead".

In Figure 2.7 above, the kernels are executed inside the Streaming Multiproces- sors (SM). The GPU is essentially a number of SMs, which in turn contain cores or Streaming Processors (SP). For example, Nvidia’s Fermi GPU architecture contain 32 SPs per SM and the amount of SPs and SMs in GPUs today are increasing [18].

The contents of single SM is shown in Figure 2.8 below. Recall that kernels are executed by SPs, which in CPU terminology can be seen as cores. At any point in time during kernel execution, all SPs in one SM are running the same single instruction. The updates of data in private registers and shared memory runs at the same speed as SPs do, resulting in zero waiting time for reading and writing of memory [18].

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SM

Shared memory

Figure 2.8. The contents of a single Streaming Multiprocessor. It contains a number of Streaming Processors together with a fast shared memory.

The execution patterns of the SPs are different from CPU threads. All SPs in an SM execute instructions as a unit, while CPU threads could be at any point in the program if no synchronization is present. However, the nature of SPs execution path can collide with the instructions in the source code. If any branching instruction appears in the code, some SPs will diverge from the rest and they do no longer follow the same execution path. For example, consider the GPU code which branches for even numbers:

if (myNum % 2 == 0) { ...

}

If half of the SPs have an even myNum value in their registers they would enter the

body of the if-statement. The GPU is forced to stall all the remaining SPs with

odd myNum until all SPs can continue running as a unit. The main reason for the

GPU to prefer a single flow of execution is its efficient coalesced bursts of reads and

writes to memory. This is known as branch divergence, and branching instructions

can hurt the performance of the GPU [18].

(23)

2.3. PARALLEL PROGRAMMING FRAMEWORKS

2.3 Parallel programming frameworks

Below we give an overview of different programming frameworks for parallelizing the reprojection problem.

2.3.1 Pthreads

POSIX threads or Pthreads specify an API for multithreaded programming. The API is a low level C library providing the programmer with full control of thread creation and memory synchronization at the expense of difficult and error-prone programming. Threads are created with the pthread_create(..., f) function call which spawns a thread with the lifetime of the provided function f. Note that a master thread creates more threads in a non-blocking manner, synchronization must be explicitly stated. Pthreads provide several synchronization mechanisms for this reason, for example barriers, mutex locks and condition variables [14] [20].

The amount of threads used by Pthreads is user-defined in the implementation.

The library does provide function calls and global variables for changing default thread’s stack size and scheduling policies [21].

2.3.2 C++11 STL threads

The C++11 standard introduced support for a thread-aware memory model with libraries for portable multithreaded C++ code without needing to specify platform- specific details. The new C++ <thread>, <mutex> and <atomic> library provides mechanisms for handling threads, synchronization and atomic operations.

Like Pthreads, a thread is spawned to run a function:

std::thread myThread(f);

myThread.join();

The master thread spawns a new thread to run the function f and waits for it to return. The thread constructor also supports function objects in case running only one function is limiting.

2.3.3 OpenMP

Open Multi Processing or OpenMP is an API for shared-memory parallel program- ming in C/C++ and Fortran. The programmer defines regions of code that should be executed in parallel through preprocessor directives [14]. For example

#pragma omp parallel for for (int i = 0; i < n; i++) {

// this will run on multiple threads

} // implicit barrier

(24)

The directive #pragma will make compiler generate multi-threaded code at compile time. The programmer is able to define data parallelism in a SPMD (single program multiple data) manner. OpenMP uses a thread pool with the fork-join model but this is an implementation detail which is abstracted away from the programmer. In the loop above, the amount of threads and the threads’ workload are not controlled by the programmer. However, the inherent problems of parallel programming, like race conditions and deadlocks, are still present. What happens inside the parallel region still requires synchronization. Hence, the OpenMP API provides more control through runtime calls and clauses [16].

Task parallelism was introduced in OpenMP version 3.0. Instead of defining parallel regions or loops to be run in parallel, the programmer can use the task directive to define a task and its data environment. A new task will be generated as soon as a thread encounters the task construct. One of the benefits of tasking is that terminated threads do not need to wait for slow working threads, instead they can help with the remaining tasks, a concept known as load balancing [22].

The pragma directives can include clauses to control the runtime. The different parameters are schedule(dynamic) which makes threads execute chunks of iter- ations and keep requesting chunks until no chunks remain. The schedule(auto) leaves all decisions regarding scheduling to the runtime system [23].

2.3.4 Intel TBB

Intel threading building blocks (TBB) is a portable C++ library for shared-memory parallel programming with an emphasis on task parallelism. TBB runs on any operating system and compiles with any C++ compiler and promotes idiomatic C++ code. Usage of the library is done by specifying tasks instead of threads and the library handles load balancing, cache optimization and efficiently mapping tasks on to threads [24].

TBB provides parallel loops with the parallel_for function. The parallel_for function arguments are a range and a function object. The range specifies the whole iteration space and the granularity of a task. The data to iterate over is sent to the function object. For example

parallel_for(tbb::blocked_range<int>(0, n, grainSize), MyFunctor(array));

class MyFunctor {

float *const _array;

public:

MyFunctor(float array[]) : _array(array) {}

void operator()(const tbb::blocked_range<int>& range) const { for (int i = range.begin(); i != range.end(); ++i)

Foo(_array[i]);

}

(25)

2.3. PARALLEL PROGRAMMING FRAMEWORKS

};

will create tasks which handle at most grainSize iterations. Note that in this example the iteration range here is one-dimensional and TBB iterates from 0 to n−1 by creating smaller tasks of grainSize iterations. The TBB runtime scheduler will distribute tasks between threads.

TBB also provides explicit creation of tasks through a task_group. Tasks can dynamically be added to a task group during execution. This is done through a non-blocking call to the task group method run(f) where f is some function.

2.3.5 CUDA

Compute Unified Device Architecture (CUDA) is an extension of the C program- ming language and it supports parallel processing on NVIDIA GPUs [18] [25]. A kernel function is code to be executed on the GPU and it is written in CUDA C.

For example, a traditional vector addition for(int i = 0; i < 512; i++)

a[i] = b[i] + c[i];

would have its loop structure removed in CUDA C, because the CUDA programming model handles serial code execution:

int my_id = threadIdx.x

a[my_id] = b[my_id] + c[my_id];

There are a few observations to be made on the CUDA C version of the vector addition:

• Arrays a, b and c have to be transferred from host-to-device through CUDA functions cudaMalloc and cudaMemcpy

• The distribution of indices to threads is done with a global parameter threadIdx.x which represents the one dimensional range of the data

• The size of the array is not explicitly stated in the kernel, how do we avoid indexing out of bounds?

After the host-to-device memory transfer, the CPU is ready to invoke a kernel on the GPU with my_kernel«1, 512» where the parameters specify how threads will be laid out on the GPU. After the kernel invocation, the CPU does a blocking call to cudaMemcpy to wait for the kernel to complete and move data from device-to-host [18].

In GPU programming, 512 threads is not fully utilizing the GPU, even though

it is a relatively high amount of threads in the CPU domain. CUDA’s thread

model defines grids, blocks, warps and threads as abstractions on top of Streaming

(26)

Multiprocessors. With these abstractions, the programmer can fully utilize the GPU and achieve performance and scalability [18].

CUDA defines a block as a group of threads running a kernel. The number of blocks is given in the kernel invocation: kernel_name«num_blocks, num_threads».

If the kernel above were to be invoked with parameters my_kernel«2, 512» there would be two blocks of 512 threads executing the same code on the same arrays.

In order to map blocks to different parts of the array, CUDA provides blockIdx so that threads can find their unique indices [18].

An optimal CUDA program is a synergy between efficient mapping of threads to blocks and efficient usage of the different memory regions available. Efficiently mapping your problem domain to blocks and threads goes a long way, CUDA also defines a memory model of the GPU’s on-chip memory. There are four memory regions with different properties: global, shared, texture and constant memory [18].

2.3.6 OpenCL

Open Computing Language (OpenCL) also allows the programmer to target the GPU [26][27]. OpenCL is not only restricted to NVIDIA GPUs and neither does it restrict code to be executed on GPUs. The OpenCL platform model defines a high level compute device, which may be a CPU or a GPU and the goal of OpenCL is to hide hardware details from the programmer, through high-level abstractions:

• Platform model: defines a device which is a high level description of pro- cessors (CPU or GPU)

• Execution model: an abstract representations how streams of instructions execute on the host

• Memory model: the abstraction of the device’s memory

• Programming model: task and data parallelism for designing algorithms The programming flow of OpenCL is similar to CUDA. Generally, it includes three additional steps:

1. choosing an OpenCL platform 2. creating a context

3. creating a command-queue 4. move data host-to-device 5. execute kernel

6. move data device-to-host

(27)

2.3. PARALLEL PROGRAMMING FRAMEWORKS

The three additional steps add complexity to the code but it is due to OpenCL’s wide support for different devices. For example, step 1 involves choosing between different CPUs and GPUs on the machine, in source code.

A vector addition loop written in OpenCL C would look similar to CUDA C:

int my_id = get_global_id(0);

a[my_id] = b[my_id] + c[my_id];

Where get_global_id(0) returns the one-dimensional global ID for a work-item. A work-item is the smallest unit of execution in OpenCL. Each work-item executes the same kernel and each work-item belongs to a work-group. The concept of work-item and work-groups maps directly to CUDA’s blocks and threads. The terminology is different but the purpose is the same; allowing the programmer to fully utilize the device [26].

OpenCL also offers five different memory regions: global, constant, texture, local

and private memory.

(28)

(29)

Chapter 3

Method

The methodology in this thesis will be a quantitative evaluation measuring:

• Execution time of the reprojection algorithm

• Execution time when scaling the number of threads

• Profiling information

The execution time measurements are wall-clock times given by C++ date and time utilities in the chrono library. No computationally intensive applications were running during the benchmarks.

Apart from the execution time of the reprojection algorithm it is also interesting to see where the time is spent. Intel VTune Amplifier XE 2016 was used for profiling and the five most time consuming functions are reported. Intel VTune uses sampling profiling (see Figure 3.1) where it interrupts the processor at fixed intervals and records the current function and the call stack.

Figure 3.1. A screenshot of the Intel VTune profiler. The profiling information

shows where the program spends most of its time.

(30)

3.1 Reprojection algorithm

In Section 2.1.2 we saw how to reproject a single coordinate from one map projection to another, while the reprojection algorithm reprojects all coordinates in a raster.

A raster in C++ is implemented as a contiguous one-dimensional array of bytes, and the two-dimensional representation is done by giving the one-dimensional array a fixed row size. In short, the reprojection algorithm iterates through this array and reprojects its geographic location to another map projection.

The input and output of the reprojection algorithm is one raster projected with two different map projections, see Figure 3.2 below.

Reprojection algorithm Input raster

Output raster

Figure 3.2. Example of input and output rasters to the reprojection algorithm.

The algorithm begins with an empty output raster and fills it up with colour values

point by point. A flow diagram of the algorithm is shown in Figure 3.3 below. Note

that the reprojection step (highlighted in blue) applies the inverse formula shown

in Section 2.1.1.

(31)

3.1. REPROJECTION ALGORITHM

Is (i,j) inside the input raster?

Start

Set coordinate (x,y) in the empty output raster to value in the input raster (i,j)

Pick coordinate (x,y) in the empty output raster

Reproject (x,y) to (i,j)

Go to Start NO

YES

Go to Start

Figure 3.3. A flow diagram of the reprojection algorithm.

The algorithm is essentially building up a new matrix (or raster) by transform- ing coordinates and querying the input matrix. A key thing to understand here is that the values inside the matrix are not transformed. The matrices contain dis- crete colour values and they are not altered, instead it is the indices or coordinates that are reprojected. The pseudocode below is a sequential implementation of the reprojection algorithm.

for (i = 0; i < outputRasterHeight; ++i) for (j = 0; j < outputRasterWidth; ++j) {

i += northOffset j += eastOffset

x, y = inverseReproject(i, j) if (isInside(inputRaster, x, y)) {

colorValue = inputRaster[inputRasterWidth*y + x]

outputRaster[outputRasterWidth*i + j] = colorValue }

}

Note that the input raster remains unchanged and it is treated as a read-only raster.

(32)

If a reprojected coordinate was not inside the input raster, the element will get an undefined, or transparent, value. The inverseReproject function is defined as:

void inverseReproject(double& x, double& y) {

double rho = sqrt(x * x + (rho0 - y) * (rho0 - y));

double t = pow((rho / a * F), 1 / n);

double theta = atan2(Eprim, (rho0 - y));

double phi = HALFPI - 2 * atan(t *

pow((1 - esin(phi)) / (1 + esin(phi)), e/2));

double lambda = (theta / n) + lambda0;

x = lambda*rad_to_degree;

y = phi*rad_to_degree;

}

The inverseReproject function applies the trigonometry, power and square root functions shown in Section 2.1.1.

3.2 Parallelization

The reprojection algorithm is embarrassingly parallel since each element in the raster can be reprojected independently. In the sections below, we show the parallel implementations of the sequential reprojection algorithm.

3.2.1 Pthreads

The raster is evenly divided in to n chunks of m rows. Each chunk is given to one of n threads, for example in the Figure 3.4, the first thread to arrive would apply the reprojection algorithm on the uppermost chunk.

m

**m*n**

Figure 3.4. A raster divided in to n chunks of size m.

The implementation consists of a master thread spawning n threads with

pthread_create and waiting for all threads to finish with the barrier pthread_join.

The spawned threads find their chunk with their thread ID and they apply the se- quential reprojection algorithm on their chunk:

for (i = 0; i < myChunkHeight; ++i) {

(33)

3.2. PARALLELIZATION

for (j = 0; j < outputRasterWidth; ++j) { ...

} }

3.2.2 C++11 STL threads

The raster is divided in to chunks, like the Pthread implementation in Section 3.2.1. The threads are spawned with std::thread() and they apply the sequential reprojection algorithm on their chunk. Finally, the master thread waits for all threads to finish with std::thread::join(). The code for the spawning of threads is shown below, note that threads find their specific chunk with their unique ID and the total number of threads:

void spawnThreads() {

const uint numThreads = std::thread::hardware_concurrency();

std::vector<std::thread> threads;

for (int idThread = 0; idThread < numThreads; idThread++) { auto threadFunc = [=, &threads, &idThread]() {

reproject(idThread, numThreads);

};

threads.push_back(std::thread(threadFunc));

}

for (auto & t : threads) t.join();

}

3.2.3 OpenMP

A pragma compiler directive was added to the outermost loop, because one iteration of the outer loop is more expensive than one iteration of the inner loop. The sequential code remains the same except for the directive:

#pragma omp parallel_for [parameters]

for (i = 0; i < outputRasterHeight; ++i) { for (j = 0; j < outputRasterWidth; ++j) {

...

} }

The task-based implementation used a divide-and-conquer strategy. The raster is split into four subrasters and four tasks are created to reproject each subraster.

The splitting continues recursively until the size of the subrasters reach a threshold.

Figure 3.5 below visualizes the task-based implementation during splitting. Note

that threads work on square chunks instead of rows.

(34)

... ...

Figure 3.5. The process of recursively splitting a raster in to four quadrants.

The tasks were spawned with #pragma omp task until the chunks were of size threshold × threshold. Finally, all threads were synchronized with #pragma omp taskwait. The main part of this implementation is shown in the code below:

void reproject(const Point lowerLeft, const Point upperRight) { int threshold = 100; // tuning parameter

if (height > threshold && width > threshold) { // upper-left square

lowerLeftNew.x(lowerLeft.x());

lowerLeftNew.y(middle.y() + 1);

upperRightNew.x(middle.x());

upperRightNew.y(upperRight.y());

#pragma omp task {

reproject(lowerLeftNew, upperRightNew);

} }

// Continue with lower-left, upper-right and lower-right }

The reprojection part is ommited from the code above, however when the threshold is reached the threads will apply the sequential reprojection algorithm on their specific square chunks.

3.2.4 Intel TBB

In the TBB implementation, the sequential algorithm was encapsulated inside a function object:

class Reprojector { private:

const int _rowsize;

public:

(35)

3.2. PARALLELIZATION

Reprojector(int rowsize) : _rowsize(rowsize) {}

void operator()(const tbb::blocked_range<int>& r) const { for (int i = r.begin(); i != r.end(); ++i) {

for (int j = 0; j < _rowsize; j++) { ...

} } } };

The outermost loop was divided into a range of grainSize iterations and the task scheduler mapped tasks (chunks of rows) on to threads. The function object was invoked with:

parallel_for(blocked_range<int>(0, outputRasterHeight, grainSize), Reprojector(finalRasterWidth));

The task-based implementation followed the same divide-and-conquer strat- egy as the OpenMP task-based implementation, where tasks were spawned with tbb::task_group::run(f) and f created more tasks or reprojected a square chunk if the threshold was reached. The main part of this implementation is shown in the code below:

void reproject(const Point lowerLeft, const Point upperRight) { int threshold = 100; // tuning parameter

tbb::task_group g;

if (height > threshold && width > threshold) { Point middle((width - 1) / 2 + lowerLeft.x(),

(height - 1) / 2 + lowerLeft.y());

Point lowerLeftNew, upperRightNew;

// upper-left square

lowerLeftNew.x(lowerLeft.x());

lowerLeftNew.y(middle.y() + 1);

upperRightNew.x(middle.x());

upperRightNew.y(upperRight.y());

// Start a task for the upper left square

g.run([=]{reproject(lowerLeftNew, upperRightNew); });

// Continue with lower-left, upper-right and lower-right }

}

(36)

Note in the code above that a new tbb::task_group is being created on every recursive call. Intel recommends this approach, because creating a large number of tasks on a single task group does not scale. The large number of tasks created becomes a bottleneck and serializes the algorithm.

3.2.5 CUDA

The design used in CUDA avoids looping through the raster. Instead of assigning parts of the raster to threads, the CUDA implementation maps one thread per coordinate (or one thread per pixel in CUDA terminology). The implementation starts by doing a host-to-device transfer of the input raster, output raster and necessary constants for reprojections with cudaMemcpy. Secondly, the kernel is invoked with parameters to assure that every coordinate gets a thread assigned to it:

dim3 threadsPerBlock(p, q);

dim3 numBlocks(outputRasterWidth/p, outputRasterHeight/q);

kernel_reproject<<<threadsPerBlock, numBlocks>>>(inputRaster,

outputRaster, constants);

Whereas each block has p·q threads. The number of blocks is chosen to fit the output raster, while the number of threads per block is used as a tuning parameter. After invoking the kernel, the CPU does a device-to-host blocking call to cudaMemcpy, waiting for all kernels to finish.

The kernel applies the reprojection algorithm on one coordinate. The threads identify their unique coordinate through CUDA’s unique blockID and threadID constants. The simplified code for the CUDA kernel is shown below, note that inverseReproject also needs to be translated in to a CUDA kernel.

global void kernel_reproject(unsigned char* inputRaster, unsigned char* outputRaster,

double* constants) {

int i = (blockIdx.y * blockDim.y) + threadIdx.y;

int j = (blockIdx.x * blockDim.x) + threadIdx.x;

i += northOffset;

j += eastOffset;

x, y = inverseReproject(i, j);

if (isInside(inputRaster, x, y)) {

colorValue = inputRaster[inputRasterWidth*y + x];

outputRaster[outputRasterWidth*i + j] = colorValue;

}

(37)

3.2. PARALLELIZATION

3.2.6 OpenCL

The OpenCL implementation follows the same one thread per coordinate design as the CUDA implementation in Section 3.2.5, and their kernels are similar. However, the work done before invoking the kernel is different from CUDA.

First, the GPU platform is found with clGetDeviceIDs and a context is created

with clCreateContext. After a context is defined, a command queue is created for

the host-to-device transfers of rasters and constants. Finally, the invocation of the

kernel is done with clEnqueueNDRangeKernel where the global and local work-

group size is given. The global work-group size is two-dimensional like the block

size in CUDA and it covers the whole output raster.

(38)

(39)

Chapter 4

Benchmark Environment

The experiment is a benchmark of seven implementations of the reprojection al- gorithm in OpenMP, Pthreads, C++11 STL threads, Intel TBB, OpenCL and NVIDIA CUDA. A sequential algorithm was also implemented as a reference im- plementation.

The main performance metric is execution time reported in ms. The time needed to reproject a raster is measured ten times and the mean value is reported together with the standard deviation of the ten measurements as error bars. The pseudocode below shows how the execution time was measured:

start = now()

reproject(framework, raster) stop = now()

Where framework is one of the parallel programming frameworks mentioned above and raster is one of three different sized rasters.

The second measurement was memory usage and CPU load during the execution of the reprojection algorithm. Microsoft’s Performance Monitor (perfmon) was used to record both measurements.

4.1 Data

The raster data used in this experiments are three raster images of Earth:

Name File size Resolution Format Map projection Small 338 KB 2000 × 2000 GeoTiff (.tif) Plate Carrée Medium 55 MB 10800 × 5400 GeoTiff (.tif) Plate Carrée Large 222 MB 21600 × 10800 GeoTiff (.tif) Plate Carrée

Both Medium and Large rasters are in the public domain

¹

. The Small raster is a subraster of the Medium raster.

1

http://www.naturalearthdata.com

(40)

In the commercial GIS toolkit used in this thesis, the larger rasters are split in to smaller rasters of size 2000 × 2000. For example, the Medium raster is not a single data structure in memory, instead it is split up in to 40 smaller rasters.

4.2 Verification

The notion of a correctly projected map is not well-defined. In order to verify the correctness of the implementations’ reprojection there needs to be some reference for comparison. The commercial GIS toolkit and the output from their map projection implementation is used as a reference. The outputs from the benchmarks were compared with the commercial GIS output both visually and in terms of equal raster values.

4.3 Hardware

The specifications for the computer, where the results were produced, are given in the table below:

Model Dell Precision M4600

OS Windows 7 Professional 64-bit

Processor Intel Core i7 2760QM @ 2.40GHz 4 cores (4 Hyper-threads) RAM Memory 8 GB

Graphics card NVIDIA Quadro 1000M (Dell)

4.4 Compilers

The MSVC++ compiler version 12.0 was used in all the implementations together

with the O2-maximize speed optimization compiler flag. The GCC compiler version

4.9.2 was used to implement tasks in OpenMP, because of MSVC’s lack of support

of OpenMP versions newer than 2.0. The CUDA implementation used the NVCC

compiler version 7.5 and the OpenCL implementation used Intel’s SDK for OpenCL

applications. Both GPU implementations were compiled with O2 flags and the

-use_fast_math flags for faster floating point operations.

(41)

Chapter 5

Results

This chapter shows the results from the benchmark presented in Chapter 4. Each implementation of the reprojection algorithm was evaluated by measuring their execution time. The execution time from the sequential implementation is also presented. To get an overall overview of the benchmark, the reader can find the fastest results from each implementation in Section 5.4.

The figures for the parallel frameworks are coloured with red, green and blue colors to distinguish from the Small (338 KB), Medium (55 MB) and Large (222 MB) rasters. The coloured bars represent the mean execution time measured from ten executions, while the black error bars represent the standard deviation from these executions.

5.1 Sequential implementation

The execution time for the sequential implementation is shown in Table 5.1 below.

Note that the sequential implementation needed approximately 80 seconds to re- project the Large raster, while the parallel implementations in Figure 5.19 needed 2 to 16 seconds to reproject the same raster.

Execution time (ms) Standard deviation

Small 930.41 9.32

Medium 36238.80 28.39

Large 77196.30 42.13

Table 5.1. The execution times and their standard deviations for the sequential

implementation.

(42)

5.2 CPU frameworks

The results of the CPU frameworks; Pthreads, C++11 STL threads, OpenMP and TBB are shown below. The amount of threads used in their implementation was increased from 1 to 16 in order to both analyze the behavior of the algorithm and find the amount threads that gave a fast execution time. Recall from Chapter 4 that the computer used in the benchmarks had 4 cores and 4 additional hyperthreads.

5.2.1 Pthreads

Figure 5.1 shows the time needed to reproject each raster on the Pthread implemen- tation. The x-axis represents the amount of threads used by the implementation.

The execution time decreases when more threads are added until the fastest execu- tion times are reached, around 7-8 threads.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of threads

0 200 400 600 800

Time (ms)

Scalability

Small raster

0 5000 10000 15000 20000 25000 30000 35000

Time (ms)

Scalability

Medium raster

0 10000 20000 30000 40000 50000 60000 70000

Time (ms)

Scalability

Large raster

Figure 5.1. Scaling the amount of threads used in the Pthread implementation of the reprojection algorithm.

Figure 5.2 shows the speedup of the Pthread implementation on the three rasters.

The speedup is linear from 1 to 2 threads. When more threads are added, the

speedup converges to the speedup of 7-8 threads.

(43)

5.2. CPU FRAMEWORKS

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of threads

0 2 4 6 8 10 12 14 16

Speedup

Small Medium Large

Figure 5.2. The speedup of the Pthread implementation on the three rasters.

Figure 5.3 shows the profiling information of the Pthread implementation and rep- resents where most of the time was spent during reprojection. The legend shows atan2 a function that takes two parameters x, y and calculates the arctangent of y/x, atan is a function for calculating the arctangent. The pow and sqrt are power and square root functions. The Pthread implementation spends mosts of the time computing arctangent functions.

Figure 5.3. Where time is spent during the Pthreads implementation of the repro-

jection algorithm.

(44)

5.2.2 C++11 STL threads

Figure 5.4 shows the time needed to reproject each raster on the C++11 STL thread implementation. The x-axis represents the amount of threads used by the implementation. The execution time decreases when more threads are added until the fastest execution times are reached, around 7-8 threads.

0 200 400 600 800

Time (ms)

Scalability

Small raster

0 5000 10000 15000 20000 25000 30000 35000

Time (ms)

Scalability

Medium raster

0 10000 20000 30000 40000 50000 60000 70000

Time (ms)

Scalability

Large raster

Figure 5.4. Scaling the amount of threads used in the C++11 STL threads imple- mentation of the reprojection algorithm.

Figure 5.5 shows the speedup of the C++11 STL thread implementation on the

three rasters. The speedup is linear from 1 to 2 threads. The maximum speedup is

around 7-8 threads. Note that the speedup is almost equal for all three rasters.

(45)

5.2. CPU FRAMEWORKS

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of threads

0 2 4 6 8 10 12 14 16

Speedup

Small Medium Large

Figure 5.5. The speedup of the C++11 STL threads implementation on the three rasters.

Figure 5.6 shows the profiling information of the C++11 STL thread implemen- tation and represents where most of the time was spent during reprojection. The legend shows vectorised versions of the atan2, atan, pow and sqrt functions.

atan2 is a function that takes two parameters x, y and calculates the arctangent of y/x, atan is a function for calculating the arctangent. The pow and sqrt are power and square root functions. The C++11 STL thread implementation spends most of the time computing arctangent functions.

Figure 5.6. Where time is spent during the C++ STL threads implementation of

the reprojection algorithm.

(46)

5.2.3 OpenMP

Figure 5.7 shows the time needed to reproject each raster on the OpenMP parallel for and tasks implementation. The x-axis represents the amount of threads used by the implementation. The execution time decreases as the amount of threads increases. After 8 threads, the execution time remains the same.

0 200 400 600 800 1000 1200

Time (ms)

Scalability (Small raster) tasks parallel for

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

Time (ms)

Scalability (Medium raster) tasks parallel for

0 20000 40000 60000 80000 100000

Time (ms)

Scalability (Large raster) tasks parallel for

Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016 ,

Parallelizing Map Projection of

Raster Data on Multi-core CPU and GPU Parallel Programming

Frameworks

DANIEL CHAVEZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming

Frameworks

DANIEL CHAVEZ

Master’s Thesis at CSC Supervisor: Stefano Markidis

Examiner: Erwin Laure

Abstract

The thesis compares the execution times from the dif-

ferent implementations on small, medium and large raster

data sets, where OpenMP had the best speedup of 6, 6.2

and 5.5, respectively. Meanwhile, the GPU implementa-

tions were 293 % faster than the fastest CPU implemen-

tations, where profiling shows that the CPU implementa-

tions spend most time on trigonometry functions. The re-

sults show that reprojection algorithm is well suited for the

GPU, while OpenMP and Intel TBB are the fastest of the

CPU frameworks.

Parallellisering av kartprojektion av rasterdata på flerkärniga CPU- och GPU-programmeringsramverk.

Kartprojektioner är en central del av geografiska informa- tionssystem och en otalig mängd av kartprojektioner an- vänds idag. Omprojiceringen mellan olika kartprojektioner sker regelbundet i ett geografiskt informationssystem och den kan parallelliseras med flerkärniga CPU:er och GPU:er.

Denna masteruppsats implementerar en parallel och analy- tisk omprojicering av rasterdata i C/C++ med ramverken Pthreads, C++11 STL threads, OpenMP, Intel TBB, CU- DA och OpenCL.

Uppsatsen jämför de olika implementationernas exe- kveringstider på tre rasterdata av varierande storlek, där OpenMP hade bäst speedup på 6, 6.2 och 5.5.

GPU-implementationerna var 293 % snabbare än de snab-

baste CPU-implementationerna, där profileringen visar att

de senare spenderade mest tid på trigonometriska funktio-

ner. Resultaten visar att GPU:n är bäst lämpad för ompro-

jicering av rasterdata, medan OpenMP är den snabbaste

inom CPU ramverken.

Contents

1 Introduction 1

1.1 Motivation . . . . 2

1.2 Limitations . . . . 2

1.3 Contribution . . . . 2

1.4 Related work . . . . 3

1.4.1 Map projection . . . . 3

1.4.2 Comparison of frameworks . . . . 3

1.5 Outline . . . . 4

2 Background 5 2.1 Reprojection . . . . 5

2.1.1 Map projection . . . . 5

2.1.2 Reprojection . . . . 8

2.1.3 Raster data . . . . 8

2.1.4 Reprojection in GIS . . . . 9

2.2 Shared-memory systems . . . . 10

2.2.1 Threads . . . . 11

2.2.2 Tasks . . . . 11

2.2.3 Speedup . . . . 11

2.2.4 Hardware limitations . . . . 11

2.2.5 SIMD . . . . 12

2.2.6 CPU Architecture . . . . 12

2.2.7 GPU Architecture . . . . 13

2.3 Parallel programming frameworks . . . . 15

2.3.1 Pthreads . . . . 15

2.3.2 C++11 STL threads . . . . 15

2.3.3 OpenMP . . . . 15

2.3.4 Intel TBB . . . . 16

2.3.5 CUDA . . . . 17

2.3.6 OpenCL . . . . 18

3 Method 21

3.1 Reprojection algorithm . . . . 22

3.2.2 C++11 STL threads . . . . 25

3.2.3 OpenMP . . . . 25

3.2.4 Intel TBB . . . . 26

3.2.5 CUDA . . . . 28

3.2.6 OpenCL . . . . 29

4 Benchmark Environment 31 4.1 Data . . . . 31

4.2 Verification . . . . 32

4.3 Hardware . . . . 32

4.4 Compilers . . . . 32

5 Results 33 5.1 Sequential implementation . . . . 33

5.2 CPU frameworks . . . . 34

5.2.1 Pthreads . . . . 34

5.2.2 C++11 STL threads . . . . 36

5.2.3 OpenMP . . . . 38