A Skeleton Programming Library for Multicore CPU and Multi-GPU Systems

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

A Skeleton Programming Library for Multicore

CPU and Multi-GPU Systems

by

Johan Enmyren

LIU-IDA/LITH-EX-A--10/037--SE

2010-10-05

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

(2)

(3)

Linköping University

Department of Computer and Information Science

Final Thesis

A Skeleton Programming Library for

Multicore CPU and Multi-GPU Systems

by

Johan Enmyren

LIU-IDA/LITH-EX-A--10/037--SE

2010-10-05

Supervisor: Christoph W. Kessler

Examiner: Christoph W. Kessler

(4)

(5)

Abstract

This report presents SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP back end. It also supports multi-GPU systems.

Benchmarks show that copying data between the host and the GPU is often a bottleneck. Therefore a container which uses lazy memory copying has been implemented to avoid unnecessary memory transfers.

SkePU was evaluated with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. The best performance gains are received when the computation load is large compared to memory I/O (the lazy memory copying can help to achieve this). We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to ten times faster run times when using SkePU with a GPU back end compared to a sequential solver running on a fast CPU.

From the benchmarks we can conclude that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. SkePU does however have some disad-vantages too; there is some overhead in using the library which we can see from the dot product and LibSolve benchmarks. Although not big, it is still there and if performance is of uttermost importance, then a hand coded solution would be best. One cannot express all calculations in terms of skeletons either, if one have such a problem, specialized routines must still be created.

(6)

(7)

Acknowledgments

I would like to thank my examiner and supervisor Christoph W. Kessler for his time and advice during the project. I would also like to thank Mattias Korch and Thomas Rauber for sharing their ODE solver library and Markus Ålind for sharing his BlockLib port of LibSolve. Further development of SkePU has been funded by project PEPPHER.

(8)

(9)

4 Programming Frameworks 11 4.1 CUDA . . . 11 4.1.1 Programming Interface . . . 12 4.2 OpenCL . . . 14 4.2.1 Programming Interface . . . 14 4.3 OpenMP . . . 14 4.3.1 Programming Interface . . . 16 5 SkePU 19 5.1 Container . . . 19 5.2 User Functions . . . 20 5.3 Skeleton Functions . . . 20 5.3.1 CUDA Implementation . . . 22 5.3.2 OpenCL Implementation . . . 22 5.3.3 Types of Skeletons . . . 23 5.4 Multi-GPU Support . . . 29 5.4.1 Map . . . 29

5.4.2 Reduce and MapReduce . . . 29

5.4.3 MapOverlap . . . 30

(10)

viii Contents 5.4.4 MapArray . . . 30 5.4.5 Scan . . . 30 5.5 Helpers . . . 30 5.5.1 Device Management . . . 30 5.5.2 Memory Management . . . 31 5.5.3 Testing . . . 31 5.6 Dependencies . . . 32 6 Evaluation 33 6.1 Time Distribution . . . 33 6.2 Gaussian Blur . . . 35 6.3 Dot Product . . . 36 6.4 Scan . . . 36

6.5 A Runge-Kutta ODE Solver . . . 37

6.5.1 LibSolve . . . 37

6.5.2 Porting . . . 38

6.5.3 Benchmarking . . . 38

7 Related Work 43 7.1 GPU Libraries . . . 43

7.2 Non GPU Skeleton Libraries . . . 44

8 Limitations and Future Work 47 9 Conclusions 49 9.1 Interface . . . 49 9.2 Performance . . . 50 9.3 Back Ends . . . 50 Bibliography 51 A Glossary 55 A.1 Words and Abbreviations . . . 55

B SkePU API Reference 57 B.1 Vector< T > . . . . 57

B.1.1 Detailed Description . . . 59

B.1.2 Constructor & Destructor Documentation . . . 59

B.1.3 Member Function Documentation . . . 59

B.1.4 Friends And Related Function Documentation . . . 61

B.2 Vector< T >::iterator< T > . . . 61 B.2.1 Detailed Description . . . 61 B.3 Vector< T >::proxy_elem< T > . . . . 61 B.3.1 Detailed Description . . . 61 B.4 Map< MapFunc > . . . . 61 B.4.1 Detailed Description . . . 63

(11)

Contents ix

B.5 Reduce< ReduceFunc > . . . 66

B.6 MapReduce< MapFunc, ReduceFunc > . . . . 67

B.7 MapOverlap< MapOverlapFunc > . . . . 71

B.8 MapArray< MapArrayFunc > . . . 73

B.9 Scan< ScanFunc > . . . 75

C Code for the Map Skeleton 78 C.1 map_kernels.h . . . 78 C.2 map.h . . . 79 C.3 map.inl . . . 82 C.4 map_cpu.inl . . . 85 C.5 map_omp.inl . . . 87 C.6 map_cl.inl . . . 89 C.7 map_cu.inl . . . 96

(12)

(13)

Chapter 1 Introduction

With the development of computers, there has always been a need for making programs and their calculations run faster. One approach to fulfill this need has long been to build new hardware that can be run at higher clock rates. Today, this approach is becoming less and less attractive due to the problems that arise when this technology is reaching its physical limits. Instead, other ways must be found to address the needs of more computing power. Parallelization is a natural way of speeding up programs and their calculations and seems to be the road taken by the hardware manufacturers. Central Processing Units (CPUs) are constructed with multiple cores and Graphics Processing Units (GPUs) are transforming from a specialized hardware, only for graphics, to a more general computing unit supporting massively parallel programming.

This paradigm shift from sequential to parallel computing is not an easy one. For years, programmers have learned how to think in a sequential way when de-veloping algorithms. Different techniques and tricks have been invented to get as much computation power out of the sequential machine as possible, techniques which are often hard to utilize in a parallel machine. Furthermore there are today several different parallel architectures such as homogeneous SMP-like multi-cores, heterogeneous multi cores like the Cell Broadband Engine, or hybrid CPU/GPU systems which all can be both complex and time consuming to program for. More-over, with new architectures constantly arriving and using different programming tools and interfaces for each one of them generates a portability problem.

Skeleton programming [7, 26, 27] is one way of making parallel programming easier by hiding a lot of the parallelism and identifying a set of higher-order func-tions that can be used when building parallel algorithms. Using skeletons solves the portability problem to a large degree since they are not bound to a certain architecture but can unify those architectures under the same interface.

This thesis takes the skeleton approach to parallel programming and applies it to modern GPUs by constructing SkePU, a skeleton library on top of CUDA and OpenCL, two platforms used for programming GPUs. It is modeled after BlockLib [1], a skeleton library for the Cell BE processor. The interface of SkePU is however not bound to only use GPUs as back ends; both a single core CPU and

(14)

2 Introduction

an OpenMP based multi-core back end are also implemented.

The most important features of SkePU are also described in a workshop paper [11].

1.1 Project Goal

The main goal of this master thesis is to construct a programming library with the purpose of making it easier to write efficient programs for GPUs using data parallel concepts. This should be done using a skeleton approach similar to that in BlockLib [1] and the interface should be general enough to support other ar-chitectures than graphics hardware. Further, the project should also investigate if the skeleton approach is viable for GPU Computing and how the library performs for various tasks.

1.2 Project Approach

To learn more about the different concepts and technologies used in the project, the first step was to make a literature study in the area of GPU Computing using OpenCL and CUDA as well as in the area of skeleton programming. The result can be seen in Chapter 7 where related work is discussed briefly.

After that, an interface was designed and it was decided that C++ with tem-plates was going to be used. Then the skeleton library itself was created which was named SkePU, a contraction of Skeleton, GPU and CPU.

The last step was to evaluate the library and approach made by implementing a few smaller benchmarks and a larger real-world application using SkePU functions.

1.3 Thesis Outline

The report has the following structure:

• A short introduction to the skeleton concept will be given in Chapter 2. • NVIDIA GPU Overview: Since only NVIDIA GPUs were used in the

eval-uation of the library, an overview of this architecture and its most relevant features is the topic of Chapter 3.

• Chapter 4 introduces the different frameworks used internally by SkePU, OpenCL, CUDA and OpenMP.

• After that, the main result of this thesis, SkePU, is presented in more detail. Both its functionality and implementation are covered. This is the topic of Chapter 5.

• In Chapter 6 the results from the evaluation of SkePU are presented. They consist of a few smaller benchmarks and a larger application.

(15)

1.4 Clarifications 3

• There has been a lot of research in both the area of GPU Computing and in skeletal parallel programming. The topic of Chapter 7 is therefore re-lated work. This chapter presents the results from the literature study and compares SkePU to a few similar libraries.

• The limitations of SkePU and suggestions for future improvements are dis-cussed in Chapter 8.

• Conclusions and further discussion of the overall results of SkePU and the thesis project as a whole are covered in Chapter 9.

• A Glossary is available in Appendix A and a short SkePU API reference in Appendix B.

1.4 Clarifications

To avoid confusion, some initial clarifications about the terminology used in the report should be made. When talking about GPUs, the word device will often be used. In the same sense, the word host is sometimes used instead of CPU. A kernel is a piece of code that is callable from the host and executed on the device. The words vector and array are used interchangeably and means a one-dimensional data structure. Also see Appendix A for a complete glossary.

(16)

(17)

Chapter 2 Skeleton Programming

Most programmers today are used to a sequential way of thinking when develop-ing applications and implementdevelop-ing algorithms. Since the hardware platforms are making a transition towards parallel architectures, so must software development and there are several ways to make this transition smoother. At one end, we could continue to program in a sequential way and hope that the tools we use will be developed to handle this transition for us, for example compilers that automati-cally do parallelization of sequential code. On the other end, programmers need to learn how to utilize the parallel performance of the machine they are working on and "manually" program it in an efficient way. Both of these methods have their respective advantages and disadvantages. The first does not require much effort from the regular programmer, but instead the compiler and compiler de-signers must do the bulk of it. Often this also limits the amount of parallelization that can be done and the performance gains. The second way offers the most possibilities for performance gains, but also a considerate amount of effort by the programmers. The portability also suffers when trying to optimize for a special architecture.

Parallel programming with skeletons [7] is a way between those endpoints. The skeleton approach offers an abstraction from different parallel architectures by identifying certain patterns used when developing parallel programs and imple-menting those with the help of higher-order functions, functions which take other functions as parameters. The interface is often sequential-like and therefore makes it easy for programmers without prior experience with parallel programming to start using them. Skeletons also offer some portability since they can be expert-implemented and optimized for different architectures and encapsulate things like load balancing, communication and memory management. The skeleton approach does, however, set some constraints for the programmer on what kind of compu-tations can be parallelized automatically. If the computation cannot be expressed with the defined set of skeletons, it must still be coded manually.

(18)

6 Skeleton Programming

Figure 2.1. Task graph of example expression.

2.1 Types of skeletons

Parallelism in an application can usually be divided into data or task parallelism. In the task parallel model, different parts of the program or algorithm are executed in different places concurrently and in the data parallel model, the same operations are executed in parallel on different parts of a data set. Often these models can also be combined with several tasks doing operations on large datasets. For example consider evaluating the expression (A + B) ∗ (C − D), also illustrated in figure 2.1, where A, B, C and D are vectors and the operations are performed element-wise. Here we see that each operation is data parallel since they are done element-wise and two of the operations, A + B and C − D, can be performed in parallel and are therefore task parallel.

Skeletons can in a similar way also be categorized as being either task or data parallel. Examples of popular task parallel skeletons are pipe and farm and data parallel skeletons include map, reduce and scan. They can also be combined, which is done in the two-tier model of P3L [26]. In [26], a detailed explanation of the previously mentioned skeletons can also be found.

In this thesis, only data parallel skeletons were considered. Chapter 5 gives a full description of the skeletons in SkePU and their implementation details. Skele-tons for parallel programming is a well documented topic in the research literature and several skeleton implementations exist for various parallel architectures. See Chapter 7 for a short survey of some of them.

(19)

Chapter 3 NVIDIA GPU Architecture

For the last decades, the demand for faster and faster graphic acceleration has lead the way for a quick development in the area of GPU (Graphics Processing Unit) architecture. At first they were mainly used for pure graphic calculations, but when programmable shaders started to arrive, people began experimenting with utilizing the GPU’s processing power for other purposes. Its potential in non-graphic related areas was recognized and in 2006 NVIDIA released the first version of CUDA, a general purpose parallel computing architecture, which purpose was to simplify the development of applications that utilize the power of the GPU [25]. One of the big differences between a CPU and a GPU is how they are composed, or what the chip-area is used for. Since the CPU is not specialized on a specific task, it is instead aimed at being fairly general, with a lot of transistors being used for caching and flow control. The GPU, on the other hand, devotes much more space on the chip for pure floating-point calculations since having a high through-put of massively parallel comthrough-putations is what graphics processing is all about. This makes the GPU very powerful on certain kinds of problems, especially those that have a data-parallel nature, preferably with much more arithmetic operations compared to memory operations [25]. The sections below give a short introduction to the hardware model in CUDA and highlight some important features. Also see figure 3.1 which illustrates this model.

3.1 Streaming Multiprocessors

The CUDA architecture consists of several compute units called Streaming Mul-tiprocessors (SMs). They do all the thread management and are able to switch threads with no scheduling overhead. This zero-overhead allows one to use, for ex-ample, one thread per data element in data parallel problems. The multiprocessor executes threads in groups of 32, called warps, but each thread executes with its own instruction address and register state, which allows for separate branching. It is, however, most efficient if all threads in one warp take the same execution path, otherwise the execution in the warp is sequentialized [25]. The SMs themselves consists of several Scalar Processor cores (SPs), also called CUDA cores.

(20)

8 NVIDIA GPU Architecture

Figure 3.1. Illustration of the CUDA hardware model. A device consists of several

Streaming Multiprocessors (SM) which themselves are made up of several Scalar Proces-sor cores (SP).

3.2 Memory

There exist several memory types in the GPU that are part of the CUDA archi-tecture. They compose a memory hierarchy similar to those in regular computers. First there is a global device memory which can be accessed by the entire GPU, that is by all the streaming multiprocessors. It is fairly slow but usually big (hundreds of megabytes or even gigabytes) in size. Next, each of the streaming multiprocessors has the following types of on-chip memory, shared by all the scalar processor cores: a fast shared memory, a read-only constant cache that is used to speed up reads from constant regions of global memory and a read-only texture cache which speeds up reads from texture regions of global device memory. Each scalar processor core also has a set of 32-bit registers. [25]

3.3 Execution Model

The entity that is executed on a GPU is called a kernel which, when launched, is instantiated for each thread that is participating. The threads are divided into groups called blocks and all the blocks in a launch composes a grid. The threads within a block share their shared memory and can be synchronized with function calls. The block size is limited and is typically around 512. Both the block size and the grid size are parameters that need to be set before a kernel can be launched.

(21)

3.4 Optimizing for NVIDIA GPUs 9

3.4 Optimizing for NVIDIA GPUs

To utilize the NVIDIA GPU effectively, one should make sure that you run tasks with a high arithmetic intensity and also that there are many active threads per SM. If these two are fulfilled, memory accesses can be overlapped by computations by switching threads which, as mentioned earlier, has zero overhead.

The GPU is very fast when doing floating point calculations and should there-fore be left doing that as much as possible by avoiding too many memory transac-tions. When they are unavoidable, shared memory should be utilized as often as possible. This means reading a part of device memory and placing it in the local shared memory before doing the calculations. This leads to a common structure when programming for the GPU:

1. Read a chunk of global device memory and place its contents in shared memory.

2. Do calculations with the information in shared memory.

3. Write the results back to global memory.

Another important aspect to consider is allowing simultaneous global mem-ory accesses to be coalesced into only one memmem-ory transaction which greatly can improve performance. The shared memory access patterns are also important to think about since the shared memory is composed of different memory banks and accessing the same bank from two threads will serialize the access. Both of these memory optimizations are described in detail in [25].

3.5 Future - NVIDIA Fermi

NVIDIA GPUs have been developed for over a decade. During this time they have gone through several types of architectures, but as mentioned before, it was in 2006 when they released CUDA that they deliberately tried to focus more on GPU Computing. The GPU architecture at the time was the G80 architecture. It was a unified processor with shared memory and a SIMT (Single Instruction, Multiple Thread) execution model. It was later upgraded to the GT200 architecture which featured more computing power in terms of more SMs and larger register files. It also implemented more effective memory access and added support for double precision floating point calculations. The previous sections described the G80 and GT200 architectures.

When writing this thesis, a new architecture was under development which is worth mentioning although it was not used in the development of SkePU. It is code-named "Fermi" and takes another step towards a more general purpose GPU. It is designed to offer more computational power but also with increased programmability and compute efficiency. Some of the key features of the Fermi architecture are [9]:

More CUDA cores The Fermi architecture will feature 32 CUDA cores (SPs)

(22)

10 NVIDIA GPU Architecture

More memory The amount of shared memory will be increased as well as the

introduction of a configurable L1 and unified L2 cache.

Double Precision Although the GT200 supported double precision floating point

calculations to some extent, the Fermi architecture will support the full IEEE 754-2008 floating-point standard and with much higher performance.

Unified address space This will allow for full C++ support. Fermi will also

implement the ability to use function pointers to device functions and recur-sion.

Concurrent Kernel Execution Fermi will allow for the execution of several

kernels concurrently on the GPU which will greatly improve utilization of the GPU resources. It would also allow for task parallelism within one device.

(23)

Chapter 4 Programming Frameworks

SkePU is a multi back end library with the ability to choose which back end to compile the application for. These back ends have been implemented using three programming frameworks targeting different type of hardware; NVIDIA’s CUDA framework for NVIDIA GPUs, OpenCL for programming heterogeneous processing platforms (GPUs included) and OpenMP for shared memory systems, especially multi-core CPUs. These frameworks all use different approaches to parallel programming but with some similarities, especially between CUDA and OpenCL. In this chapter they are given a brief introduction and their, for the development of SkePU, most important features are highlighted.

4.1 CUDA

One of the easiest ways to start programming for NVIDIA CUDA architecture is to use the company’s own software environment which allows developers to use the common high-level language C. The software environment is divided into two interfaces that work a bit differently: C for CUDA and CUDA driver API.

When C for CUDA is used, device and host code are mixed in the same source. To separate them, the source code must be compiled with NVIDIA’s own compiler NVCC. With the help of a conventional C/C++ compiler like GCC, it generates an executable with an embedded binary form of the device code. Uploading of code to the device and its execution is handled automatically.

The CUDA driver API instead works on a lower level by providing API func-tions to load precompiled device code and launch it. It is a pure C API and there-fore no special compiler is needed except when precompiling the kernels. Both interfaces provide facilities to for example manage device memory and transfer data between host and device. C for CUDA does this through the runtime API which is built on top of the driver API.

(24)

12 Programming Frameworks

4.1.1 Programming Interface

When developing SkePU, only the C for CUDA interface was used due to its possibility of mixing device and host code in the same source which, for example, greatly simplifies the implementation of user functions. Below follows a short introduction to this interface. For more information about both CUDA driver API and C for CUDA please refer to [25].

C for CUDA

To be able to mix host and device code in the same source some extensions to the C programming language are needed. C for CUDA provides these and a runtime library which can be used for memory management, error handling etc. Initialization is done implicitly at the first function call.

For the compiler to know what is to be executed on the host and what is to be executed on the device, function and variable type qualifiers are prepended to the function or variable names. There are three function qualifiers. __device__ means that the function is executed on the device and callable from device only. With __host__, the function is executed on the host and callable from host only. If no qualifier is used, __host__ is the default. __global__ stands for a special kind of device function which is executed on the device but callable from host. The global functions are usually called kernels and are the bridge between host and device.

Variable type qualifiers include __device__, __constant__ and __shared__. For more information see appendix B in [25].

The syntax for calling a kernel is a bit special in C for CUDA. The basics are the same as when calling a regular function; one provides a name and a parameter list. There are however two more pieces of information that need to be supplied, the group size in threads and the number of groups that will execute the kernel as explained in Chapter 3. They are added to the call in a special parameter list enclosed by the characters <<< and >>>. Another optional parameter in that parameter list is the allocated shared memory size, if that is to be decided externally.

To understand how these are used in the source code, an example is provided in listing 1. It shows how two arrays can be added together, producing a result array. First a device function which adds two elements and returning a third is defined. After that comes the kernel which will be called from the host. The kernel is run by several threads and each thread adds one pair of elements. In the main procedure, the host arrays are first created, then space is allocated in device memory and the data is copied from host to device. Then the execution parameters are set, the amount of threads and thread blocks to execute with, they are set to match the number of elements. Lastly the kernel is launched and the result is copied from device memory to host memory.

(25)

4.1 CUDA 13

#define ARRAY_LENGTH 20

__device__ int plus(int a, int b) {

return a + b; }

__global__ void mapKernel(int* input1, int* input2, int* output,

unsigned int n) {

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; output[i] = plus(input1[i], input2[i]);

} int main() { // Host arrays int *v0, *v1; int *r; // Device arrays int *d_v0, *d_v1; int *d_r;

size_t aSize = sizeof(int)*ARRAY_LENGTH; // Allocate host arrays

v0 = (int*)malloc(aSize); v1 = (int*)malloc(aSize); r = (int*)malloc(aSize);

// Allocate device arrays and copy data from host to device cudaMalloc((void**)&d_v0, aSize);

cudaMemcpy(d_v0, v0, aSize, cudaMemcpyHostToDevice); cudaMalloc((void**)&d_v1, aSize);

cudaMemcpy(d_v1, v1, aSize, cudaMemcpyHostToDevice); cudaMalloc((void**)&d_r, aSize);

// Specify execution paramters, use 512 as max thread block size.

int numThreads = std::min(512, ARRAY_LENGTH);

int numBlocks = ARRAY_LENGTH/numThreads +

(ARRAY_LENGTH%numThreads == 0 ? 0:1); // Execute kernel

mapKernel<<< numBlocks, numThreads >>>(d_v0, d_v1, d_r, ARRAY_LENGTH); // Copy result from device to host

cudaMemcpy(r, d_r, aSize, cudaMemcpyDeviceToHost); }

(26)

4.2 OpenCL

OpenCL is an open standard by the Khronos group [24]. It does not specifically target NVIDIA GPUs but instead aims at being a more general standard for programming heterogeneous processing platforms [24]. By having the vendors of different devices create their own OpenCL implementations following the standard, the same piece of code can be used to program AMD and NVIDIA GPUs or even embedded devices.

4.2.1 Programming Interface

There are similarities between OpenCL and CUDA in terms of programming inter-faces. Exposing its functionality as a pure C API, compilable by any C compiler, OpenCL mostly resembles the CUDA driver API. There are however several dif-ferences, most notably how the two libraries generate the device code. In the OpenCL case, device code is stored as strings in the program which can be used as arguments to API calls that compile and upload the code to the device at runtime. With CUDA driver, the device source code is compiled offline and then uploaded at runtime.

Before one can start using a device in OpenCL there are some initializations that has to be done. A platform needs to be created as well as the devices to be used, execution contexts and command queues. When this is completed, memory transfers and kernel executions can be done through API function calls.

The OpenCL language, which is used when creating kernels and device func-tions is based on the ISO C99 standard but with some extensions similar to those of C for CUDA described above. For example, to mark a function as a kernel callable from host, the identifier __kernel is used. There are also identifiers such as __local that describe shared memory and __global for global device memory. For a more comprehensive description of the OpenCL language see [24]. Kernel launches are also done by API calls. First the parameters are set by calling clSetKernelArg, then the kernel can be launched by enqueuing it in the command queue.

To get a feeling for how a OpenCL program can look see the example in listing 2. The example does the same as the C for CUDA example in listing 1, adding two arrays producing a third. The basic structure is the same, but note how the device source code is in the form of a string in the OpenCL example. Notice also the increased amount of "initialization" needed. Platform, Context, Device and Command queue need to be created before anything can be done with the device. This has the advantage that the library can be more general, but more code is also needed.

4.3 OpenMP

OpenMP is an API used to program shared memory multiprocessors which in-cludes today’s popular multi-core processors. It is composed of several compiler directives, library routines and environment variables that allow users to create and

(27)

4.3 OpenMP 15

#defineARRAY_LENGTH 20 std::string source( "int plus(int a, int b)\n" "{\n"

" return a + b;\n" "}\n"

"__kernel void mapKernel(__global int* input1, __global int* input2, __global int* output, unsigned int n)\n" "{\n"

" int i = get_global_id(0);\n"

" output[i] = plus(input1[i], input2[i]);\n" "}\n"

);

intmain() {

unsigned intnumElements = ARRAY_LENGTH; size_t globalWorkSize[1]; size_t localWorkSize[1]; // Host arrays int *v0, *v1; int *r; // Device arrays cl_mem d_v0, d_v1; cl_mem d_r;

size_t aSize = sizeof(int)*ARRAY_LENGTH; v0 = (int*)malloc(aSize);

v1 = (int*)malloc(aSize); r = (int*)malloc(aSize); // Create platform for devices cl_platform_id platform;

clGetPlatformIDs(1, &platform, NULL); // Create and get those devices found cl_device_id device;

clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); // Specify context properties

cl_context_properties props[3];

props[0] = (cl_context_properties)CL_CONTEXT_PLATFORM; props[1] = (cl_context_properties)platform; props[2] = (cl_context_properties)0; // Create a context

cl_context context = clCreateContextFromType(props, CL_DEVICE_TYPE_GPU, NULL, NULL, &err);

// Create a command-queue on the GPU device

cl_command_queue queue = clCreateCommandQueue(context, device, 0, &err); // Build program

const char* c_src = source.c_str();

cl_program program = clCreateProgramWithSource(context, 1, &c_src, NULL, &err); clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// Create kernel object

cl_kernel kernel = clCreateKernel(program, "mapKernel", &err); // Allocate and copy data

d_v0 = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, aSize, (void*)v0, &err); d_v1 = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, aSize, (void*)v1, &err); d_r = clCreateBuffer(context, CL_MEM_READ_WRITE, aSize, NULL, &err);

intnumThreads = std::min(512, ARRAY_LENGTH);

intnumBlocks = ARRAY_LENGTH/numThreads + (ARRAY_LENGTH%numThreads == 0 ? 0:1); // Sets the kernel arguments

clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_v0); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_v1); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_r);

clSetKernelArg(kernel, 3, sizeof(unsigned int), (void*)&numElements); globalWorkSize[0] = numBlocks * numThreads;

localWorkSize[0] = numThreads; // Launches the kernel

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);

// Copy result from device to host

clEnqueueReadBuffer(queue, d_r, CL_TRUE, 0, aSize, (void*)r, 0, NULL, NULL); }

(28)

manage parallel programs on this kind of hardware. It uses a fork and join style of parallel programming where a master thread forks a number of slave threads that do work in parallel and are later joined with the master thread. Implementations exist for C/C++ and Fortran [5].

4.3.1 Programming Interface

As mentioned before, the OpenMP API consists of both compiler directives and library routines. In the case of C and C++, the compiler directives are imple-mented using the #pragma mechanism and all start with #pragma omp followed by the directive in question on the same line. The library routines are provided by the header file omp.h and are named with the prefix omp_.

The fundamental OpenMP directive is the parallel region defined by

#pragma omp parallel. It is used to express that the block following the statement will be run in parallel. There are also various work-sharing constructs with the loop construct being the most important for SkePU. It is defined with #pragma omp for within a parallel region and distributes the loop iterations among the participating threads. The two directives can also be combined to #pragma omp parallel forto parallelize just a certain loop. Variables used in parallel regions are usually shared but can also be declared private by adding privateto the parallel construct like this:

#pragma omp parallel private(myid), makes the variable myid private. The library routines used by SkePU are omp_get_max_threads() which returns the maximum number of threads that will participate in a parallel region, omp_set_num_threads(int)which sets this number and

omp_get_thread_num() which returns the executing thread’s number within a parallel region.

This was only a short introduction of OpenMP. For a complete description see the official specification [5]. In listing 3 a short example is shown. The same calculation as was shown in listing 1 and 2 is performed. We can see that the only thing needed compared to a sequential implementation is the #pragma directive. The OpenMP implementation automatically handles the work-sharing.

(29)

4.3 OpenMP 17 #define ARRAY_LENGTH 20 int main() { int *v0, *v1; int *r;

size_t aSize = sizeof(int)*ARRAY_LENGTH; v0 = (int*)malloc(aSize);

v1 = (int*)malloc(aSize); r = (int*)malloc(aSize); #pragma omp parallel for

for(int i = 0; i < ARRAY_LENGTH; ++i) {

r[i] = v0[i] + v1[i]; }

}

(30)

(31)

Chapter 5 SkePU

SkePU is a C++ template library designed to make parallel programming easier with the use of higher-order functions, skeletons. It is modeled after BlockLib [1], a similar C library for the IBM Cell BE, and implements the same set of data parallel skeletons. SkePU was originally constructed with the aim of providing the same functionality as BlockLib, but instead of using the Cell processor, it was geared towards GPUs, using CUDA and/or OpenCL as back end. A large portion of the library therefore consists of GPU memory management, kernels and, in the case of OpenCL, code generation and compilation. The interface is however fairly general and does not make the library bound to only GPUs. This can also be seen in SkePU as there is a sequential CPU and an OpenMP based implementation of all the skeletons. Which variant to use can be controlled by defining either SKEPU_CUDA, SKEPU_OPENCL or SKEPU_OPENMP in the source code. With no defines, the single-thread CPU variant is used. Other modifications of the source code are not necessary since the interface is the same for all back ends.

In addition to the skeletal functions, SkePU also includes one container which must be used when doing computations with the skeletons. It is a vector/array type, designed after the STL container vector. Actually, its implementation uses the STL vector internally and therefore also has an interface mostly compatible with it. The SkePU Vector hides GPU memory management and also uses lazy memory copying to avoid unnecessary memory operations.

5.1 Container

As mentioned earlier, the only container in the current version of SkePU is a vector. It is modeled after the STL vector and is largely compatible with it. However, since it uses proxy elements for some operations to be able to distinguish between reads and writes, some behavior might differ. See [23] for more information.

The SkePU Vector keeps track of which parts of it are currently allocated and uploaded to the GPU. If a computation is done, changing the vector in the GPU memory, it is not directly transferred back to the host memory. Instead, the vector waits until an element is accessed on the host side before any copying is

(32)

20 SkePU

skepu::Vector<double> input(100,10); .

.

input.flush();

Listing 4. Vector creation.

done (for example through the [] operator); this lazy memory copying is of great use if several skeletons are called one after the other, with no modifications of the vector by the host in between. In that case, the vectors are kept on the device (GPU) through all the computations, which greatly improves performance. Most of the memory copying is done implicitly but the vector also contains a flush operation which updates the vector from the device and deallocates its memory. Listing 4 shows how a vector of size 100 is created with all elements initialized to 10 and also how flush is called.

5.2 User Functions

To provide a simple way of defining functions that can be used with the skeletons, several preprocessor macros have been implemented that expand to the right kind of structure that constitutes the function. In SkePU, user functions are basically a struct with member functions for CUDA and CPU, and strings for OpenCL. Listing 5 shows one of the macros and its expansion.

The macros available in the current version of SkePU are shown in Listing 6.

5.3 Skeleton Functions

In the object-oriented spirit of C++, the skeleton functions in SkePU are repre-sented by objects. By overloading operator() they can be made to behave in a way similar to regular functions. All of the skeletons contain member functions representing each of the different implementations, CUDA, OpenCL, OpenMP and CPU. The member functions are called CU, CL, OMP and CPU respectively. If the skeleton is called with operator(), the library decides which one to use depend-ing on what is available. In the OpenCL case, the skeleton objects also contain the necessary code generation and compilation procedures. When a skeleton is instan-tiated, it creates an environment to execute in, containing all available OpenCL or CUDA devices in the system. This environment is created as a singleton so that it is shared among all skeletons in the program.

The skeletons can be called with whole vectors as arguments, doing the oper-ation on all elements of the vector. Another way to call them is with iterators. In that case, a start iterator and an end iterator are instead provided which makes it possible to only apply the skeleton on parts of the vector. When calling with whole vectors, the result vector is always resized and overwritten with the results from the computation, this is however not true when calling with iterators. Listing 7 shows how a skeleton is created from a user-defined function and how it is used.

(33)

5.3 Skeleton Functions 21 BINARY_FUNC(plus, double, a, b, return a+b; ) // expands to: struct plus { skepu::FuncType funcType; std::string func_CL; std::string funcName_CL; std::string datatype_CL; plus() { funcType = skepu::BINARY; funcName_CL.append("plus"); datatype_CL.append("double"); func_CL.append(

"double plus(double a, double b)\n" "{\n"

" return a+b;\n" "}\n");

}

double CPU(double a, double b) {

return a+b; }

__device__ double CU(double a, double b) {

return a+b; }

};

Listing 5. User function, macro expansion.

UNARY_FUNC(name, type1, param1, func)

UNARY_FUNC_CONSTANT(name, type1, param1, const1, func) BINARY_FUNC(name, type1, param1, param2, func) BINARY_FUNC_CONSTANT(name, type1, param1, param2, \ const1, func)

TERNARY_FUNC(name, type1, param1, param2, param3, func) TERNARY_FUNC_CONSTANT(name, type1, param1, param2, \ param3, const1, func)

OVERLAP_FUNC(name, type1, over, param1, func) ARRAY_FUNC(name, type1, param1, param2, func)

(34)

22 SkePU

BINARY_FUNC(plus, double, a, b,

return a+b; )

// Creates a reduction skeleton from the plus operator skepu::Reduce<plus> globalSum(new plus);

skepu::Vector<double> input(100,10); // Applies the skeleton to the vector input. // Backend depends on what #defines are made.

doublesum = globalSum(input);

doublehalfsum = globalSum(input.begin(), input.begin()+50); // Call CPU backend explicitly

doublesum = globalSum.CPU(input);

Listing 7. Skeleton creation.

First a user function called plus is created as described in Section 5.2. Then a skeleton object is instantiated with that function as a parameter. The function needs to be provided both as a template parameter and as a pointer to an instan-tiated version of the user function (remember that the user functions are in fact structs).

To see how a skeleton class is implemented in SkePU, all code for the Map skeleton is provided in appendix C.

5.3.1 CUDA Implementation

When skeletons are instantiated, they save a pointer to a user function struct. In the CUDA case, the actual user function is a member of that struct with the __device__ specifier, a function executed on the device. When a skeleton is called, the struct is sent as a parameter to a predefined kernel for that skeleton and the kernel calls the device function when it is needed.

5.3.2 OpenCL Implementation

For the OpenCL back end, a lot more work is done behind the scenes than in the CUDA case. First, when a skeleton using OpenCL is instantiated, the OpenCL source code for that skeleton is generated including both kernel source and user function source. The user function source is saved in the user function struct as a string and the kernel source is stored as a predefined static string in the library source code. Both these strings are combined, creating an OpenCL program which is compiled and a handle for the kernel is stored in the skeleton object. When the skeleton later is called, this handle is used to set parameters and launch the kernel.

(35)

5.3 Skeleton Functions 23 #include <iostream> #include "skepu/vector.h" #include "skepu/map.h" BINARY_FUNC(plus, double, a, b, return a+b; ) int main() {

skepu::Map<plus> sum(new plus); skepu::Vector<double> v0(10,10); skepu::Vector<double> v1(10,5); skepu::Vector<double> r; sum(v0,v1,r); std::cout<<"Result: " <<r <<"\n"; return 0; } // Output // Result: 15 15 15 15 15 15 15 15 15 15

Listing 8. A Map example.

5.3.3 Types of Skeletons

As mentioned earlier, SkePU implements the same set of skeletons as BlockLib: The two common data parallel patterns Map and Reduce, a combination of those called MapReduce, and a variant of Map called MapOverlap. Apart from these, SkePU also implements another variant of Map called MapArray and an operation called Scan. Below is a short description of each of the skeletons.

Map

In the Map skeleton, every element in the result vector r is a function f of the cor-responding elements in one or more input vectors v0. . . vk. The vectors have length

N . A more formal way to describe this operation is: r[i] = f (v0[i], . . . , vk[i]) ∀i ∈

{0, . . . , N − 1} In SkePU, the number of input vectors k is limited to a maximum of three (k ≤ 2).

The parallelization of Map is a fairly easy task. In OpenMP it is basically done by placing the #pragma omp parallel for directive before a loop over all the elements and therefore leaving the work of parallelizing to the implementation. In OpenCL and CUDA, the kernels are also quite simple. Each thread in the execution grid maps one element at a time until there are none left.

An example of Map, which calculates the result vector as the sum of two input vectors is shown in listing 8. The output is shown as a comment at the end. A Map skeleton with the name sum is instantiated in the same way as described earlier and is then applied to vector v0 and v1 with result in r.

(36)

24 SkePU #include <iostream> #include "skepu/vector.h" #include "skepu/reduce.h" BINARY_FUNC(plus, double, a, b, return a+b; ) int main() {

skepu::Reduce<plus> globalSum(new plus); skepu::Vector<double> v0(1000,2); double r = globalSum(v0); std::cout<<"Result: " <<r <<"\n"; return 0; } // Output // Result: 2000

Listing 9. An example of a reduction with + as operator.

Reduce

Reduction is another common data parallel pattern. The scalar result is computed by applying a commutative associative binary operator ⊕ between each element in the vector. With the same notation as before, reduction can be described like this: r = v[0] ⊕ v[1] ⊕ . . . ⊕ v[N − 1]. A reduction using + as operator would, for example, yield the global sum of the input vector.

In OpenMP, the parallel reduction is done by letting each thread reduce one part of the vector, producing a thread-result. The thread-results are then reduced by the master thread yielding the final result. The GPU back ends implement Reduce as described in [13] which is optimized for NVIDIA GPUs.

This is shown in listing 9. The syntax of skeleton instantiation is the same as before but note that when calling the reduce skeleton, the scalar result is returned by the function rather than returned in a parameter.

MapReduce

MapReduce is basically just a combination of the two above: It produces the same result as if one would first Map one or more vectors to a result vector, then do a reduction on that result. It is provided since it combines the mapping and reduc-tion in the same computareduc-tion kernel and therefore avoids some synchronizareduc-tion, which speeds up the calculation. Formally: r = f (v0[0], . . . , vk[0]) ⊕ . . . ⊕ f (v0[N −

1], . . . , vk[N − 1]).

MapReduce is basically implemented in the same way as Reduce, but with the addition of a mapping operation before the values are combined.

(37)

5.3 Skeleton Functions 25 #include <iostream> #include "skepu/vector.h" #include "skepu/mapreduce.h" BINARY_FUNC(plus, double, a, b, return a+b; ) BINARY_FUNC(mult, double, a, b, return a*b; ) int main() {

skepu::MapReduce<mult, plus> dotProduct(new mult,

new plus); skepu::Vector<double> v0(1000,2); skepu::Vector<double> v1(1000,2); double r = dotProduct(v0,v1); std::cout<<"Result: " <<r <<"\n"; return 0; } // Output // Result: 4000

Listing 10. A MapReduce example that computes the dot product.

product. The MapReduce skeleton is instantiated in a similar way as the Map and Reduce skeletons, but since it needs two user functions, one for the mapping part and one for reduction, two parameters are needed at instantiation time. First is the mapping function then the reduce function. In listing 10 a MapReduce skeleton is created which will map two vectors with mult and then reduce the result with plus producing the dot product between the two vectors.

MapOverlap

The higher order function MapOverlap is similar to a Map, but each element r[i] of the result vector is a function of several adjacent elements of one input vector that reside at a certain constant maximum distance from i in the input vector. The number of these elements is controlled by the parameter overlap. With the notation used above: r[i] = f (v[i − d], v[i − d + 1], . . . , v[i + d]) ∀i ∈ {0, . . . , N − 1}, where d denotes the value of overlap. Convolution is an example of a calculation that fits into this pattern. The edge policy, how MapOverlap behaves when a read outside the array bounds is performed, can be either cyclic or constant. When cyclic, the value is taken from the other side of the array and when constant, a user-defined constant is used. When nothing is specified, the default behavior in SkePU is constant with 0 as value. When using any of the GPU variants of MapOverlap, the maximum overlap that can be specified is limited by the shared memory available to the GPU, and also the maximum number of threads per block.

(38)

26 SkePU

#include <iostream>

#include "skepu/vector.h"

#include "skepu/mapoverlap.h" OVERLAP_FUNC(over, double, 2, a,

return a[-2]*0.4f + a[-1]*0.2f + a[0]*0.1f + a[1]*0.2f + a[2]*0.4f;

)

int main() {

skepu::MapOverlap<over> conv(new over); skepu::Vector<double> v0(10,10); skepu::Vector<double> r;

conv(v0, r, skepu::CONSTANT, (double)0); std::cout<<"Result: " <<r <<"\n";

return 0; }

// Output

// Result: 7 9 13 13 13 13 13 13 9 7

Listing 11. A MapOverlap example.

These two factors typically limit the overlap to < 256.

The OpenMP implementation of MapOverlap is done by parallelizing two for-loops the same way as in the Map skeleton. The CUDA and OpenCL kernels have the same basic structure as presented in 3.4. First data is copied to shared memory, both the actual data and the overlap. Then the mapping is done and the new data is stored back into the device memory.

An example program that does a convolution with the help of MapOverlap is shown in listing 11. Note that the indexing is relative to the element calculated, 0 ± overlap. A MapOverlap skeleton is instantiated with over as user function and is then called with vector v0 as input and vector r as result. It is also using the constant 0 at the edges, decided by the parameters skepu::CONSTANT and (double)0.

MapArray

MapArray is yet another variant of Map. It produces a result vector from two input vectors where each element of the result, r[i], is a function of the corresponding element of one of the input vectors, v1[i] and any number of elements from the

other input vector v0. This means that at each call to the user defined function

f , which is done for each element in v1, all elements from v0 can be accessed. The

notation for accessing an element in v0is the same as arrays in C. v0[i] where i is a

number from 0 to the length of v0. Formally: r[i] = f (v0, v1[i])∀i ∈ {0, . . . , N −1}.

MapArray was first devised to avoid the overlap constraints of MapOverlap and to provide a simple yet powerful way of mapping several elements of an input vector.

(39)

5.3 Skeleton Functions 27

#include <iostream>

#include "skepu/vector.h"

#include "skepu/maparray.h" ARRAY_FUNC(arr, double, a, b,

int index = (int)b;

return a[index]; )

int main() {

skepu::MapArray<arr> reverse(new arr); skepu::Vector<double> v0(10);

skepu::Vector<double> v1(10); skepu::Vector<double> r; //Sets v0 = 1 2 3 4... // v1 = 9 8 7 6...

for(int i = 0; i < 10; ++i) { v0[i] = i+1; v1[i] = 10-i-1; } reverse(v0, v1, r); std::cout<<"Result: " <<r <<"\n"; return 0; } // Output // Result: 10 9 8 7 6 5 4 3 2 1

Listing 12. A MapArray example that reverses a vector

MapArray is basically implemented the same way as Map both for OpenMP and the GPU back ends, with the only difference that a pointer to one of the arrays is used instead of an element.

Listing 12 shows how MapArray can be used to reverse a vector by using v1[i]

as index to v0. A MapArray skeleton is instantiated and called with v0 and v1

as inputs and r as output. v0 will be corresponding to parameter a in the user function arr and v1 to b. Therefore, when the skeleton is applied, each element in v1 can be mapped to any number of elements in v0. In listing 12, v1 contains indexes to v0 of the form 9, 8, 7..., therefore, as the user function arr specifies, the first element in r will be v0[9] the next v0[8] etc, resulting in a reverse of v0.

Scan

The Scan operation, also known as prefix sum, is related to Reduce but instead of producing a single scalar result it produces an output vector, r, of the same length as the input, v, with its elements being the reduction of itself and all elements preceding it in the input. The reduction is performed with a commutative

(40)

28 SkePU #include <iostream> #include "skepu/vector.h" #include "skepu/scan.h" BINARY_FUNC(plus, double, a, b, return a+b; ) int main() {

skepu::Scan<plus> prefixSum(new plus); skepu::Vector<double> v0(10);

skepu::Vector<double> r; // Sets v0 = 1 2 3 4 5...

for(int i = 0; i < 10; ++i) v0[i] = i+1; prefixSum(v0, r, skepu::INCLUSIVE); std::cout<<"Result: " <<r <<"\n"; return 0; } // Output // Result: 1 3 6 10 15 21 28 36 45 55

Listing 13. An example of the skeleton Scan with a plus operator.

associative binary operator ⊕. With the same notation as used previously it can be described like this: r[i] = v[0] ⊕ v[1] ⊕ . . . ⊕ v[i] ∀i ∈ {0, . . . , N − 1}. For example the input vector [4, 3, 7, 6, 9] would produce the result vector [4, 7, 14, 20, 29] if the + operator was used.

There are actually two variants of Scan in SkePU, inclusive and exclusive. The one described above is the inclusive Scan because it includes the current element in the reduction. The exclusive Scan does not and can be described as:

r[0] = ∅; r[i] = v[0] ⊕ v[1] ⊕ . . . ⊕ v[i − 1] ∀i ∈ {1, . . . , N − 1}. In the previous

example, the exclusive result would be [0, 4, 7, 14, 20].

In OpenMP, the parallel scan is implemented by letting each thread scan one part of the array and storing its last value into a temporary results array. This temporary results array is then scanned by the master thread and each thread adds its corresponding value from this array to its scanned part producing the final result. The GPU back ends use an implementation based on the naive scan from [15] although changed a bit to allow both inclusive and exclusive scans as well as arbitrary sized arrays.

Listing 13 shows a complete program using the Scan skeleton from SkePU. A Scan skeleton is instantiated in the same manner as seen previously with plus as operator and the input vector is initialized with the numbers [1, 2, . . . , 10]. The skeleton is then called with the type specified as inclusive. The output can be seen as a comment at the end of the listing.

(41)

5.4 Multi-GPU Support 29

5.4 Multi-GPU Support

SkePU has support for carrying out computations with the help of several GPUs on a data-parallel level. It utilizes the different GPUs by dividing the input vectors equally amongst them and doing the calculations in parallel on the devices. Here CUDA and OpenCL differ a lot. In OpenCL, one CPU thread can control several GPUs, by switching queues. In CUDA, or to be more precise, in the CUDA runtime system which SkePU is based on, this is not possible. Instead, each CPU thread is bound to one device. To make multi-GPU computation possible, several host threads must then be created. This is done in SkePU, but in the current version new threads are started for each skeleton call and bound to devices; this binding can take a lot of time, and hence the multi-GPU support in SkePU with CUDA is not very efficient. With OpenCL however, it works much better. Below follows a short description on how multi-GPU support was implemented for the various skeletons in SkePU.

By default, SkePU will utilize as many GPUs as it can find in the system; however, this can be controlled by defining SKEPU_NUMGPU. Setting it to 0 makes it use its default behavior. Any other number represents the number of GPUs it should try to utilize.

5.4.1 Map

The multi-GPU implementation of the Map skeleton is fairly straightforward since no dependency exists between elements. The vector is divided in equally sized parts, one for each GPU in the system (the last part might be a little bigger since it gets the rest if there is one). Each part is transferred to their GPU and the mapping is performed in parallel, both element-wise on each GPU and part-wise among the GPUs. The vector, which the mapping was performed on, actually stores on which device each part is located so that the lazy memory copying also works in the multi-GPU case. The division of elements among the GPUs is done the same way each time, so if the mapping is done twice in a row, no extra device<->host memory transfers are needed because of the lazy memory copying. This is also true between the skeletons; the division is made the same way for all.

5.4.2 Reduce and MapReduce

The Reduce skeleton is however not as easy to parallelize on a multi-GPU level since the elements needs to be combined with each other. The strategy here is the same as Map at first, the elements are divided among the GPUs then each GPU reduces its part, producing a scalar result. The GPUs are then synchronized and their results are combined with the reduce operator on the CPU to yield the final result.

Since the MapReduce skeleton is very similar to Reduce, it works the same way. Each part is Mapped and Reduced separately on each GPU producing GPU results, then these are reduced on the CPU to yield the final result.

(42)

30 SkePU

5.4.3 MapOverlap

The MapOverlap skeleton has a similar multi-GPU implementation as Map; how-ever the overlap is a dependency between the GPUs. This means that parts of the vector need to be on two GPUs at a time which have the implication that the lazy memory copying is not very effective since the overlap part need to be transferred from one GPU to the host and to the other GPU between two MapOverlap calls.

5.4.4 MapArray

In the MapArray skeleton, only one of the input vectors is divided among the participation GPUs. The other one is copied to all devices but no synchronization is done to update this vector between skeleton calls, it is up to the user to update it if needed with the help of the flush() functionality.

5.4.5 Scan

Similar to Reduce, the Scan skeleton requires some extra work besides just dividing the vector among the GPUs. Each GPU scans their respective part and saves its last value, which after a scan equals the reduction of the part, at the host. The host takes all the devices’ last values and performs a scan on these values on the host side. Then the first value is added to all the elements of the first GPU and the second value to all the elements of the second GPU and so on. This produces the correct results and since no other dependency between the GPUs exists, the lazy memory copying works as expected.

5.5 Helpers

Except for the functionality described in previous sections, SkePU also contains some helper objects that are used by the library itself and that the user do not come in contact with directly. They help with memory and device management as well as providing functionality for testing purposes.

5.5.1 Device Management

Since SkePU supports multiple back ends and the use of multiple GPUs for those back ends, two classes exist to ease the development of the skeletons; one repre-senting an execution environment and the other one reprerepre-senting a device.

Environment

The Environment class defines an execution environment for the skeletons. It mainly keeps track of which devices are available and gives access to them. It is implemented as a singleton so that only one environment is actually instantiated and the skeletons store a pointer to this instance which is created by the first defined skeleton.

(43)

5.5 Helpers 31

Device

The device class exists in two versions, one representing an OpenCL device and one representing a CUDA device. They are quite similar and except for storing the device ID, or in case of OpenCL the cl_device_id object, they also contain various properties about the devices and means of accessing them; for example, the maximum thread block size and the maximum global memory size. The OpenCL device also contains a command queue and execution context.

5.5.2 Memory Management

Device memory management is handled by the container class which, as mentioned earlier, implements a lazy memory copying scheme. To simplify the development of the container, a class representing a device memory allocation is available. It is created with a specific size and, optionally, a host memory pointer where the data will be stored on the host side. It also keeps track of when device data has changed and depending on this status, it can copy data either from host to device or from device to host.

With the help of this device-memory-pointer, the skepu::Vector can imple-ment lazy memory copying as laid out in section 5.1. Each vector contains a list of device-memory-pointers paired with its device, start address and length so that it can keep track of where and which part of the vector is stored on devices. The vector also contains methods for adding, updating and invalidating the device-memory-pointers. When a skeleton is called it asks the vector if a pointer already exists for that data on a specific device, if it does, data is updated if needed, otherwise the data already on the device and pointed to by the pointer is used in the skeleton’s calculation. This is the heart of the lazy memory copying, by keep-ing track of device-memory-pointers and the pointers themselves keepkeep-ing track of when data is modified, unnecessary memory transfers between host and device are avoided. All methods on the vector which accesses data also goes through the list of device-memory-pointers and checks if data needs to be updated first. Here, reads and writes to the vector are distinguished a read updates the data from device if needed while a write first updates from device then invalidates the device-memory-pointer.

5.5.3 Testing

SkePU also includes some functionality to simplify testing of the library and appli-cations developed with it. The SkePU Vector class has the possibility to randomize its content. It can also store, load and print its elements.

To make it easier to measure time, a timer class has been implemented utilizing the gettimeofday() function on Linux systems. It stores the times from all runs until a reset and can return both average and total time of the runs.

To save results from measurements, there is also a data collector class which can store a pair of information, for example an integer and a double, and then saved it to a file.

(44)

32 SkePU

5.6 Dependencies

SkePU does not use any third party libraries except for CUDA and OpenCL. It does however make some use of the C++ standard library, especially STL which must be available for the library to compile. If either CUDA or OpenCL is to be used, of course they must also be installed. CUDA programs need to be compiled with NVCC since CUDA support is provided with the CUDA runtime API. SkePU is a template library, which means that there is no need to link against a precompiled library, only include the header files.

(45)

Chapter 6 Evaluation

All of the following evaluations were performed on a server equipped with 24 giga-byte of RAM and 2 quad-core Intel(R) Xeon(R) CPU E5520 clocked at 2.27GHz, making a total of 8 cores available on the CPU side. Further it had two NVIDIA Corporation GT200 [Tesla C1060] GPUs installed. It ran Linux with kernel ver-sion 2.6.33 and GCC 4.5 (for compilation with NVCC, GCC 4.3 was used due to problems with using STL together with CUDA code with newer versions of GCC). CUDA version 3.0 was used with NVIDIA drivers 195.36.15 and their included OpenCL implementation.

Since there are problems with running multi-GPU using CUDA as back end in the current version of SkePU (see Section 5.4 for an explanation), no evaluation was done with this combination. Instead, the multi-GPU support was evaluated using OpenCL.

6.1 Time Distribution

To see how the execution time is distributed when calling a skeleton with a GPU back end, some measurements have been made on small test cases and the result can be seen in figure 6.1. Time was measured with OpenCL as back end on an array of size 10 000 000. The test functions are shown in listing 14. The dominating time for all skeletons is the memory transfer time; this is of course dependent on how computation-heavy the user-defined functions are, and since the test functions are quite small, this result is expected. It is however important to note this difference in time since eliminating unnecessary memory transfers is a key factor in reaching good performance. The reason for Map, MapReduce and MapArray having double the host-to-device copy time is because the variants tested used two vectors as input. Reduce and MapReduce have a small device-to-host copy time because they only return scalar results.

(46)

34 Evaluation 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Map Reduce MapReduce MapOverlap MapArray Scan

Time (Sec) Time distribution CopyUp CopyDown Calculate Other

Figure 6.1. Time distribution for the implemented skeletons. OpenCL back end.

BINARY_FUNC(map_test, double, a, b, return ( fmax(a,b)*(a-b) ); ) BINARY_FUNC(reduce_test, double, a, b, return ( a + b ); ) OVERLAP_FUNC(mapoverlap_test, double, 5, a,

return a[-5]*0.5f + a[-4]*0.6f + a[-3]*0.7f + a[-2]*0.8f + a[-1]*0.9f + a[0] + a[1]*0.9f +

a[2]*0.8f + a[3]*0.7f + a[4]*0.6f + a[5]*0.5f; ) ARRAY_FUNC(maparray_test, double, a, b, int index; index = (int)b; return ( a[index] ); ) BINARY_FUNC(scan_test, double, a, b, return ( a + b ); )