Generating SkePU Code from Automatically Detected Algorithmic Patterns in C Source Programs

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Generating SkePU Code from

Automatically Detected Algorithmic

Patterns in C Source Programs

by

Marcus A Johansson

LIU-IDA/LITH-EX-G--16/003--SE

2016-04-17

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

(2)

Generating SkePU Code from

Automatically Detected Algorithmic Patterns

in C Source Programs

Marcus A. Johansson

April 17, 2016

(3)

Abstract

Modern heterogeneous multi-core architectures containing one or multiple GPU de-vices require expert knowledge in order to be fully utilized through parallelization by the programmer. Software written for one hardware setup might not easily be portable to work as efficiently on a differing architecture. Automatic parallelization of sequential C code to make efficient use of such architecture in an extensible man-ner would facilitate the porting of legacy code and provide a non-expert programmer with a tool granting access to modern hardware architectures.

We present an early prototype of such an extensible tool-chain and attempt to apply it on domain-specific C source code. It is based on a generic tool for hierarchical pattern matching in C source codes, where the user can define own patterns and recognition rules, and a code generation back-end. We show how it, combined with existing libraries, can be used to automatically port sequential legacy code to different multicore architectures, such as multicore CPUs and GPUs. Our tool is an attempt to do this and yields valid parallelized code, but fails to reach speedup for most implemented patterns. The tool is applied on one test case, a legacy ODE implementation in C, with similar results. A reason for slowdown is discussed in the concluding section.

(4)

Acknowledgements

I would like to show my gratitude towards Erik Hanson (my supervisor) and Christoph Kessler (my examiner) for their comments and advice during this thesis. This work was partly funded by SeRC (project OpCoReS).

(5)

Chapter 1 Introduction

1.1 Background

In the domain of multi-core architecture the task of writing optimized parallel code requires expert knowledge of each specific target architecture developed for. Par-allelizing sequential legacy C code is desirable for the sake of making optimal use of modern hardware [19] [20]. In an attempt to solve this problem we introduce a merge of two tools developed at IDA, Link¨oping University: the algorithmic pattern matching tool PRT [19], and the skeleton-based automatically parallelizing C++ template library SkePU [4] [10] [6]. Our algorithmic pattern recognition tool is based on the earlier work on PARAMAT [11], and deterministically matches algo-rithmic patterns in the source code and can then replace the corresponding legacy code with calls to wrapper functions implementing the matched patterns. These wrapper calls are then supposed to be implemented specifically for each target ar-chitecture the code is to be run on. Combining this tool with an automatically parallelizing template library by implementing the wrapper functions has resulted in a tool that automatically parallelizes sequential legacy C code for different target parallel platforms.

1.1.1 Parallel and GPU Computing

There are two main reasons for running a program on parallel cores: to lower the time it takes for a process to complete, and to reduce the hardware’s energy consumption on completing a task. In theory running a task on two processors si-multaneously will halve the wall-clock time of the process, if it is a scalable process. Wall-clock time is the real time that has passed upon completion of the task, as op-posed to the total number of ticks1_{. This is often carried out by having an operating}

system managing the process by multiple threads which can run on one core each. However, overhead such as thread creation and destruction and scheduling of core access is generated by the operating system when administering the multitasking of a process. Therefore, in general for small problem sizes the overhead will be higher than the time gained by parallelizing a process, while for larger problem sizes this overhead becomes insignificantly small. The degree of parallelism, i.e. the amount of code/operations that can be parallelized, or the fact that certain parallelizations

1_{A tick is an arbitrary time unit for measuring internal system time, for further details see the} Linux manual pages http://man7.org/linux/man-pages/man2/times.2.html

(7)

require generation of large data structures that imply a high communication over-head between the processor cores also affect the speedup of a parallelized program. Measuring the reduction of wall-clock time is achieved by calculating the parallel speedup metric. Parallel speedup is defined as the ratio between the fastest known sequential version of a particular process being run on only one core of a system (sequential time) and the total wall-time of the process being run in parallel on the same system (parallel time) [9] [8].

On the hardware level, parallel computation is achieved by multicore processors among which there generally exist two common different types: CPU (central pro-cessing unit), see Figure 1.1 and GPU (graphics propro-cessing unit), see Figure 1.2. CPU is used as the main processing unit in a computer and carries out operations at a relatively high frequency; thus known as a general purpose processing unit. His-torically the GPU was built for graphics processing only; often using SIMD (single instruction multiple data) operations such as vector calculation, matrix calculation, pixel transformations and rendering. The GPU is therefore regarded as a special purpose processing unit, however today the usage of GPUs has extended into the general purpose computing domain yielding hardware known as GPGPU (general purpose graphics processing unit). While a CPU can contain multiple cores, as well as a GPU, these two types differ in their hardware specifications. A CPU generally has much fewer cores than a GPU but runs at higher frequencies, and is better at handling serial operations. A GPU has a vast number of cores that are designed with fewer registers than those of a CPU, and simpler ALUs. A GPU’s cache memory and memory bandwidth is usually higher and the cores are built for data-parallelism which makes them highly superior to a CPU when it comes to running multiple similar and simple operations in parallel over different chunks of data. Examples of programs that would suit well to be run on GPU hardware over a CPU are thus problems which have regular memory access patterns and a high arithmetic intensity such as dense linear algebra operations, fast Fourier transforms, or grid generation used in finite element analysis, et cetera.

This topic is also very much related to threading. Threading is a seemingly parallel execution of one or many computational tasks in the same process. For example, this can in essence be a software implementation of so-called context switching done by the operating system. By saving a thread or process’ current context (program counter, registers, etc.) the thread can be paused and another thread can enter a logical processor core to be calculated. As this context switching is carried out far more often than a human user can perceive, the result is a seemingly parallel run of the two or more threads. While traditional threading is not carried out in parallel, a sophisticated enough operating system can make use of multiple processor cores and let several threads run in parallel in real. Threading however can affect the speedup of a computer program that has multiple internal threads that need to be synchronized. By making use of context switching each waiting thread can have its time online a CPU core reduced to the benefit of the task waiting for it.

If the process is parallelized on a GPU, which uses its own device memory, the PCIe bus and memory bandwidth will determine the transfer time to and from the GPU. This will in effect affect the parallel speedup, as sequential computation on CPU usually does not involve hard-copying data. Parallel speedup is defined as such:

Speedup = Tseq Tpar

(8)

Figure 1.1: Block diagram of a general quad-core CPU.

the parallel wall-clock time that the same algorithm takes. The same metric can be defined for the reduction of energy consumption by measuring sequential and parallel throughput per time unit, but that is beyond the scope of this thesis [9] [8].

1.1.2 PRT - Pattern Recognition Tool

PRT [19] is an extensible tool written in Java that parses a C source program and de-tects algorithmic patterns. This is done by utilising the Cetus compiler framework2

[12] to search for matching computation patterns among the nodes in the program’s abstract syntax tree. PRT can then replace these legacy algorithms with, preferably more efficient, target specific code. Extending PRT with a specific target involves adding classes to the source code; thus PRT requires a recompilation. PRT is ex-tensible in the sense that the patterns it recognises and the recognition rules are defined by the user in a hierarchical structure building on lower level patterns; the patterns and rules are defined in XML. New patterns and rules can thus be defined by extending the current pattern definitions. The node structure of this tree is then pattern matched against pattern descriptions given in the pattern definition XML-file. PRT divides the current pattern descriptions into, at the moment, four levels:

• Trivial patterns, which denote access to basic operands such as scalar vari-ables, array identifiers, integers and floats.

(9)

Figure 1.2: Block diagram of Nvidias Tesla (GPU) architecture.

• Level0 patterns, which represent simple operations such as assignments and mathematical operations.

• Level1 patterns, which refers to the concept of looping over a Level0 pat-tern. Typically this involves mathematical vector operations. For example

for (i=0; i<N; ++i) m_vctr[i] = 2.2; is an assignment inside a for-loop. • Level2 patterns which refers to the occurrence of a Level1 pattern inside a loop, thus operating on two different array indices. Typically this involves matrix operations. For example, matrix initialization

for (i=0; i<N ++i)

for (j=0; j<N; ++j) m_mtrx[i][j] = 2.2;

can be expressed as a Level1 pattern (vector initialization) inside a for-loop.

The PRT source code is open for extension to add target architectures. Ex-tending PRT with a target architecture is done through implementing the following available classes/interfaces:

ida.pelab.cgn.ICodeGenetator

cetus.hir.Annotation

ida.pelab.cng.TargetArchitecture

ida.pelab.cng.XMLSettingReader

During this project PRT has been extended with wrapper functions as source code output. These wrappers can then be implemented for different target architectures, in our case the wrappers have implemented the pattern algorithms in SkePU code. The enumeration ida.pelab.prt.constraints.RuleParameterType is used to indicate different kinds of elements in a pattern. Among these types are: Number, Identifier, Vector, Subscripted, nonSubscripted, String, Loopvariable, Invalid, and Condition.

(10)

1.1.3 SkePU - Skeletons Processing Unit

SkePU3 _{[4] [10] [6] is a C++ template library that provides a way to specify}

data-and task-parallel computations on multiple CPUs data-and multiple GPUs. The library is developed at IDA, Link¨oping University, mainly by Christoph Kessler, Johan Enmyren and Usman Dastgeer. SkePU allows the programmer to define parallel programming tasks in a sequential way by the use of skeletons parameterised with so called user-functions. SkePU supports multiple back-ends to which the unsavvy programmer in this way is granted access. A user-function is specified in a macro based definition language for a function or task that is expanded by the preprocessor to target specific code depending on which backend SkePU is set to. SkePU cur-rently supports one single CPU through sequential code, multithreading on CPUs through OpenMP, and multi-GPU computations with CUDA or OpenCL.

The designation of which backend to use for what problem size can be handled by SkePU through execution plans. These plans can either be set manually by the programmer or be generated by the tuner available with SkePU. The auto-tuner runs a given skeleton on a given user function with the available backends over a training span. This process is carried out offline and the results are stored in .dat-files next to the binary in the .dat-filesystem. The tuner uses inter- and extrapolation to determine which backend to use for vector sizes that were not part of the training process. CUDA and OpenCL can not be compiled to the same executable, and currently SkePU only supports auto-tuning for multiple GPU’s using the CUDA backend. An example of the compilation of a SkePU program which uses auto tuning is given:

nvcc -o a.out ./main.cu -I(PATH-TO-SKEPU-INCLUDES)

-DSKEPU_CUDA -DSKEPU_OPENMP -Xcompiler -fopenmp

Below is a short example of SkePU code calculating the PRT VAMUL pattern, without enabling auto-tuning or using a specified training set. The VAMUL pattern denotes a computation that, if expressed in sequential C code equals to:

for (i=0; i<N; ++i) arr1[i] *= arr2[i]; Its implementation in SkePU is the following:

BINARY_FUNC(mult_f, float, a, b, return a*b;)

int main() {

SkePU::Vector<float> vec1(); SkePU::Vector<float> vec2();

SkePU::Map<mult_f> vamul(new mult_f()); vamul(vec1, vec2, vec1);

}

OpenMP

OpenMP is a standardized API for C, C++ and Fortran whose main purpose is to provide support for parallel programming on shared-memory multiprocessors such as standard multicore CPUs. An unexperienced C programmer, for example, can by utilizing OpenMPs pragma-directives parallelize parts of the code, without much knowledge of parallel programming. As a simple example, a for-loop can be marked

(11)

with a pragma-directive #pragma omp parallel for, an OpenMP supporting com-piler can then detect the pragma and automatically parallelize the following for-loop section of the code. OpenMP consists of a collection of compiler directives, library routines and environment variables, and in this way portability is provided. The API is also supported by multiple compilers from various vendors, as is does not alter the grammar of the base language (C/C++) by adding language tokens; i.e. any standard C or C++ compiler not supporting OpenMP should be able to parse and compile OpenMP code as if it were sequential code. The OpenMP API ex-tends the base language with worksharing, synchronization, and tasking constructs et cetera [16].

CUDA - Compute Unified Device Architecture

CUDA is a software architecture that provides high-level access to parallel comput-ing on an Nvidia GPU. CUDA requires GPU hardware from Nvidia to run on such as Fermi. Whereas OpenMP programs are not compiled for a specific hardware with its own virtual instruction set, CUDA programs are. CUDA also alters the grammar of its base language C, which in turn requires the specific C compiler nvcc provided by Nvidia to compile. An unexperienced programmer can quite easily set up computational kernels and define GPU devices to which kernel definitions and data are sent without having to care about the specific instruction-set on a par-ticular Nvidia GPU hardware. However, making full use of the performance of an underlying GPU hardware usually requires expert knowledge of the programmer [9].

SkePU supports both OpenCL, CUDA and OpenMP as a backend. This can be enabled by setting the compiler flag SKEPU CUDA or SKEPU OPENMP, or by adding such a define-directive in the source file. However, auto-tuning is not yet supported for the OpenCL backend. SkePU then provides the unexperienced programmer with ways to better utilize a multicore-CPU or GPU hardware.

1.2 Method

Wrapper functions for source code generated by PRT have been implemented using the SkePU library. These wrapper functions have then been tested for validity over a data size span, and the speedup for each wrapper function has been tested over a data size span. Also a practical example using a subset of all implemented wrappers to solve a practical problem, by automatic code transformation using PRT, has been tested for validity and its speedup has been measured.

All tests have been carried out on a server hardware consisting of two Intel Xeon E5520 processors with four cores, each processor running at a base frequency of 2.26 GHz with the ability to run eight hardware threads each. The total amount of RAM is 24 Gigabyte.4 _{The server also contains two NVIDIA Tesla M2050 cards}

with one GPU each running at a core clock frequency of 1.5 GHz, each with 3 gigabyte onboard memory.5

The operating system running on the server is ArchLinux version 3.10.10-1-ARCH, and the CUDA version is release 5.0, V0.2.1221. This version of CUDA does not support operations on double; at compile time occurrences of double are de-moted to float. The code has been compiled with the Nvidia C++ compiler nvcc

4 http://ark.intel.com/sv/products/40200/Intel-Xeon-Processor-E5520-8M-Cache-2_26-GHz-5_86-GTs-Intel-QPI

(12)

with the same version. The size of an integer on this system is 4 bytes, float 4 bytes and double 8 bytes.

The largest size of an array of double that the main memory of this system can allocate, when allocating three arrays, then becomes (24 ∗ 10243_{)/(3 ∗ 8) ≈ 10}9_.6 _In

reality test programs to evaluate operations running on three double-array operands are able to allocate arrays of size 108_{. Having 6 Gigabytes in total memory on the}

GPUs does not hinder dynamic allocation of such large arrays, as it will lead to multiple data transfers.

1.3 Thesis Outline

The rest of the thesis is organized as follows.

In Chapter 2, the tool-chain is presented along with extensions to the initial code base.

Chapter 3 evaluates the wrapper functions’ implementations. Also a sequential ODE solver is parsed with the tool-chain and the result of its evaluation is presented.

Chapter 4 covers research and work related to the project.

Chapter 5 summarizes the project, presents the general conclusion and discusses further work.

Appendix A provides a complete set of speedup and overhead graphs.

Appendix B includes a general presentation of the system architecture of PRT. Appendix C holds a list of knows bugs in SkePU and PRT.

(13)

Chapter 2 Tool Design

The main idea of the toolchain is to automatically detect parallelizable algorithmic patterns in legacy C source code and exchange these parts of the code with auto-matically parallelizing SkePU code versions. The toolchain consists of the extensi-ble knowledge based C source-to-source compiler PRT, the automatic parallelizing C++ template library SkePU, and a set of wrapper functions.

The SkePU template library provides one generic Vector and one Matrix con-tainer for parallel operations on one- and two-dimensional arrays. The internal data representation of SkePU::Vector<T>is a pointer-to-T (which by default is a dynamically allocated legacy C array, but could be used to wrap static arrays as well), while SkePU::Matrix<T> instead operates on a std::vector<T> container internally.

Patterns suitable for parallelization are thus Level1 and Level2 patterns. How-ever, SkePU’s Matrix container (SkePU::Matrix<T>) does not support direct ma-nipulation of a linear C-array representation of Matrix data. This leads to the need of hard-copying data from a legacy array to the internal std::vector<T>of the Matrix container upon parallelization of Matrix data, yielding a high overhead. This problem has led to the wrapper functions for Level2 patterns not yet having been implemented.

There are two requirements the wrapper functions need to hold:

i all algorithmic parameters must be conserved and sent to the wrapper function, and

ii as SkePU is compiled with a C++ compiler the wrapper functions can be written in C++ but must be callable from a program written in C.

The first requirement is held by using auxiliary data objects that serve to pass the algorithmic parameters to the wrapper functions. Provision of a C function interface for each wrapper holds the second requirement.

Figure 2.1 describes the workflow of compiling C source code with PRT and SkePU. A C source file is first compiled with PRT producing a parsed version of the C source. In this step Level1 patterns are found by rules matching XML-defined structures with the original source program’s abstract syntax tree. The matched node patterns are exchanged with auxiliary object allocations, see the next section, and a wrapper function call corresponding with that pattern. The wrapper function takes the auxiliary data objects as arguments and implements a parallelized version of the matched algorithm using SkePU. As an example, the

(14)

Figure 2.1: Overview of the toolchain

dotproduct operation will take two vector objects encapsulating the C array data, and one scalar object refering to the memory location holding the resulting data; these are then sent to the wrapper function as arguments and used in the SkePU algorithm implementation of dotproduct. The wrapper function definitions are provided in PrtSkepuWrappers.cpp. The parsed C source can then be compiled with the NVCC compiler by including the PrtSkepuWrappers wrapper function definitions and the SkePU template library. The compilation of the parsed version of the code is done using the following command:

nvcc -o a.out ./parsed.c PrtSkepuWrappers.cpp -I(path to PrtSkepuWrappers)

-DSKEPU_CUDA -DSKEPU_OPENMP -Xcompiler -fopenmp

Setting both the SKEPU CUDA and the SKEPU OPENMP flags makes the CUDA and OpenMP backends of SkePU available. These are the two backends used by the automatic tuner, see Section 2.3.

2.1 Auxiliary Data Structures

Level1 and Level2 patterns in PRT are characterized by simple assignment and mathematical operations occurring in one or two nested for-loops. In order to pass the parameters of a certain algorithm to its corresponding wrapper function three auxiliary data structures are used: a loop range structure, a vector structure and a matrix structure. Each vector and matrix hold their own ranges to allow for array indexing such as a*i+b, where i is the iteration variable and a and b are constants. Range and For-Loops

A for loop carries information on starting index, end index and the stride length of the iterations. A PrtRange object is supposed to carry this information on the iteration variable.

struct PrtRange {

int lower_bound;

int upper_bound;

int step_size;

PrtRange(int _lower_bound, int _upper_bound, int _step_size); PrtRange(const PrtRange & copy);

};

Both lower_bound and upper_bound are inclusive loop indices. step_size, the incremental step size, is allowed to be negative.

(15)

Vectors

An array or vector is identified by a pointer to the beginning of its data, its type information and over what range the elements are to be accessed. This information is encapsulated in a PrtVector object, see Listing 1. Currently supported types through the field type are: 1 forint, 2 forfloat, 3 fordoubleand 4 forlong. The fields factor and is_divided provide a compact way through implicit unary-minus and binary-division operators to transmit information on whether the legacy array in the original source is to be preceded by a constant factor, and/or inverted when accessed.1 The PrtVector data type also allows appearing as the denominator in a division operation. This is used by PRT when matching the VINV pattern for instance and sets the is divided constant so that elementwise inversion can be enabled in the VINV SkePU operation.

struct PrtVector {

void * base_pointer;

int type;

struct PrtRange access_range;

double factor;

int is_divided;

PrtVector(void* _base_pointer, int _type, PrtRange _access_range); PrtVector& operator-();

friend PrtVector& operator/(double _factor, PrtVector& vector); };

PrtVector& PrtVector::operator-() {

this->factor = -1;

return *this; }

PrtVector& operator/(double _factor, PrtVector& vector) {

vector.factor = _factor; vector.is_divided = 1;

return vector; }

Listing 1: Definition of PrtVector.

A data structure for matrices has also been made available, but is omitted here as no wrapper functions have been developed for matrix operations. For the full source code, please see the source files PrtDataAccess.h and PrtDataAccess.cpp.

1_{For example the VADD pattern in PRT recognises both elementwise summation and difference} between two array elements inside a for loop. This information needs to be fed through to the wrapper function in the tool-chain.

(16)

2.2 Extensions of PRT

This section depicts the alterations made to the PRT code base.

The PatternDefinition.xml which holds all pattern definitions has been bro-ken down into five files, one main xml file, and four files containing the patterns for a certain pattern level. This is to increase readability when extending PRT with new patterns.

PRT works in running through the abstract syntax tree multiple times; the first run is to redistribute loop structures in the IR tree, the second run is to perform bottom-up pattern matching. Upon pattern matching during the sec-ond run PRT creates a ida.pelab.prt.pattern.MatchedInstance object which holds the parameters of the matched computation unit. These objects are then used in a third traversal of the IR tree where pattern annotation is performed. Depending on which target architecture is chosen, its respective code generation functionality must be implemented in the PRT source code. Therefore, a new package ida.pelab.cgn.ccode has been added that holds classes for generating wrapping target code. Upon pattern annotation, a target code implementation such as ida.pelab.cgn.ccode can replace pattern matched nodes in the IR tree with string objects. The Ccode target architecture thus uses strings to replace pat-tern matched IR nodes with code including wrapper function calls and auxiliary data objects. With each auxiliary data object created, an identifier, a serial num-ber, is coupled which is appended to the data object name in the output source to distinguish between created data objects.

2.2.1 Data Consistency Management

As a target architecture can imply data transfer to one or multiple computing devices, for example multiple GPUs, the data transfer between the host and a device may be time consuming. To address this issue PRT has been extended with an interproceural analyser (IPA). The IPA, found in the module

ida.pelab.prt.annotation.DataConsistencyAnnotation

and its helper classes, traverses the IR tree after pattern annotation has been done. The idea is to add extra wrapper function calls (namely prt_get_data_to_host(void*) and prt_get_data_to_device(void*)) that take a pointer to the data in question, on the host, and update either the host or the device memory areas with the correct version of the data. The memory address on the host then works as an identifier for a particular set of data.

As an example, consider multiple consecutive pattern instances that operate on the same data. If a target architecture that needs to transfer data to a device is chosen, that data does not need to be copied back to the host until after the consec-utive set of pattern matched code. To ensure this it is left up to the implementation of the pattern wrapper functions whether to always transfer data between host and device or not, and specific wrapper functions for doing so are provided and generated automatically by PRT.

The reason why the SkePU containers (vector and matrix) cannot handle this on their own is that there is no way for such an object to sense changes made on its data on the host by sequential parts (unmatched and unaltered by PRT) of the original code. Whenever such a change occurs, SkePU, or a hypothetic other target architecture implementing the wrapper functions, needs to be informed in some way. Whenever a SkePU vector or matrix container is constructed it will hold its own

(17)

flags for telling whether its data is up-to-date on the host respectively a device [6]. Calling a SkePU operation on such a container could, if the operation is carried out on a GPU device, lead to the data on the host not being updated (a flush would need to be called upon the container in question). SkePU, in this manner, is employing smart containers for the very reason to minimize transfer times. However, as data declared and operated on in the sequential parts, i.e. the unmatched parts by PRT, also acts as data being operated on by SkePU, the smart container functionality isn’t playing much of a role any longer unless these sequential parts are rewritten using SkePU containers as well.

To battle this issue, a hash-table consisting of SkePU vector containers has been introduced. Taking the host memory address as hash-key, this hash table is sup-posed to manage all container creation and destruction throughout the toolchain. In this way the container objects are always available to any wrapper function im-plementation, and data flushes can be performed on demand. By providing the functions prt_get_data_to_host(void*) and prt_get_data_to_device(void*), and by proper insertion of such function calls in the annotated code, the data trans-fer to and from a device can still be kept at a minimal rate. As the SkePU containers are constructed on demand, i.e. when a pointer to an array on host memory first appears in a pattern instance, PRT is also thought to append the IR with a call to prt_free_data(void*) at any point when data wrapped by a Skepu container is free’ed or goes beyond scope. The functions prt_get_data_to_host(void*) and prt_free_data(void*) have been successfully implemented using SkePU, along with a fully functional SkePU vector hash-table. However, as SkePU contains no means for updating all allocated devices the function prt_get_data_to_device(void*) is not successfully implemented yet, also PRT is yet unable to make a complete set of such function call insertions in the IR to make this work on a C program. Therefore, no evaluation has been available nor carried out.

Technically PRT does the insertion of the data consistency function calls by per-forming static analysis on the IR after pattern matching; this is done by traversing the FCG (function call graph). Upon a matched pattern instance/node, the data identifiers involved are added to a program-data-consistency-state object, this is in order to note that the identifier/data is being monitored. Such a program-data-consistency-state object consists of a set of identifier names coupled with flags for de-termining whether its data is up-to-date on the host and device respectively. When, during the traversing of the FCG, an already monitored identifier appears in a pat-tern and its up-to-date-on-device flag is negative, a prt_get_data_to_device(void*) call is appended to the IR tree before the matched pattern instance replacement code, taking the monitored identifier name as its only argument. The same pro-cedure applies when a monitored identifier appears in a statement that will be run on

the host and has its up-to-date-on-host flag set to negative, a prt_get_data_to_host(void*) call is appended to the IR tree before the statement node, taking the monitored

identifier name as its only argument.

An example of how this system is supposed to work, as a part of a sequential program, is shown below.

int a[10], b[10], c, i;

// Initialization of a and b with values omitted

for (i=0; i<10; ++i) a[i] *= b[i]; // Will match the VAMUL pattern

a[1] += 50; // Sequential code part

c = 0;

(18)

// but that is not shown here

printf("c = %d", c);

This should, after proper application of a data consistency subsystem, be trans-formed into the following.

int a[10], b[10], c, i;

// Initialization of a and b with values omitted

// PRT auxiliary objects rng, vtr_a, vtr_b declaration omitted

prt_get_data_to_device(a); prt_get_data_to_device(b);

prt_vamul(rng, &vtr_a, vtr_a, vtr_b); prt_get_data_to_host(a);

a[1] += 50; c = 0;

prt_get_data_to_device(a);

for (i=0; i<10; ++i) c+= a[i]; // Will be matched as the VSUM pattern

prt_get_data_to_host(c); printf("c = %d", c);

2.3 Auto-Tuning and Overhead Minimization

Each skeleton call to SkePU requires an execution plan to know which backend to use for the execution, see the Section 1.1 for more details. To allow for C-source programs parsed with PRT to apply the auto-tuning facility in SkePU this must be provided by the wrapper functions. Upon calling a wrapper function memory allocation, such as object creation and loading of execution plans, will need to be lowered to minimise function overhead. The design issues coupled with this problem are then:

• How can the number of tunings in a program be minimized? As excessive tuning results in unnecessary CPU-time.

• How can the tuners be accessible through C-style wrapper calls? As PRT can not produce C++ code.

• How can the overhead of the wrapper functions be minimised over multiple function calls?

This is achieved by providing a helper object for each PRT pattern that the corresponding wrapper function(s) in turn call. The helper object is implemented as a singleton struct whose member fields include one skeleton instantiation for each of the data types supported by PRT 2 ₍_int_, _float _and _double_{), and one}

tuner object for each skeleton. By the fact that the helper object is global the member objects are not destroyed after completion of a wrapper call, thus reducing the overhead by eliminating object creation and destruction. Each tuner object is accompanied by a boolean flag that is used to reduce running the tuner to one time only; this also reduces overhead. The typical structure for a wrapper function then involves creating SkePU::vector<>objects to wrap the legacy arrays passed as arguments inside prt_vector objects (this is later thought to be done by the

2_{PRT supports}_long_{as well, but longs are not supported by CUDA and for that reason have} been left out in this implementation.

(19)

yet incomplete data consistency hash table), and to send these SkePU container objects as reference to the accompanying helper object. In this way object creation is reduced to only occur at the first call to a wrapper function, and the wrapping SkePU containers do not hard-copy any data from the legacy arrays.

As a depictive example the non-compilable code excerpt below serves to explain this implementation further.

struct prt_dotproduct__ { // the global helper singleton object

int TUNED_I; // flag for tuner file loading

skepu::MapReduce<mult_i, plus_i> operation_i; skepu::ExecPlan plans[MAX_EXEC_PLANS];

skepu::Tuner<mult_i, MAPREDUCE, plus_i> tuner_i;

// the constructor runs the tuner’s constructor // as the constructor is only called once

prt_dotproduct__() :

TUNED_I(0), operation_i(new mult_i, new plus_i), tuner_i("mult_i_mapreduce_plus_i", 2, 1e3, 1e8)

{}

inline void operator()(skepu::Vector<int> &v1,

skepu::Vector<int> &v2, int &lhs) { if (TUNED_I == 0) { tuner_i(plans); operation_i.setExecPlan(plans); TUNED_I = 1; }

// dotproduct performs accumulative add

// since no pattern distinction for accumulative // / non-accumulative exists for PRT currently

lhs += operation_i(v1, v2); }

};

void prt_dotproduct(const struct PrtRange &_range,

int * lhs,

const struct PrtVector & rhs1,

const struct PrtVector & rhs2,

const int & value) {

int size = _range.upper_bound - _range.lower_bound + 1;

if (!(size > 0)) return;

int* rhs1ptr = (int*) rhs1.base_pointer + rhs1.access_range.lower_bound;

int* rhs2ptr = (int*) rhs2.base_pointer + rhs2.access_range.lower_bound;

// pointer arithmetics should be done by SkePU iterators instead

skepu::Vector<int> v1(rhs1ptr, size, false); skepu::Vector<int> v2(rhs2ptr, size, false);

(20)

// re-allocating existing SkePU vector objects

prt_dotproduct__(v1, v2, *lhs); }

2.4 Supported Wrapper Functions

The toolchain does not support PRT Level2 patterns as the Matrix container of SkePU does not provide for direct memory manipulation of a C-array. Below is a list of Level1 patterns with their corresponding wrapper function call. All functions have been tested throughout the value space (1e3, 1e8) for data types int, float

and double, but not long. As CUDA release 5.0 does not support double, it is automatically demoted to float. long, which is recognised by PRT, is not supported by CUDA as it yields unaligned memory access on the GPU.

Below are lists of PRT patterns with corresponding implemented parallelised wrapper functions, and the patterns that are not supported by the toolchain:

2.4.1 Level1 Patterns

In Table 2.1 rng is a PrtRange object containing the information given in the loop, i.e. lower bound, upper bound (N) and stride length. vtr_X is a PrtVector object corresponding to the array X. value and result are variables or arrays of typeint,

float ordouble. CONSTANT is a placeholder for aint,floatordouble constant. Pattern Identifier Pattern Example with Corresponding Wrapper Call

DOTPRODUCT result = initial;

for (i=0; i<length; i++) result += a[i] * b[i]; prt_dotproduct(rng, &result, vtr_a, vtr_b, result); SV for (i=0; i<N; i++) result[i] = a[i] * value;

prt_sv(rng, &vtr_result, vtr_a, value); VAADD for (i=0; i<N; i++) result[i] += a[i];

prt_vaadd(rng, &vtr_result, vtr_result, vtr_a); VAADDSS for (i=0; i<N; ++i) result[i] += value1 * value2;

prt_vaaddss(rng, &vtr_result, vtr_result, value1, value2); VAADDSV for (i=0; i<N; i++) result[i] += a[i] * value;

prt_vaaddsv(rng, &vtr_result, vtr_result, vtr_a, value); VADD for (i=0; i<N; i++) result[i] = a[i] + b[i];

prt_vadd(rng, &vtr_result, vtr_a, vtr_b);

VADDMUL for (i=0; i<N; i++) result[i] = a[i] + b[i] * c[i]; prt_vaddmul(rng, &vtr_result, vtr_a, vtr_b, vtr_c); VAINC for (i=0; i<N; i++) result[i] += value;

prt_vainc(rng, &vtr_result, vtr_result, value); VAMOD for (i=0; i<N; i++) result[i] = result[i] % a[i];

(21)

VAMUL for (i=0; i<N; i++) result[i] *= a[i];

prt_vamul(rng, &vtr_result, vtr_result, vtr_a); VAND for (i=0; i<N; i++) result[i] = a[i] && b[i];

prt_vand(rng, &vtr_result, vtr_a, vtr_b); VASSIGN for (i=0; i<N; i++) result[i] = value;

prt_vassign(rng, &vtr_result, value); VCOPY for (i=0; i<N; i++) result[i] = a[i];

prt_vcopy(rng, &result, prt_a);

VINC for (i=0; i<N; i++) result[i] = a[i] + value; prt_vinc(rng, &vtr_result, vtr_a, value); VINIT for (i=0; i<N; ++i) result[i] = CONSTANT;

prt_vinit(rng, &prt_result, CONSTANT);

VINV for (i=0; i<N; ++i) result[i] = CONSTANT / a[i]; prt_vinv(rng, &vtr_result, vtr_a, CONSTANT); VMOD for (i=0; i<N; i++) result[i] = a[i] % b[i];

prt_vmod(rng, &vtr_result, vtr_a, vtr_b); VMUL for (i=0; i<N; i++) result[i] = a[i] * b[i];

prt_vmul(rng, &prt_result, prt_a, prt_b);

VMULMUL for (i=0; i<N; ++i) result[i] = a[i] * b[i] + c[i] * d[i]; prt_vmulmul(rng, &prt_result, prt_a, prt_b, prt_c, prt_d); VOR for (i=0; i<N; i++) result[i] = a[i] || b[i];

prt_vor(rng, &vtr_result, vtr_a, vtr_b); VSUM result = initial;

for (i=0; i<N; i++) result += a[i];

prt_vsum(rng, &vtr_result, vtr_a, initial); VSHIFT for (i=0; i<N; i++) result[i] = result[i+c];

prt_vshift(rng, &vtr_result, vtr_result, c); Table 2.1: List of PRT patterns currently supported by the SkePU/PRT toolchain.

As a reference the set of PRT Level-1 patterns that have no wrapper function implementation yet is given below:

• FOLRHELP, FOLR, VSIN, VCOS, VTAN, VEXP, VLOG, VADDSV, VSWAP, VPROD, VMAXVAL, VMINVAL, VMAXLOC, VMINLOC, VMAXLOCM, VMINLOCM, VMAXVL, VMINVL.

(22)

Chapter 3 Evaluation

The evaluation of the tool-chain has involved checking correctness of implemented pattern operations, speedup and overhead measurements of wrapper functions, and speedup measurements of an example ODE solver case. All evaluations have been carried out on the same hardware, which consists of:

• two Intel Xeon E5520 64bit processors with 4 cores each, 8M cache, running at 2.26 GHz,

• two NVIDIA Tesla M2050 GPU cards with 3GB DDR5 memory (1.54 GHz) each, running at 1.15 GHz,

• 24 GB of main memory,

• running the operating system ArchLinux version 3.10.10-1-ARCH, • SkePU version 1.2,

• and CUDA version release 5.0, V0.2.1221.

Time measurements have been carried out using MeterPU [13]1_{, a measuring}

tool for GPU-based systems developed at IDA, Link¨oping University. The define flag ENABLE_CUDA_TIME is used to enable MeterPU to measure wall clock time on GPU. All code has been compiled with the nvcc compiler with the same version as CUDA under the flags:

nvcc -Xcompiler -fopenmp -O2 -DSKEPU_OPENMP -DSKEPU_CUDA -DENABLE_CUDA_TIME

3.1 Wrapper Validation

To test the validity of each wrapper implementation, an automatized tester was developed. Over a value span concerning data size, a specific wrapper was tested for validity against its respective sequential implementation, on the three data types

int, float and double. The data size span, i.e. the number of elements in a sequential int/float/double vector, elected was [103, 104, , ..., 108]. A data size lower than 103 is seldom worth parallelizing, and the underlying hardware (see section Method in Chapter 1) supported only those pattern operations known by PRT on arrays at largest size 108_{. As floating point operations seldom are carried out}

in the same way on different hardware, it is often assumed that the result from

(23)

operations carried out on a CPU versus a GPU may differ. Therefore, float and double operations are tested on the relative error between the sequential and parallel operation to be less than 0.1%. An exemplifying print-out from the validity testing program is given below.

Validity check for: PRT SV

vsize = 1e3: integer OK, float OK, double OK vsize = 1e4: integer OK, float OK, double OK vsize = 1e5: integer OK, float OK, double OK vsize = 1e6: integer OK, float OK, double OK vsize = 1e7: integer OK, float OK, double OK vsize = 1e8: integer OK, float OK, double OK

All wrappers in Table 2.1 have had their validity rendered positive, except for some special cases listed below.

For floating-point operations with data size 107_{PRT DOTPRODUCT generates}

a relative error of 1%, and for 108 _{37%. PRT VADDMUL fails to run on data sizes}

108_{, for all three data types. PRT VMULMUL generates a relative error of 50% for}

data size 108 _{and data type double. PRT VSUM generates relative errors higher}

than 0.1% for float operations with data size higher than or equal to 107_.

3.2 Speedup Measurements

For each wrapper function the speedup has been measured by dividing the wall-clock time of each wrapper functions respective sequential version by the PRT/SkePU implementations wall-clock time:

Speedup = Tseq Tpar

Thus a speedup of 1.0 is defined as both the parallel and the sequential version of on operation/algorithm consumes the same time on a specific hardware for a certain data size. In this metric the data transfer time to and from a possible GPU device is included, since a system for data consistency analysis is not yet completed, please see Section 2.2.1 for more details.

In general almost all implemented wrappers show a speedup of less than one over the whole training span ([103_{, 10}4_{, ..., 10}8_{]); in Figure 3.1 this typical behaviour is}

shown. This is thought to be due to the transfer time to and from the GPU for the higher data sizes (approximately sizes over 106). For the lower data sizes the overhead of the wrapper functions and the auxiliary data objects is too large to allow for any significant speedup. The speedup and overhead measurements of the lowest data sizes (103 and 104) are likely to be unstable and show high variations in both speedup and overhead. This is due to the fact that the total wall-clock time for both the sequential and the parallel cases are almost the same size as the pseudorandom differences in microseconds between different runs of the testing program. Lower data sizes than 103 _{produce unstable results over many trials, and}

the hardware (see Section 1.2) only supports data sizes up to 108_.

The overhead is here defined as the percentage of the wall-clock time remaining of the total wrapper wall-clock time, i.e. the total wall-clock time for the wrapper function call plus possible creation of auxiliary data objects, minus the time SkePU

(24)

Figure 3.1: Speedup and Overhead of the dotproduct operation. Both speedup and overhead includes data transfer to and from a possible GPU in the cases where a GPU is used. The overhead is measured as all the extra instructions added by PRT in the wrapping of the SkePU operations, thus SkePU:s operation is not included in the overhead.

takes to carry out the operation; better expressed as:

1.0 − Toperation/Twrappercall

A comprehensive set of measurement charts are available in Appendix A, only a few will be discussed in more detail here.

A positive trend can be seen in all of the wrapper functions speedups, which is to be expected regarding previous work on SkePU, see for example [4]. As SkePU applies automatic tuning between a sequential CPU, OpenMP and CUDA backends, and at the moment does not provide any access to which backend is used at which data size it is not possible to know what backend is generating what speedup. However, in general the backends are applied in the order CPU, OpenMP, and CUDA for higher and higher data sizes. Since only CUDA involves data transfer time between the host and a device, the higher data sizes are the ones involving such transfer times.

In Figure 3.2 we can see a sharp drop in speedup at data size 106_{. It is therefore}

reasonable to assume that the auto-tuner has selected CUDA for this operation at this data size. Hence, transfer time is playing a large role in determining the speedup and is therefore highly unwanted.

In Figure 3.3 we can see a steep rise for the integer operation at data size 107. This time it is more likely to assume that the CPU backend has been choosen by SkePU up until data size 107 for the integer operation; since CUDA infers transfer

(25)

Figure 3.2: Speedup and Overhead of the VADD operation. The overhead is mea-sured as all the extra instructions added by PRT in the wrapping of the SkePU operations, thus SkePU:s operation is not included in the overhead.

time this is an unlikely option. The same phenomenon can be seen with the VINC pattern.

In Figure 3.4 speedup higher than 1 is produced for data sizes larger than ap-proximately 2∗105_{. The reason why this seems like a justifiable option is that VINV}

is the only operation over a one-dimensional array that operates on the operand’s own elements at the same index without the presence of an arbitrary scalar constant or the like.

3.3 Case Study – ODE solver

A legacy C ODE solver was parsed with PRT generating wrappers to SkePU code. The ODE solver originated from the book Numerical Algorithms in C, see [18], and will further be viewed in more detail. The parsed code included wrapper calls to: prt vaddsv, prt vshift, prt dotproduct, and prt vinit. The total walltime of running the wrapper over different numbers of variables in a sample ODE problem, i.e. the parallelized data size, was measured both for the sequential unparsed case and for the parsed parallelized case. The upper limit of the data size was determined by the total amount of memory of the hardware. The resulting speedups can be seen in Figure 3.5. This case study only involves measurements made on float operations, as the original code operated on floats.

(26)

Figure 3.3: Speedup and Overhead of the VCOPY operation. The overhead is measured as all the extra instructions added by PRT in the wrapping of the SkePU operations, thus SkePU:s operation is not included in the overhead.

The ODE problem that the solver was applied on is

n

X

k=0

ak· y(k)= r · exp(vx) (3.1)

where ak, r, and v are constants, and n is the order of the ODE. This can also be

written in the form

any(n)+ an−1y(n−1)+ ... + a1y(1)+ a0y = r · evx (3.2)

This ODE has a particular solution in y = A · evx_{, which yields A =} r

Pn k=0akv

k.

The reason for choosing such an ODE is that it is easy to extend by choosing an appropriate value of n, which in turn allows us to alter the parallelizable data size. The solver consists of three helper functions: rkdumb, rk4 (continued in 3.3) and derivs, these are listed below. The derivs function, see Listing 2, implements the actual ODE, and is used to take incremental steps over the function in the integration. derivs uses the global variable eqc to store the constants ak, see the

mathematical formulation above. The Runge-Kutta solver is applied by calling the rkdumb function, see Listing 4, with the inistial y-value, the number of variables NVAR (i.e. n in the mathematical formulation), the x boundary values, and the number of steps. By altering the number of variables, NVAR, the parallelizable data size can be altered for testing different data sizes. The rk4 function, see Listing 3, is a helper function used by rkdumb. The code parts that have been

(27)

Figure 3.4: Speedup and Overhead of the VINV operation. The overhead is mea-sured as all the extra instructions added by PRT in the wrapping of the SkePU operations, thus SkePU:s operation is not included in the overhead.

parsed and replaced by PRT, and is thus running SkePU code, are described in the caption of the three successive listings.

As can be seen in Figure 3.5 the speedup never reaches above 1.0, i.e. the sequential case is always better. For the lower data sizes in the chart SkePU is likely to choose either the CPU or OpenMP backend. The reason for getting speedups lower than 1 in these cases are that the overhead of the wrapper functions is too high. In the upper parts of the chart, it is reasonable to believe SkePU is choosing CUDA as backend, and here the transfer times to and from the device are likely to affect the speedup to below 1. In [5] an ODE-solver implementation called LibSolve, built on SkePU, is showing speedup above 1. The main difference in our application is that data flushes back to host are made after each SkePU operation; this is not the main idea of SkePU usage, as it supports smart containers which are put out of action in our scenario. A notion about Amdahl’s law should be made at this point; if not enough parts of the original sequential program have been parallelized then the amount of speedup that each pattern achieves makes less difference to the total speedup of the program. However, as can be seen with the pattern matching of our ODE solver a major part of the program has been parallelized. This leads to the conclusion that the overhead added by the parallel implementation of our wrappers is too large to achieve speedup with this particular sequential ODE solver program.

(28)

Figure 3.5: Speedup of the ODE solver.

float rr, vv;

float *eqc;

void derivs(float x, float y[], float dydx[]) {

int i; {

struct PrtRange rng_1021781237(1, (-1+NVAR), 1);

struct PrtVector vtr_2090630597(dydx, 2, PrtRange(1, (-1+NVAR), 1));

struct PrtVector vtr_1189291825(y, 2, PrtRange(2, NVAR, 1)); prt_vshift(rng_1021781237, &vtr_2090630597, vtr_1189291825, 1); }

{

struct PrtRange rng_1928367471(1, (-1+NVAR), 1);

struct PrtVector vtr_723356838(eqc, 2, PrtRange(1, (-1+NVAR), 1));

struct PrtVector vtr_1092542236(y, 2, PrtRange(2, NVAR, 1));

prt_dotproduct(rng_1928367471, &dydx[NVAR], vtr_723356838, vtr_1092542236, 0); }

dydx[NVAR]=(((rr*vv)*exp((vv*x)))-dydx[NVAR]); dydx[NVAR]/=eqc[NVAR];

return ; }

Listing 2: Listing of the derivs function. Both the for loops have been automatically replaced in the parallelized version with calls to prt_vshift and prt_dotproduct.

(29)

float *dym, *dyt, *yt;

void rk4(float y[], float dydx[], int n, float x, float h, float yout[]) { int i; float xh, hh, h6; hh=(h*0.5); h6=(h/6.0); xh=(x+hh); { struct PrtRange rng_515742560(1, n, 1);

struct PrtVector vtr_1882786420(yt, 2, PrtRange(1, n, 1));

struct PrtVector vtr_499913613(y, 2, PrtRange(1, n, 1));

struct PrtVector vtr_1332617007(dydx, 2, PrtRange(1, n, 1));

prt_vaddsv(rng_515742560, &vtr_1882786420, vtr_499913613, vtr_1332617007, hh); }

/* First step. */

derivs(xh, yt, dyt);

/* Second step. */

{

struct PrtRange rng_841206628(1, n, 1);

struct PrtVector vtr_454930340(dyt, 2, PrtRange(1, n, 1));

prt_vaddsv(rng_841206628, &vtr_137400604, vtr_1158986359, vtr_454930340, hh); }

derivs(xh, yt, dym);

/* Third step. */

{

struct PrtVector vtr_176869961(y, 2, PrtRange(1, n, 1)); prt_vcopy(rng_1274948595, &vtr_981356300, vtr_176869961); }

{

struct PrtVector vtr_934404771(dym, 2, PrtRange(1, n, 1));

prt_vaaddsv(rng_469206399, &vtr_2031402128, vtr_2031402128, vtr_934404771, h); }

{

prt_vaadd(rng_391996209, &vtr_1050227781, vtr_1050227781, vtr_771474036); }

derivs((x+h), yt, dyt);

Listing 3: Listing of the rk4 function. Includes calls to prt_vaddsv, prt_vcopy, prt_vaaddsv, prt_vaadd, prt_vadd.

(30)

/* Fourth step. */

/* Accumulate increments with proper weights */

{

struct PrtVector vtr_1921861250(yout, 2, PrtRange(1, n, 1));

struct PrtVector vtr_2025207331(dydx, 2, PrtRange(1, n, 1));

prt_vadd(rng_1947939314, &vtr_1921861250, vtr_2025207331, vtr_1827648495); }

{

prt_vaaddsv(rng_627851883, &vtr_733244429, vtr_733244429, vtr_47799046, 2.0); } for (i=1; i<=n; i ++ ) { yout[i]*=h6; } { struct PrtRange rng_1719439870(1, n, 1);

prt_vaadd(rng_1719439870, &vtr_322592618, vtr_322592618, vtr_951406889); }

return ; }

(31)

float *v, *vout, *dv;

void rkdumb(float vstart[], int nvar, float x1, float x2, int nstep) {

int i, k;

float x, h;

/* Load starting values. */

{

struct PrtRange rng_936816937(1, nvar, 1);

struct PrtVector vtr_703415046(v, 2, PrtRange(1, nvar, 1));

struct PrtVector vtr_638798081(vstart, 2, PrtRange(1, nvar, 1)); prt_vcopy(rng_936816937, &vtr_703415046, vtr_638798081);

} {

struct PrtVector vtr_264347170(y, 2, PrtRange(1, nvar, 1));

struct PrtVector vtr_93856956(v, 2, PrtRange(1, nvar, 1)); prt_vcopy(rng_1879644002, &vtr_264347170, vtr_93856956); } xx[1]=x1; x=x1; h=((x2-x1)/nstep); for (k=1; k<=nstep; k ++ ) {

/* Take nstep steps. */

derivs(x, v, dv);

rk4(v, dv, nvar, x, h, vout);

if ((((float)(x+h))==x)) {

nrerror("Step size too small in routine rkdumb"); }

x+=h;

xx[(k+1)]=x;

/* Store intermediate steps. */

{

struct PrtVector vtr_528323352(v, 2, PrtRange(1, nvar, 1));

struct PrtVector vtr_314527853(vout, 2, PrtRange(1, nvar, 1)); prt_vcopy(rng_1692294244, &vtr_528323352, vtr_314527853); }

{

struct PrtVector vtr_584221347(y, 2, PrtRange(1, nvar, 1));

struct PrtVector vtr_285940258(v, 2, PrtRange(1, nvar, 1)); prt_vcopy(rng_1115938040, &vtr_584221347, vtr_285940258); }

}

return ; }

(32)

Chapter 4 Related Work

Over the years there have been several approaches to pattern recognition for au-tomatic parallelization, mostly for Fortran programs. Examples include work done with the Polaris compiler [17] and the XARK compiler [1].

The Polaris project has focused a lot on extensible and robust transformation techniques on the internal representation. Apart from many commonly known com-pilation passes the Polaris compiler utilizes a set of more advanced tasks, namely: array privatization, data dependence testing, induction variable recognition, inter-procedural analysis, and symbolic program analysis.

The XARK compiler works in three steps. First a GSA1_{is produced from the IR}

tree. Secondly, simple kernels (based on SCC, or Strongly Connected Components) are searched for in the GSA by running pattern recognition on the GSA. And lastly, inter-dependencies among these simple kernels are recognised producing a more sophisticated kernel graph with more complex and higher level kernels. [1]

Both these compilers require recompilation of their respective source codes to extend the compiler with additional IR analysis features. Our compiler, PRT, instead allows the user to add further .xml-patterns for pattern recognition. We refer to [11], [14] and [7] for more details and related work concerning algorithm recognition.

More recent work on pattern recognition of skeletons for parallel programming has been done at Eindhoven University of Technology by Nugteren and Corporaal [15] and by Custers [3]. Custers developed ASET, a program that can detect so-called algorithmic species which to some degree correspond to our patterns. ASET is able to recognize 134 such manually constructed algorithmic species, and to auto-matically extract memory requirements for parallelization. Through optimization of memory transfers by utilizing the information on inter-species memory requirements the memory transfers can for example be performed in parallel with computations on a GPU device. [3]. Nugteren and Corporaal introduce bones, a source-to-source compiler based on algorithmic skeletons [15]. The user annotates (C) source code with pragmas containing information that is used by bones to generate parallel tar-get code for CUDA, OpenCL for CPU and OpenCL for AMD GPUs. Compared to our work our recognized patterns can be replaced with function calls instead of using pragmas. Our approach has the advantage that any code can be wrapped inside the implementation of our function calls. Another feature of our tool is that it, in our view, allows more extensibility by the user.

(33)

In the ParaPhrase project2a tool based on Eclipse IDE has been developed that can refactor C++ programs using FastFlow skeleton library to utilize multicore ar-chitectures [2]. Even though the transformation rules are applied automatically the user needs to select which possible refactorings to use. The approach of combining semi-automatic (user guided) parallelization gives scalable speedups up to 21 times on a 24 core computer [2].

(34)

Chapter 5 Conclusion

In this thesis project a way to implement automatic parallelization of C source code through automatic pattern detection has been developed and tested. This has been carried out by means of collecting data access information related to algorithmic patterns suitable for parallelizing which is sent to pattern wrapper functions. In turn the wrapper functions have been implemented using SkePU, a C++ template library that uses skeletons to give a programmer access to OpenMP, CUDA and OpenCL functionality and thus gives access to multi-core and multi-GPU hardware. As SkePU can auto-tune skeleton implementations and switch between the backends on a specific hardware configuration, these wrapper functions extend to any hardware that SkePU (i.e. OpenMP, OpenCL or CUDA) is compatible with. To make full use of SkePUs smart containers, a functionality that automatically reduces the amount of data transfer between memory hosts upon running code on a GPU device by keeping track of where data is up-to-date, a hash table for storing such containers was proposed. The hash table extends the tool-chain with an ability to perform data consistency analysis, a means to capture the utilization of wrapped sequential data such as C arrays within matched pattern instances and provide the user with additional wrapper function calls for handling data transfer to and from a device upon necessity.

We have shown that automatic parallelization of sequential C code can be achieved in such an extensible way as our PRT/SkePU tool-chain does. We have also presented a way to augment the use of SkePUs smart containers onto wrapped static C arrays, however this has not been successfully implemented during this project. In order to evaluate potential speedup using the above mentioned smart container hash-table SkePU needs to be extended with functionality to update all allocated GPU device memory held by a container. In this version of our tool-chain no relevant speedup has been reached, this however does not mean that the ap-proach must fail seeing that the full potential of the well tested SkePU library is not applicable yet.

5.1 Further Work

As stated earlier in this report, the transfer time to and from a GPU will have a significant negative impact on speedup. A way to work around this problem has been introduced in this thesis, see Section 2.2.1. A suggestion for further work in this area would then imply completion and testing of the static analyser in PRT, implementation of two (or more) data consistency wrapper functions for controlling

(35)

data flow to a device, and then reimplementing all wrapper functions to make use of the SkePU container hash table object instead.

To allow for detection and replacement of higher level patterns, i.e. matrix operations and nested for-loops, SkePU would need to be redesigned slightly to avoid hard-copying data from and to host memory. The SkePU matrix container stores its internal data in a std::vector<>object, as opposed to a SkePU vector container which stores its data in a legacy C array. This makes pointer assignment to redirect a SkePU container inappropriate for matrix operations at the moment. One way of doing this could be to introduce an Allocator class to the SkePU library, that provides the programmer with a way to choose the underlying data storage structure. For the record, SkePU also currently suffers from not being able to be compiled separately in different translation units that are to be linked together at the moment. In this sense SkePU is not a fully functional template include library.

(36)

Bibliography

[1] M. Arenaz, J. Touri˜no, and R. Doallo. Xark: An extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst., 30(6):32:1–32:56, Oct. 2008.

[2] C. Brown, V. Janjic, K. Hammond, H. Schoner, K. Idrees, and C. Glass. Agri-cultural reform: More efficient farming using advanced parallel refactoring tools. In Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on, pages 36–43, Feb 2014.

[3] P. Custers. Algorithmic species: Classifying program code for parallel comput-ing. Master’s thesis, Eindhoven University of Technology, Electronic Systems group, 2012.

[4] U. Dastgeer. Skeleton Programming for Heterogeneous GPU-based Systems. Licentiate’s thesis, Link¨oping University, Link¨oping, Sweden, 2011.

[5] U. Dastgeer. Performance-Aware Component Composition for GPU-based Sys-tems. PhD thesis, Link¨oping University, Link¨oping, Sweden, 2014.

[6] U. Dastgeer and C. Kessler. Smart containers and skeleton programming for gpu-based systems. International Journal of Parallel Programming, 3 2015.

[7] B. diMartino and C. Kessler. Two program comprehension tools for automatic parallelization. IEEE Concurrency, 8(1):37–47, 3 2000.

[8] J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White. Sourcebook of Parallel Computing. Elsevier Science, San Fransisco, USA, 1 edition, 2003.

[9] F. Gebali. Algorithms and Parallel Computing. John Wiley and Sons, Inc., New Jersey, USA, 1 edition, 2011.

[10] Johan Enmyren, Christoph Kessler. SkePU: A Multi-Backend Skeleton Pro-gramming Library for Multi-GPU Systems. Baltimore, USA, Okt 2010. High-Level Parallel Programming, Association of Computer Machinery.

[11] C. W. Kessler. Pattern-driven automatic parallelization. Scientific Program-ming, 5:251–274, 1996.

[12] S.-I. Lee, T. A. Johnson, and R. Eigenmann. Cetus – An Extensible Compiler Infrastructure for Source-to-Source Transformation, pages 539–553. Springer Berlin Heidelberg, 10 2003.

(37)

[13] Lu Li, Christoph Kessler. MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection. In Proc. International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Plat-forms (REPARA-2015) at ISPA-2015 (to appear), 2015.

[14] R. Metzger and Z. Wen. Automatic algorithm recognition and replacement: a new approach to program optimization. MIT Press, Cambridge, MA, USA, 2000.

[15] C. Nugteren and H. Corporaal. Introducing ’bones’: a parallelizing source-to-source compiler based on algorithmic skeletons. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 1–10, New York, NY, USA, 2012. ACM.

[16] OpenMP ARB. OpenMP API. http://www.openmp.org/mp-documents/ openmp-4.5.pdf, 2015. [Online; accessed 23-November-2015].

[17] B. Pottenger and R. Eigenmann. Idiom Recognition in the Polaris Parallelizing Compiler. In Proc. 9th Int. Conf. on Supercomputing, pages 444–448. ACM, July 1995.

[18] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C [The Art of Scientific Computing]. Cambridge University Press, New York, USA, 2 edition, 1992.

[19] A. S. Sarvestani. Automated recognition of algorithmic patterns in dsp pro-grams. Master’s thesis, Link¨oping University, Link¨oping, Sweden, Dec 2011.

[20] A. S. Sarvestani, E. Hansson, and C. Kessler. Extensible recognition of algo-rithmic patterns in dsp programs for automatic parallelization. International Journal of Parallel Programming, 41:806–824, 2013.

(38)

Appendix A

Comprehensive Set of

Speedup Graphs

(39)

(40)

(41)

(42)

(43)

Appendix B

PRT System Architecture

This Appendix is provided to the reader as a resource for further development of PRT.

Figure B.1: Overview of PRT

B.1 Main Program Flow

A list of the main function calls in one typical run of PRT follows.

1. ida.pelab.prt.patternrecognitionTool.PatternRecognitionTool.execute(String filename); PRT is invoked by calling thisstaticfunction with a file path as argument.

This function in turn calls the following:

2. cetus.exec.Driver.main(String[2] programArgs);

This is the main entry point for Cetus. It takes two string arguments. The first is the driver to be used, here"-idaprt=" is sent. The second is the file path to the C source file to be compiled.

(44)

Calling main() of Cetus with first argument "-idaprt=" will result in Ce-tus calling this function. This function works on the IR produced by CeCe-tus in modular steps. It has a member field cetus.hir.Program cSourceCode which will hold the IR tree after Cetus has built it from the input source code. This function performs the following function calls:

(a) ida.pelab.prt.patterndefinition.PatternParser.parse(); This function loads the patterns from the PatternDefinition.xml XML-file, and builds the Pattern Hierarchy Graph object which is stored stat-ically in class

ida.pelab.prt.patternhierarchy.PatternHierarchyGraph. (b) ida.pelab.prt.analysis.RecognitionAnalysis.Start();

Start() initializes the member fields of RecognitionAnalysis. (c) ida.pelab.prt.loopdistributionTool.TopDownLoopDistributor

.applyForLoopDistribution(Program cSourceCode);

applyForLoopDistribution applies loop distribution on a certain sub-tree by generating new loops and extending the IR structure for this sub-tree. E.g. a for-loop including two different patterns can be divided into two sequential for-loops so that the patterns can be matched. (d) prt.patternrecognitionTool.traverse(Program cSourceCode);

traverse() recursively traverses the node given as argument in bottom-up post order and performs the pattern recognition. The recognition is done through the functions

MatchingModule.applyVerticalRecognition and MatchingModule.applyHorizontalRecognition. (e) chooseTargetArchitechture();

The target architecture is found from the XML-file (defaultarchitecture) and the corresponding ida.pelab.cgn class is instantiated here and sent to the

ida.pelab.prt.annotation.AnnotationModule class.

(f) AnnotationModule.annotateSourceCode(Program cSourceCode); The AnnotationModule exchanges the matched parts of the IR tree in cSourceCode with code parts proper for the given target architecture. (g) cetus.hir.Program.print();

This function is called on the cSourceCode object to print the new (an-notated) source to an output file.

Generating SkePU Code from Automatically Detected Algorithmic Patterns in C Source Programs

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Generating SkePU Code from

Automatically Detected Algorithmic

Patterns in C Source Programs

by

Marcus A Johansson

LIU-IDA/LITH-EX-G--16/003--SE

2016-04-17

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

Generating SkePU Code from

Automatically Detected Algorithmic Patterns

in C Source Programs

Marcus A. Johansson

April 17, 2016

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Background

1.1.1

Parallel and GPU Computing

1.1.2

PRT - Pattern Recognition Tool

1.1.3

SkePU - Skeletons Processing Unit

1.2

Method

1.3

Thesis Outline

Chapter 2

Tool Design

2.1

Auxiliary Data Structures

2.2

Extensions of PRT

2.2.1

Data Consistency Management

2.3

Auto-Tuning and Overhead Minimization

2.4

Supported Wrapper Functions

2.4.1

Level1 Patterns

Chapter 3

Evaluation

3.1

Wrapper Validation

3.2

Speedup Measurements

3.3

Case Study – ODE solver

Chapter 4

Related Work

Chapter 5

Conclusion

5.1

Further Work

Bibliography

Appendix A

Comprehensive Set of

Speedup Graphs

Appendix B

PRT System Architecture

B.1

Main Program Flow