Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

(1)

A U G U S T A N D R É N

a n d

P A T R I K H A G E R N Ä S

Data-parallel Acceleration of PARSEC

Black-Scholes Benchmark

K T H I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y

(2)

(3)

Data-parallel acceleration

of

PARSEC Black-Scholes benchmarks

Patrik Hagernäs August Andrén

Stockholm 2013

Parallel programming

School of Information and Communication Technology Royal Institute of Technology

(4)

Abstract

The way programmers has been relying on processor improvements to gain speedup in their applications is no longer applicable in the same fashion. Programmers usually have to parallelize their code to utilize the CPU cores in the system to gain a signicant speedup. To accelerate parallel applications furthermore there are a couple of techniques available. One technique is to vectorize some of the parallel code. Another technique is to move parts of the parallel code to the GPGPU and utilize this very good multithreading unit of the system.

The main focus of this report is to accelerate the data-parallel workload Black-Scholes of PARSEC benchmark suite. We are going to compare three accelerations of this workload, using vector instructions in the CPU, using the GPGPU and using a combination of them both.

The two fundamental aspects are to look at the speedup and determine which technique requires more or less programming eort. To accelerate with vectorization in the CPU we use SSE & AVX techniques and to accelerate the workload in the GPGPU we use OpenACC.

(5)

List of Figures

2.1 Single Instruction Single Data vs Single Instruction Multiple Data . . . 12 2.2 YMM and XMM registers . . . 14 2.3 Structure of a CUDA kernel . . . 15 3.1 SSE and AVX functions used to test vector instructions. . . . 19 3.2 Denition set for SSE and AVX. . . 19 3.3 Alignment for SSE and AVX. Note that the _MM_ALIGN16

is present in more than one place in the code. . . 20 4.1 Graph to establish how well vector instructions work with

threads for SSE and dierent size of data. Data span: Large. . 27 4.2 Graph to establish how well vector instructions work with

threads for AVX and dierent size of data. Data span: Large. 27 4.3 Graph to establish how well vector instructions work with

threads for AVX and dierent size of data. Data span: Small - Medium . . . 28 4.4 The ratio how much time the application spends on

transfer-ring data and how much it actually spends on the algorithm. . 30 4.5 SSE speedup & GPGPU speedup. . . 31 4.6 Execution time of the CPU & the GPU kernel. . . 31 4.7 Execution time of combined Black-Scholes & and GPGPU

ac-celerated Black-Scholes when prior to and after GPGPU has reached maximum capacity. . . 32 4.8 Execution time in seconds of all the three versions of

(8)

List of Tables

2.1 PARSEC workloads . . . 11

4.1 SSE . . . 25

4.2 AVX . . . 26

4.3 OpenACC - Zorn vs Single GPU system . . . 29

(9)

1 |

Introduction

This section will focus on introducing the background of the problem state-ment, the goals and purpose of the research. It will also include what tools have been used and the delimitations for this research.

1.1 Background

Computer industries have for a very long time relied on processor improve-ment that lead to upgrade in clock cycles per seconds. These processor improvements could be connected to the prediction known as Moore's Law. Even as early as 1965 Gordon Moore made this prediction. He predicted that the number of transistors on a processor chip would double every twelfth month. Now it's more commonly quoted as every eighteenth month or even every second year but is still considered to be accurate [3]. This process improvements was a reality for the developers and they could rely on it for their applications to gain speedup. The reality went on until the proces-sors reached a clock frequency limit which was hard to push further without receiving serious thermal problems. To maintain speedup in processors the industries set a new standard of using multi core processors.

Multicore processors as a standard did not come without a sacrice. De-velopers could no longer rely on the processors to improve and wait for the increased amount clock cycles of the processors to achieve speedup. Software had to be parallelised in order to utilize the processor to achieve the desired speedup. This parallelization was not always considered to be an easy task. There are a lot of techniques to make the program run parallel and there are still more being developed. These techniques are based on dierent ways to thread the program, dividing the workload and utilizing the hardware. Developers are constantly trying to adapt to these new techniques to achieve

(10)

1.2 Problem description

Many of today's parallel applications cannot be fully utilized because they are limited by the amount of cores in the CPU. There are techniques which can be used to achieve a speedup without increasing the number of cores and instead utilizing the advanced extended CPU registers. Speedup could also be achieved from moving parts of the workload from the CPU to the GPGPU (General Purpose Graphical Processing Unit).

1.2.1 Problem

The main focus in this study is to utilize the PARSEC benchmark and the Black-Scholes workload in a heterogenous system. Using both vector instruc-tions in the CPU as well as moving part of the workload to the GPGPU and achieve a speedup compared to the now already parallelized code. The rea-sons for using Black-Scholes workload from the PARSEC benchmark suite is because it is already data-parallel code which can easily be vectorized and GPGPU accelerated.

There are already many comparisons between running data-parallel code on the CPU and running it on the GPGPU. Most of the time these compar-isons are made between a not vectorized data-parallel algorithm running of the CPU and an equivalent algorithm on the GPGPU, these condition are not considered to be fair. In some cases the comparisons have been remade using vectorized code and the result did come out dierently.

1.2.2 Problem statement

Which techniques is most ecient to increase the speed of an benchmark data-parallel application within the PARSEC suite? Vector instructions, accelerating it on GPGPU or even a combination of both?

1.3 Purpose

The purpose of this study is to focus on two fundamental acceleration tech-niques, vectorization in the CPU and running code on GPGPU. Applying these techniques onto data-parallel applications from the PARSEC bench-mark suite to determine the optimal solution for speedup and performance acceleration. This research will also look into how much programming eort is required of each technique to create a relevant and functional program.

(11)

1.4 Goal

The goal of this study is to manage to run the PARSEC benchmark Black-Scholes with both vectorization and GPGPU and get a speedup. Hopefully the result will help developers to map some sort of general idea when it is good to vectorize the code, use the GPGPU or perhaps a combination of them both.

1.5 Method

This research is not about reinventing the wheel and we will try if possible not to rewrite any code that has been written by people with greater knowledge in this area. We must however learn these techniques and understand the code we manage to nd or be able to write the code that we could not nd. To our help in achieving the goals, we have access to a four core Intel i7 3630-QM process with AVX/SSE functions, a twelve core AMD Opteron 6172 with SSE instructions, a graphics card Nvidia GTX 680 and a cluster of graphics cards with Nvidia Tesla M2090, Nvidia GeForce GTX 580 and Nvidia Tesla C2050.

In order to apply these techniques we have read previous research rap-ports, tutorials and manuals. Our research materials and manuals have been obtained mostly from the archives of the developers for these dierent tech-niques.

To obtain our results we have programmed in C using GPGPU and vector instruction APIs with dierent compilers such as g++, ICC, PGI, nvcc and Visual Studio.

We started programming from the beginning with smaller programs in each technique and then move forward to even more advanced programs. Eventually we applied the techniques on the desired application. Using the time operations and estimating the eort needed to complete the task we could determine our result in acceleration and programming eort.

(12)

1.6 Limitations

This report may not include fully optimized code which has been determined through tests due to lack of knowledge with these techniques. The perfor-mance tests may have been run on dierent level of optimization due to diculty of applying the techniques.

The hardware architecture cannot really be compared equally which means that there cannot be any guarantee that the results will be completely fair.

(13)

2 |

Theoretic background

This research is based on accelerating applications using dierent accelerating techniques. The techniques used in this study are vector instructions as well as usage of GPGPU. To limit the area of interest the main focus was set to a few specic techniques within these areas. This section will introduce both of the areas and techniques chosen for this study. It will also explain PARSEC (The Princeton Application Repository for Shared-Memory Computers) and also compare the PARSEC benchmark suite against other benchmark suits.

2.1 PARSEC

PARSEC (The Princeton Application Repository for Shared-Memory Com-puters) is one of the more popular benchmark suites within parallel program-ming. PARSEC has earned this popularity because it provides a large and diverse repository of applications. All these applications have been selected from several application domains and they cover dierent areas in parallel programming. In the latest version of PARSEC (3.0) the benchmark suite contains thirteen dierent workloads. Every single one of these workloads have been parallelised using dierent techniques [2].

When learning about PARSEC other benchmarks are being mentioned. The most common benchmark mentioned is SPLASH-2. SPLASH-2 is a benchmark that also focus on parallel applications. Naturally this has led to a lot of comparisons between PARSEC and SPLASH-2. There is no con-clusion which determine which of these benchmark is better than the other. SPLASH-2 is an older benchmark suite and was started in 1990 and the applications are outdated in some manner. Despite the younger and more diverse PARSEC benchmark SPLASH-2 with its old programs are still useful depending on model and research area [1][2].

(14)

Program Application Domain Parallelization Working Set Data Usage Model Granularity Sharing Exchange blackscholes Financial Analysis data-parallel coarse small low low

bodytrack Computer Vision data-parallel medium medium high medium canneal Engineering unstructured ne unbounded high high

dedup Enterprise Storage pipeline medium unbounded high high faceism Animation data-parallel coarse large low medium

ferret Similarity Search pipeline medium unbounded high high uidanimate Animate data-parallel ne large low medium

freqmine Data Mining data-parallel medium unbounded high medium raytrace Rendering data-parallel medium unbounded high low streamcluster Data Mining data-parallel medium medium low medium

raytrace Financial Analysis data-parallel coarse medium low low vips Media Processing data-parallel coarse medium low medium x264 Rendering pipeline coarse medium high high

Table 2.1: PARSEC workloads

2.1.1 PARSEC benchmark suite

PARSEC contains thirteen applications as previously mentioned, which can be seen in table 2.1. Each of these is an application in a specic area of interest. What makes these applications so valuable is the fact that they are all state of the art applications within their area. Each workload is parallelised in multiple ways which enables various benchmark studies [2][1]. Every workload is interesting but this research will focus on Black-Scholes.

2.1.2 Black-Scholes

Black-Scholes is the workload in PARSEC that this study is trying to ac-celerate. It is a application that comes from a mathematical model that contains investment instruments for the nancial market. The application is parallel so it's very convenient for since it is not needed to make it data-parallel before accelerating it. The data-parallel part of Black-Scholes consists of only one for loop that contains the Black-Scholes algorithm. Parallelizing it with OpenMP is therefore very simple and the ow of the application consists of three basic parts, copy in data, the parallel for loop with the Black-Scholes algorithm and copy out result.

2.2 Vector instructions

This section will introduce vector instructions used by the CPU. To explain more in detail how this can be used we will introduce two vector instruction related techniques and one architecture that we used in our research. SIMD (Single Instruction, Multiple Data) - Architecture, SEE (Streaming SIMD

(15)

Extensions) and AVX (Advanced Vector Execution) - Techniques.

2.2.1 SIMD - Single Instruction, Multiple Data

SIMD is a cpu architecture which allows instructions to be vectorized. SIMD architecture allows for one CPU instruction to work on multiple blocks of data simultaneously. For instance you may need to do a operation on all elements in arrays, with SIMD you can then work on large chunks of data instead of one element at the time. The number of cores in the system sets the limit for the level of parallelism.

Figure 2.1: SISD vs SIMD

SIMD is an architecture that has been around for sometime now and it is based on oating point calculations. Earlier processors was not capable to handle oating point, which meant that oating point calculations had to be done in a separate unit. Floating point calculations was in high demand and it was not not long before oating points were introduced to the processors. Along with the introduction of oating point calculation capabilities in the processors new classications were introduced, one of these was SIMD [4].

(16)

In order to use SIMD the developer need to invoke vector instructions so the compiler will convert your instructions correctly. A correct conversion can be viewed in the assembler code where you have one instruction containing multiple data instructions. Otherwise they will convert them to the more common SISD (Single Instruction, Single Data). Where each instruction can only handle one set of data. Note that this can also be done in parallel depending on the algorithms parallelism and the number of available cores in the processor.[6][4]

2.2.2 SSE - Streaming SIMD Extensions

SSE is one of the dierent vector instructions that has been developed as the evolution of SIMD. SSE was not the rst instruction set that was im-plemented with the SIMD architecture, before that Intel had imim-plemented an instruction set called MMX. The MMX instructions was later re modied to what is now known as the rst version of SSE. SSE has from this point been updated to several versions and implemented functions based on user feedback. The most commonly used version is the SSE3 which came with the second generation family of Intel processors.[9]

In modern processors there are certain registers used for the purpose of vector instructions such as SSE. These are commonly known as XMM (as can be seen in gure 2.2) registers and can contain 128 bits. Using these registers instead of the 64 or 32 bits registers (based on cpu architecture) you can scale down two to four instructions into one single instruction if done properly.[8]

2.2.3 AVX - Advanced Vector Execution

AVX is an advanced version of the previously mentioned SSE. This new instruction set has been developed with the advanced register architecture. As mentioned in the SSE section there are registers named XMM which was 128 bit sized registers. The new register architecture has 256 bit registers called YMM where the last 128 bits are the former known XMM registers (See gure 2.2).[5][6]

(17)

Figure 2.2: YMM and XMM registers

2.3 GPGPU - General Purpose GPU

GPGPU is a term that was founded when it became an advantage to run parallel code on the GPU that did not have anything to do with graphics. The advantage with a GPU against an ordinary CPU is the multi thread capability, GPU can multithread much better and with many more threads. The GPU core is running on a lower frequency, generating less heat. This is what many parallel applications are looking for.

2.3.1 CUDA

CUDA is an environment to use when programming in parallel. CUDA is a full C++ compiler with the purpose to scale the code to hundreds of cores and thousands of threads on a GPU. With CUDA you enable heterogeneous systems meaning you can combine both the CPU and the GPU within the code. The way it works in details is that the main serial thread runs on the CPU and your parallel parts of the code runs on the GPU. These parallel parts are called kernels and the GPU can run multiple kernels at the same time. The way that the GPU run in parallel is in kernels and these kernels run as a grid of blocks of threads as shown in the gure 2.3.

(18)

Figure 2.3: CUDA kernel. The grid contains three blocks, each block contains threads.

Communication between threads in a parallel code is available within the same block. The communication is then achieved with shared memory and synchronization with each other. The way a thread know its own index in the kernel is by taking its block index and multiply with the number of block added with the thread index, this is shown in the code below.

int idx = blockIdx.x * blockDim.x + threadIdx.x;

There are ways to optimize CUDA code and reduce overheads, Memory Coalescing, Shared Memory, Cache-ecient texture memory accesses, loop unrolling, Parallel Reduction and Page-locked Memory allocation.

2.3.2 OpenACC

OpenACC is a programming standard that is developed to make it easier to program in parallel on heterogeneous systems. To accelerate an application on the GPGPU with plain CUDA can be very hurtful if the programmer is new to these concepts. OpenACC can be used as a bridge between pro-gramming in parallel on the CPU to program plain CUDA on the GPGPU. OpenACC can be compared to OpenMP where both uses the PRAGMA com-mands to execute code in parallel, however OpenMP can only execute on the CPU, where OpenACC can execute on both the CPU and the GPGPU. This standard is developed by the companies Cray, CAPS, Nvidia, and PGI. An OpenACC command can look like this:

#pragma acc kernels loop independent

(19)

It rst denes that it is an independent loop that will be accelerated. The function will need the array that is copied in using the copyin() function and the whole array is copied in, from zero to the predened MAX denition. The resultArray is then being copied in before the kernel launch and copied out after the kernel has terminated using the copy() function, only the rst ten values is copied.

(20)

3 |

Methodology

This chapter will document the methodology of the study with both vector-ization, accelerating with GPGPU and using a combination of them both. If the reader follows this chapter the same result should be obtained if the exact same hardware architecture environment is used.

The source les and input les that is used in this study comes from the standard PARSEC 3.0 package [11].

3.1 Vector instructions

In order to vectorize the PARSEC application Black-Scholes we had to learn how to work with vector instructions to begin with. PARSEC already in-cluded a vectorized version of Black-Scholes but to understand it and adjust it to AVX we needed to know more about vector instructions. To achieve our goals to vectorize Black-Scholes we had to start from the beginning. We decided to gure out when and where it is possible to use vector instructions. First of all we wanted to know when we could apply the vector instructions to the code and then evaluate when we thought it was worth applying it. It was important for us to analyze and know what actually happened with the code. When we reached this point we would evolve our code to a more advanced stage and later with our new knowledge work with Black-Scholes.

When researching about vector instructions and looking at Black-Scholes we narrowed down the critical sections in which we could utilize the vector instructions. These critical section contained loops and arrays. With this knowledge we could minimize the area of code we wished to look at. Know-ing that we should focus on loops and arrays we created a simple program using arrays and two kinds of loops, one which was dependant and one non dependant loop. We predicted the non dependant loop was going to work

(21)

because if there were no dependencies there would be no restrictions in work-ing with multiple instructions at the same time. So we was more interested in the dependant loop. As suspected the non dependant loop was vectorized while the dependant was not able to be vectorized.

The way we were able to determine if the loop could be vectorized was to use the Intel C++ Compiler (ICC). What this compiler allows us to do is to automatically check the code and validate if a loop can or cannot be vectorized. If a loop can be vectorized it will automatically vectorize in the most optimal way it is capable of. This is very much like the gcc -O2 option which allows for automatic optimization. What the ICC is also contributing with is a report option that prints which loops have been vectorized. Using these features we could determine which part of the code we could apply these vector instructions on.

To be able to analyze further of what we know of our auto vectorized code we created an assembler le using a compilation command to compare the non vectorized and vectorized code to determine what instructions is used by the compiler to see how they decided to vectorize the code. We also used the GDB (the GNU Project Debugger) with split options so we could see the assembler code while debugging the program. We were able to nd the instructions which used the XMM or YMM cpu registers that is signicant for vector instructions. Learning from this analysis we consider us ready to evolve our code to apply these vector instructions on our own.

To evolve our code we decided to create a new version for SSE and one for AVX and time the three methods against each other to see if we could manage to control vector instructions and also compare the time dierence between our results. We successfully produced three dierent parallel functions using AVX, SSE and without vector instructions. We however felt that we could apply and understand vector instructions to the point where we could look into Black-Scholes.

Looking into the vectorized Black-Scholes code we could discuss the content and understand how it will be converted to the registers. To convert the SSE functions to AVX we followed the already implemented denition set they used to separate if the program should run with oat and double instructions. We implemented a new denition using the AVX 256 bit commands.

(22)

Figure 3.1: SSE and AVX functions used to test vector instructions. are using 32 bytes. So we had to create our own denition when using the AVX set. So we changed it to the following.

We also automatically activated OpenMP to allow for parallel execution and use the OpenMP time calculations. With this experiment we also re-ceived expected and unexpected result which we will present in the result section.

We ran a lot of dierent tests to gather the results we considered where needed to determine how well the vector instructions worked with parallelism or how well the GPU managed with smaller data sets. To be able to get the most fair comparison with vector instructions and GPU we ran the tests on our dierent hardware architectures to see which of them gave the most fair result.

3.2 GPGPU

The idea is to rst learn about CUDA and OpenACC to implement Black-Scholes on the GPGPU. The OpenACC standard is very clear and easy

(23)

to use hence the rst implementation was with OpenACC. The Portland Group compiler will be used to accelerate the code on the device. A li-cence is needed for the compiler, and this was obtained from their web-site(http://www.pgroup.com).

3.2.1 OpenACC

To run OpenACC on our test devices we used PGI(The Portland Group)[10] compiler for the language C(pgcc). Since PGI is one of the founders of OpenACC there are a lot of help and tutorials with this compiler and its availability for us supported our choice to use it.

Changing the code is very simple since there is a OpenMP version of Black-Scholes and OpenMP is very similar to OpenACC. The Black-Scholes algorithm is implemented in the method BlkSchlsEqEuroNoDiv and the rst thing that must be changed is to make this function inline. If we are try-ing to compile with functions that are not inlined the compiler tells us that function calls are not supported. An inlined function is a function where all the code can simply be put where the function was called.

fptype BlkSchlsEqEuroNoDiv( inline fptype BlkSchlsEqEuroNoDiv(

This noties the compiler that the complete body of the function can be transferred into the code where the function was called. The CNDF method also need to be inlined in the Black-Scholes application. Now it's time to locate all the data that the device will need inside the function and copy those data in using the copyin() function. With a quick look in the method we locate that these arrays are used.

(24)

price sptprice strike rate volatility otime otype prices

If we look after the method call we see that it's only the prices array that is needed and therefore we use the copy() function for that array so it gets copied out from the device after the kernel has terminated. The OpenMP Black-Scholes version is changed to run OpenACC by changing the PRAGMA command.

OpenMP

#pragma omp parallel for private(i, price, priceDelta) OpenACC

#pragma acc kernels loop independent copyin(price[0:numOptions], sptprice[0:numOptions], strike[0:numOptions], rate[0:numOptions], volatility[0:numOptions], otime[0:numOptions], otype[0:numOptions]) copy(prices[0:numOptions])

When we give the attribute kernels loop the compiler will try to accelerate the loop below this line on the GPGPU. The independent keyword tells the compiler that the loop is independent, that not any turn in the loop changes the data in some other turn.

Using the PGI accelerator compiler we set four ags. fast

ta=nvidia Minfo=all,accel Minline

These ags are telling the compiler to accelerate it and the target should be a Nvidia device and to give us all the information about the automatic acceleration of the compilation. With the ag -Minline we tell the compiler to support inline functions and consider oat constants as type oat. The whole compilation process is in the appendix under OpenACC Compile & Run.

(25)

The altered code has a timing function for the total execution time. The PGI compiler have support for timings on the device, transfer time and ker-nel time. To use this the environment variable PGI_ACC_TIME needs to be set to 1.

The code is now able to be accelerated on the GPGPU with OpenACC and the whole code is attached in the appendix. Now we run tests with the given input les in PARSEC Black-Scholes application, the sizes on input rows are 4, 16, 4 000, 16 000, 64 000 and 10 000 000. This gives us results that shows speedup and a ratio between transporting data to the GPGPU and execute the algorithm on that data.

3.2.2 Larger input

After managing to run the two major programs we decided to run tests to gather the information needed to evaluate our problem statement. After doing speedup tests on both the GPU and CPU we realised that our data was not sucient to determine which of the optimization techniques was the optimal one. We decided then to change the code a bit to be able to work with larger sets of data. We generated our own data set which we ran up to 280 million instead of the largest given data set of 10 million from PARSEC. With these new data sets we managed to compare the GPU against the CPU successfully.

To change the application to do this we need to add a for loop that initializes the data array with data. We need to add it somewhere before the for loop that adds the values to the input arrays. The two for loops will then look like this.

For CPU:

f o r ( loopnum = 0 ; loopnum < numOptions ; ++ loopnum ){ data [ loopnum ] . s = 4 2 . 0 0 ;

data [ loopnum ] . s t r i k e = 4 0 . 0 0 ; data [ loopnum ] . r = 0 . 1 0 0 0 ; data [ loopnum ] . divq = 0 . 0 0 ; data [ loopnum ] . v = 0 . 2 0 ; data [ loopnum ] . t = 0 . 5 0 ;

data [ loopnum ] . OptionType = 'C ' ; data [ loopnum ] . divs = 0 . 0 0 ;

(26)

For GPU:

f o r ( i =0; i<numOptions ; i++) {

otype [ i ] = ( data [ i ] . OptionType == 'P' ) ? 1 : 0 ; s p t p r i c e [ i ] = data [ i ] . s ; s t r i k e [ i ] = data [ i ] . s t r i k e ; r a t e [ i ] = data [ i ] . r ; v o l a t i l i t y [ i ] = data [ i ] . v ; otime [ i ] = data [ i ] . t ; }

This will ll the input arrays with dummy values that will not give us any reasonable result but we can now see the speedup gained over 10 millions data rows.

3.3 Combined vector instruction with GPGPU

We managed to create a test le that had the base of the vectorized version of Black-Scholes. We looked at where the program called the calculation algorithm for the rst time and divided that part into methods. One of which called the GPU and the other to use the CPU. The major problem was to get these methods to run parallel, meaning that the GPU and the CPU would execute at its fullest potential simultaneously. We used OpenMP to try and get this to work properly but without success. It was not that we did not manage to run them parallel but in fact that is was not the best solution for the way our program worked and the data transfer compared to the execution time on the GPU.

Another problem we ran into was that we got a hard time to determine how much data the GPU actually could handle. With dierent ways of trying to solve this we did not manage to create a self sucient running program which was able to determine the upper limit for the GPU. So manually we had to adjust the size which would be computed on the GPU and CPU or run twice on the GPU. In the code this can be view as the MAX_SPAN variable which is the amount of data rows transferred to the GPU. We found that the max capacity for the GPU in our case was 198 millions data rows. After that point it got interesting to see what we could achieve by combining the GPU with the CPU.

(27)

We moved the program over to our GPU cluster Zorn which was capable of running both vector instructions(SSE4.2) and GPU code properly. But in order to compile using the PGI compiler a few adjustments had to be made to the vectorized code. The major change was that we could no longer align data in 16 bytes because the compiler did not allow that. Even though we have to remove the _MM_ALIGN16 denition from the code the PGI compiler helps out with the vectorization which meant that we could utilize the SSE part of Black-Scholes anyways.

(28)

4 |

Result

In this chapter all of the result we have gathered will be presented, note that we will not draw any conclusions on the results in this chapter. Conclusions will be presented in the next section. We have divided the result into four parts, vector instructions, GPGPU acceleration, the combination and one part where present results when comparing GPU and SSE.

All of the test that have generated the results below have executed at least ten times and then we have selected the median to minimize the risk of gathering results which can be aected by hardware issues or other unforeseen complications. We have also booked and run the tests on dierent time of day and validated the results to make sure we could manage to utilize the CPU and GPU to their fullest potential.

4.1 Vector instructions

We started comparing our two CPU:s the Intel i7 third generation quad core using hyper-threading and AMD Opteron 6172 with 48 cores available. The following results were gathered running with the maximum amount of threads on both CPU:s and the SSE vector instructions:

Hardware Architecture Workload (data rows) Execution time (Maximum number of cores) Intel i7 4 cores 4 0.012 AMD Opteron 6172 4 0.031 Intel i7 4 cores 16 0.227 AMD Opteron 6172 16 0.324 Intel i7 4 cores 4096 0.549 AMD Opteron 6172 4096 2.35 Intel i7 4 cores 16384 1.32 AMD Opteron 6172 16384 3.512 Intel i7 4 cores 65536 1.994 AMD Opteron 6172 65536 8.005 Intel i7 4 cores 10000000 299.966 AMD Opteron 6172 10000000 427.598 Table 4.1: SSE

(29)

The AMD Opteron processor do not support the AVX vector instructions and it was only the Intel i7 processor who was able to run the following tests. These test are also done using the maximum amount of threads. Note that the data set which contains four rows could not be run with AVX instructions because it was too small:

Hardware Architecture Workload (data rows) Execution time (Maximum number of cores)

Intel i7 4 cores 4 -Intel i7 4 cores 16 0.204 Intel i7 4 cores 4096 0.583 Intel i7 4 cores 16384 1.083 Intel i7 4 cores 65536 2.071 Intel i7 4 cores 10000000 316.59 Table 4.2: AVX

Gathering the data we could see that the Intel i7 was faster so we decided to make a few tests to determine which of the data sets would be interesting to look at during a speedup scenario. The only interesting data set was the 10 million rows. We generated the following graphs containing execution time for one, two, four and eight cores using both AVX and SSE to see if there is a similar pattern.

(30)

Figure 4.1: Graph to estab-lish how well vector instruc-tions work with threads for SSE and dierent size of data. Data span: Large.

Figure 4.2: Graph to estab-lish how well vector instruc-tions work with threads for AVX and dierent size of data. Data span: Large.

These tests gave us a view of how well vector instructions can work with parallelism. Using only eight threads this generated the following speedup. Speedup x = Execution time(1 thread) / Execution time(n threads)

Speedup SSE = 1453,404 / 299,996 = 4,9x Speedup AVX = 1745,111 / 316,59 = 5,5x

Even though there is a good speedup from the sequential to eight threads there is potential for even greater speedup. Going back to using the AMD Opteron processor and its 48 cores we ran new tests to determine how much speedup we could obtain using a lot more cores in order to see if there was a limit to the speedup with vector instructions. The result can be view in gure 4.3 below.

(31)

Figure 4.3: Graph to establish how well vector instructions work with threads for AVX with dierent size of data. Data span: Small - Medium.

After only reaching a speedup of 21x on the AMD processor we decided to run it on Zorn to see if this was a hardware issue and also to determine which of our hardware would be the most optimal for our comparison. The best speedup we could manage to get was

(32)

4.2 GPGPU acceleration

The following data is comparing our GPGPU cluster Zorn to a single GTX 680, this result might ease the conclusion on the long initiation time on Zorn.

Hardware Workload Transfer time Alg time Total time

NVIDIA GTX 680 4 0.05 0.011 117.536 NVIDIA Tesla M2090 4 20.454 0.029 465.208 NVIDIA GTX 680 16 0.048 0.011 116.026 NVIDIA Tesla M2090 16 20.457 0.029 454.325 NVIDIA GTX 680 4096 0.074 0.014 109.621 NVIDIA Tesla M2090 4096 20.689 0.03 459.028 NVIDIA GTX 680 16384 0.141 0.033 108.722 NVIDIA Tesla M2090 16384 12.115 0.035 499.26 NVIDIA GTX 680 65536 0.404 0.11 96.729 NVIDIA Tesla M2090 65536 12.924 0.083 401.532 NVIDIA GTX 680 10000000 54.336 15.331 229.408 NVIDIA Tesla M2090 10000000 108.42 9.198 517.582

Table 4.3: OpenACC - Zorn vs Single GPU system

In table 4.3 we see that both the transfer time and algorithm execution time. We can see that the total time does not add up with the other results. The reason for the total time being so much higher is because each GPU got an initiation time. The initiation time can be determined by the total time subtracted by the algorithm execution and transfer time. As we can see moving the data to the GPU is very time consuming. When running Black-Scholes with 64 000 input rows we see this performance output, using the environment variable PGI_ACC_TIME=1. This row below shows how much time the acceleration spends on executing the kernel and how much on moving data.

kernels=87ms data=12678ms

This tells us that the time the program spent on moving data is actually 145 times more then the program spent on the algorithm itself. The power in the GPU is rst shown when the input rows are many more. When running Black-Scholes with as much as 10 million input rows we see this performance output instead:

(33)

In comparison to the previous execution this execution only spends about 11 times more on data transfer then algorithm execution. When we do runs larger than 10 million we see that this ratio stagnates on 10 times more on data transfer than on algorithm execution time.

Figure 4.4: The ratio how much time the application spends on transferring data and how much it actually spends on the algorithm.

From gure 4.4 we see that the time that is spent on moving data is overwhelming until the workload is close to 10 million input rows. Then the ratio lies steady at around ten times more on moving data than to do the execution.

4.3 Compare GPU to vector instruction

In order to answer our problem statement we have to compare these dier-ent optimization techniques against each other. We decided to look at the execution time on a data set span. We choose the data span of 10 millions to 100 million data rows and compared SSE vs OpenACC, CPU vs GPU.

(34)

Figure 4.5: SSE speedup & GPGPU speedup.

We also decided to show how powerful the GPU kernel is when executing the algorithm and how much time that is actually spent transferring the data to the GPU. Using a log10 graph of the executions time in milliseconds where we compare CPU and the GPU kernel we can view the dierence in execution time.

(35)

4.4 Combined GPU with vector instructions

After comparing the CPU vs the GPU we took the time to combine these two into one solution to determine at which point we benet from using the the CPU and GPU together. As mentioned before we found the point where the memory for our GPU reached maximum capacity (198 millions data rows) and worked on tests around that area. So we started to measure the maximum speed for 180 millions to 280 millions data rows.

Figure 4.7: Execution time of combined Black-Scholes & and GPGPU accel-erated Black-Scholes when prior to and after GPGPU has reached maximum capacity.

In gure 4.7 we have used the most optimal time for the combined version prior to the GPU memory allocation cap. During our test executing the combined version was most optimal using only the GPU before the cap.

We also wanted to view the entire scope we been working on with all the dierent techniques. We ran and plotted the following graph (gure 4.8) to show the dierence using these three techniques in terms of execution time. We chose the data span from 10 million to 280 million to really show the dierence in the execution time and how these techniques compares to each other.

(36)

Figure 4.8: Execution time in seconds of all three versions of Black-Scholes used in the research.

We can actually see using this graph that we managed to increase the speed from the start as well by adjusting the SSE version to run on the PGI compiler. We can also see how fast the SSE only Black-Scholes version is being less eective than the GPU version when working with such big data sets.

(37)

5 |

Summary & Conclusion

This chapter will include explanations and ideas of the results in this thesis. It will also include a summary of the whole thesis.

5.1 Conclusion

Within this section we will discuss and evaluate the gathered results. We will both talk about the speedup as well as the programming eort in the two areas vectorization and GPGPU acceleration.

5.1.1 Vectorization

This part will focus on the result of the vectorized version of Black-Scholes. Discuss how well vectorized Black-Scholes speed increases with parallelism and how much programming eort needed to compile and run a vectorized Black-Scholes.

Speedup

Our rst speedup result was to look at AVX and SSE instructions to compare which one was faster. In terms of speedup we managed to obtain 5,5 times speedup using the AVX instructions but the SSE only reached a 4,9 times speedup. Still the SSE managed to perform better when doing our tests on our only AVX capable CPU which only had four cores. We started looking around for a reason why this occurred because generally AVX is said to be faster then SSE. We could not determine for sure why this was happening but we found out what other developers had answered with when asked about the same problems. The rst interesting reason was pointed out by a developer at Intel who said that SSE and AVX has the same upper limit for store per-formance the amount of data that could be loaded and stored simultaneously was equal. The second interesting point we found out was that two move

(38)

gave us and realisation we might have reach the upper limit of store perfor-mance and using slower instructions for moving it seemed reasonable why the SSE managed to be faster. However since the AVX managed to reach a higher speedup we cannot be sure that it would not exceed or reach the same limit as SSE in speed with more cores applied, unfortunately we did not have the hardware to test this.

When looking only at SSE in turns of speedup we were a bit disappointed when it came to utilizing the parallelism. We managed to get speedups but not in the fashion we expected. Running the vectorized version on our AMD Opteron 6172 and only managed to get a 21 times (see gure 4.3)speedup while the normal version got a 47 times speed up as most. The major concern here was that when applying more cores than 32 we got a speed decrease. This made us wonder if there was any cap to the registers of some sort or increase amount of cache misses when applying more cores and working with the vector instructions. We were not able to determine why this happened we just had to accept the fact that we did not manage to get the speedup we expected. When looking at the speedup generated over the dierent hardware we can conclude that our solution is not scaling all too well with parallelism. This is not a proof that in general SSE is not optimal with parallelism but in our case it was.

Programming eort

In terms of programming eort we managed to avoid a lot of work when work-ing with the vector instructions version of Black-Scholes since as mentioned earlier PARSEC supplied us with an already vectorized version. But looking through the vectorized version and the normal Black-Scholes it contains a lot of changes and it would probably have taken a very long time to implement this by ourselves and make it as optimal as they have done it. Even though its not at all impossible to recreate a fully functional vectorized version. The eort to vectorizing Black-Scholes by ourselves would have been demanding but implementing the AVX support in the already vectorized version was very little eort. That being said if you have already vectorized a program for SSE then the transition to AVX is almost eortless and converting a non vectorized program to become vectorized would be dicult and take time if you have no earlier experience of doing so. You really have to understand the code to its entirely to be able to apply the correct functions for the most optimal use.

(39)

5.1.2 GPGPU - OpenACC

OpenACC is a good technique that is used to accelerate code on the GPGPU. This thesis chose between OpenCL, CUDA and OpenACC. All three was tried out but the main research used OpenACC because of the minimal prob-lems it produced when programming and compiling the code.

Speedup

In this part the main discussion will be about the speedup and timings when running Black-Scholes with OpenACC on a GPU cluster. The big issue when accelerating on the GPGPU is the transfer time. This issue cannot be minimized in any way, the result tells us that when we have the best data optimization possible and the largest workload, the data movement takes about ten times longer than how long the execution of the algorithm takes. This conclusion led us to not spend more time on trying to optimize the application any more. Where we could have done some optimizations is in the algorithm part of the application by mapping the OpenACC code better to CUDA code by specifying number of grids and blocks to use. Since the data transfer is optimized as much as possible and the data transfer is so much larger then the algorithm execution time the optimization would not gain the study very much.

The speedup we get when we compare running Black-Scholes on the CPU without vector instructions and accelerating it on the GPU is what we had expected. With lower inputs it gives no speedup at all and it is actually slower to run on the GPU since the initiation time and transfer time is longer than the whole execution time on the CPU.

Programming eort

When accelerating code on the GPGPU there are some important concept to understand and is somewhat time consuming. The biggest thing to un-derstand is the architecture of a GPGPU with the aspect of grids, blocks and threads. This research used OpenACC and the compiler is mapping our algorithm to CUDA code. If the acceleration don't need to be optimized any further on the GPGPU the knowledge about the GPGPU architecture is not necessary. Since it is so easy to translate an OpenMP application to Ope-nACC and this research was lucky enough to start of with a good OpenMP version of the Black-Scholes application the programming eort was very

(40)

have to be able to be inlined and the lack of support of pointers that are ambiguous. To learn about this does not need to take more than one or two days.

To accelerate with plain CUDA or OpenCL more knowledge is required. From our perspective those techniques are way more complicated.

5.1.3 Combined CPU and GPGPU

This section we will discuss the result we received when running our combined version of Black-Scholes.

Speedup

We got some interesting results when combining the CPU with the GPU in terms of speedup. We started doing tests from a workload below the max capacity for the GPU where we realised there was no real need to use both the GPU and the CPU because we had already proven previously that already after 30 millions the GPU was much fast than the CPU. We decided to focus on the area where the GPU would reach max capacity (180 million to 198 million) and see if we could manage to get a speedup with help of the CPU. We made sure the GPU only version was faster than the GPU combined with CPU version for our data span before the cap to validate our results.

We predicted that when the GPU lls up there would have to be another data transfer when the GPU was done and for the lower data sets of 10 million and 20 millions we had already proven that it was worth using the CPU. So these were the most interesting values to look at when we started running the tests. To our surprise combining the code for the CPU and the GPU we actually managed to get a speedup on the vectorized part compared to our previously vectorized version of Black-Scholes. So not only did we get a speedup on the lower part of the data sets but even up to 60 millions data rows we obtained a speedup by using the CPU. We can not really explain the speedup of the CPU using the PGI compiler because we got a vectorized code but we suspect it has something to do with the alignment of data. When using the align functions you set boundaries in bytes to the amount you declared to the alignment functions and it is also used to align data to the cache line to improve cache performance. We suspect that the compiler optimize this alignment in a way that was more ecient than our own alignments.

(41)

Using this technique we can see that there is a great span where we actually benet from applying the CPU when the GPU reach its maximum capacity. The problem is when to realise what the capacity limit is for the GPU and we did not manage to create a functionality for our code for that problem. So our code is not optimal for running this kind of problem but we were able to determine a speedup which was what we were looking for. So when if able to control this and able to know when its worth using the GPU instead of the CPU this technique is denitely the best option when working with very large data sets.

Programming eort

Considering our situation when we had the two programs it would seem to be easy just combining them and create a correct program. We had to change the code due to compiler issues and adjust the data span which took a long time until we settle for a rather poor solution. Also working with parallelism did cause issues and we have no idea why the result was worse when trying to use GPU and CPU simultaneously. The program is working for our benet but we consider it unnished as for now and applying the proper features would require a lot of eort since we have yet found a proper solution for the problem.

5.1.4 General conclusions

In this section we will discuss the results gathered from GPU and vector optimization to get a general idea of how to compare to each other.

Speedup

When starting this research we had a general idea of the results we would obtain when comparing vector instructions with GPU optimization. We pre-dicted that the GPU would run slower than the vector instructions on low amount of data due to the memory transfer time over the bus and the work actually performed on the GPU. We also predicted that at some point when increasing the data size we would gain a lot of speed using the GPU since the memory transfer time would be worth the execution speed obtained running the algorithm on the GPU. As we could see on the results we managed to prove our predictions were correct and we could also see where the GPU out performed the CPU in terms of execution time. Basically the only conclu-sion to be said here is that the GPU is a lot more preferable when using

(42)

going past the max capacity of the GPUs memory the process starts over. Combining it with the CPU to gain speedup the rst couple of data sets and then move over to only use the GPU when it can utilize the kernel eectively compared to the transfer time. Concluding that the best way to go by and create an optimal program is to combine both techniques into one le and determine a way to adjust technique based on the data size.

Programming eort

When comparing programming eort against the other techniques there are two major things we should take into consideration. The rst being the amount of code rows you have to write in order to reach your goal and the second is knowledge. We sat down and elaborated how we thought about the dierent les we have written and tried to write and came up with the following.

Minimal Medium Large

Code rows OpenACC SSE, CUDA, OpenACC + SSE

Knowledge OpenACC, SSE CUDA, OpenACC + SSE

Table 5.1: Programming eort

The motivation to our decisions are related to our own experience with the techniques and have not been announced elsewhere. We considered CUDA to be the most demanding as a single individual technique due to our failure in completing the program and the amount of rows and time we spent learning Cuda compared to the result we could not manage to receive. When looking at SSE and OpenACC in terms of code rows there was only one obvious decision. Since OpenACC is built in a very similar way to OpenMP the amount of rows was very few compared to any other technique we looked into. The SSE code was basically the same amount of rows as the normal Black-Scholes le but in terms of keystrokes to generate the vector methods it is quite a bit more and it's a lot more than OpenACC. The combined le though required mixing SSE and OpenACC together and adding additional code to make them compatible with each other. Increasing the work also meant increasing in terms of code rows.

When looking at how we reasoned on knowledge we feel obligated rst to clarifying that none of these techniques require little or even medium knowledge. The results we present in table 5.1 are based on a comparison against each other. These are all advanced techniques which requires practice to learn.

(43)

As can be viewed in the table 5.1 we consider Cuda and SSE to be both higher level than OpenACC due to the lack of automatic generated code which is supplied with OpenACC. Every method call you do in SSE and Cuda requires knowledge of how the CPU or the GPU will acknowledge this information otherwise the optimization might not be so useful as suspected. Working with optimization you want to do it the best way possible because there is a reason you wanted to optimize in the rst place and that takes a great deal of knowledge with Cuda and vector instructions.

5.2 Summary

When looking at CPU architecture today we have reached a limit in terms of clock frequency. This means that applications have to utilize the CPU better in order to gain speedup. A common way to achieve speedup is the use of parallelism where you utilize a multi core CPU in order to make the program run parallel. But there are even more optimization techniques. We have focused on two of these, the GPGPU acceleration and vector instructions. We decided to answer the problem statement which of the optimizations techniques is faster and how much eort is required? GPGPU acceleration, vector instructions or a combination of them both? To determine this we use dierent hardware architectures and the PARSEC application Black-Scholes. There are dierent APIs to apply GPGPU accelerations and vector in-structions. The API we chose for GPGPU acceleration was OpenACC but we also looked into Cuda and slightly OpenCL. For vector instructions we mainly used SSE over AVX due to hardware architectures and CPUs with more cores only supporting SSE.

The result of this research came out as we predicted. When there was little workload the vector instructions was faster and with large workload accel-erating on the GPGPU was faster. In those cases where the workload sizes exceeded the memory capacity of the GPGPU we created a combined version with OpenACC and SSE of Black-Scholes to create an optimal program. In conclusion the general idea is to realise what data size your program is going to use and eectively apply the correct technique to match the data. For us the most eective program overall was the combined because we could adjust technique based on the data size.

(44)

of the techniques we looked into and Cuda was the most demanding. Working with the GPU there is a sacrice in time compared to functionality between Cuda and OpenACC. If you want full control Cuda is the proper choice and only a functional GPGPU acceleration OpenACC is preferable. SSE is also very time consuming but with good knowledge of the API you only have to convert variables into SSE objects and apply the appropriate vector instructions of your intention to use the CPU more eective. To generate the most optimal code all these techniques require a lot of knowledge and practices for desired results.

5.3 Future research

To continue from this point with this research it is an idea to go back to complete the Cuda version to see if it could aect the results more and work around with the additional functions available with Cuda. There would also be an idea to look into improving the combined Black-Scholes, perhaps change it to run with Cuda and see if there is a solution to control the GPU and CPU better.

In terms of vector instructions there is the option to nd a cluster which supports AVX or even the new AVX 2 which is new for 2013. This could change the results even more but that might require a look on the Black-Scholes AVX part. The AVX part is working but can surely be improved to utilize the CPU better.

(45)

References

[1] C. Bienia, S. Kumar, and K. Li. Parsec vs. splash-2: A quantitative com-parison of two multithreaded benchmark suites on chip-multiprocessors. In Workload Characterization, 2008. IISWC 2008. IEEE International Symposium on, pages 4756, 2008.

[2] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architec-tures and compilation techniques, PACT '08, pages 7281, New York, NY, USA, 2008. ACM.

[3] Alfred E. Brenner. Science, volume 275. American Association for the Advancement of Science, 1997.

[4] O El Hamzaoui. A fast scan matching for grid-based laser slam us-ing streamus-ing simd extensions. In Control Automation Robotics Vision (ICARCV), 2010 11th International Conference, pages 19861990, 2010. [5] Naveen Gv. Intel R integrated performance primitives (intel R ipp)

func-tions optimized for intel R advanced vector extensions (intel R avx).

http://software.intel.com, 2012.

[6] Eli Hernandez and Larry Moore. Improving the compute performance of video processing software using avx (advanced vector extensions) in-structions. http://software.intel.com, 2012.

[7] T. Mattson and M. Wrinn. Parallel programming: Can we please get it right this time? In Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE, pages 711, 2008.

[8] Mark Sabahi. A guide to vectorization with Intel C++ Compilers.

(46)

[10] STMicroelectronics. Pgi group. http://www.pgroup.com/, 2013.

[11] Princeton University. Parsec 3.0. http://parsec.cs.princeton.edu/index.htm, 2013.

(47)

6 |

Appendix

6.1 GPGPU - OpenACC

Compile & run Compile command pgcc -fast -ta=nvidia -Minfo=all,accel -Minline blackscholes-acc.c -Mfcon

Output 293, Loop not vectorized/parallelized: contains call 299, Generat-ing copyin(otype[0:numOptions]) GeneratGenerat-ing copyin(otime[0:numOptions])

Generating copyin(volatility[0:numOptions]) Generating copyin(rate[0:numOptions]) Generating copyin(strike[0:numOptions]) Generating copyin(sptprice[0:numOptions]) Generating copyin(price) Generating copy(prices[0:numOptions]) Generating

NVIDIA code Generating compute capability 1.3 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 300, Loop is parallelizable Accelerator kernel generated 300, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ Loop not fused: no successor loop Loop not vectorized: data dependency 308, BlkSchlsEqEuroNoDiv inlined, size=25 (inline) le blackscholes-acc.c (159) 213, CNDF inlined, size=27, le blackscholes-acc.c (95) 214, CNDF inlined, size=27, le blackscholes-acc.c (95) main: 380, Loop not vectorized/parallelized: contains call 415, Loop not vectorized/parallelized: not countable 492, Loop not vectorized/paral-lelized: contains call

Run command $ blackscholes-acc.exe 1 in_64K.txt output.txt Output

PARSEC Benchmark Suite Num of Options: 65536 Num of Runs: 100 Size of data: 2621440

Accelerator Kernel Timing data blackscholes-acc.c bs_thread NVIDIA devicenum=0 time(us): 59,116 301: data copyin reached 800 times device time(us): total=38,174 max=317 min=5 avg=47 302: kernel launched 100 times grid: [512] block: [128] device time(us): total=16,334 max=169 min=163 avg=163 elapsed time(us): total=15,000 max=1000 min=0 avg=150 314: data copyout reached 100 times device time(us): total=4,608 max=52 min=45

(48)

6.1.1 blackscholes-acc.c

// Analytical method for calculating European Options //

//

// Reference Source: Options, Futures, and Other Derivatives, 3rd Edition, //Prentice

// Hall, John C. Hull,

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <string.h> #include <time.h> #define ENABLE_OPENMP 1 #ifdef ENABLE_PARSEC_HOOKS #include <hooks.h> #endif

// Multi-threaded pthreads header

#ifdef ENABLE_THREADS

// Add the following line so that icc 9.0 is compatible with //pthread lib.

#define __thread __threadp

MAIN_ENV

#undef __thread #endif

// Multi-threaded OpenMP header

#ifdef ENABLE_OPENMP #include <omp.h> #endif

//Moving worload to GPGPU

#ifdef ENABLE_OPENACC #printf("Using OpenACC"); #endif

(49)

#ifdef ENABLE_TBB #include "tbb/blocked_range.h" #include "tbb/parallel_for.h" #include "tbb/task_scheduler_init.h" #include "tbb/tick_count.h" using namespace std; using namespace tbb; #endif //ENABLE_TBB

// Multi-threaded header for Windows

#ifdef WIN32

#pragma warning(disable : 4305) #pragma warning(disable : 4244) #include <windows.h>

#endif

//Precision to use for calculations

#define fptype float #define NUM_RUNS 1

typedef struct OptionData_ {

fptype s; // spot price

fptype strike; // strike price

fptype r; // risk-free interest rate

fptype divq; // dividend rate

fptype v; // volatility

fptype t; // time to maturity or option expiration in years

// (1yr = 1.0, 6mos = 0.5, 3mos = 0.25, ..., etc)

char OptionType; // Option type. "P"=PUT, "C"=CALL

fptype divs; // dividend vals (not used in this test)

fptype DGrefval; // DerivaGem Reference Value

} OptionData; OptionData *data; fptype *prices;

(50)

int * otype; fptype * sptprice; fptype * strike; fptype * rate; fptype * volatility; fptype * otime; int numError = 0; int nThreads; ////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////// // Cumulative Normal Distribution Function

// See Hull, Section 11.8, P.243-244

#define inv_sqrt_2xPI 0.39894228040143270286

//compares if the float f1 is equal with f2 and //returns 1 if true and 0 if false

inline int compare_float(float f1, float f2) { float precision = 0.01; if (((f1 - precision) < f2) && ((f1 + precision) > f2)) { return 1; } else { return 0; } }

inline fptype CNDF ( fptype InputX ) { int sign; fptype OutputX; fptype xInput; fptype xNPrimeofX; fptype expValues;

(51)

fptype xK2;

fptype xK2_2, xK2_3; fptype xK2_4, xK2_5; fptype xLocal, xLocal_1; fptype xLocal_2, xLocal_3;

// Check for negative value of InputX if (InputX < 0.0) { InputX = -InputX; sign = 1; } else sign = 0; xInput = InputX;

// Compute NPrimeX term common to both four & six decimal accuracy calcs

expValues = exp(-0.5f * InputX * InputX); xNPrimeofX = expValues;

xNPrimeofX = xNPrimeofX * inv_sqrt_2xPI; xK2 = 0.2316419 * xInput; xK2 = 1.0 + xK2; xK2 = 1.0 / xK2; xK2_2 = xK2 * xK2; xK2_3 = xK2_2 * xK2; xK2_4 = xK2_3 * xK2; xK2_5 = xK2_4 * xK2; xLocal_1 = xK2 * 0.319381530; xLocal_2 = xK2_2 * (-0.356563782); xLocal_3 = xK2_3 * 1.781477937; xLocal_2 = xLocal_2 + xLocal_3; xLocal_3 = xK2_4 * (-1.821255978); xLocal_2 = xLocal_2 + xLocal_3; xLocal_3 = xK2_5 * 1.330274429; xLocal_2 = xLocal_2 + xLocal_3; xLocal_1 = xLocal_2 + xLocal_1;

(52)

OutputX = xLocal; if (sign) { OutputX = 1.0 - OutputX; } return OutputX; } //////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////// inline fptype BlkSchlsEqEuroNoDiv( fptype sptprice,

fptype strike, fptype rate, fptype volatility,

fptype time, int otype, float timet )

{

fptype OptionPrice;

// local private working variables for the calculation

fptype xStockPrice; fptype xStrikePrice; fptype xRiskFreeRate; fptype xVolatility; fptype xTime; fptype xSqrtTime; fptype logValues; fptype xLogTerm; fptype xD1; fptype xD2; fptype xPowerTerm; fptype xDen; fptype d1; fptype d2; fptype FutureValueX; fptype NofXd1; fptype NofXd2; fptype NegNofXd1; fptype NegNofXd2; xStockPrice = sptprice;

(53)

xStrikePrice = strike; xRiskFreeRate = rate; xVolatility = volatility; xTime = time;

xSqrtTime = sqrt(xTime);

logValues = log( sptprice / strike );

xLogTerm = logValues;

xPowerTerm = xVolatility * xVolatility; xPowerTerm = xPowerTerm * 0.5;

xD1 = xRiskFreeRate + xPowerTerm;

xD1 = xD1 * xTime; xD1 = xD1 + xLogTerm;

xDen = xVolatility * xSqrtTime; xD1 = xD1 / xDen; xD2 = xD1 - xDen; d1 = xD1; d2 = xD2; NofXd1 = CNDF( d1 ); NofXd2 = CNDF( d2 );

FutureValueX = strike * ( exp( -(rate)*(time) ) );

if (otype == 0) {

OptionPrice = (sptprice * NofXd1) - (FutureValueX * NofXd2); } else {

NegNofXd1 = (1.0 - NofXd1); NegNofXd2 = (1.0 - NofXd2);

OptionPrice = (FutureValueX * NegNofXd2) - (sptprice * NegNofXd1);

(54)

}

#ifdef ENABLE_TBB

struct mainWork { mainWork() {}

mainWork(mainWork &w, tbb::split) {}

void operator()(const tbb::blocked_range<int> &range) const { fptype price;

int begin = range.begin();

int end = range.end();

for (int i=begin; i!=end; i++) {

/* Calling main function to calculate option value based on * Black & Scholes's equation.

*/

price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i],

rate[i], volatility[i], otime[i], otype[i], 0);

prices[i] = price;

#ifdef ERR_CHK

fptype priceDelta = data[i].DGrefval - price;

if( fabs(priceDelta) >= 1e-5 ){

fprintf(stderr,"Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta);

numError ++; } #endif } } }; #endif // ENABLE_TBB /////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////// #ifdef ENABLE_TBB

(55)

int bs_thread(void *tid_ptr) {

int j;

tbb::affinity_partitioner a; mainWork doall;

for (j=0; j<NUM_RUNS; j++) {

tbb::parallel_for(tbb::blocked_range<int>(0, numOptions), doall, a); }

return 0; }

#else // !ENABLE_TBB

#ifdef WIN32

DWORD WINAPI bs_thread(LPVOID tid_ptr){

#else

int bs_thread(void *tid_ptr) {

#endif

int i, j; fptype price; fptype priceDelta;

int tid = *(int *)tid_ptr;

int start = tid * (numOptions / nThreads); int end = start + (numOptions / nThreads);

for (j=0; j<NUM_RUNS; j++) {

clock_t begin, end;

float time_spent;

begin = clock();

double startTime = omp_get_wtime();

(56)

strike[0:numOptions], rate[0:numOptions], volatility[0:numOptions], otime[0:numOptions], otype[0:numOptions]) copy(prices[0:numOptions]) for (i=0; i<numOptions; i++) {

prices[i] = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i],

rate[i], volatility[i], otime[i], otype[i], 0);

}

#pragma acc wait

printf("Execution time: %f\n", (omp_get_wtime() - startTime)); fptype check;

for(i=0; i<numOptions; i++){

check = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i], otype[i], 0);

if(!compare_float(prices[i], check)){ printf("Wrong result!\n");

} } } return 0; } #endif //ENABLE_TBB

int main(int argc, char **argv){ FILE *file;

int i;

(57)

fptype * buffer; int * buffer2;

int rv;

#ifdef PARSEC_VERSION

#define __PARSEC_STRING(x) #x

#define __PARSEC_XSTRING(x) __PARSEC_STRING(x)

printf("PARSEC Benchmark Suite Version "__PARSEC_XSTRING(PARSEC_VERSION)"\n"); fflush(NULL);

#else

printf("PARSEC Benchmark Suite\n"); fflush(NULL); #endif //PARSEC_VERSION #ifdef ENABLE_PARSEC_HOOKS __parsec_bench_begin(__parsec_blackscholes); #endif if (argc != 4) {

printf("Usage:\n\t%s <nthreads> <inputFile> <outputFile>\n", argv[0]);

exit(1); }

nThreads = atoi(argv[1]); char *inputFile = argv[2]; char *outputFile = argv[3];

//Read input data from file

file = fopen(inputFile, "r");

if(file == NULL) {

printf("ERROR: Unable to open file `%s'.\n", inputFile); exit(1);

}

rv = fscanf(file, "%i", &numOptions);

if(rv != 1) {

printf("ERROR: Unable to read from file `%s'.\n", inputFile); fclose(file);

exit(1); }

(58)

threads to match number of options.\n"); nThreads = numOptions;

}

#if !defined(ENABLE_THREADS) && !defined(ENABLE_OPENMP) && !defined(ENABLE_TBB)

if(nThreads != 1) {

printf("Error: <nthreads> must be 1 (serial version)\n"); exit(1);

}

#endif

// alloc spaces for the option data

data = (OptionData*)malloc(numOptions*sizeof(OptionData)); prices = (fptype*)malloc(numOptions*sizeof(fptype));

for ( loopnum = 0; loopnum < numOptions; ++ loopnum ) {

rv = fscanf(file, "%f %f %f %f %f %f %c %f %f", &data[loopnum].s,

&data[loopnum].strike, &data[loopnum].r, &data[loopnum].divq,

&data[loopnum].v, &data[loopnum].t, &data[loopnum].OptionType,

&data[loopnum].divs, &data[loopnum].DGrefval);

if(rv != 9) {

printf("ERROR: Unable to read from file `%s'.\n", inputFile); fclose(file); exit(1); } } rv = fclose(file); if(rv != 0) {

printf("ERROR: Unable to close file `%s'.\n", inputFile); exit(1);

}

#ifdef ENABLE_THREADS

MAIN_INITENV(,8000000,nThreads);

#endif

printf("Num of Options: %d\n", numOptions); printf("Num of Runs: %d\n", NUM_RUNS);

#define PAD 256 #define LINESIZE 64

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

A U G U S T A N D R É N

a n d

P A T R I K H A G E R N Ä S

Data-parallel Acceleration of PARSEC

Black-Scholes Benchmark

Data-parallel acceleration

of

PARSEC Black-Scholes benchmarks

Abstract

Contents

List of Figures

List of Tables

1

|

Introduction

1.1 Background

1.2 Problem description

1.2.1 Problem

1.2.2 Problem statement

1.3 Purpose

1.4 Goal

1.5 Method

1.6 Limitations

2

|

Theoretic background

2.1 PARSEC

2.1.1 PARSEC benchmark suite

2.1.2 Black-Scholes

2.2 Vector instructions

2.2.1 SIMD - Single Instruction, Multiple Data

2.2.2 SSE - Streaming SIMD Extensions

2.2.3 AVX - Advanced Vector Execution

2.3 GPGPU - General Purpose GPU

2.3.1 CUDA

2.3.2 OpenACC

3

|

Methodology

3.1 Vector instructions

3.2 GPGPU

3.2.1 OpenACC

3.2.2 Larger input

3.3 Combined vector instruction with GPGPU

4

|

Result

4.1 Vector instructions

4.2 GPGPU acceleration

4.3 Compare GPU to vector instruction

4.4 Combined GPU with vector instructions

5

|

Summary & Conclusion

5.1 Conclusion

5.1.1 Vectorization

5.1.2 GPGPU - OpenACC

5.1.3 Combined CPU and GPGPU

5.1.4 General conclusions

5.2 Summary

5.3 Future research

References

6

|

Appendix

6.1 GPGPU - OpenACC

6.1.1 blackscholes-acc.c