PROJECTREPORT ParallelAssemblyofSparseMatri-cesusingCCSandOpenMP

(1)

Parallel Assembly of Sparse Matri- ces using CCS and OpenMP

Aidin Dadashzadeh and Simon Ternsjo

Project in Computational Science: Report 17 februari 2013

PROJECT REPORT

(2)

(3)

Abstract

With Moore’s law being accurate even decades after originating and the development changing course from higher clock frequency on single cores into multiple cores due to the power wall. The programming efficiency has not developed as fast. Various tools to simplify the use of all the extra computing power available has been created, but to fully utilize the multicore machines available, parallel programming is necessary. One such parallel programming language is OpenMP.

In computational science and especially finite element methods (FEM), sparse matrices are more important than ever because the problem sizes keep increasing, and with it the memory needed grows. Assembling sparse matrices not only saves crucial memory but also speed up the actual calculations. Thus, assembling sparse matrices is an important step in using sparse matrices and should be assembled in a relatively short time. To simplify the use of sparse matrices, computational tools such as Matlab have created built-in functions to take care of this. However, these tools tend to be rather slow for large problem sizes and thus users have started creating their own routines for faster assembling.

A routine for assembling a sparse matrix using a compressed column storage (CCS) format has been written by our supervisor Stefan Engblom in the C language. The purpose of this project is to parallelize a routine written by Stefan Engblom with the goal of becoming five times faster on six cores than the built in Matlab function that is run sequentially. Since the routine in C code is already more than twice as fast as Matlab in its original form, the goal is then reduced to essentially double the speed of the C code on six cores.

From the results it is clear that the goal was reached and to some extent even exceeded. However, it is also clear that there is room for much more potential improvements, mainly by changing the algorithm. Comparing the results to Am- dahl’s law shows that in our attempts to parallelize the C code, we have managed to parallelize about 60% to 70% of the entire program.

This report explains what impacts Moore’s law and the power wall has had on the computational development. We will also show step by step how the parallelization was managed and what methods were considered. Also, a brief explanation of the sequential algorithms, such as the CCS method is given.

(4)

1 Introduction

Assembling sparse matrices is very common in scientific computing. In finite element methods (FEM) for instance it is used regularly. Because of the big problem sizes it has become common practice to transfer the data from a regular matrix format into a sparse matrix format to save memory. For a typical simulation a matrix is generated in each time step, and the size of the matrix depends on the mesh size. While sparse matrices become increasingly important and consistently used, it is of great value to reduce the assembling time.

To increase the performance of a program various methods are available. A boost in performance can be achieved both by applying changes in hardware as well as in software. Hardware changes include the increase of memory, higher clock frequency of the processor or better pipelining as well as adding more hardware for simultaneous work. On the software side a program code can be more optimized or migrated to another faster language, or the algorithm may be changed entirely. In this report we focus on the practice to add more hardware and rewrite a sequential code into a parallelized code to achieve better speed.

By adding more hardware we are in this case referring to utilizing more processing units for simultaneous work. These additional processing units may either be in the central processing unit (CPU) or in the graphics processing unit (GPU). A CPU usually consists of one to eight cores in a normal personal computer where each core has a high clock frequency. A GPU on the other hand may contain several hundred cores but each operates usually at a significantly lower frequency.

For the last couple of decades, Moore’s law has been accurate due to the fast improvements and drop in cost for transistors through the years. However, micro structure capabilities have recently developed so far that the earlier problem of fitting more transistors on a chip has been resolved. Instead due to the power wall the problem is now instead how to efficiently cool a chip with a large amount of transistors on. The power wall has led the development from adding more transistors on a chip into another direction where instead more chips are added to a CPU. This is what is referred to as a multicore machine. The power wall is a reason why multicore processing is becoming increasingly important. Moore’s law is now still accurate because of a fast increase in amount of cores, instead of increase of transistors, which in turn has led to a single core machine being almost extinct.

In this report we do not mention GPU programming any further and only describe parallel programming on CPU cores using the parallelization tool OpenMP. It will be explained how OpenMP can be applied to a program assembling a sparse matrix written in the C language to achieve parallelization and better performance. The C program will later be compared to sparse, the built in function for assembling sparse matrices in Matlab.

2 Background and Purpose

At the earlier stages of the computer development the hardware was more expensive to build and they took more space. As the time passed, the building blocks of the hardware were developed and reduced both in price and space, and this was the trend for many years. An observation was made by Gordon E. Moore, co-founder of Intel and can be further read about in (Moore, 1965, 2005, 2007), referred to as Moore’s Law.

What Moore then predicted was that the number of integrated transistors doubles

(6)

approximately every two years. Looking back and noticing the development in technology, lower prices and advances in microstructure technology it turns out Moore’s prediction was close to perfect.

This trend has been true for a few decades, but recently it has started to slow down.

Transistors are still getting cheaper and smaller. There are no manufacturing problems involved with squeezing even more transistors onto a chip for improved computational power. The reason for this slowing down is instead in the management. More transistors mean more power consumption, and more power consumption means more heat. So the issue now is to be able to cool the chip efficiently and so far such a method has not been found unless costly. This is referred to as the power wall.

The power for a transistor can be calculated from equation 1 as follows,

P = CV²f, (1)

where P is the power, C is the capacitance of a transistor, V is the voltage for a transistor and f is the frequency. The power wall implies that the maximum power that is efficiently cooled has been reached and thus the frequency cannot be increased any further. As it is shown from equation 1, to increase the frequency without the power rising, there is a need to lower either the capacitance or the voltage. The equation also show that a decrease in voltage would be the most beneficial, as it is changing exponentially.

The power used to decrease as the transistors got smaller. However, the voltage has reached a critical level where lower voltage would result in slower transistors, which in turn will not increase the clock frequency. Furthermore, if the voltage gets too low, close to the voltage threshold V_th there will be more leakage current (Sugii et al., 2002), even when the transistor is off, due to the tunneling phenomenon known as the Fowler-Nordheim tunneling (Sze, 1981). As for the transistors, a decrease in size gives a decrease in capacitance. However, decreasing the size of the transistors but adding more on a chip, results in the total capacitance staying the same.

With the power wall being a hindrance, new ways to improve performance are needed. One such way is, instead of increasing the clock speed of a CPU any further, to add more cores to the CPU. This is known as multicore processor architecture and is very common today. In fact, it has become the standard to such an extent that the single core processor is very outdated and uncommon. A high CPU frequency means that the computer can work on a job faster but sequentially. Instead, adding more cores to the processor means that the job can be divided into smaller parts that are run in parallel.

So, even if each part is executing slower, the overall speed may be increased.

To be able to use all the cores, programming has become more complex as the jobs have to be divided manually by the programmer. Thus, there have been several programming tools developed to help programmers exploit all the added computational power. One such tool is called OpenMP. But even with good programming tools not all programs are parallelizable. Further, for the parts that are parallelizable there is still the issue of load balancing. All in all, achieving linear speedup is nearly impossible for most programs and thus some computer power is wasted.

By calculating how large fraction of the program that is parallelizable the expected speedup can be measured using Amdahl’s law (Amdahl, 1967) as shown in equation 2,

S(N ) = 1

(1 − P ) +^P_N, (2)

where S(N ) is the speedup using N cores and P is the portion of the program that can be made parallel (P ≤ 1). This way it is possible to weight the amount of work spent parallelizing against the yield and then consider the worth.

(7)

2.1 Reason behind the project and goals

In scientific computing Matlab is commonly used. Matlab is easy to use compared to the C or Java language, for instance, and has a lot of built-in functions ready to be used.

One such function is the sparse function. The sparse function sequentially assembles a sparse matrix from a regular matrix input. However, as easy as Matlab may be, it pays with the lack in speed. Due to Matlab lacking in speed it is not uncommon for people to write their own routines. One such routine is a sequential C code provided to us by our supervisor Stefan Engblom, this program is called fsparse. The fsparse program has the same purpose as the Matlab sparse function but does it differently. The purpose of this project is to rewrite fsparse to become parallel using the parallelization tool OpenMP with the goal of improving the speed to be five times faster than the Matlab routine on six cores.

3 Method

Throughout this report a few tools and programs are introduced. These are an essential part and used extensively in this project. A Matlab program has been compared to a C code program as well as parallelized using the parallelization tool OpenMP. This section gives a short introduction to these concepts. This section also includes how the project was progressed and has been executed. It explains how the programs were compared, tested, parallelized and profiled. Finally, it is explained how OpenMP was applied to fsparse in order to make a faster parallelized version.

3.1 Matlab’s sparse function and the C code fsparse

Matlab is very commonly used for scientific computing due to the fact that it is relatively easy to program and user friendly if compared to high level languages such as C or Java. In Matlab, for instance, there is no need to allocate memory manually and there are a lot of built-in functions where only input data need to be applied. Additio- nally vector and matrix arithmetic are very simplified and straight forward. These are all tools to let the scientist concentrate on the problem rather than the programming. On the other hand, despite all of these advantages, Matlab has a big disadvantage. Matlab is overall slow when compared to a compiling language such as C or Fortran.

A reason Matlab is lacking in speed is because it is mostly an interpreted language.

This means that the program interprets the code as it goes, instead of first compiling it into machine code. Because of this there is no need to get a separate compiler, but on the other hand it is less portable and access to the Matlab program is needed in turn to execute Matlab code.

The Matlab function that this report focuses on is called sparse. It is a function that from input data assembles a sparse matrix. A sparse matrix is the same as a regular matrix but with only the necessary information. That is, all the zero values are removed and only indices and nonzero values are kept. This is done to save memory when handling very big matrices. The sparse function has five constructors and can thus handle five different input variations. The two most common inputs are either a regular matrix or three arrays, where the arrays contain the row indices, the column indices and the value.

Since Matlab is not an open source program the codes are hidden from the user and the algorithms used can only be speculated about. Hence, there is no certain way of

(8)

knowing what algorithm is used in the sparse function. However, in a blog post from 2007 (Loren, 2007) on Mathworks homepage, Loren Shure writes that Matlab sparse function uses quicksort to sort the input data. Loren explains that this method is slower than if a linear sorting algorithm such as bucket sort would have been used, but argues that it is still much faster than it could be.

The C code named fsparse is a program with the same purpose as the Matlab sparse function. This program is coded in the C language and based on the CCS (compressed column storage) format. Being coded in C already suggests it should be faster than the Matlab sparse function. Also, the fsparse program is very memory efficient and able to allocate the exact amount of memory needed.

3.2 OpenMP

Parallel programming is as mentioned earlier a very important and useful way of improving performance for a program. Parallel programming is not a trivial matter though.

It is generally harder to parallelize a program than to write a code for sequential run.

Parallel programming has been developing so fast and progressed to such an extent that users have a hard time keeping up (Mo et al., 2010). Programming tools have been developed for the user to write the parallel code manually, but these require a dee- per understanding of what is underneath the hood. To help the users, parallel software architectures such as Jasmin (Mo et al., 2010) has been introduced. In this type of architecture the user starts by writing a sequential code after specific guidelines where the software later tries to automatically parallelize the program. This kind of architecture significantly lowers the burden and the user can focus on the actual problem.

OpenMP is not quite as easy as a software architecture designed to help the user, however, it is compared to other parallel programming tools, such as MPI or pthre- ads, very user friendly. The syntax for OpenMP is very short and intuitive while it still is a high performance tool for accessing all the computational power in a computer.

OpenMP supports shared memory between the processes and is a shared memory pa- radigm which greatly reduces the difficulty level. It works to together with C, C++ and FORTRAN and on most platforms, hence granting the user flexibility and portability as well as being user friendly and making it ideal for personal use on the home computer. OpenMP is very similar no matter the language it is written in, however there may occur some differences in available functionality. In FORTRAN for instance, the OpenMP reduce function includes a wider range of options.

To write a parallel code in OpenMP, it is suggested to first write a sequential code that is later transformed into the parallel code. This is because often OpenMP only adds a few additional lines to the code and is majorly based on the sequential code. When the sequential code is written, OpenMP can be applied to the parallelizable parts of the code. The code will then be a sequential code where defined parts are parallelized. In OpenMP there is a thread, a working unit, that forks a specified amount of additional threads, among which the work is divided. This thread is referred to as the master thread and the forked threads are referred to as slave threads. For every parallel part, the master thread will create the user defined amount of slave threads with a unique ID attached to each one. All the slave threads will then execute the parallel section independently. When the parallel part is over all the slave threads will be terminated.

This method of parallelizing is called fork/join and is visualized in Figure 1.

Creating and terminating threads as in the fork/join method creates some overhead and if used carelessly and improperly may reduce the performance. It is therefore a

(9)

Figure 1: Visualization of the fork/join model of OpenMP.

good idea to be cautious and aware with where and how the parallel regions should be defined when writing an OpenMP parallelization.

It was mentioned before that OpenMP is user friendly and adds only a few extra lines to the code. In Listing 1 and 2 this is shown by first writing a simple f or-loop in plain C as in Listing 1 and then adding OpenMP to parallelize the loop in Listing 2. The f or-loop in this example takes the predefined myArray and inserts the square of the counter i in each of its slots. This f or-loop is independent which means that what happens at a certain i has nothing to do with what happens at any other i. Being independent is essential to parallel programming and simplifies it a lot.

Listing 1: C: f or-loop

// A general independent for-loop written in C

for(int i = 0; i < 1000; i++) { myArray[i] = i*i;

}

Listing 2: OpenMP: f or-loop

// An OpenMP implementation of a general for-loop

#pragma omp parallel for // Starts the parallel section and parallelizes the for-loop

for(int i = 0; i < 1000; i++) { myArray[i] = i*i;

}

In the example code in Listing 1 and 2 only one additional line has been added to parallelize the code, this excludes the lines for including the OpenMP package in the program. For the parallelization, OpenMP will create the threads that will all run the f or-loop but for different i at the same time giving this part a very good speedup.

A slightly more complex example is shown in Listing 3 and 4, where the f or- loop executing is a little more dependent. The sum that is calculated depends on its previous value, but since it is executed in parallel the previous value is unknown at the time it is needed. This can be solved by letting each thread create its own mySum that is local and hidden from the other threads. Each thread can then go on and update mySum as they want which at the end of the f or-loop will contain a partial value of sum. At the end a critical section is introduced which basically makes the parallel

(10)

region sequentially by only allowing one thread to enter at a time. This section adds each of the partial mySum into the global sum. Adding a critical section slows down the performance and should be avoided whenever possible, but as can be seen, it is not always possible.

The basics of OpenMP as in Listings 1 through 4 is not hard to grasp and can be taught easily by anyone. However, there are more complex functions which require a lot of tweaking and thought to successfully parallelize a code. Such examples are shown later in Section 3.3 where the method for parallelizing the C fsparse code is explained.

Listing 3: C: dependent f or-loop

// A slightly dependent for-loop written in C

int sum = 0;

for(int i = 0; i < 1000; i++) { sum += myArray[i];

}

Listing 4: OpenMP: dependent f or-loop

// An OpenMP implementation of a slightly dependent for-loop

int sum = 0; // A global sum shared by all threads

#pragma omp parallel // Starts the parallel section {

int mySum = 0; // Every thread creates its own local mySum

#pragma omp for

for(int i = 0; i < 1000; i++) {

mySum += myArray[i]; // myArray[i] is added to the local mySum

}

// Reduction

#pragma omp critical // Only one thread can enter this area at a time

sum += mySum; // Adding all the partial local mySum into the global sum

}

3.3 The fsparse code, its functionality and parallelization

This section discusses the C code fsparse and briefly explains how it is structured and what functionality it has. The section is divided into subsections, where each subsection explains a crucial part of the code and shows how it has been parallelized. Section 4.4 has the same structure where the results for each of the program parts are presented.

The fsparse code consists of several functions doing different steps until a sparse matrix finally is assembled. However, even though there are several functions, some of them are only handling special cases. In fact, only three of the functions play a crucial role and are run in all cases. This report only focuses on these three functions. The names of the three functions are getix, sparse and sparse insert.

(11)

The fsparse program primarily takes two different input data. Either a regular matrix A or a triplet II, JJ and SS may be given. The triplet input is probably somewhat more common and useful. Input in the form of a regular matrix is very straight forward. All the zero values are removed and only the important data are stored and given as output, a sparse matrix. The output has the same format as the triplet input, that is, three arrays containing row indices, column indices and values. The inputs II and JJ are arrays containing row and column indices respectively, while the SS is an array containing the values for each pair of indices. II and JJ have no restrictions about being sorted, or containing the same index pair more than once. If this occurs the program simply adds the values for the multiple pair of indices together. The output of these inputs is a regular sparse matrix as well with the triplet format.

There are two more possible inputs that can be given. These are however options for the triplet input. The user may specify the dimensions for the output or specify that the columns do not need to be sorted with respect to the row indices. A complete call to fsparse may look like S = f sparse(II, J J, SS, [M N N zmax]), where M , N and N zmax are the optional input data and M and N are the largest values in II and J J respectively and N zmax is the number of nonzero elements in the final matrix. This report has only focused on inputs on the form of the triplet without any options given.

Note that the following code examples in each part are only the crucial parts of the entire code. Some parts may have been removed to give a clearer picture. Also, all of the different parts of the code will not be explained, but mainly the structure and what to keep in mind when parallelizing.

3.3.1 The getix function

Out of the three core functions, getix is the first to run in the program. This function is executed two times for each program run. The first time it finds the largest value in the input II, and the second time it finds the largest value in JJ. The largest values in II and JJ correspond to the greatest indices, which at the end decide the matrix dimensions.

Finding the largest indices is getix only purpose.

In Listing 5 the basic code for the getix function is shown. The code consist of a f or-loop searching for the largest value in the input array. This code is similar to the code in Listing 3 in the way that it is slightly dependent. The variable mx containing the currently largest value for each iteration, may be overwritten with an incorrect value if a thread is doing the if -statement while another is writing its value to mx. Since there is a possibility of the right value getting overwritten by a wrong value, special care needs to be taken while parallelizing.

Listing 6 and 7 shows two different attempts to parallelize this code, A and B respectively. In attempt A the code tests whether the value need to be updated or not.

If it needs to be updated it enters a critical area where only one thread can be active at the same time. The rest of the threads are idle while waiting for their turn to enter.

Inside the critical area the same statement is tested again to make sure that nothing has changed since the last statement before making the update. This way of doing the parallelization uses the critical region only when it needs to update. However, the critical region is inside a f or-loop and may potentially be entered many times, which greatly decreases performance.

In attempt B a local variable myM x is introduced. Every thread has its own myM x and may update it whenever necessary. In this way, at the end of the f or-loop each of the threads has the maximum value of their range of i values. In this attempt the critical region is removed outside the f or-loop and is only entered to figure out which of the

(12)

threads has the largest value for myM x. This is the final value passed on. By moving the critical region outside of the loop we ensure that it only runs once, hence we reduce the idle times of the threads and improve performance.

Listing 5: getix: sequential code

// The original and sequential code for getix

const double *ival = mxGetPr(IX);

int *iix = (*ix = mxMalloc(M*N*sizeof(int)));

for (int i = 0; i < M*N; i++) { if ((iix[i] = ival[i]) > mx)

mx = ival[i];

}

*max = mx;

Listing 6: getix: parallel code A

// The first attempt att parallelizing: Attempt A

int mx = *max;

/* typecast copy */

const double *ival = mxGetPr(IX)

// Straight forward parallelization over i

#pragma omp parallel for

for (int i = 0; i < M*N; i++) {

// Test to see if we need to run if at all if ((iix[i] = ival[i]) > mx) {

// Only one thread at a time can do the update of mx

#pragma omp critical {

// Additional test to make sure that nothing has changed before assigning new value to mx if ((iix[i] = ival[i]) > mx)

mx = ival[i];

} } }

*max = mx;

Listing 7: getix: parallel code B

// The second attempt att parallelizing: Attempt B

int mx = *max;

/* typecast copy */

const double *ival = mxGetPr(IX);

// Parallelized region where all threads share mx

(13)

#pragma omp parallel shared (mx) {

// Create a local myMx for all threads int myMx = 0;

#pragma omp for

for (int i = 0; i < M*N; i++) { // Make changes to myMx

if ((iix[i] = ival[i]) > myMx) { myMx = ival[i];

} }

// Update original mx from the local myMx

// Test to see if we need to run critical if at all if (mx < myMx) {

// Only one thread at a time can do the update of mx

// Additional test to make sure that if nothing has changed before assigning new value to mx if (mx < myMx)

mx = myMx;

} } }

*max = mx;

3.3.2 The sparse function

Since the fsparse. and Matlab sparse function are doing essentially the same thing it is natural to name them similarly. This may however cause confusion in this section.

As was mentioned earlier, one of the functions inside thefsparse program are named sparse. This section will describe that function and there are no correlation between this sparse and the Matlab sparse function.

The sparse function is the most important function inside fsparse. Inside of sparse is where the most crucial part of the program occurs. This is the function that constructs the sparse matrix using the CCS format. Being such an important function also makes it long and more complex than the other functions. To simplify the structure this function can be divided into four parts, where each part consists of one or more f or-loops constructed to do a part in the assembly process.

During the parallelization of the four different parts, part 3 and 4 turned out to be different, while part 1 and 2 and 4 could be parallelized separately. To parallelize part 3, it had to be merged with part 4. Because of this the order in which the different parts are explained is part 1, part 2, part 4 and lastly part 3.

CCS, also called CSC (compressed sparse column), is a sparse matrix storage format. To build a CCS, three arrays are needed. The first array is a value array that saves all the non-zero values of the matrix from top-to-bottom and then left-to-right. The second array saves the row indices corresponding to each value in the value array. The last array is a pointer array, containing a list of indices where each column starts. An example is shown below.

Given a matrix A defined by

(14)

A =







10 0 0 −2

3 9 0 0

0 7 8 7

3 0 8 5





 ,

the three arrays for the CCS format will form as follows, val =

10 3 3 9 7 8 8 −2 7 5 ,

row ind =

1 2 4 2 3 3 4 1 3 4

col ptr =

1 4 6 8 11

where val contain the non-zero values in matrix A, row ind contain the row indices for which the values in val are stored and col ptr contain a column-wise accumulating sum for the number of non-zero elements in matrix A, starting with the value 1.

Only a brief explanation regarding CCS is given for the codes in the different parts.

For further knowledge about the CCS format you are encouraged to read the article by I. Duff et al. (Duff et al., 1989).

Part 1 This part builds row pointer. This is done by the code in Listing 8. The first loop is dependent as well as there is a risk that the same memory address will be overwritten more than once since ii[i] does not contain unique numbers. Therefore, the parallelization of the first loop is much longer and is done in two steps. In the first step, each thread gets its own local irS, called myirS, that is updated to the local values.

In the second part the local myirS is reduced to the global irS, this is done by each thread updating irS simultaneously but at different locations.

The second loop is harder to parallelize. It can be compared to the Fibonacci sequence (Beck and Geoghegan, 2010) and can be parallelized likewise. By changing the algorithm into a recursive algorithm, a parallelization may be more easily attained.

An efficient parallelization can then be achieved by using the #pragma omp section directive in OpenMP. However, according to runtimes of the program, as is shown in Section 4.4, this part does not take any significant amount of execution time. The work to parallelize compared to the gain was considered too much and thus this loop is not parallelized in this project.

Listing 8: sparse part 1: sequential code // Sparse function part 1

for (int i = 0; i < len; i++) // First loop irS[ii[i]]++;

for (int r = 2; r <= M; r++) // Second loop irS[r] += irS[r-1];

A first attempt, attempt A, at parallelizing is shown in Listing 9. Again, as in the case of getix, a local variable myirS can be created. Since irS contains a sum, the final value can easily be found by adding all the local variables together. In attempt A the local myirS are created by each thread and updated independently. Lastly the different threads enter a critical region where they one by one add their local, partial sum myirS to the global final sum in irS. As have been mentioned before, critical

(15)

regions are not good for performance and should be avoided if possible since all the threads cannot work at the same time. A solution for this has been worked out and introduced in attempt B as in Listing 10.

Attempt A and attempt B differ only in the way they add their partial sum to the global sum. In attempt A, only one thread at a time is allowed to add its value at a time inside a critical region. In attempt B however, this critical region has been removed.

The way the sums are added is that the array irS is divided into as many chunks as there are threads. The different threads are then each assigned a different chunk that they can update in parallel. If the array cannot be evenly divided, a small chunk will be unassigned. This chunk will then be updated by a single thread after the assigned chunk has been updated already. Preferably, the first available thread should be assigned the last chunk. When all the threads are done updating their respective chunks, they rotate.

That means, thread 0 goes on updating the chunk thread 1 was updating and so on. This method of adding all the local variables into the global variable in parallel is referred to as a rotational reduce and is visualized in Figure 2 for a case with four running threads.

Rotational reduce greatly enhances performance compared to using a critical region since no thread is idle except for in the case if there is a small chunk left over.

Listing 9: sparse part 1: parallel code

//parallelization of first loop in Sparse function part 1 - Attempt A

// Create a parallel region

#pragma omp parallel {

// Allocate a local myirS for temporary storage of each threads updates

mwIndex *myirS = mxCalloc(M+1,sizeof(irS[0]));

int myId = omp_get_thread_num(); // The id of the threads

#pragma omp for {

// Update the local myirS for (int i = 0; i < len; i++) {

myirS[ii[i]]++;

} }

// Reducing each thread one by one inside critical region for (int i = 0; i < M+1; i++) {

irS[i] += myirS[i];

} } }

// Second loop - No parallelization for (int r = 2; r <= M; r++) {

irS[r] += irS[r - 1];

}

(16)

//parallelization of first loop in Sparse function part 1 - Attempt B

// Create a parallel region

// Allocate a local myirS for temporary storage of each threads updates

mwIndex *myirS = mxCalloc(M+1, sizeof(irS[0]));

int myId = omp_get_thread_num(); // The id of the threads // Update the local myirS:

#pragma omp for

for (int i = 0; i < len; i++) { myirS[ii[i]]++;

}

// Rotational reduce

// Reducing irS simultaneously but at different offsets for (int loop = 0; loop < maxThreads; loop++) {

int chunk = (loop + myId) % maxThreads; // Determines offset

// Declare a chunklength before parallel region as (int)(array_size)/maxThreads

for (int i = (chunk) * chunkLength; i < (chunk+1) * chunkLength; i++) {

irS[i] += myirS[i];

}

// Update the last part of the array if array_size was not evenly divided by maxThreads

if (myId == loop) {

for (int i = (maxThreads) * chunkLength; i < M+1;

i++) {

irS[i] += myirS[i];

} }

// Synchronize threads to be sure all have updated before continuing

#pragma omp barrier

}// end reduction for-loop }

// Second loop - No parallelization for (int r = 2; r <= M; r++) {

irS[r] += irS[r - 1];

}

Although the critical region has been deleted, a barrier has been inserted in its place.

Similar to critical regions, barriers should be avoided if possible. A barrier is used to synchronize all the threads. In attempt B, it is inserted so that no thread can start update on the next chunk before all threads have finished their chunk. If there would have been no barrier a thread might start updating a chunk that is already being updated. If this were to occur the data would be corrupt.

(17)

Figure 2: Visualization of the rotational reduce method for four threads adding their local variables into a global variable in parallel.

Part 2 In part 2 of the sparse function the arrayrank is created. This is done by traversing the data in a row-wise manner. The sequential code to create rank is shown in Listing 11. Due to the deep level of memory accesses occurring for building rank, it is very hard to parallelize. Since ii[i] can have the same value for different values of i, the loop can increment the same place of irS more than once. Using a straightforward method to parallelize this would lead to a wrong value i to be written to the wrong memory address in rank. Also, some positions in rank might not be written to at all.

To solve this problem, the loop has been divided into two parts. One part that is not parallelizable and one part that can be run in parallel, as shown in Listing 12. Since irS[ii[i]] could be incremented at the same place more than once, and this decides what place in rank that should be written to, irS[ii[i]] must be done sequentially. In the first part the values of irS are saved into a temporary array. The values of irS are the memory addresses that rank will write to in the second part.

The second part updates rank with its values in the right positions. Since rank only contains unique values, this part can be parallelized by doing a straightforward parallelization of the loop. The performance of the whole part 2 is not improved as much as it potentially could have if irS could be run in parallel as well. Since only half of the loop is executed in parallel, about half the increase in performance is expected.

Listing 11: sparse part 2: sequential code rank = mxMalloc(len*sizeof(rank[0]));

irS--; /* unit offset in ii */

for (int i = 0; i < len; i++) rank[irS[ii[i]]++] = i;

/* build rank which provides for a row-wise traversal */

rank = mxCalloc(len,sizeof(rank[0]));

irS--; /* unit offset in ii */

// Allocate a temporary array to save irS values in for later use in rank

int *temp = mxCalloc(len, sizeof(int));

// First part - sequential for (int i = 0; i < len; i++) {

temp[i] = irS[ii[i]];

irS[ii[i]]++;

(18)

}

// Second part - parallel

// Straight forward parallelization since we know rank only contains unique values

#pragma omp parallel for

for (int i = 0; i < len; i++) { rank[temp[i]] = i;

}

// Free the temporary array temp mxFree(temp);

Part 4 Due to the specific features of the algorithm in part 3 and part 4, it is better to explain part 4 before stepping into part 3. Part 4 is the last part in the sparse function and is similar to part 1 in two ways. Part 4 like part 1 consists of two f or-loops, where the first f or-loop in part 4 is the same as the second f or-loop in part 1. The sequential code is shown in Listing 13. Similarly this f or-loop can be parallelized in the same way as a Fibonacci sequence (Beck and Geoghegan, 2010), however, due to its relatively small computing time, it is found unnecessary to parallelize this loop for this particular project.

Since rank is being updated at memory addresses corresponding to i, it is known that multiple threads will not be able to write to the same memory address, and hence it is perfectly independent. Knowing this, it is straight forward to go on and parallelize by using the simplest way as in Listing 14.

Next we show how part 4 can be parallelized separately. However, due to part 3 being so complex and hard to parallelize, it is shown later in the next paragraph how part 4 can be merged together with part 3 to achieve parallelization of part 3. This means that the parallelization shown in Listing 14 is not implemented in the final fsparse program. A different parallelization has instead been made for part 4 inside of part 3 and will be shown in following paragraph.

Listing 13: sparse part 4: sequential code //Sparse function part 4

for (int c = 2; c <= N; c++) jcS[c] += jcS[c-1];

/* account for the accumulation of indices */

jcS--;

for (int i = 0; i < len; i++) irank[i] += jcS[jj[i]];

jcS++;

//Parallelization of sparse function part 4 for (int c = 2; c <= N; c++)

jcS[c] += jcS[c-1];

(19)

jcS--;

#pragma omp parallel for private(temp2) for (int i = 0; i < len; i++)

{

temp2 = jj[i];

irank[i] += jcS[temp2];

} jcS++;

Part 3 In this part of the code, the program loops over the input and make each column unique with respect to row indices, building both an index vector irank and the final column pointer at the same time. This is done by two nested loops with somewhat intertwined loop index. Since part 4 was merged into part 3 in the final version, Listing 15 shows both part 3 and part 4 as the sequential code.

Listing 15: sparse part 3: seriell code

/* loop over input and make each column unique with respect to rowindices, building both an index vector irank and the final column pointer at the same time */

jcS = mxCalloc(N+1,sizeof(jcS[0]));

hcol = mxCalloc(N,sizeof(hcol[0]));

hcol--; /* unit offset in jj */

irank = mxMalloc(len*sizeof(irank[0]));

for (int row = 1,i = 0; row <= M; row++) for ( ; i < irS[row]; i++) {

const int ixijs = rank[i];

const int col = jj[ixijs];

/* new element; count it */

if (hcol[col] < row) { hcol[col] = row;

jcS[col]++;

}

/* remember where it should go */

irank[ixijs] = jcS[col]-1;

}

mxFree(++irS);

mxFree(rank);

mxFree(++hcol);

for (int c = 2; c <= N; c++) jcS[c] += jcS[c-1];

/* account for the accumulation of indices */

jcS--;

for (int i = 0; i < len; i++) irank[i] += jcS[jj[i]];

jcS++;

In order to parallelize part 3, the data is divided so that the different processors get their own chunk of rows of the matrix. This is done manually and achieved by

(20)

the lowBound and hiBound values that will be different for all processors, but are arranged so that all rows in the matrix will be covered. After that, each processor does the same calculations as in the sequential version but works on its own data.

This nested loop creates a matrix jcS, instead for the vector jcS in the sequential case, which needs to be accumulated to get the corresponding values as in the sequential case. This is done very efficiently in parallel, see the code in Listing 16, below the

”accumulating jcS over the proccessors”, comment for more details.

The original nested loop also creates a version of irank that is not the same as the corresponding sequential irank, since each processor only use its own data. This must be accounted for and is done by adding specific values from jcS to specific places in irank, see the code in Listing 16, below the comment ”accounting for wrong irank”, for more details.

The ¨accounting for wrong irank”is only done by the processors with an id higher than 0. This lets the processor with id 0 to start working on the next part of the code, which is the first subpart of part 4. Since this part is done sequentially, thread 0 can work on this part while the other threads account for wrong irank. There is no risk for thread 0 to be overwriting values in jcS used by other processors, since thread 0 is writing to a different row.

After the first subpart of part 4 is done, the next subpart of part 4 begins, a second accounting for wrong irank. This accounting is not a consequence of parallelizing the code, but was in the sequential code from the beginning. This part is straightforwardly parallelizable, and although the variable temp2 can be omitted in favor of jj[i], we noticed a slight increase in speed when we use it, and therefore it is incorporated in the final code.

/* loop over input and make each column unique with respect to rowindices, building both an index vector irank and the final column pointer at the same time */

jcS = mxCalloc(N+1,sizeof(jcS[0]));

irank = mxMalloc(len*sizeof(irank[0]));

int numThreads = omp_get_max_threads();

mwSize **myJcSP;

myJcSP = (mwSize**) mxMalloc(numThreads*sizeof(mwSize*));

for(int i=0; i<numThreads; i++){

myJcSP[i] =(mwSize*) mxCalloc(N+1,sizeof(jcS[0]));

}

//deklarering av my-grejjerna:

int myId = omp_get_thread_num();

int *hcol;

hcol = mxCalloc(N,sizeof(hcol[0]));

hcol--; /* unit offset in jj */

(21)

int lowBound = M*myId/numThreads+1; //+1 because of matlab standard (not starting at 0)

int hiBound = M*(myId+1)/numThreads+1;

/* calculate the thing */

int iStart = irS[lowBound-1];

if(myId==0) iStart = 0;

for (int row = lowBound, i=iStart; row < hiBound;

row++){

for ( ; i < irS[row]; i++) { const int ixijs = rank[i];

const int col = jj[ixijs];

/* new element; count it */

if (hcol[col] < row) { hcol[col] = row;

// myJcSP[myId][col]++;

irank[ixijs] = myJcSP[myId][col]++;

} else

irank[ixijs] = myJcSP[myId][col]-1;

} }

#pragma omp barrier

/* accumulates myJcSP over the processors */

#pragma omp for

for(int i=0; i<N+1; i++){

for(int row=1; row<numThreads; row++){

myJcSP[row][i]+=myJcSP[row-1][i];

} }

/*acounting for wrong irank*/

if(myId!=0){

int increase;

for(int i=iStart; i<irS[hiBound-1]; i++) { irank[rank[i]] += myJcSP[myId-1][jj[rank[i]]];

} }

// don’t need a barrier because read & write from myJcSPs is from different rows

#pragma omp single

for (int c = 2; c <= N; c++) {

myJcSP[numThreads-1][c] += myJcSP[numThreads-1][c-1];

}

/***************************/

(22)

#pragma omp single

*myJcSP[numThreads-1]--;

#pragma omp barrier int temp2 = 0;

#pragma omp for

for (int i = 0; i < len; i++) {

temp2 = jj[i];

irank[i] += myJcSP[numThreads-1][temp2];

}

#pragma omp barrier

#pragma omp single

*myJcSP[numThreads-1]++;

#pragma omp barrier int temp2 = 0;

#pragma omp for

for (int i = 0; i < len; i++) { temp2 = jj[i];

irank[i] += myJcSP[numThreads-1][temp2];

}

#pragma omp barrier

#pragma omp single

*myJcSP[numThreads-1]++;

} // end parallel

3.3.3 The sparse insert function

Sparse insert is inserting the elements into the final sparse matrix and is a function called at the end of the sparse function. The function is divided into four different cases depending on what the input data look like. However, there are no big differences between the cases and if one is parallelized, the same parallelization technique can be applied to the other cases. In Listing 17 one of the four cases is shown to illustrate how it can be parallelized.

Listing 17: sparse insert: sequential code //One of four cases in sparse_insert case 3 : /* full case */

for (int i = 0; i < len; i++) { irS[irank[i]] = ii[i]-1;

prS[irank[i]] += sr[i];

}

(23)

As is shown in Listing 17, irS and prS are independent of each other. However, because irank at memory address i is unknown and irank could contain the same value at two different memory addresses i, irS and prS are hard to parallelize. A straightforward parallelization of the f or-loop will give a wrong result. Using local variables to store the partial values and accumulate the local values into a global value at the end also fails to work simply because in this case no partial values are stored, only absolute values are assigned.

When analyzing the computation time for irS and prS separately it turns out that they take approximately the same time. Since irS and prS are independent of each other a simple solution is then to divide the threads to do both irS and prS parallel in respective to each other. A way of doing this is shown in Listing 18.

Listing 18: sparse insert: parallel code A

// The first attempt at parallelizing: Attempt A // Create a parallel region

int myID = omp_get_thread_num(); // Thread unique ID int maxThreads = omp_get_num_threads(); // Number of

available threads

// Only two threads do any job if (myID == 0) {

// Job 1

} }

// If only 1 thread is available it will do both jobs sequentially

if (myID == 1 || maxThreads == 1) { // Job 2

for (int i = 0; i < len; i++) { prS[irank[i]] += sr[i];

} } }

In attempt A, Listing 18, irS and prS are moved into two separate f or-loops, job 1 and job 2. The program then assigns these two jobs to two different threads, in this case to thread 0 and 1. This is however not the best way of doing this division between threads in a case such as this. Imagine for instance if threads 0 and 1 are the last ones to finish the previous work. This means that the program is idle waiting for the specific threads instead of assigning the work to an available thread. A better way of doing so is shown in attempt B, Listing 19.

This method does the same as attempt A, but with the difference that the jobs are assigned to the first threads available rather than to a predefined thread, thus saving some execution time.

Listing 19: sparse insert: parallel code B

(24)

// The second attempt at parallelizing: Attempt B // Create parallel region

// Only one thread does this work, the rest of the threads moves on

#pragma omp single nowait {

// Job 1

} }

// Only one thread does this work, the rest of the threads moves on

#pragma omp single nowait {

// Job 2

for (int i = 0; i < len; i++) { prS[irank[i]] += sr[i];

} } }

The benefits of this method is its simplicity and easy to follow execution. However, while being a simple and friendly method, it is not to be preferred if there are better methods available. Not using all of the computational power granted is a waste of energy. For these methods the threads that are not doing job 1 or job 2 are idly waiting.

The expected performance on two threads is to be twice as fast as if run sequentially.

However, because only a maximum of two threads are used, it will not scale any further.

In other words, since only two threads may be used, sparse insert will not benefit by adding any more threads than two and hence may never be more than twice as fast.

4 Results

In this section the results of the parallelization are presented. First all the results are presented individually for the different functions and parts, and later they will be com- bined to give a better overview. This section also explains the testing conditions and limitations of which the results are given.

To measure the time for the programs a profiling tool such as gprof (Graham et al., 1982) would have been preferred. However, because the fsparse program has its interface written in Matlab, while connecting it to the C language through MEX the use of gprof was not possible unless rewriting the interface in C language as well. Further- more, the use of a memory analyzing tools such as Valgrind (Nethercote and Seward, 2007) was not possible either because of the same reasons. Instead, to time the program, manually inputted wall times in the form of gettimeofday, have been implemented at critical areas. For the Matlab sparse function, tic and toc has been used for time measurements. An analysis of the memory usage has not been made in this report.

Note that not all results for all the codes are in this report. It contains only the results from the code attempts that have been used in the final version of the code. The codes that are used and have their results presented here are described in Section 3.3.

(25)

4.1 Testing environment

To be able to recreate all of the tests performed and achieve similar results, a collection of hardware and software information is gathered in this section. All the results throughout this report are perfomed on test done on a Linux server (Uppsala University) at Uppsala university. The hardware of the cluster is as follows:

• CPU: AMD Opteron(TM) Processor 6274

• Frequency: 2200 MHz

• Processors: 2

• Cores: 8

• Hyperthreading: Yes

• Cache size: 2048 KB

• Memory: 112 GB

The program was compiled using GCC version 4.4.6 (2012-03-05, Red Hat 4.4.6-4) and the tests were run on a 64 bit Matlab, version R2012b (8.0.0.783).

4.2 Creating test data

To test the two programs, fsparse sequential and parallelized, input data had to be created. The same input data is used in both programs to guarantee fairness. The test data is created in the form of a triplet, three arrays ii, jj and ss. Array ii contains the row indices, jj contains the column indices and ss contains the values. It is constructed in such a way that for every column index there exists between 90 and 100 unique row indices in the range of one to 500000. And for each of these row indices there exist between four and 20 duplicates. Finally the ss array contains only the values one in each memory address. These values are uniformly distributed, and the upper limit of the unique values is referred to as the problem size. The problem size is what is determining the final size of the sparse matrix. Finally the arrays ii and jj are randomized according to the same seed. This means that a certain size of the final matrix and a certain degree of chaos can be guaranteed. If analyzing the final sparse matrix, the final S values, it will contain values corresponding to how many duplicates existed for every row- column index pair. The test data was created with a routine written in Matlab as in Listing 20 and used throughout the project.

Listing 20: Creating test data

nnz_row = 95; % average nnz per row...

nnz_span = 5; % ...+/- this value

dupl = 12; % average number of duplicates of each entry...

dupl_span = 8; % ...+/- this value

N = 500000; % final size of sparse matrix - increase...

nnz_row = nnz_row+round(2*nnz_span*(rand(1,N)-0.5));

nnz = sum(nnz_row); % total number of non-zeros

(26)

dupl = dupl+round(2*dupl_span*(rand(1,nnz)-0.5));

ii = 1:N;

ix = cumsum(sparse(1,cumsum([1 nnz_row]),1));

rep = sparse(1,ii(ix(1:end-1)),dupl);

ix = cumsum(sparse(1,cumsum([1 rep]),1));

ii = ii(ix(1:end-1));

jj = ceil(N*rand(1,nnz));

ix = cumsum(sparse(1,cumsum([1 dupl]),1));

jj = jj(ix(1:end-1));

p = randperm(numel(ii));

ii = ii(p);

jj = jj(p);

ss = ones(1, size(ii,2));

4.3 Comparing the performance of Matlab’s sparse function and the C code fsparse

It is expected that fsparse is faster than the Matlab sparse function, mainly because it is written in C. Creating test data, as described in Section 4.2, as input and insert it in both fsparse and the sparse function to measure the execution time reveals that fsparseis indeed faster than the Matlab sparse function. In fact, from the measurements visualized in Figure 3 it is seen that fsparse is actually about 2.5 times faster than Matlab sparse function in its original form. This is true for both smaller and larger problem sizes. However, whether this is because it is written in C language or have different algorithm is not known from this test.

The goal of this project is to parallelize fsparse to become five times faster on six cores when compared to the Matlab sparse function which runs sequentially. Due to fsparsealready being 2.5 times faster than Matlab sparse function, the goal for this project is essentially to double the speed of fsparse on six cores. This problem can more closely be related to Amdahl’s law from equation 2 and it is shown in Figure 4 what the possibilities of such a speedup is for this program.

In addition for the test shown in Figure 3 an additional test was made. However, this test was only done out of curiosity and no results from it will be shown. The test in question was made by using sorted input data rather than the unsorted. This test revealed that for a sorted list, the execution time for fsparse is somewhat decreased, while the execution time for Matlab sparse function is greatly decreased. In fact, when the input data is sorted, the Matlab sparse function is executing faster than fsparse. This may be an additional indication that the Matlab sparse function is relying on sorting and can verify if a list is sorted or not. While fsparse not relying on sorting will not be affected as much. However, since this was a small test and no results are being presented, any further discussions will not be made. Tests regarding partially sorted input data has not been made hence no results are presented.

4.4 Parallelization

The structure in which the results are presented in this section is similar to that of the analysis of the code in Section 3.3. By first presenting the results individually,

(27)

Figure 3: Comparison of Matlab sparse function fsparse in its original form. The two bars two the left shows the times for a smaller problem size while the two bars two the right shows the time for a larger problem size. The blue bars show the times for Matlab sparse function and the yelow bars show the times for fsparse. fsparse is about 2.5 times faster than Matlab sparse independent of problem size.

(28)

Figure 4: Shows a plot of Amdahl’s law for six cores. The obtainable speedup possible on six cores depending on the fraction of the sequential code being parallelizable.

a better view of how much each function and part has been improved thanks to the parallelization is given. And by later combining the results, a good overview is given of how well the entire program has been improved.

As is mentioned in Section 4.3, the goal of this project has been reduced to double the speed of fsparse on six cores. To know whether this kind of speedup is possible or not Amdahl’s law, equation 2, may be handy. If the fraction for how much the sequential program is parallelizable is known, the maximum speed up can easily be calculated. However, this fraction is not easily obtainable. To use Amdahl’s law, rather than knowing the degree of parallelization exactly, an estimate is more reasonable.

Estimating how much of fsparse is parallelizable is in this case done by timing all the separate functions of the program individually, and compare the time of those functions that are attempted to be parallelized to the total execution time. This method is assuming that all the parts attempted to be parallelized in this project are perfectly parallelizable. That is however not the case as has been seen for sparse insert for instance. Ignoring that fact for now, assuming a good level of parallelism is achievable for all the attempted parts. The fraction of which this program is parallelizable is as much as 99.77%. From Figure 4 it can be seen that with a fraction of 99.77%, an almost perfect speed up of six times on six cores is ideally obtainable. From Figure 4 it can also be seen that the possible speedup is decreasing very fast with a decrease in fraction. Due to the fast decrease, the required degree of parallelization required to obtain the goal of twice as fast, is only about 60%.

Before presenting the results for each of the individual functions it is important to explain how much execution time each of the functions need in comparison to each other. This way a sense of importance can be derived to which functions are the most important to parallelize. Figure 5 show how much the three functions, getix, sparse and sparse insert consume. As can be seen in the figure, the program by far spend the most time inside of the sparse function. This indicates that this function should be prioritized the most. However, the program spends a fair amount of time in both the other functions as well and thus they cannot be neglected.

(29)

Figure 5: The amount of time the three functions, getix, sparse and sparse insert need in comparison to eachother.

4.4.1 The getix function

From Figure 5 it could be seen that the getix functions is the least time consuming function in fsparse. However, this does not mean it should be neglected. Especially since a parallelization of this function is relatively easy as shown in Section 3.3. Implementing the parallelization and measuring run time for different number of cores reveals how successful the parallelization is. In Figure 6 it is shown how much time is decreased for every added core. Figure 7 gives a clearer picture of the actual result in the form of speed up.

There are reason that the parallel code is slower than the sequential code when running with only one thread. For instance, even though there is only one thread it has to be checked and created for each of the parallel regions in the code. Fork/join overhead is a problem even if only using one core. For ideal runs, two versions of every function should be made, a sequential version not including OpenMP, and a parallelized version. If then the user wants to run on only one core the program should choose the sequential path, while if the user wants to run on two or more cores the parallel path should be chosen. However, in some cases, the overhead is so large and the degree of parallelization is rather low that not until using a certain amount of cores the performance starts improving. In these cases the two versions may again be implemented but with the sequential path chosen for all amount of cores under the critical amount which improves the performance.

From Figures 6 and 7 it is clearly visible that the parallelization is very successful. The speedup is almost optimal for up to eight cores, after that it is damped. The reason it does not achieve perfect speed for 16 cores may be because at this point the overhead from the critical area is too large. Recall that running the program on many threads means the threads will be idle longer times for every critical area. Overall, the result concerning the getix function is very satisfactory and the parallelization is thus considered successful.

(30)

Figure 6: The execution time for the getix functions when run on different amount of threads compared to the sequential version.

Figure 7: The amount of speed up achieved for getix is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized

(31)

Figure 8: The amount of time the parts inside of the sparse function consumes in relation to the whole of the sparse function.

4.4.2 The sparse function

Since this function was revealed to be a very time consuming function from Figure 5 and contained multiple parts, a separate profiling was made. In Figure 8 the profiling of only the sparse function is visualized. For this figure the parts where each measured separately and compared to the total function time. Every loop has been designated a number, so for the numbering of the chart, the first loop in part one is called part 1,1, while the second loop in part 4 is called part 4,2 and so on. Part 3 is however an exception due to its complexity and is measured as a whole.

From Figure 8 it is visible that part 3 is the most time consuming part followed by part 2. It also shows that one of the loops in each part 1 and part 4 is relatively insignificant as was explained in Section 3.3. In this case part 3 is the most important part as well as the most complex part. To parallelize this part a great deal of changes were implemented.

Part 4 has not been tested separately since in the final program it has been used as a merged version inside of part 3. Hence, the results from part 3 and 4 are presented together.

Part 1 In this part it was shown in Section 3.3 that only one of the two available loops where parallelized. However, the loop left sequential was insignificant in relation to the parallelized loop. In Figure 9 the execution time is again shown for different number of cores. As before, we see that when run on only one core it is slower than the sequential version.

In Figure 10 the speedup of part 1 is measured. It can be seen that the speedup is not optimal but still good. The speedup levels off and even decrease for more than eight cores. This is mainly due to two reasons. Firstly, for every additional core used the rotational reduce increases its iterations. This increases the overhead. Secondly, we must remember that one of the loops has not been parallelized. That means, the faster the parallelized loop is, the more significant the sequential loop becomes.

Considering these two reasons for not an optimal speedup the results are acceptable,

(32)

Figure 9: The execution time for sparse part 1 when run on different amount of threads compared to the sequential version.

especially for up to eight cores. To optimize further, a similar condition as explained in the previous paragraph can be made. If more than eight cores are used, the number of cores for this particular section can be forced down to only use eight cores to achieve better performance.

Part 2 Part 2 is a little different than the previous ones. It starts off with only one loop. But to be able to parallelize it, the loop is divided into two loops, where only one loop is parallelizable. Each loop take approximately the same execution time when run sequentially, while the total execution time staying the same. For this part the same phenomenon as in part 1 is expected. The faster the parallelized loop is running, the more significant the sequential loop will become. This can be clearly seen in Figures 11 and 12.

If only the parallelized loop were presented on the other hand, the speedup would be nearly optimal. However, being able to parallelize the entire part 2 would increase the performance significantly. As is shown in Figure 8, part 2 consumes a great deal of time and only half of that amount has been successfully parallelized.

Part 3 and 4 In order to parallelize part 3, it has been shown that a significant amount of tweaking and adding of loops to the code is necessary. One example of this is the loop to accumulate jcS that is only done once in the sequential code, but is done twice in the parallel code. This is because jcS is turned into a matrix in the parallel version.

With this done, a perfect speed up cannot be expected, and both Figure 13 and 14 show this.

The overhead makes the parallel code run significantly slower on one thread, and there is actually a speed down when going from one thread to two threads. This is probably because when running the parallel code on one thread, jcS is not a matrix.

jcS is a matrix where the number of rows are the same as the number of threads.

Therefore, the first accumulation of jcS over the threads is practically not run, since the innermost loop will exit before the code is executed. Also, the first accounting for wrong irank will not be done when the program is run on one thread because only the threads with id greater than 0 does it.

(33)

Figure 10: The amount of speed up achieved for sparse part 1 is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized

Figure 11: The execution time for sparse part 2 when run on different amount of threads compared to the sequential version.

(34)

Figure 12: The amount of speed up achieved for sparse part 2 is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized

When going from one to two threads, two extra loops are added and therefore the execution time for the program increases. However, when reaching four or more threads, the extra processing power more than makes up for the extra overhead added by the loops.

There is another way to solve the accumulating jcS and accounting for wrong irank after the nested ”calculate irank”-loop. That is to accumulate jcS in both direc- tions and then account for wrong irank only once as shown in Listing 21. This is equal to performing both accountings at the same time. This code works, and even if it might be a more nice looking code, it was found to be slightly slower than the code presented in Listing 16. The reason for this might be that while the first thread do the first subpart of part 4, which is sequential, the other threads have to wait, whereas in Listing 16 they would work on one of the ”acounting for wrong irank” while waiting.

Listing 21: sparse part 3; after ”calculate irank”-loop (not used)

#pragma omp barrier

/* acumulating jcS over the processors */

#pragma omp for

for(int i=0; i<N+1; i++)

for(int row=1; row<numThreads; row++) jcS[row][i]+=jcS[row-1][i];

#pragma omp barrier

/* acumulating jcS over the lowest row */

#pragma omp single

for (int c = 2; c <= N; c++)

jcS[numThreads-1][c] += jcS[numThreads-1][c-1];

#pragma omp barrier

/* acounting for wrong irank, and former Part 4 */

(35)

Figure 13: The execution time for sparse part 3 merged with sparse part 4 when run on different amount of threads compared to the sequential version.

Figure 14: The amount of speed up achieved for sparse part 3 merged with sparse part 4is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized

(36)

Figure 15: The execution time for the sparse insert function when run on different amount of threads compared to the sequential version.

if(myId!=0)

for(int i=iStart; i<irS[hiBound-1]; i++) irank[rank[i]] +=

jcS[myId-1][jj[rank[i]]]+jcS[numThreads-1][jj[rank[i]]-1];

else

for (int i = iStart; i <irS[hiBound-1]; i++)

irank[rank[i]] += jcS[numThreads-1][jj[rank[i]]-1];

} // end parallel

4.4.3 The sparse insert function

For this function a good speedup is not expected for more than two cores. This is because it has been specified in the code that only a maximum of two cores are allowed to be used. However, for two cores, an expected speedup of about two is expected since the work is evenly divided in two. Figures 15 and 16 show that the expected results are confirmed.

From Figure 5 it is known that this function takes a considerably large amount of time. Not being able to parallelize it in an optimal way is a very big setback. A future improvement of this function is recommended and will greatly enhance performance of the entire program.

4.5 Overall performance analysis

From the individual results presented in Section 4.4 it is hard to tell how much the entire program has been improved. Some of the functions have had a good speed up while other functions have had a very limited boost in performance. To determine if the results are enough to accomplish the goals, the results will in this section be presented as a whole.

In Figures 17 and 18 Matlab sparse function, fsparse sequential and fsparse parallel are compared to each other. In Figure 17 the times for the three different programs are compared when run on different amount of cores. It is seen that Matlab sparse function

PROJECTREPORT ParallelAssemblyofSparseMatri-cesusingCCSandOpenMP

Parallel Assembly of Sparse Matri- ces using CCS and OpenMP

Aidin Dadashzadeh and Simon Ternsjo

Project in Computational Science: Report 17 februari 2013

PROJECT REPORT

Contents

1 Introduction

2 Background and Purpose

2.1 Reason behind the project and goals

3 Method

3.1 Matlab’s sparse function and the C code fsparse

3.2 OpenMP

3.3 The fsparse code, its functionality and parallelization

4 Results

4.1 Testing environment

4.2 Creating test data

4.3 Comparing the performance of Matlab’s sparse function and the C code fsparse

4.4 Parallelization

4.5 Overall performance analysis

Aidin Dadashzadeh and Simon Ternsjo