4.4 Parallelization
4.4.2 The sparse function
Since this function was revealed to be a very time consuming function from Figure 5 and contained multiple parts, a separate profiling was made. In Figure 8 the profiling of only the sparse function is visualized. For this figure the parts where each measured separately and compared to the total function time. Every loop has been designated a number, so for the numbering of the chart, the first loop in part one is called part 1,1, while the second loop in part 4 is called part 4,2 and so on. Part 3 is however an exception due to its complexity and is measured as a whole.
From Figure 8 it is visible that part 3 is the most time consuming part followed by part 2. It also shows that one of the loops in each part 1 and part 4 is relatively insignificant as was explained in Section 3.3. In this case part 3 is the most important part as well as the most complex part. To parallelize this part a great deal of changes were implemented.
Part 4 has not been tested separately since in the final program it has been used as a merged version inside of part 3. Hence, the results from part 3 and 4 are presented together.
Part 1 In this part it was shown in Section 3.3 that only one of the two available loops where parallelized. However, the loop left sequential was insignificant in relation to the parallelized loop. In Figure 9 the execution time is again shown for different number of cores. As before, we see that when run on only one core it is slower than the sequential version.
In Figure 10 the speedup of part 1 is measured. It can be seen that the speedup is not optimal but still good. The speedup levels off and even decrease for more than eight cores. This is mainly due to two reasons. Firstly, for every additional core used the rotational reduce increases its iterations. This increases the overhead. Secondly, we must remember that one of the loops has not been parallelized. That means, the faster the parallelized loop is, the more significant the sequential loop becomes.
Considering these two reasons for not an optimal speedup the results are acceptable,
Figure 9: The execution time for sparse part 1 when run on different amount of threads compared to the sequential version.
especially for up to eight cores. To optimize further, a similar condition as explained in the previous paragraph can be made. If more than eight cores are used, the number of cores for this particular section can be forced down to only use eight cores to achieve better performance.
Part 2 Part 2 is a little different than the previous ones. It starts off with only one loop. But to be able to parallelize it, the loop is divided into two loops, where only one loop is parallelizable. Each loop take approximately the same execution time when run sequentially, while the total execution time staying the same. For this part the same phenomenon as in part 1 is expected. The faster the parallelized loop is running, the more significant the sequential loop will become. This can be clearly seen in Figures 11 and 12.
If only the parallelized loop were presented on the other hand, the speedup would be nearly optimal. However, being able to parallelize the entire part 2 would increase the performance significantly. As is shown in Figure 8, part 2 consumes a great deal of time and only half of that amount has been successfully parallelized.
Part 3 and 4 In order to parallelize part 3, it has been shown that a significant amount of tweaking and adding of loops to the code is necessary. One example of this is the loop to accumulate jcS that is only done once in the sequential code, but is done twice in the parallel code. This is because jcS is turned into a matrix in the parallel version.
With this done, a perfect speed up cannot be expected, and both Figure 13 and 14 show this.
The overhead makes the parallel code run significantly slower on one thread, and there is actually a speed down when going from one thread to two threads. This is probably because when running the parallel code on one thread, jcS is not a matrix.
jcS is a matrix where the number of rows are the same as the number of threads.
Therefore, the first accumulation of jcS over the threads is practically not run, since the innermost loop will exit before the code is executed. Also, the first accounting for wrong irank will not be done when the program is run on one thread because only the threads with id greater than 0 does it.
Figure 10: The amount of speed up achieved for sparse part 1 is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized
Figure 11: The execution time for sparse part 2 when run on different amount of thre-ads compared to the sequential version.
Figure 12: The amount of speed up achieved for sparse part 2 is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized
When going from one to two threads, two extra loops are added and therefore the execution time for the program increases. However, when reaching four or more thre-ads, the extra processing power more than makes up for the extra overhead added by the loops.
There is another way to solve the accumulating jcS and accounting for wrong irank after the nested ”calculate irank”-loop. That is to accumulate jcS in both direc-tions and then account for wrong irank only once as shown in Listing 21. This is equal to performing both accountings at the same time. This code works, and even if it might be a more nice looking code, it was found to be slightly slower than the code presented in Listing 16. The reason for this might be that while the first thread do the first subpart of part 4, which is sequential, the other threads have to wait, whereas in Listing 16 they would work on one of the ”acounting for wrong irank” while waiting.
Listing 21: sparse part 3; after ”calculate irank”-loop (not used)
#pragma omp barrier
/* acumulating jcS over the processors */
#pragma omp for
for(int i=0; i<N+1; i++)
for(int row=1; row<numThreads; row++) jcS[row][i]+=jcS[row-1][i];
#pragma omp barrier
/* acumulating jcS over the lowest row */
#pragma omp single
for (int c = 2; c <= N; c++)
jcS[numThreads-1][c] += jcS[numThreads-1][c-1];
#pragma omp barrier
/* acounting for wrong irank, and former Part 4 */
Figure 13: The execution time for sparse part 3 merged with sparse part 4 when run on different amount of threads compared to the sequential version.
Figure 14: The amount of speed up achieved for sparse part 3 merged with sparse part 4is compared to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized
Figure 15: The execution time for the sparse insert function when run on different amount of threads compared to the sequential version.
if(myId!=0)
for(int i=iStart; i<irS[hiBound-1]; i++) irank[rank[i]] +=
jcS[myId-1][jj[rank[i]]]+jcS[numThreads-1][jj[rank[i]]-1];
else
for (int i = iStart; i <irS[hiBound-1]; i++)
irank[rank[i]] += jcS[numThreads-1][jj[rank[i]]-1];
} // end parallel