Overall performance analysis - PROJECTREPORT ParallelAssemblyofSparseMatri-cesusingCCSandOpenMP

Figure 15: The execution time for the sparse insert function when run on different amount of threads compared to the sequential version.

if(myId!=0)

for(int i=iStart; i<irS[hiBound-1]; i++) irank[rank[i]] +=

jcS[myId-1][jj[rank[i]]]+jcS[numThreads-1][jj[rank[i]]-1];

else

for (int i = iStart; i <irS[hiBound-1]; i++)

irank[rank[i]] += jcS[numThreads-1][jj[rank[i]]-1];

} // end parallel

Figure 16: The amount of speed up achieved for the sparse insert function compa-red to the optimal speed up obtainable only if the code is perfectly parallelizable and parallelized

is the slowest as expected. The fsparse parallel is slower than fsparse sequential, and this is one again, as explained before, due to a parallelization of a program creates over-head whether run on only one core or more. However, it can be seen that the execution time for fsparse parallel is rapidly decreasing for a small amount of cores and stabilizes at around eight cores. Sixteen cores seems to be giving the best performance, however, the improvement is not much and efficiency wise, it is better to use eight or six cores.

Figure 18 shows the total speedup over a big range of problem sizes. This figure gives a good indication of how the different programs compare to each other for more than only the large problem sizes. The figure is divided into three lines, where in each of the lines corresponds to a programs speed up compared to one of the other programs.

From Figure 18 it is clearly visible that for six cores, our goal of achieving a speed of five times faster than Matlab sparse function, or twice as fast as fsparse sequential is successful. Although the speed up is not stable it shows indications of converging to acceptable levels.

To further show what the program looks like in its parallel form it is necessary to show a final profiling of the program. Timing the different functions in the program again while running on six cores shows what functions and parts are still taking time and are important for further improvements. Figures 19 and 20 show that getix has been reduces to almost being insignificant and should not be prioritized for further improvements. The big issue is still within the sparse function while sparse insert takes a fair amount of time as well. Inside of the sparse it is clear that the complex part 3 and part 2 as well should be prioritized when further improving.

Finally an extensive run was made where all three programs, Matlab sparse func-tion and the two fsparse programs, were tested on how their dependencies differed. In this test two variables were changed. All the programs were tested on their behavior when changing the problem size and the amount of unique row indices per column index. The test revealed that all the programs have the same dependencies and the execution time mainly increases when both the amount of unique indices as well as the problem size is increased. This is first shown for only the fsparse sequential program in Figure 21. Its purpose is to give a better picture of what the test included. In Figure

Figure 17: Compares the execution time for the three different program. The plot is showing results for the problem size of N = 3.7 ∗ 10⁸(where matrixsize = N ∗ N ), and where there are between 90 and 100 unique row indices per colums.

Figure 18: A comparison of speed up between the three different programs. Eeach line corresponds to a speed up where two programs are compared. The legend tells what programs has been compared and the first name in each legend entry is the faster program.

Figure 19: A second profiling of the functions inside of fsparse when it has been paral-lelized and run on six cores. sparse and sparse insert should be prioritized in the future to reach further improvements.

Figure 20: A second profiling of the parts inside of the sparse function when it has been parallelized and run on six cores. Part 3 and part 2 are the main time consuming parts and should be prioritized in the future for greater performance.

Figure 21: An extensive test to find the dependencies for the programs. This figure shows the execution time for fsparse sequential when the problem size and amount of unique indices are varied.

22 all the three programs, Matlab sparse function, fsparse sequential as well as fspar-separallel, are compared to each other and it can be seen that they all have the same dependencies. This test is made to see if the input data can be manipulated or chosen somehow to speed up the process of assembling the matrix.

5 Discussion and conclusions

The parallelization of fsparse can overall be considered successful since the goal was achieved and even exceeded slightly. The greatest achievement lies in being able to pa-rallelize part 3 inside of the sparse function. This complex function that was very time consuming and complex seemed impossible to parallelize at first. But due to excessive work and troubleshooting it was finally parallelized. Even if its speedup is not optimal, without it the goal would not have been reached.

There are however still room for more improvements. Both the sparse and the spar-se inspar-sertstill takes the majority of the execution time and in the future this is where most attention should fall. However, there exist other methods rather than just paralle-lizing directly to improve the performance.

Recalling Amdahl’s law and comparing the results to Figure 4 it can be said that the fsparse program has been around 60% to 70% parallelized. And also remembering that the individual results gave a decent speed up in most of the cases indicates that finding new way of parallelizing its current form might be hard. Instead an entirely new algorithm that offers greater degree of parallelizability may be considered.

Our suggestions for the current parallelized program is to write a routine that choo-ses the optimal amount of cores for each of the individual functions and parts. For instance, do not run the sparse part 3 in parallel unless more than four cores are avai-lable and so on. Using a mixture of amount of cores to optimize each individual part may give some improved results and should not be hard given the results in this report.

Figure 22: This figure shows the execution time for all of the three programs compared to each other when the problem size and amount of unique indices are varied. It is seen that they all look the same and thus have the same dependencies.

References

Gene Amdahl. Validity of the single processor approach to achieving large-scale com-puting capabilitiess. AFIPS Conference Proceedings, 30:483–485, 1967.

Matthias Beck and Ross Geoghegan. The art of proof: Basic training for deeper mat-hematics. New York: Springer, 2010.

I. Duff, R.Grimes, and J.Lewis. Sparse matrix test problems. ACM Trans. Math. Soft., 15:1–14, 1989.

Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. gprof: a call graph execution profiler. Proceedings of the SIGPLAN ’82 Symposium on Compiler Con-struction, 17:120–126, 1982.

Shure Loren. Creating sparse finite-element matrices in matlab. Loren on the Art of MATLAB (MATHWORKS), March 2007.

Zeyao Mo, Aiqing Zhang, Xiaolin Cao, Qingkai Liu, Xiaowen Xu, Hengbin An, Wen-bing Pei, and Shaoping Zhu. Jasmin: a parallel software infrastructure for scientific computing. Front. Comput. Sci. China, 4(4):480–488, 2010.

Gordon E Moore. Cramming more components onto integrated circuits. Electronics Magazine, page 4, 1965. Retrieved 2006-11-11.

Gordon E Moore. Excerpts from a conversation with gordon moore: Moore’s law. Intel Corporation, page 1, 2005. Retrieved 2006-05-02.

Gordon E Moore. 1965 - moore’s law predicts the future of integrated circuits. Com-puter History Museum, 2007. Retrieved 2009-03-19.

Nicholas Nethercote and Julian Seward. Valgrind: A framework for heavyweight dy-namic binary instrumentation. Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, 2007.

In document PROJECTREPORT ParallelAssemblyofSparseMatri-cesusingCCSandOpenMP (Page 36-42)