Improving performance of sequential code through automatic parallelization

(1)

Improving performance of

sequential code through automatic

parallelization

CLAUDIUS SUNDLÖF

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

sequential code through

automatic parallelization

CLAUDIUS SUNDLÖF

Master in Computer Science Date: October 12, 2018 Supervisor: Håkan Lane Examiner: Elena Troubitsyna

Swedish title: Prestandaförbättring av sekventiell kod genom automatisk parallellisering

School of Electrical Engineering and Computer Science

(4)

(5)

Abstract

Automatic parallelization is the conversion of sequential code into multi- threaded code with little or no supervision. An ideal implementation of automatic parallelization would allow programmers to fully utilize available hardware resources to deliver optimal performance when writing code.

Automatic parallelization has been studied for a long time, with one result being that modern compilers support vectorization without any input.

In the study, contemporary parallelizing compilers are studied in order to determine whether or not they can easily be used in modern software development, and how code generated by them compares to manually parallelized code. Five compilers, ICC, Cetus, autoPar, PLUTO, and TC Opti- mizing Compiler are included in the study. Benchmarks are used to measure speedup of parallelized code, these benchmarks are executed on three different sets of hardware. The NAS Parallel Benchmarks (NPB) suite is used for ICC, Cetus, and autoPar, and PolyBench for the previously mentioned compilers in addition to PLUTO and TC Optimizing Compiler. Re- sults show that parallelizing compilers outperform serial code in most cases, with certain coding styles hindering the capability of them to parallelize code. In the NPB suite, manually parallelized code is outperformed by Ce- tus and ICC for one benchmark. In the PolyBench suite, PLUTO outper- forms the other compilers to a great extent, producing code not only opti- mized for parallel execution, but also for vectorization. Limitations in code generated by Cetus and autoPar prevent them from being used in legacy projects, while PLUTO and TC do not offer fully automated parallelization.

ICC was found to offer the most complete automatic parallelization solution, although offered speedups were not as great as ones offered by other tools.

(6)

Sammanfattning

Automatisk parallellisering innebär konvertering av sekventiell kod till mul- titrådad kod med liten eller ingen tillsyn. En idealisk implementering av automatisk parallellisering skulle låta programmerare utnyttja tillgänglig hårdvara till fullo för att uppnå optimal prestanda när de skriver kod. Au- tomatisk parallellisering har varit ett forskningsområde under en längre tid, och har resulterat i att moderna kompilatorer stöder vektorisering utan nå- gon insats från programmerarens sida. I denna studie studeras samtida pa- rallelliserande kompilatorer för att avgöra huruvida de lätt kan integreras i modern mjukvaruutveckling, samt hur kod som dessa kompilatorer genererar skiljer sig från manuellt parallelliserad kod. Fem kompilatorer, ICC, Cetus, autoPar, PLUTO, och TC Optimizing Compiler inkluderas i studien. Benchmarks används för att mäta speedup av paralleliserad kod. Dessa benchmarks exekveras på tre skiljda hårdvaruuppsättningar. NAS Parallel Benchmarks (NPB) används som benchmark för ICC, Cetus, och autoPar, och PolyBench för samtliga kompilatorer i studien. Resultat visar att paral- lelliserande kompilatorer genererar kod som presterar bättre än sekventiell kod i de flesta fallen, samt att vissa kodstilar begränsar deras möjlighet att parallellisera kod. I NPB så presterar kod parallelliserad av Cetus och ICC bättre än manuellt parallelliserad kod för en benchmark. I PolyBench så presterar PLUTO mycket bättre än de andra kompilatorerna och producerar kod som inte endast är optimerad för parallell exekvering, utan också för vektorisering. Begränsningar i kod genererad av Cetus och autoPar förhind- rar användningen av dessa redskap i etablerade projekt, medan PLUTO och TC inte är kapabla till fullt automatisk parallellisering. Det framkom att ICC erbjuder den mest kompletta lösningen för automatisk parallellisering, men möjliga speedups var ej på samma nivå som för de andra kompilatorerna.

(7)

1 Introduction 1

1.1 Problem definition . . . 2

1.2 Delimitation . . . 3

1.3 Intended readers . . . 3

2 Background 5 2.1 Early automatic parallelization . . . 5

2.2 Dependency analysis . . . 6

2.2.1 Types of data dependencies . . . 6

2.2.2 Aliasing . . . 7

2.2.3 Polyhedral model . . . 8

2.2.4 Banerjee-Wolfe inequalities test . . . 8

2.3 Automatic parallelization techniques . . . 8

2.3.1 Dependency elimination techniques . . . 9

2.3.2 Automatic vectorization . . . 11

2.3.3 Run-time tests . . . 11

2.4 Code generation . . . 11

2.5 Contemporary parallelizing compilers . . . 12

2.5.1 PLUTO . . . 12

2.5.2 TC optimizing compiler . . . 13

2.5.3 Intel C++ Compiler . . . 13

2.5.4 Cetus . . . 13

2.5.5 Par4All . . . 14

2.5.6 ROSE Compiler Framework - autoPar . . . 14

2.6 Benchmarks . . . 14

2.6.1 NAS Parallel Benchmarks . . . 15

2.6.2 The Polyhedral Benchmark Suite . . . 17

2.7 Cyclomatic complexity . . . 17

v

(8)

3 Related work 18

3.1 Vectorizing compilers . . . 18

3.2 Parallelizing Compilers . . . 19

3.2.1 Limitations and potential improvements . . . 19

3.2.2 Evaluation of parallelizing compilers . . . 20

4 Method 22 4.1 Choice of compilers . . . 22

4.2 Choice of Benchmarks . . . 23

4.2.1 NAS Parallel Benchmarks 2.3 . . . 23

4.2.2 The Polyhedral Benchmark Suite . . . 24

4.3 Compiler options . . . 24

4.3.1 Cetus . . . 24

4.3.2 Intel C++ Compiler . . . 25

4.3.3 PLUTO . . . 26

4.3.4 TC Optimizing Compiler . . . 26

4.4 Execution of benchmarks . . . 27

4.4.1 Identification of performance-critical loops and sections 27 4.4.2 Classification of benchmarks as complex or non-complex 27 4.5 Investigation of parallelized source code . . . 27

5 Results 29 5.1 Compilation of benchmarks . . . 29

5.2 Benchmarks . . . 31

5.2.1 NPB . . . 32

5.2.2 PolyBench . . . 42

5.3 Speedups and cyclomatic complexity . . . 47

5.4 Average speedups by compiler . . . 48

6 Discussion 51 6.1 Hardware differences and their impact on measurements . . . 51

6.2 NPB results . . . 52

6.3 PolyBench results . . . 53

6.4 Cyclomatic complexity . . . 54

6.5 Applicability of automatic parallelization in software development . . . 54

6.6 Economics of automatic parallelization . . . 56

6.7 Conclusions . . . 57

6.8 Sustainability and ethics of automatic parallelization . . . 59

6.9 Future work . . . 60

(9)

Bibliography 61

(10)

(11)

Introduction

Automatic parallelization refers to the conversion of sequential code into multi-threaded code with little or no supervision. Modern electronic devices such as computers or smartphones generally utilize processors with several cores and are thus capable of executing code in parallel.

In the past, iterations of processors improved on performance by improving single-core performance, e.g. through clock speed, but since the middle 2000s the focus has shifted towards multicore processors [1]. Figure 1.1 illustrates the change, where transistor count keeps increasing but clock speed evens out. Applications need to be parallel if they want to make use of CPU throughput gains that occur over time [2]. Writing parallel code is a time-intensive and complicated task involving multiple steps [3], with much room for error. There is thus a need for compilers and tools that can assist in the parallelization of code. Automatic parallelization tools can be useful when parallelizing applications, as no express knowledge of parallelization is necessary to use them. However, In order to make full use of them users must be aware of the limitations of the tools and have knowledge of parallelism.

Due to limitations in contemporary tools, [4] suggests that automatic parallelization should not be considered when the application is a real- time application or when it is important to achieve perfect parallelization.

Benchmarks show that automatic parallelization can at times achieve similar speedups to that of hand-parallelization, while failing to achieve any noteworthy parallelization or speedups in other cases [5]. Recent develop- ments [6] have been able to improve on current methods by further increasing the number of synchronization-free parallel transformations.

1

(12)

Figure 1.1: Intel CPU trends [2].

1.1 Problem definition

The research question of interest that will be explored in the study is "How can automatic parallelization tools be used to parallelize sequential code?".

In order to explore the question, the following hypothesis will be tested

"Current tools are able to parallelize certain (possibly simple) programs to a large degree, but more complex programs that are harder for developers to parallelize will also be difficult for these tools to parallelize".

In order to answer the question and test the hypothesis, the limitations and use-areas of different tools must be explored. This will be done through

(13)

experimentation with the tools in order to produce measurement data, and by investigation of what code transformations these tools do. Causes for failed transformations will also be investigated. Programs parallelized using the tools will be classified as complex or not complex, in order to see if the complexity of the program has an effect on the ability of tools to parallelize the code. The classification will be done using cyclomatic complexity, where a number of 15 or higher indicates that the program is complex.

Benchmarks will be used to collect measurement data in terms of speedup, and data in terms of applied transformations on source code. The speedup data will help answer an extended question, if tools can not only parallelize sequential code, but if they can do so such that performance is increased over the serial version of the application. There is also an interest in know- ing if automatically parallelized code exhibits performance close to that of manually parallelized code.

Benchmark data will be collected on different sets of hardware, to ensure that observed speedup data is not exclusive to a specific configuration.

1.2 Delimitation

The aim of the thesis is to explore currently available tools and if they can easily be integrated into software development projects, as well as if they offer performance improvements over the original sequential code. The degree project does not include the implementation of any automatic parallelization, and is strictly confined to the evaluation of current tools, code transformations, and methods. The objective is not to produce any new methods or transformations. Existing benchmarks will be used in the study, and no new ones will be constructed.

The collection of measurements on different sets of hardware will mainly be to verify that results are consistent. While differences between architec- tures and generations of hardware are important when evaluating performance gains from parallelization, conducting a deeper study into this area is outside of the scope of this degree project.

1.3 Intended readers

The aim of the report is to produce a document that can be used to help motivate why automatic parallelization would or would not be a fit for software projects, both established and new. As automatic parallelization could

(14)

possibly save development time that would otherwise be spent writing parallel code, being able to make such a decision is valuable for anyone interested in planning a project where parallelization could be useful.

Stakeholders interested in reading the report may thus be involved in the planning of projects, as they might be interested in improving the performance of their programs without the vast effort that normally comes with planning and implementing parallel code. The report may also be of interest for anyone conducting similar research into automatic parallelization, limitations highlighted in the report may provide insight into the usage and possible areas of improvement that exist for parallelizing compilers.

(15)

Background

The purpose of this chapter is to introduce the reader to important concepts within the field of automatic parallelization, and important history related to the field, as well as a selection of state-of-the-art compilers supporting automatic parallelization. For the sake of brevity, compilers supporting automatic parallelization will henceforth be referred to as parallelizing compilers.

A short description of cyclomatic complexity is also included.

2.1 Early automatic parallelization

Early research concerning automatic parallelization was done mainly within the context of C and FORTRAN [3], [7]–[9], with the latter being of specific interest due to its history as the dominant programming language in scientific computing and in the context of supercomputers [10], [11], which historically were the only multiprocessor devices. Vector machines were introduced in the 1970s [12], and by the 1980s vendors of these machines provided vectorizing compilers.

Program transformation techniques employed by 1990s era tools include the parallelization of acyclic code, parallelization of for loops, and the making of run-time decisions [8]. The techniques depend on an availability of knowledge – in this case gained through dependence analysis, and on the ability to change the dependence structure of the program to increase parallelism. As not all information is available at compile time in terms of dependencies, certain tests may be inserted into the code to determine which path to take.

5

(16)

2.2 Dependency analysis

When parallelizing a program, it is important to determine what data is accessed by which statements, and how these statements depend on each other. This section will introduce the reader to data dependencies that are of importance, as well as two methods of determining dependencies: the Polyhedral Model (which actually transcends dependency analysis), and the Banerjee-Wolfe inequalities test.

2.2.1 Types of data dependencies

According to [13], four unique kinds of data dependencies exist, of which three are of interest when parallelizing loops. These three are:

2.2.1.1 True dependence

A true dependence exists when a value is written to a variable or an element of an array, and at a later stage read from.

for(i = 1; i < n; ++i) a[i] = a[i - 1];

Listing 2.1: A loop where a[i] is modified in iteration i and read in iteration i + 1.

2.2.1.2 Anti dependence

An anti dependence exists when a value is read from a variable or an element of an array, and at a later stage written to.

for(i = 0; i < n; ++i) a[i] = a[i + 1];

Listing 2.2: A loop where a[i] is modified in iteration i and read in iteration i - 1.

2.2.1.3 Output dependence

An output dependence exists when a value is written to a variable or an element of an array, and at a later stage overwritten.

(17)

for(i = 0; i < n; ++i) a[i] = i;

a[i + 1] = 5;

Listing 2.3: A loop where a[i] is modified in iteration i and read in iteration i - 1.

2.2.2 Aliasing

Aliasing occurs when a single memory location is referenced by different variables, e.g. two pointers pointing to the same address. Having knowledge of which locations in memory are shared is important when determining dependencies, and if different branches in a program result in aliasing changing, parallelizing compilers must take these different possibilities into account when parallelizing code. Analysis of aliasing is imprecise; the analysis must be true for all paths throughout the program, which is done by merging alias information at control flow joins causing possible analyses to be conservative in nature [13].

for (i = 1; i <= n; i++) for (j = 1; j <= n; j++)

if (i <= n + 2 - j) S(i,j);

(a) Surrounding control of S.

D_S =











i j

| i j

∈ Z²,







1 0

−1 0 0 −1 0 −1

−1 −1







i j

+







−1 n

−1 n n + 2







≥ 0









 (b) Iteration Domain of S.

Figure 2.1: [14] Static control and iteration domain.

(18)

2.2.3 Polyhedral model

In the polyhedral model, Static Control Parts(SCoP) are a subset of loop nests that can be represented [14]. The iteration domain of these loops can be represented as polyhedrons consisting of a set of points in a Zⁿ vector space bounded by a set of inequalities defined by:

D = {x | x ∈ Zⁿ, Ax + a ≥ 0} (1) where A is a constant matrix, a a constant vector, and x the iteration vector of the SCoP. Figure 2.1 (b) shows the set of inequalities defined by (1) and figure 2.1 (a). The resulting set of inequalities is shown in figure 2.2, where i − 1 ≥ 0is the lower bound of the outermost loop, and −i − j + n + 2 ≥ 0 shows when the function S will be called. Using the set of inequalities, it is possible to map out array accesses to determine data dependencies and determine on which iterations in a loop other iterations depend [15].

D_S =

















i − 1

−1 + n j − 1

−j + n

−i − j + n + 2







≥ 0











Figure 2.2: Resulting set of inequalities defined by figure 2.1.

2.2.4 Banerjee-Wolfe inequalities test

The Banerjee-Wolfe test is a heuristic used to test dependence between references to memory [13]. The references are represented as linear functions, and together with loop bounds and directions for variables used in the functions, a Diophantine equation for the linear functions is formed, and an attempt to find an upper and lower bound on the right-hand side of the equation is made. Using the found bounds with the Diophantine equation, a dependence is assumed to exist if a solution exists within the iteration space.

2.3 Automatic parallelization techniques

Contemporary automatic parallelization techniques include the transformation of code to eliminate and modify dependencies to facilitate paralleliza-

(19)

tion and vectorization using transformations such as scalar expansion and array privatization [13], of which scalar expansion has been measured to have the most impact on the performance of automatically parallelized code [5]. Most loops cannot be directly parallelized due to dependencies, applying transformations to modify dependencies can greatly affect the number of parallelizable loops in a program. In order to ensure that the generated parallel code provides a speedup, parallelizing compilers may optimize for locality and measure the overhead of parallelizing loops through heuristics.

2.3.1 Dependency elimination techniques

In their studies of parallelizing compilers and the impact of different transformations and features, [5] found that scalar expansion and array privatization, along with array reduction transformations were the most effective techniques in terms of speedup gains.

2.3.1.1 Scalar Expansion

Scalar expansion is a method of converting scalar data to match the dimen- sions of an array. If a scalar’s value is always assigned within an iteration of a loop before its use, then each iteration can store the scalar in separate memory locations [13]. Since every iteration reads and writes to different memory locations, there are no loop-carried dependencies on the variable.

2.3.1.2 Privatization

According to [13], scalar expansion along with array privatization are the most important transformations a compiler can do to eliminate dependencies. Privatization aims to remove dependencies on shared data that is both read and modified inside loops by making local copies for each thread oper- ating on the loop. A key difference between privatization and scalar expansion is that privatization creates only one copy per thread and stores it on the thread stack, while scalar expansion creates copies based on the number of iterations.

var x;

for(i = 0; i < n; ++i)

S1: x = b[i] + c[i];

S2: b[i] = a[i + 1] + x;

Listing 2.4: A loop that cannot be run in parallel without applying transformations.

(20)

In listing 2.4, the variable x is shared between iterations and a dependence between S2 and S1 exist, such that S2 in iteration i has to be executed imme- diately after S1 in iteration i. Due to interleaved execution between threads, the loop run in parallel will not always yield the same result. In this example, the code can be transformed in such a way that x is not shared between iterations by making a private copy of the variable, as seen in listing 2.5.

for(i = 0; i < n; ++i) private x’;

S1: x’ = b[i] + c[i];

S2: b[i] = a[i + 1] + x’;

Listing 2.5: A parallelizable loop with no shared variables.

2.3.1.3 Reduction substitution

Reductions are operations of the form s = s ⊕ expr that reduce the dimen- sionalities of the variables on which they are performed. Reductions are amenable to parallelization if the intermediate variable s to which expr is re- duced is not read within the loop [13]. Parallelization is done by performing the operations on multiple threads and storing the results on a per-thread basis, and then merging them after computation has finished.

Figure 2.3 (a) shows a loop performing the reduction s = s+a, and figure 2.3 (b) shows the parallelization of the same loop, containing one loop run in parallel where partial sums are stored in an array, which are then summed up in a separate loop at the end.

for(i = 0; i < n; ++i) s = s + a[i]

(a) A parallelizable loop containing a reduction.

int ps[numThreads];

parallel for(i = 0; i < n; ++i)

ps[threadID] = ps[threadID] + a[i]

for(i = 0; i < numThreads; ++i) s = s + ps[i]

(b) The loop transformed to execute in parallel.

Figure 2.3: Reduction substitution of a loop, adapted from [13].

(21)

2.3.2 Automatic vectorization

Vectorization of array accesses is the transformation of code that operates on one element at a time to code that works on multiple elements simulta- neously. Supported by hardware, vector operations are up to 16 times faster than sequential operations on modern machines, depending on data types [16]. Automatic vectorization has been studied for over 40 years, and is a special case of automatic parallelization. In order to vectorize code, compilers need to analyze dependencies and perform transformations to change or remove them.

2.3.3 Run-time tests

When transforming parallel code, there are decisions that are difficult or impossible to make when the exact values of certain variables are not known [8]. In such cases, compilers may generate different branches of code and insert tests that determine which branch to take.

In 2.4a, the loop can only be parallelized when k is not zero. If e.g. k’s value is determined by data from outside of the application, the compiler will not be able to generate safe parallel code. 2.4b solves this by adding a second branch that activates when k is zero.

for(i = 0; i < n; ++i) a[m + k * i] = b[i];

(a) Possibly parallelizable code.

if(k==0) a[m] = b[n]

else

parallel for(i = 0; i < n; ++i) a[m + k * i] = b[i];

(b) The loop transformed to execute in parallel, with run-time test on k.

Figure 2.4: Run-time test example, adapted from [8].

2.4 Code generation

The OpenMP API is a specification for parallel programming, providing developers with an interface for developing parallel applications [17]. Pro- grammers specify which parts of the program should run in parallel, and how they should run in parallel with OpenMP directives that instruct supporting compilers. OpenMP supports SPMD constructs, tasking constructs,

(22)

device constructs, worksharing constructs, and synchronization constructs.

There is also support for sharing, mapping, and privatization of data. OpenMP is supported by commonly used compilers such as GCC and ICC [18]. Par- allelizing compilers make wide use of OpenMP when generating parallel code [13].

#pragma omp parallel for for(i = 0; i < n; ++i)

a[start + i] = b[i];

(a) Parallelization of a for loop without dependencies.

#pragma omp parallel for reduction (+:s) for(i = 0; i < n; ++i)

s = s + a[i];

(b) Parallelization of figure 2.3a.

#pragma omp parallel for private(x) for(i = 0; i < n; ++i)

x = b[i] + c[i];

b[i] = a[i + 1] + x;

(c) Parallelization of listing 2.4.

Figure 2.5: OpenMP examples.

2.5 Contemporary parallelizing compilers

The following section contains descriptions of several contemporary parallelizing compilers, of which one is commercial, namely the Intel C++ Com- piler.

2.5.1 PLUTO

PLUTO is a modern automatic parallelization tool for C based on the polyhedral model, with an open-source scheduling algorithm [19], and using OpenMP directives to achieve parallelization. A speedup of up to 20 times over non-parallelized code has been reported for some benchmarks [20].

PLUTO is not the first tool to use polyhedral transformations, but previous implementations did not scale well and were not practical, PLUTO improved on previous methods [21].

PLUTO is open-source in its entirety and available on GitHub [22].

(23)

2.5.2 TC optimizing compiler

The TC optimizing source-to-source compiler supports a novel approach to the generation of parallel synchronization-free tiled code using a combination of the Polyhedral and Iteration Space Slicing frameworks [6]. The technique presented is capable of extracting parallelism when other, well- known techniques fail. The TC optimizing compiler has been shown to outperform or match PLUTO in a set of benchmarks. The compiler is open- source [23], and parallelizes code with OpenMP directives.

2.5.3 Intel C++ Compiler

The Intel C++ Compiler contains an automatic parallelization feature that translates serial code into equivalent multithreaded code [24]. The parallelization can be combined with other optimization features like automatic vectorization. The ICC performs automatic parallelization through 6 steps:

1. Data flow analysis where the flow of data is computed.

2. Loop classification where loop candidates for parallelization are determined based on their correctness and efficiency.

3. Dependency analysis for references in loop nests.

4. High-level parallelization where analysis of the dependency graph is done to determine which loops can be run in parallel and the computation of their run-time efficiency.

5. Data partitioning.

6. Multithreaded code generation through OpenMP.

2.5.4 Cetus

Cetus is an open-source compiler infrastructure for source-to-source transformation for ANSI C [25]. Written to be able to compile large and realistic applications while supporting interprocedural analysis and being easily ex- tensible, Cetus includes a set of optimization and analysis passes of source code that are used to parallelize loops [26]. Cetus uses the Banerjee-Wolfe inequalities to test for data dependency. Cetus generates OpenMP directives to parallelize code. Developers can extend the framework by writing their own passes.

(24)

Cetus is developed by researchers at Purdue University that were involved in the development of the earlier FORTRAN compiler Polaris [26].

2.5.5 Par4All

Par4All is an open-source source-to-source compiler supporting C99, For- tran 77, OpenCL and CUDA [27]. OpenMP is used to parallelize code for multi-core platforms. Par4All uses the PIPS [28] framework for parallelization optimization of code. PIPS supports a wide range of interprocedural program analyses as well as transformations and makes use of the polyhedral model to represent programs [29]. Users can specify which transformations and analyses they wish to use when compiling code by using commands from PIPS. This allows users to fine-tune the compilation based on the program to be parallelized.

Development of Par4All was supported by SILKAN, a company partic- ipating in the research of automatic HPC code generation, distributed real time synchronization, and technologies for accurate and fast simulation of physical models [30]. Par4All is no longer actively developed [31].

2.5.6 ROSE Compiler Framework - autoPar

ROSE is an open-source source-to-source compiler supporting C, C++, For- tran 77/95/2003, and UPC 1.1 [32]. ROSE contains many tools which users can use to perform transformations or analysis of source code; autoPar is one such tool which automatically inserts OpenMP pragmas into C/C++

code based on dependence analysis and liveness analysis.

ROSE is developed by a team of researchers from the Center for Ad- vanced Scientific Computing at the Lawrence Livermore National Labora- tory, an American federal research facility [33].

2.6 Benchmarks

While measuring the performance of parallelizing compilers has been of interest since their early days, no benchmarks explicitly written to test parallelizing compilers have gained traction besides the PolyBench suite [35], which is primarily for use in studies dealing with the polyhedral model. A commonly used [5], [36]–[38] test suite for the benchmarking of contemporary parallelizing compilers and their techniques is the NAS Parallel Bench- marks suite [39].

(25)

Figure 2.6: Diagram outlining the compilation steps of autoPar [34]

2.6.1 NAS Parallel Benchmarks

The NAS Parallel Benchmarks (NPB) consists of a suite of eleven different benchmarks and seven different problem classes as of NPB 3.3.1. The problems are derived from computational fluid dynamics applications. The problem classes S, W, A, B, C, D, E contain workloads of differing sizes suitable for different scenarios. Implementations of the benchmarks are sup- plied in three different versions, a serial implementation, an OpenMP implementation, and an MPI¹implementation. The reference implementations are a mix of C/Fortran, and alternative implementations are readily available, such as [40] implementing NPB 2.3, and [41] implementing NPB 3.3.

The NPB suite consists of 12 benchmarks as of NPB 3.3.1.

The NPB 2.3 benchmarks included in the study are the 8 benchmarks in table 2.1 and 2.2, with descriptions adapted from the benchmark specification [39]. BT, SP, and LU use the same synthetic system of nonlinear partial

1MPI(Message Passing Interface) was originally presented in 1993, and is a communi- cation protocol for the programming of parallel computers. MPI is not considered in this thesis and will thus not be discussed further.

(26)

differential equations.

Benchmark Problem size Iterations

MultiGrid (MG) 256³ 4

Conjugate Gradiant (CG) 14000 15

Fourier Transform (FT) 256²*128 6

Integer Sort (IS) 2²³ -

Embarrassingly Parallel (EP) 2²⁸ -

Block Tri-diagonal (BT) 64³ 200

Scalar Penta-diagonal (SP) 64³ 400

Lower-Upper Gauss-Seidel Solver (LU) 64³ 250

Table 2.1: Problem sizes for NPB 2.3 under problem class A, adapted from [39].

Benchmark Problem or task Method

MG Solve discrete Poisson problem V-cycle multigrid method CG Estimate largest eigenvalue of

symmetric definite sparse matrix

Inverse power method FT Solve a 3D partial differential equa-

tion

Forward and inverse fast Fourier transformation IS Sort N keys uniformly distributed

in memory

Parallel Bucket sort EP Generate pairs of Guassian ran-

dom deviates and tabulate the number of pairs in successive square annuli

Marsaglia polar method

BT Solve a system of nonlinear partial differential equations

Beam-Warming method SP Solve a system of nonlinear partial

differential equations

Variant of Beam-Warming method transforming sys- tems into uncoupled diagonal form

LU Solve a system of nonlinear partial differential equations

Successive over-relaxation method, a variant of the Gauss-Seidel method Table 2.2: Problems and methods used to solve them in NPB 2.3.

(27)

2.6.2 The Polyhedral Benchmark Suite

The Polyhedral Benchmark suite (PolyBench) is a collection of 30 benchmarks that has been used in several previous studies including [6], [19]. The benchmarks contain static control parts (SCoPs) and was created with the in- tent to uniformize the execution of kernels for use in publications [35]. They are typically used when dealing with the polyhedral model, and no reference parallel versions of the benchmark programs are provided. Five different datasets are provided with the benchmark, MINI, SMALL, MEDIUM, LARGE, EXTRA LARGE.

2.7 Cyclomatic complexity

The cyclomatic complexity number of a program, function, or module, is a metric describing the amount of decision logic contained within [42]. It is based entirely on the control flow of the software. As a metric, it can be used to infer the number of recommended tests for software, but can also be used to ensure that software is designed such that it is reliable, testable, and main- tainable. Software with larger cyclomatic complexity numbers are harder or impossible to cover using tests, and getting an overview of existing paths in the source code is difficult. For this reason cyclomatic complexity is often limited, with the limit proposed by the developer of the metric being 10, and data supporting 15 as a valid alternative [42].

(28)

Related work

Parallelizing compilers have been benchmarked in the past, by comparing the performance of programs parallelized by a compiler and ones parallelized by an expert programmer where the programmer outperformed the compiler by a factor of up to 2.5 times [3], and by running well-known benchmarks such as the Perfect Benchmarks [9] and the NAS Parallel Bench- marks [5], [36]–[38]. In studies where parallelizing compilers have been tested using benchmarks, the focus has been on measuring speedup of different subroutines/loops in the code.

While the performance of the code written by an expert programmer in [3] outperformed the compiler’s code, the programmer spent eight working days parallelizing the code. There’s no mention of compilation time, which could be an indicator that it is negligible. In the study, the compiler was able to achieve up to 50% of ideal speedup, compared to the programmer who at times achieved ideal speedup.

3.1 Vectorizing compilers

Vectorizing compilers have been commercial since 1980s, and even though vectorizing compilers for FORTRAN could partially or completely vectorize 96% (combined, any single compiler could vectorize at most 80%) of loops from the Test Suite for Vectorizing Compilers in 1988 [43], ICC and GCC were shown to be capable of vectorizing at most 71% of the loops in 2011 [16]. ICC and GCC were found unable to perform certain transformations of code that result in higher degrees of vectorization due to lacking dependency analysis. The number of vectorized loops increased when transformations were applied manually, showing that the number of vectorized

18

(29)

loops is limited by the transformations that ICC and GCC implement.

3.2 Parallelizing Compilers

Parallelizing compilers are currently not capable of matching the performance of hand-parallelized code, but there are trade-offs to be made; automatic parallelization can yield significant performance improvements without necessitating algorithmic change [9]. [4] suggests that automatic parallelization should not be used when perfect parallelization is important, but when human capital is limited, automatic parallelization could be useful if performance improvements are desired.

3.2.1 Limitations and potential improvements

The Banerjee-Wolfe test that Cetus uses to test for dependencies has previously been shown to sometimes yield false positives, which result in compilers failing to parallelize sections of code that can not be transformed so that dependencies disappear [44]. The study highlighting the problems with the test also proposed a combination of the Banerjee-Wolfe test and the GCD test that allegedly results in perfect accuracy.

[38] found that Cetus would parallelize loops with small iteration sizes in the EP benchmark in such a way that performance degraded and was outperformed by the serial implementation. They also identified loops within the benchmark that Cetus would not parallelize at all, even though OpenMP directives could be used to directly parallelize the code without transformations. Results from [5] also show that increased parallel coverage does not necessarily result in better performance. ICC implements a heuristic for de- ciding when the parallelization of loops will lead to improved performance [5], which could mitigate the observed pattern of higher parallel coverage sometimes resulting in worsened performance. [45] suggests a solution using profiling data to further improve accuracy.

In [36], the authors discovered that several contemporary parallelizing compilers ran into issues when parallelizing the NPB, including PLUTO and Cetus. The number of problems that PLUTO ran into resulted in it being excluded from the benchmarking, as solving the many issues was not deemed feasible. The study found that ICC and Par4All were able to parallelize the NPB suite without any manual intervention, and that they were able to achieve reasonable speedup for some benchmarks. A need for an environment that highlights problems faced while parallelizing loops was

(30)

identified, as many of the problems the tools had when parallelizing code could be solved with user-intervention. It is not clear from the study which compiler was used to compile the output from Cetus or Par4All, which is an issue considering the fact that there are demonstrable performance differences between different compilers [46]. The speedup measurements from the study are thus hard to compare to results from other studies.

[6] compared the performance of PLUTO and the TC Optimizing Com- piler using the PolyBench suite, showing that the TC compiler outperformed PLUTO in some benchmarks, highlighting problems with techniques depending solely on the Polyhedral framework. The article demonstrated that well-known techniques were not able to guarantee the generation of synchronization-free code on the tile level, even when synchronization-free parallelism existed. The authors plan to further improve the performance of the TC compiler by implementing certain transformations that increase locality of the generated code.

3.2.2 Evaluation of parallelizing compilers

Traditionally [9], [36]–[38], parallelizing and vectorizing compilers have been evaluated by running various benchmarks and typically comparing the results with the results for other compilers. These results are valuable when comparing the overall success of the compilers, but they do not necessarily show the impact of different techniques on the final result, nor their impact on each other. Having knowledge of the performance gains of individual techniques allows developers to make informed choices in terms of cost ver- sus performance gains during the design phase of a compiler [5].

Recently, there have been attempts to develop evaluation tools for parallelizing compilers that provide more information than simply running benchmarks does. One such tool is PETRA [5]. The authors of the paper on PETRA make use of the Performance Drop Ratio metric, which measures the performance reduction when one optimization technique of a compiler is turned off while others remain the same. The tool supports the specification of optimization techniques to be tested and the performance measurement of all different combinations. For evaluation of compilers, the tool is used to collect speedup of benchmarks and the number of parallelized/vectorized loops. For the compilers measured with the tool, scalar expansion and array privatization along with reduction substitution were shown to have the greatest impact on performance, which is in line with previous studies [8]. The main compiler evaluated in the study, Cetus, was shown to achieve

(31)

73% of the hand parallel performance on average over NPB, although it was not able to successfully parallelize all benchmarks.

(32)

Method

In order to evaluate contemporary parallelizing compilers and transformations done by them, a selection of production and research compilers em- ploying a wide spread of techniques had to be done. These compilers were evaluated by running well-known benchmarks on code generated by them, and comparing the resulting runtimes and code with hand-parallelized versions of the same benchmark in the case of NPB or with other compilers in the case of PolyBench. In order to collect data on how generated code performs under different circumstances, the benchmarks were run on several sets of hardware. The programming language used as the base for the study was C.

4.1 Choice of compilers

The parallelizing compilers evaluated in this study were chosen based on their prevalence in previous studies, whether or not they are being worked on, and if source code and documentation are readily available. Compilers also had to support the C programming language, as benchmarks were im- plemented in C. ICC v18.0.3, Cetus v1.4.4, ROSE v0.9.9, PLUTO v0.11.4, and TC Optimizing Compiler v0.2.26 were chosen based on these criteria, ICC is the only one of these that is not open-source nor a source-to-source compiler. Documentation exists for all of these compilers, but is most complete for ICC and Cetus.

Par4All showed promising results in previous studies, but is no longer maintained and binaries are no longer distributed. The public GitHub repos- itory for the project is not complete and will not build on its own, making it impossible to include in the study.

22

(33)

ICC was chosen as the C compiler for the study, partially due to its demonstrated performance advantage over GCC [46], but also since ICC’s parallelization capabilities were to be included in the study; using the same compiler throughout all benchmark variations should result in more com- parable and consistent measurements.

4.2 Choice of Benchmarks

Benchmarks were primarily chosen based on their prevalence in previous studies, the NAS Parallel Benchmarks have been used to benchmark parallel supercomputers extensively in the past, and have seen use in the benchmarking of parallelizing compilers. The Polyhedral Benchmark Suite (Poly- Bench), on the other hand, was designed specifically for the benchmarking of Polyhedral compilers and has seen wide use in studies dealing with them.

Polybench was deemed suitable to include in the study as two polyhedral compilers were evaluated, PLUTO and TC Optimizing Compiler.

4.2.1 NAS Parallel Benchmarks 2.3

The NAS Parallel Benchmarks have been used in many previous studies of parallelizing compilers, and several studies have made use of a specific implementation of NPB 2.3 [40]. The implementation contains no serial versions of the benchmarks but consists only of OpenMP versions. The serial version was obtained by compiling NPB without the -fopenmp flag. The hand-parallelized benchmarks are used as reference points for the performance of the parallelizing compilers.

Problem class A was chosen as the primary set of data to be used in the study, as testing showed that the time needed to execute all versions of the benchmark under problem class A took a reasonable amount of time (roughly 30 minutes to execute all 7 versions of the benchmark once). Prob- lem class B takes roughly four times as long to execute, and it was not deemed reasonable for one pass to take around 120 minutes, given the de- sire to both run all benchmarks multiple times, as well as to run PolyBench in addition to NPB. Results collected from running Cetus were consistent along the different problem classes in [5].

When parallelizing the code, Cetus will ignore existing OpenMP directives unless instructed not to, while ICC and ROSE have no option to do so.

In order to ensure that the resulting programs generated by ICC and ROSE were not impacted by existing OpenMP directives, they were stripped from

(34)

the source files before applying automatic parallelization. PLUTO and TC Optimizing Compiler will not consider sections of code not marked with SCoP pragmas, and were thus not benchmarked with NPB.

The final compilation of source files generated by Cetus and ROSE were compiled using ICC with the flag -O3, as has been done in previous studies.

NPB outputs files documenting execution time as well as verification of the results of computation done, making it possible to not only measure the performance of the code but also the correctness.

4.2.2 The Polyhedral Benchmark Suite

A complete implementation of PolyBench in C is available on the PolyBench website [35]. Version 4.2.1 of the benchmark with the large dataset were used. No modification of source files was necessary to ensure a level play- ing field, in contrast to NPB 2.3 where files had to be stripped of existing OpenMP directives. Tools for timing the benchmark are provided with the source code. The default optimization level for the compilation of Poly- Bench is -O2, and it was not changed.

Correctness of parallelization of PolyBench was verified by compiling benchmarks with -DPOLYBENCH_DUMP_ARRAYS and comparing output from the generated files with the output from the reference file.

4.3 Compiler options

Several of the parallelizing compilers have an extensive set of parallelization techniques that can be turned on or off at the compilation stage. Enabling or disabling techniques can have a great impact on performance, and the resulting transformed code. The options used are documented here, for compilers that support them. ROSE’s autoPar specifically contains no options that impact the resulting source code besides "no_aliasing" which instructs the compiler to assume that no aliasing exists, and was not used.

4.3.1 Cetus

For NPB, Cetus was run with three sets of compiler options, default ones, experimental ones, and benchmark-specific options found to perform well in [5]. Options are outlined in table 4.1. The ddt option represents the data dependency test used, where 1 corresponds to the Banerjee test, and 2 to the Range test. The range option determines the level of symbolic analysis used,

(35)

where 1 is only local analysis, and 2 is inter-procedural analysis. A value of 1 for privatize indicates scalar privatization only, while 2 is for scalar and array privatization. An alias level of 2 assumes no aliasing, while 1 enables advanced alias analysis, and 0 enables conservative analysis. tinline was set to 0 for experimental and individualized options, enabling inlining. [5]

used selective inlining for some of the benchmarks yielding a significant performance gain for BT and SP, unfortunately the functions selected to inline are not mentioned in the study and inline-level 0 is instead used for all benchmarks.

Set of compiler options or benchmark

induction privatize reduction ddt alias range tinline

Default options

3 2 2 2 1 1 N/A

Experimental options

3 2 2 2 2 2 0

BT 0 2 2 2 2 2 0

CG 0 1 2 1 1 1 0

EP 0 2 2 1 1 1 0

FT 0 0 0 1 0 1 0

IS 0 0 1 1 0 2 0

MG 0 1 0 1 0 0 0

SP 0 2 2 2 2 2 0

LU 0 2 0 1 1 0 0

Table 4.1: Cetus compiler options and values used for NPB.

4.3.2 Intel C++ Compiler

When parallelizing code with ICC, the following flags were used for both NPB and PolyBench:

• -parallel, enables the generation of multi-threaded code.

• -unroll, enables the unrolling of loops. The compiler makes use of heuristics to decide how many times to unroll.

(36)

• -qopt-prefetch, enables the insertion of hints to the processor of when data should be loaded into cache to avoid cache misses.

• -scalar-rep, enables scalar replacement optimization as part of loop transformations. Replaces array references with register references.

• -align, enables the alignment of data in memory to facilitate vectorization.

The options beside -parallel were chosen due to their positive impact on the performance of the generated multi-threaded code, as measured in [5].

For the final compilation of all benchmarks, -static-intel was also used to statically link Intel-provided libraries to the resulting executables.

4.3.3 PLUTO

The options used with PLUTO when compiling PolyBench were:

• –parallel, enabling the generation of parallel code.

• –tile, enabling optimization for locality.

4.3.4 TC Optimizing Compiler

When compiling PolyBench with TC Optimizing Compiler, the options used were:

• –omp-for-codegen, enabling the generation of parallel code.

• –correction-tiling, tiling with correction so that all dependencies of original loop nests are preserved.

• –sfs-multiple-scheduling, enabling tiling of synchronization-free slices with multiple sources.

This approach to automatic parallelization of code, i.e. by generating synchronization- free code, was presented in [6] and differentiates TC Optimizing Compiler

from PLUTO.

(37)

4.4 Execution of benchmarks

The benchmarks were run on three different machines, a desktop computer, a laptop, and a server. The machines were chosen due to their immediate availability. Other than system critical processes, no simultaneous processes were running on the machines as the benchmarks were executed.

Machine 1, Intel i7-4790k (8 threads) clocked at 4.47 GHz, 16GB DDR3 RAM clocked at 1600 MHz, running Ubuntu 16.04.

Machine 2, Intel i5-6200U (4 threads) clocked at 2.3 GHz, 8GB DDR3 RAM clocked at 1600 MHz, running Ubuntu 14.04.

Machine 3, Intel Xeon E5-2603 v2 (4 threads) clocked at 1.8GHz, 64GB DDR3 RAM clocked at 1600 Mhz, running Ubuntu 16.04.

The NPB set of benchmarks was executed 10 times in succession on each machine, with the average execution time being used when measuring speedup. PolyBench contains a utility for timing the benchmarks that exe- cutes them 5 times and returns the average execution time of the 3 median runs. This average was used when measuring the speedup of PolyBench.

4.4.1 Identification of performance-critical loops and sec-

tions

In order to identify which parts of the original code were critical for the execution time of the programs, ICC’s profiling option -profile-loops=all was used on serial versions of the benchmarks. Intel’s Loop Profile Viewer tool was then used to interpret the resulting output data.

4.4.2 Classification of benchmarks as complex or non-

complex

The cyclomatic complexity of the serial code was calculated using lizard [47], benchmarks were considered complex if functions critical to the execution of benchmarks had a cyclomatic complexity number over 15. Verifica- tion and initialization functions were considered non-essential.

4.5 Investigation of parallelized source code

Parallelized source code for NPB was compared to the manually parallelized version. Differences between techniques used were identified, and prob-

(38)

lems inhibiting the parallelization of benchmarks were identified and outlined.

The different versions of PolyBench were compared to each other. No manually parallelized version exists, and so specific focus was given to benchmarks of particular interest. The benchmarks that were of particular interest were those reporting maximum and minimum speedups, as well as negative speedups across all versions.

ICC vectorization reports and Cetus parallelization reports were used to provide further insight into encountered problems for both benchmark suites.

(39)

Results

Results for the study include observations from compiling the benchmarks and generating code with the chosen parallelizing compilers, analysis of the resulting code and identification of pitfalls, measurement data collected when running the benchmarks, and observations related to the cyclomatic complexity of the benchmarks.

5.1 Compilation of benchmarks

When parallelizing NPB 2.3, autoPar was unable to parallelize LU, generating a file containing syntax errors. ICC and Cetus encountered no problems using the options outlined in the previous chapter.

PLUTO and TC Optimizing Compiler encountered several problems when parallelizing PolyBench, despite the fact that the benchmark was specifically designed for polyhedral compilers. PLUTO was unable to parallelize adi without modification of the source file (modification visible in figure 5.1), parallelization of deriche failed for similar reasons, while TC Optimizing Compiler was able to parallelize only 6 benchmarks such that they would compile without errors, bicg, durbin, symm, syr2k, syrk, trisolv. This is not in line with [6], where the authors studied the parallelization of 2mm, bicg, gemm, gesummv, mvt, syr2k, trmm. PolyBench v4.1 and TC v0.2.24 were used in the study, possibly explaining the difference in results. Output from TC Compiler consists of only the parallelized code in the SCoP, this code was manually inserted into the original programs.

Cetus and autoPar generated files stripped of code for timing when parallelizing PolyBench (figure 5.2), thereby changing the functionality of the original code. In order to facilitate timing, timers were manually restored to

29

(40)

#pragma scop

DX = SCALAR_VAL(1.0)/_PB_N;

DY = SCALAR_VAL(1.0)/_PB_N;

DT = SCALAR_VAL(1.0)/_PB_TSTEPS;

B1 = SCALAR_VAL(2.0);

mul1 = B1 * DT / (DX * DX);

mul2 = B2 * DT / (DY * DY);

a = -mul1 / SCALAR_VAL(2.0);

b = SCALAR_VAL(1.0)+mul1;

c = a;

d = -mul2 / SCALAR_VAL(2.0);

e = SCALAR_VAL(1.0)+mul2;

f = d;

(a) Assignments within SCoP causing errors for PLUTO.

DX = SCALAR_VAL(1.0)/_PB_N;

DY = SCALAR_VAL(1.0)/_PB_N;

DT = SCALAR_VAL(1.0)/_PB_TSTEPS;

mul1 = B1 * DT / (DX * DX);

mul2 = B2 * DT / (DY * DY);

a = -mul1 / SCALAR_VAL(2.0);

b = SCALAR_VAL(1.0)+mul1;

c = a;

d = -mul2 / SCALAR_VAL(2.0);

e = SCALAR_VAL(1.0)+mul2;

f = d;

#pragma scop

(b) Assignments moved out of SCoP.

Figure 5.1: Modification of adi.c for PLUTO.

the files. Code for dumping arrays was also modified as to not work iden- tically to the original code, and code for printing was thus also restored.

Similar behavior was observed when parallelizing NPB, where Cetus and autoPar incorrectly identified certain variables as constants, e.g. a thread- counter storing the number of threads used by OpenMP, retrieved from the function omp_get_num_threads, this function call was replaced with a 1, resulting in benchmarks incorrectly reporting that only one thread was uti- lized.

/* Start timer. */

polybench_start_instruments;

/* Run kernel. */

...

/* Stop and print timer. */

polybench_stop_instruments;

polybench_print_instruments;

(a) Original code for timing.

/* Start timer. */

;

/* Run kernel. */

...

/* Stop and print timer. */

;

(b) Cetus and autoPar output.

Figure 5.2: Timing code before and after parallelization of PolyBench.

(41)

5.2 Benchmarks

The speedup measurements (relative to serial versions) from executing the benchmarks are presented here, grouped by benchmark and machine they were collected on. Causes for performance differences between compiled versions of benchmarks will also be investigated.

bt cg ep ft

0 1 2 3 4 5 6 7

Speedup

lu mg sp

0 1 2 3

Speedup

Handparallel icc

autoPar default cetus experimental cetus individualized cetus

Figure 5.3: Speedup measurements for NPB on Machine 1.

(42)

bt cg ep ft 0

1 2 3

Speedup

lu mg sp

0 0.5 1 1.5 2

Speedup

Handparallel icc

5.2.1 NPB

Problem definitions and problem size specifics for the benchmarks can be found in table 2.1 and 2.2.

The benchmarks show significant differences in speedup between machines. A higher thread count does not necessarily result in a higher speedup across the board. Results are fairly consistent between machines as to whether speedups are positive or negative for specific benchmark versions. No speedup was measured for any version of IS, problem class B and C handparallel/serial versions of IS failed with segfaults when executed, which means that the

(43)

distributed version of IS may not work as intended when compiled on the setup used in the study, it was thus excluded from the results. autoPar was not able to parallelize lu as mentioned previously, and data for that version is thus absent from the measurements. Output was verified to be within error tolerance for all benchmark versions, where the error tolerance for each benchmark is defined in the official specification.

bt cg ep ft

0 1 2 3 4

Speedup

lu mg sp

0 1 2 3

Speedup

Handparallel icc

(44)

5.2.1.1 Parallelization techniques applied and their impact on perfor- mance

In order to explain the observed speedups, original benchmark code was inspected and compared to the code generated by the parallelizing compilers. Characteristics of these benchmarks that impact paralleization were identified, and the handparallel version was considered the baseline for parallelization techniques to look out for. For this section, Cetus using default compiler options will be referred to as Cetus 1, Cetus using experimental options as Cetus 2, and Cetus with individualized options as Cetus 3. ICC will not be discussed as the resulting code after transformations is not available.

Parallelization of BT: The bulk of execution time of BT is spent inside var- ious functions, and in many cases loops inside these functions make calls to other functions - which means that if performance gains are to be sound, parallelizing compilers need to consider parallelizing loops with functions calls inside. The handparallel version is mostly parallelizing outer loops, but in some cases inner loops are parallelized instead, indicating that parallelizing compilers may need to consider both inner and outer loops when parallelizing.

autoPar fails to parallelize any loop in BT making a function call, and in many cases parallelizes trivial loops with questionable impact on performance. The indentation of the source file is changed, but autoPar performs no transformations of the code besides normalization of loops. The loops that autoPar does parallelize are parallelized to an unnecessarily nested level. The handparallel version of listing 5.1 parallelizes only the outermost loop without any privatization. The parallelization done by autoPar results in a degradation of performance, most likely due to the overzeal- ous parallelization which has previously been shown to not always result in improved performance [9].

#pragma omp parallel for private (i,j,k,m)

for (j = 1; j <= grid_points[1] - 1 - 1; j += 1)

#pragma omp parallel for private (i,k,m)

for (k = 1; k <= grid_points[2] - 1 - 1; k += 1)

#pragma omp parallel for private (i,m) for (m = 0; m <= 4; m += 1)

#pragma omp parallel for private (i) firstprivate (dt) for (i = 1; i <= grid_points[0] - 1 - 1; i += 1)

rhs[i][j][k][m] = rhs[i][j][k][m] * dt;

Listing 5.1: Loops parallelized by autoPar in BT.

(45)

Cetus 1 identifies several parallelizable trivial loops in BT, but decides not to parallelize them due to low profitability. No loops containing function calls are recognized as parallelizable, and the only parallelized loop outside of the initialization code is the innermost loop from 5.1, visible in listing 5.2. The condition is always false for the chosen problem class A, as grid_points[0] is 64, and the condition is thus 10000 < 187. The high- est possible value for grid_points[0] is 1020 using problem class E, which means that the loop will never run in parallel regardless of problem class chosen (among currently existing ones). Parallelization of BT using Cetus 1 should thus result in no performance gain regardless of problem size, no performance loss should occur either.

#pragma omp parallel for if((10000<(-5L+(3L*grid_points[0L])))) for (i=1; i<(grid_points[0]-1); i ++ )

rhs[i][j][k][m]=(rhs[i][j][k][m]*dt);

Listing 5.2: Loop parallelized by Cetus 1 in BT.

Cetus 2 inlines and parallelizes functions used to initialize the benchmark, these transformations do not impact the measured performance of the benchmark however, as timing starts after initialization is finished. Per- formance critical functions are inlined and parallelized, and function calls within these inlined functions are also inlined. As with Cetus 1, many par- allelizations are guarded by if-conditions; however unlike Cetus 1, these conditions are overall very lax, being fulfilled even by problem class S. The output of Cetus 2 and Cetus 3 are identical as induction variables play no role in the benchmark.

Parallelization of CG: 95% of time is spent inside one function in CG. This function consists of many small loops, and in the handparallel version makes use of reduction, privatization, and synchronization-free parallelism techniques.

autoPar does not use any synchronization-free parallelism when parallelizing CG, reduction and privatization techniques are both used, although privatization is done even in cases where it is unnecessary. autoPar also parallelizes a nested loop at one point (listing 5.3), necessitating that reduction of the modified variable is used when parallelization of the outermost loop would allow the variable to be private instead. This specific miss causes a lot of extra synchronization between threads to be necessary and is most likely the reason for autoPar’s performance degradation for this benchmark. The time spent inside this specific nested loop accounts for 80% of the execution time, and the outer loop is executed 400 times, with each execution of the

(46)

outer loop resulting in 14000 executions of the inner loop, resulting in 5.6 million unnecessary reductions taking place.

//#pragma omp for private(sum,k)

for (j = 1; j <= lastrow - firstrow + 1; j += 1) { sum = 0.0;

#pragma omp parallel for private (k) reduction (+:sum) for (k = rowstr[j]; k <= rowstr[j + 1] - 1; k += 1) {

sum = sum + a[k] * p[colidx[k]];

}

w[j] = sum;

}

Listing 5.3: Loop parallelized by autoPar in CG with handparallel OpenMP directive commented out.

Neither Cetus 1, 2, nor 3 make use of synchronization-free parallelism when parallelizing CG. Cetus 1 does not make use of any privatization when parallelizing the benchmark. If-condition guards are used for mi- nor loops where the handparallel version used synchronization-free parallelism. The conditions used will however always evaluate to true under problem class W and above. Cetus 1 repeats the mistake of autoPar, but puts the parallelization behind the if-condition 10000 < 1L+(3L*rowstr[(1L+j)])- 3L*rowstr[j], which for problem class A is never true. rowstr[j] is filled with increasing values starting at 1, and ending at 1852961. rowstr[j + 1] - rowstr[j] is never larger than 500, which means that the condition can never be fulfilled under problem class A. Performance degradation observed on Ma- chine 2 could be related to the parallelism of smaller loops introduced into the code, but the number of condition checks could also have an impact (e.g.

the 5.6 million condition checks within the loop in listing 5.3).

Cetus 2 inlines functions used when initializing the benchmark, but does not inline the main function. Reduction and privatization are used to great effect, mimicking the parallelization done in the handparallel version for the performance-critical sections.

Cetus 3 performs the inlining of initialization functions as Cetus 2, all performance-critical code is identical to Cetus 1. As timing only starts after the initialization, it is expected that execution time for Cetus 3 and Cetus 1 is identical. That is however not the case, exact reasons unknown.

Parallelization of EP: Close to 100% of execution time in EP is spent in- side a loop that in the handparallel version is parallelized using reduction and static scheduling. Within the loop, 41% of the time is spent generating random numbers using a function that is not vectorizable nor parallelizable.

(47)

The remaining time is spent within a nested loop that is not parallelized further.

Figure 5.6: Diagram outlining major loop in EP. Cetus 2 parallelizes the nested loop even though the outer one is parallelizable, resulting in the idling of threads for a substantial portion of execution time.

(48)

autoPar fails to parallelize the performance-critical loop within the benchmark. Several minor loops that constitute less than 0.01% of execution time are parallelized, possibly explaining the performance degradation observed.

Cetus 1 identifies possible reductions in the main loop of EP, but does not parallelize the loops. The only parallelized loop is one run in the initialization stage. The output of Cetus 3 is identical to Cetus 1.

Cetus 2 parallelizes the nested loop using privatization and reduction, both using OpenMP constructs but also by using an array to store values and performing array reduction in a critical section. This parallelization is sub-optimal as the single-threaded random number function will be keep- ing most threads idle between executions of the parallelized nested loop.

Illustrated in figure 5.6.

Parallelization of FT: 72% of the execution time in FT is spent inside three functions containing loops running in parallel. The functions are called from another function that is run using a parallel construct, making the arrays declared inside the function that are used for computation implicitly private. The implicitly private arrays are used to temporarily store computational data in the sub-functions. The loops consuming execution time contain function calls to sequential functions.

autoPar fails to parallelize any code in the benchmark. Dependencies on the temporary arrays are identified, preventing loops from being parallelized, but the arrays are not privatized.

Cetus 1 does not parallelize any code due to dependencies on the temporary arrays. No privatization is done. Cetus 2 inlines and parallelizes two functions constituting 4% and 5% of the execution time similarly to the handparallel version. The three major functions are inlined and nested loops performing work on the arrays are parallelized. No privatization is done. The performance drop can be explained by the overly fine parallelization done, the run-time checks put in place are always true for problem class A indicating that the model expects a performance increase from running them. Cetus 3 inlines all functions as Cetus 2, but parallelizes none.