Automated Reasoning Support for Invasive Interactive Parallelization

(1)

Master’s thesis

Automated Reasoning Support for

Invasive Interactive Parallelization

by

Kianosh Moshir Moghaddam

LIU-IDA/LITH-EX-A–12/050–SE

2012-10-18

' & $ % Link¨opings universitet

SE-581 83 Link¨oping, Sweden

Link¨opings universitet 581 83 Link¨oping

(2)

(3)

Master’s thesis

Automated Reasoning Support for

Invasive Interactive Parallelization

by

Kianosh Moshir Moghaddam

LIU-IDA/LITH-EX-A–12/050–SE

2012-10-18

Supervisor: Christoph Kessler

(4)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –från

publiceringsdatum under förutsättning att inga extraordinära omständigheter

uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form

eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller

konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

för-lagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement –from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/hers own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page:

http://www.ep.liu.se/.

(5)

To parallelize a sequential source code, a parallelization strategy must be defined that transforms the sequential source code into an equivalent paral-lel version. Since paralparal-lelizing compilers can sometimes transform sequential loops and other well-structured codes into parallel ones automatically, we are interested in finding a solution to parallelize semi-automatically codes that compilers are not able to parallelize automatically, mostly because of weakness of classical data and control dependence analysis, in order to sim-plify the process of transforming the codes for programmers.

Invasive Interactive Parallelization (IIP) hypothesizes that by using an intelligent system that guides the user through an interactive process one can boost parallelization in the above direction. The intelligent system’s guid-ance relies on a classical code analysis and pre-defined parallelizing transfor-mation sequences. To support its main hypothesis, IIP suggests to encode parallelizing transformation sequences in terms of IIP parallelization strate-gies that dictate default ways to parallelize various code patterns by using facts which have been obtained both from classical source code analysis and directly from the user.

In this project, we investigate how automated reasoning can support the IIP method in order to parallelize a sequential code with an accept-able performance but faster than manual parallelization. We have looked at two special problem areas: Divide and conquer algorithms and loops in the source codes. Our focus is on parallelizing four sequential legacy C programs such as: Quick sort, Merge sort, Jacobi method and Matrix multipliation and summation for both OpenMP and MPI environment by developing an interactive parallelizing assistance tool that provides users with the assis-tance needed for parallelizing a sequential source code.

(6)

(7)

(8)

First of all, I would like to offer my sincerest gratitude to my examiner Professor Christoph Kessler who always had time for me and my questions. His advice and comments to my technical questions, implementation and even my report by proof-reading were very precise and useful. Thank you Dr.Kessler for your invaluable guidance and support during my thesis work. I’m also very grateful to Professor Anders Haraldsson who was the first person that made me familiar to the world of Lisp programming. Thank you Dr.Haraldsson, I will never forget your kindness and our fruitful discussions. I would like to thank Mikhail Chalabine, who was my supervisor in the first phase of my thesis and made me familiar with the concept of IIP. His comments were very instructive. It was a pity that I couldn’t have his collaboration for the rest of this project.

Thanks also to PELAB group at IDA department of Link¨oping Univer-sity that made this project possible for me.

I would like to express my gratitude to the National Supercomputer Center in Link¨oping Sweden (NSC) for giving me access permission to their servers.

Last, but certainly not least, I would like to thank my family members. Without your encouragement and support throughout my life none of this would have happened.

(9)

1 Introduction 2

1.1 Motivation . . . 2

1.2 Overview and Contributions . . . 3

1.3 Research questions . . . 5

1.4 Scope of this Thesis . . . 5

1.4.1 Limitations . . . 5

1.4.2 Thesis Assumptions . . . 6

1.5 Evaluation Methods . . . 6

1.6 Outline . . . 6

2 Foundations and Background 8 2.1 Compiler Structure . . . 8

2.2 Dependence analysis . . . 9

2.2.1 GCD Test . . . 13

2.2.2 Banerjee Test . . . 14

2.3 Artificial intelligence . . . 14

2.3.1 Knowledge representation and Reasoning . . . 14

2.3.2 Inference . . . 15

2.3.3 Planning, deductive reasoning and problem solving . . 15

2.3.4 Machine Learning . . . 15

2.3.4.1 Classification and statistical learning methods 16 2.3.4.2 Decision Tree . . . 16

2.3.5 Logic programming and AI languages . . . 16

2.4 Parallelization mechanisms . . . 17

2.5 Parallel Computers . . . 18

2.6 Parallel Programming Models . . . 20

2.7 Performance Metrics . . . 21

2.8 Code Parallelization . . . 21

2.8.1 Loop Parallelization . . . 23

2.8.1.1 OpenMP . . . 25

2.8.1.2 MPI . . . 30

2.8.1.3 Different methods for loop parallelization in MPI . . . 30

2.8.2 Function Parallelization . . . 35

(10)

2.9 Divide and Conquer Algorithms . . . 35

2.9.1 Parallelization of Divide and Conquer Algorithms . . . 36

2.10 Sorting algorithms . . . 37

2.10.1 Sequential Quick sort . . . 37

2.10.2 Parallel Quick sort . . . 39

2.11 Jacobi method . . . 40

2.12 Software composition techniques . . . 41

2.12.1 Aspect-Oriented Programming (AOP) . . . 42

2.12.2 Template Metaprogramming . . . 43

2.12.3 Invasive Software Composition . . . 43

2.13 Invasive Interactive Parallelization . . . 43

3 System Architecture 45 3.1 Code Analysis . . . 45 3.2 Dependence Analysis . . . 47 3.3 Strategy Selection . . . 48 3.4 Weaving . . . 49 4 Implementation 51 4.1 System Overview . . . 51

4.2 Implemented Predicates and Functions . . . 54

4.2.1 Code Analysis . . . 54

4.2.2 Dependency Analysis . . . 58

4.2.3 Strategy Selection and Weaving . . . 62

4.2.3.1 OpenMP . . . 62

4.2.3.2 MPI . . . 64

5 Evaluation 78 5.1 System Parallelization of Test Programs . . . 78

5.1.1 Quick sort . . . 78

5.1.2 Jacobi method . . . 85

5.1.3 Other test programs . . . 87

5.2 Correctness of the parallelized test programs . . . 91

5.3 Performance of the parallelized test programs . . . 92

5.3.1 Quick sort . . . 92

5.3.2 Jacobi method . . . 98

5.3.3 Other test programs . . . 101

5.4 Usefulness . . . 106

6 Related Work 107 6.1 Dependency analysis in parallelizing sequential code . . . 107

6.2 Parallelizing by using Skeletons . . . 108

6.3 Automatically parallelizing sequential code . . . 108

6.4 Semi-automatically parallelizing sequential code . . . 109

(11)

7 Conclusion 111

8 Future work 113

8.1 Header files . . . 113

8.2 Graphical user interface . . . 113

8.3 Extension of loop parallelization . . . 114

8.4 Add profiler’s support . . . 114

8.5 Pointer analysis . . . 114

8.6 Extension for D&C algorithms parallelization . . . 114

Appendix A Divide and Conquer Templates 115 A.1 MPI . . . 115

A.2 OpenMP . . . 117

Appendix B Source Codes 118 B.1 Quick sort . . . 118

B.1.1 Sequential Code(code example1) . . . 118

B.1.2 Sequential Code with better pivot selection(code ex-ample2) . . . 120

B.1.3 System-generated MPI Parallel code . . . 122

B.1.4 System-generated OpenMP Parallel code . . . 126

B.2 Merge sort . . . 128

B.2.1 Sequential Code . . . 128

B.3 Jacobi method . . . 137

B.3.2 Manual MPI Parallel Code . . . 139

B.3.4 Manual OpenMP Parallel Code . . . 146

B.4 Matrix multiplication and summation . . . 150

(12)

(13)

1.1 Overview of our system . . . 3

2.1 Compiler Front end and Back end concept. . . 9

2.2 Dominance Relations . . . 10

2.3 Data Dependence Graph . . . 11

2.4 Distributed Memory Architecture. . . 19

2.5 Shared Memory Architecture, here with a shared bus as in-terconnection network. . . 19

2.6 Fork/Join Concept . . . 20

2.7 Foster’s PCAM methodology for parallel programs . . . 22

2.8 Index Set Splitting . . . 24

2.9 Parallel nested loops . . . 26

2.10 Distributing loop iterations . . . 31

2.11 Quick sort data division . . . 38

2.12 Jacobi Method . . . 41

3.1 System Architecture Steps . . . 45

3.2 Example of code analysis decision tree . . . 46

3.3 Code Analyzer Structure . . . 48

3.4 Strategy Selection . . . 49

3.5 Weaving . . . 50

4.1 Implementation work flow . . . 52

4.2 The system asks for a sequential code and reads it . . . 52

4.3 The system will save the result . . . 53

4.4 Loop dependency analysis . . . 60

4.5 Function parallelization analysis . . . 63

5.1 Data division between processors . . . 81

5.2 Processors division . . . 81

5.3 Quick sort MPI parallelization . . . 82

5.4 Quick sort OpenMP parallelization . . . 84

5.5 Processors communications . . . 85

5.6 Jacobi method MPI parallelization . . . 88

(14)

5.7 Jacobi method OpenMP parallelization . . . 89 5.8 Merge sort . . . 90 5.9 Matrix multiplication and summation . . . 90 5.10 System-parallelized MPI Quick sort (code example1) speedup

for the problem size 107 _{. . . .} ₉₄

5.11 System-parallelized OpenMP Quick sort (code example1) speedup for the problem size 107 . . . 95 5.12 System-parallelized MPI Quick sort (code example2) speedup

for the problem size 107 . . . 96 5.13 System-parallelized OpenMP Quick sort (code example2) speedup

for the problem size 107 . . . 97 5.14 System-parallelized MPI Jacobi method speedup for the

prob-lem size 10000 ∗ 1000 . . . 99 5.15 System-parallelized OpenMP Jacobi method speedup for the

problem size 10000 ∗ 1000 . . . 101 5.16 System-parallelized MPI Merge sort speedup for the problem

size 107 _{. . . 103}

5.17 System-parallelized OpenMP Merge sort speedup for the prob-lem size 107 _{. . . 103}

5.18 System-parallelized MPI Matrix multiplication and summa-tion speedup for the problem size 10000 ∗ 1000 × 1000 ∗ 10000

. . . 105 5.19 System-parallelized OpenMP Matrix multiplication and

(15)

4.1 Expression analysis . . . 55

4.2 For-loop analysis . . . 56

4.3 Function definition analysis . . . 57

4.4 Function call analysis . . . 57

5.1 Sequential Quick sort (code example1) execution time (in sec-onds) . . . 93

5.2 System-parallelized MPI Quick sort (code example1) execu-tion time (in seconds) . . . 93

5.3 System-parallelized OpenMP Quick sort (code example1) ex-ecution time (in seconds) . . . 94

5.4 Sequential Quick sort (code example2) execution time (in sec-onds) . . . 95

5.5 System-parallelized MPI Quick sort (code example2) execu-tion time (in seconds) . . . 96

5.6 System-parallelized OpenMP Quick sort (code example2) ex-ecution time (in seconds) . . . 97

5.7 Sequential Jacobi method execution time (in seconds) . . . . 98

5.8 System-parallelized MPI Jacobi method execution time (in seconds) . . . 98

5.9 Hand-parallelized MPI Jacobi method execution time (in sec-onds) . . . 99

5.10 System-parallelized OpenMP Jacobi method execution time (in seconds) . . . 100

5.11 Hand-parallelized OpenMP Jacobi method execution time (in seconds) . . . 100

5.12 Sequential Merge sort execution time (in seconds) . . . 101

5.13 System-parallelized MPI Merge sort execution time (in seconds)102 5.14 System-parallelized OpenMP Merge sort execution time (in seconds) . . . 102

5.15 Sequential Matrix multiplication and summation execution time (in seconds) . . . 104

5.16 System-parallelized MPI Matrix multiplication and summa-tion execusumma-tion time (in seconds) . . . 104

(16)

5.17 System-parallelized OpenMP Matrix multiplication and sum-mation execution time (in seconds) . . . 105

(17)

2.1 OpenMP for-loop parallelization . . . 27

2.2 OpenMP shared vs. private variables . . . 27

2.3 OpenMP firstprivate variable . . . 27

2.4 OpenMP lastprivate variable . . . 28

2.5 OpenMP nested for-loop parallelization . . . 28

2.6 OpenMP keeping the order of for-loop execution . . . 28

2.7 OpenMP reduction for-loop . . . 29

2.8 OpenMP multiple loops parallelization . . . 29

2.9 OpenMP schedule(static) parallelization . . . 30

2.10 Sequential for-loop to calculate the sum of all array elements 32 2.11 Sequential for-loop to increase all elements of an array by one 32 2.12 MPI for-loop parallelization method1 . . . 32

2.13 MPI for-loop parallelization method2 . . . 33

2.16 Sequential Quick sort . . . 38

2.17 Jacobi iteration, sequential code . . . 41

3.1 Dependency Example . . . 48

4.1 For-loop code example . . . 71

4.2 For-loop parallelization by using method2 . . . 72

4.3 Optimized for-loop parallelization . . . 76

5.1 MPI parallel Quick sort (pseudocode) . . . 79

5.2 OpenMP parallel Quick sort function . . . 83

5.3 MPI pseudocode of implemented Jacobi algorithm . . . 86

5.4 OpenMP implemented Jacobi algorithm . . . 87

A.1 MPI D&C template . . . 115

A.2 OpenMP D&C template . . . 117

B.1 Sequential Quick sort source code example1 . . . 118

B.2 Sequential Quick sort source code example2 . . . 120

B.3 System-parallelized MPI Quick sort source code . . . 122

B.4 System-parallelized OpenMP Quick sort source code . . . 126

B.5 Sequential Merge sort source code . . . 128

B.6 System-parallelized MPI Merge sort source code . . . 130

B.7 System-parallelized OpenMP Merge sort source code . . . 134

(18)

B.8 Sequential Jacobi method source code . . . 137 B.9 Hand-parallelized MPI Jacobi method source code . . . 139 B.10 System-parallelized MPI Jacobi method source code . . . 142 B.11 Hand-parallelized OpenMP Jacobi method source code . . . . 146 B.12 System-parallelized OpenMP Jacobi method source code . . . 148 B.13 Sequential Matrix multiplication and summation source code 150 B.14 System-parallelized MPI Matrix multiplication and

summa-tion source code . . . 152 B.15 System-parallelized OpenMP Matrix multiplication and

(19)

(20)

Introduction

Computers were originally developed with one single processor. With in-creased demand for faster computation and processing larger amounts of data for scientific and engineering problems over the years, single proces-sor computers were unable to process all amounts of incoming data or ulti-mately processed them with low performance. To overcome the performance problem, computers’ architectures have changed towards Multi-processor ar-chitectures. For these multi-processor computers we need to write special programs in such a way to be able to run in parallel on a number of proces-sors in order to achieve the targeted performance.

Since there exists a large number of sequential legacy programs, and most of the times it is not possible or economic to write parallel code from scratch, we should find a way to convert them into parallel. There are several ways for this conversion with their own advantages and disadvantages, and we will discuss them more in the following pages.

1.1 Motivation

The use of multi-core computers spreads over different science areas and each day we encounter more requests for parallel applications which are able to run on multi-core computers. In creating parallel programs we encounter two approaches. The first refers to the situation where there is no parallel program for our problem and we should write one from scratch. The sec-ond approach indicates that there exists a serial program which should be changed into parallel to be able to execute on multi-processor computers. In this thesis our focus is on the second approach.

There exist several methods for parallelizing serial programs (see sec-tion 2.4); among these methods manual parallelizasec-tion is more precise since the programmer based on the character of the program can decide how to add and merge code statements in a way to reach better performance, but basically this is an exhausting task for a programmer. Manual

(21)

tion needs extensive parallelization knowledge for analysis and implemen-tation, and, since it has limited reusability in implemenimplemen-tation, takes lots of time. Compiler-based automatic parallelization by automatically paral-lelizing some parts of the problem can increase the speed of parallelization but is restricted to special code structures and sometimes it is not able to parallelize some codes such as recursion problems [47].

In the semi-automatic parallelization method, to some extent the steps of parallelization become easier for the user by combining automatic and user interactive tasks for converting a specific part of sequential code into parallel code. But still this method is not capable of assisting the user in analyzing the program and it can increase the speed just by avoiding repetitive tasks. Our remedy to this problem is using invasive interactive parallelization (IIP), see section 2.13, together with a reasoning system that will assist the user in the analysis of the code and help to make better decisions based on existing rules. We believe these methods will increase reusability and speed of parallelization. Our research now focuses on parallelizing four different kinds of test programs, Quick sort and Merge sort which both programs use the divide and conquer paradigm , the Jacobi method that consists of several loops and a Matrix multiplication and summation which has a loop and a reduction statement. All require deep understanding of the code and reasoning.

Our motivation for choosing these two programs is as follows: if we are capable of parallelizing them with the same quality as a manual paralleliza-tion method but faster, we can be able to generalize it, e.g., to other divide and conquer programs and also programs which consist of loops.

1.2 Overview and Contributions

This thesis contributes to Invasive Interactive Parallelization (IIP) - a semi-automatic approach to parallelization of legacy code rooted in static invasive code refactoring and separation of concerns.

(22)

Chalabine [13] suggests that an IIP system should comprise three core components - interaction engine, reasoning engine, and weaving engine.

• The interaction engine is in interaction with the user and the reasoning engine in three steps: first, through it the user will load a sequential program into the system. Second, the user will ask for parallelization by pinpointing a part of the code, indicating a specific defined par-allelization strategy and a target architecture. Third, the reasoning engine’s suggestions will be shown to the user and the user will select the best suggestion among them.

• The reasoning engine is in interaction with the two other components, interaction engine and weaving engine. It will preprocess the code and analyze the user’s request that was entered through the interaction engine and tries to give parallelization suggestions based on defined rules.

• The weaving engine contributes to the system in two ways. Initially, it will refactor the code based on the user’s selected suggestion and defined IIP strategies. Finally, it, based on software composition tech-nology, will combine the refactored codes in a right order. The result of this component will be a complete parallel program.

In this project initially we focus on IIP parallelization strategies for semi-automatic refactoring of one loop-based iterative function (Jacobi method) and one recursion-based divide and conquer function (Quick sort) both writ-ten in C.

Our first goal is to construct a prototype of an interactive reasoning en-gine by implementing a set of IIP strategies which are capable of guiding the user through the parallelization of the two functions above by following patterns formulated as Lisp programs. We will build a list of predicates and functions alike isArray(), hasDependencies(), etc. Then, user interaction can proceed as follows (U=user, S=system):

U: [Want to parallelize] S: [Select the sequential code] U: [Filename]

S: [Select the target architecture (currently, either SHMEM or DMEM)] U: [Architecture name, e.g. SHMEM]

S: [Present system analysis result, e.g. system can : parallelize part B1 and B2][Select part of the code to be parallelized]

U: . . .

Our second goal is to investigate how the use of the developed prototype affects the productivity of a typical parallelization expert.

(23)

1.3 Research questions

In our research we try to answer the following questions: How can we help a user with an intermediate knowledge of parallelization to parallelize se-quential codes such as Quick sort and Jacobi method? In this respect, how would an IIP parallelization system be helpful? How can we define the facts in this system? What reasoning strategy should we use for selecting the facts? What is the best strategy for refactoring the code?

These questions lead us toward the following hypothesis:

Hypothesis

While manual parallelization is a time consuming tedious task and while compilers have difficulties in automatically parallelizing recursive constructs, we are, by encoding IIP strategies using decision trees and letting the pro-grammer interact with the system, capable of parallelizing Quick sort and Jacobi method achieving an acceptable level of performance but simpler and faster than manual parallelization.

1.4 Scope of this Thesis

1.4.1 Limitations

This thesis prototype is implemented in Common Lisp language. It accepts a sequential C code as an input and converts it to a parallel code. In order to be able to do this conversion, it must be able to understand and retrieve the C syntax in Common Lisp; therefore, we have implemented a C parser in Lisp as much as we needed for our job but we did not provide a full compiler for C in Lisp.

In our implementation we did not restrict ourselves to specific hard-ware architecture and this system is able to be used for both shared and distributed memory architecture.

In nested loop parallelization for two-dimensional arrays in distributed memory, we have not implemented column-wise parallelization in our pro-totype, since arrays in C are stored in row-wise order and parallelization of arrays in column-wise order needs several loops and more communications. Later on, in section 4.2.3.2 we will describe one method for column-wise parallelization of two-dimensional arrays.

For divide and conquer (D&C) algorithms and specifically Quick sort we have investigated different methods and load balancing issues but we did not implement our system based on the best mentioned load balancing technique, because this technique can be used just for some sort of D&C algorithms and we cannot generalize it.

(24)

1.4.2 Thesis Assumptions

For simplicity, the following assumptions have been made:

• We assume that all relevant code for parallelization of sequential code exists in a single file and all statements, even “{” and “}” symbols, are in separate lines.

• For both examples we assume that we have more data items than processors, which means that sizes of arrays and matrices are bigger than the number of processors.

• We suppose that the sequential Quick sort uses the best pivot selecting method.

• In the end both examples are output in C language.

1.5 Evaluation Methods

We evaluate our approach based on the following metrics:

Correctness: We evaluate the correctness of our system for the two exam-ples Quick sort and Jacobi method in two ways. At first, we will use testing to compare the result of execution of the sequential code with that of the parallel code that we will get from the system; in principle, they should both be the same for any legal input which means parallelization will not change the semantics (input-output behavior) of the sequential code.

Second, we will by inspection compare the result of manual paralleliza-tion of the sequential code with the parallel code that we will get from the system, both should be basically the same.

Performance: We evaluate the system’s performance by comparing the execution time of the sequential code on one processor with the execution time of the parallel code on multi-processors and calculating the speedup. We will show for a large amount of data items that we reach considerably high speedup.

Usefulness: We will show that our system will increase the speed of par-allelization for users with an intermediate knowledge of parpar-allelization.

1.6 Outline

The rest of this thesis is organized as follows:

Chapter 2 explains foundations and background to this thesis.

Chapters 3 and 4 describe the contributions in terms of both the model and the algorithms used to construct the structures that make up the model. Chapter 5 contains an evaluation of the usefulness of the model.

Chapter 6 describes related work in the area of automatic and semi-automatic parallelization of sequential codes.

(25)

Chapter 7 and 8 summarize the contributions of this work and propose some suggestions for extending this research.

(26)

Foundations and

Background

This chapter provides background necessary to understand the research de-scribed in this master thesis. It begins with a description of compiler struc-ture and continues with dependence analysis, some notions of artificial in-telligence, parallelization, software engineering techniques and finally, it fin-ishes with invasive interactive parallelization. The order of the mentioned sections is based on their priority for implementing the selected method.

2.1 Compiler Structure

A compiler is a software system which receives a program in one high level language as an input and then processes and translates it to a program in another high level or lower level language usually with the same behavior as the input program [56].

As presented in e.g. Cooper et al. [56], the compilation process is tradi-tionally decomposed into two parts, front end and back end. The front end takes the source program as an input and concentrates on understanding the language (both syntax and semantics) and encoding it into an interme-diate representation (IR). In order to complete this procedure, the input program’s syntax and semantics must be well formed. Each language has its own grammar which is a finite set of rules. The language is an infinite set of strings defined by a grammar. Scanner and parser check the input code and determine its validity based on the defined grammar. The scanner discovers and classifies words in a string of characters. The parser, in a num-ber of steps, applies the grammar rules. These rules specify the syntax for the input language. It may happen that sentences are syntactically correct but they are meaningless; therefore, beside syntax, the compiler also checks semantics of programs based on contextual knowledge. Finally, the front

(27)

end generates the IR [56].

After generating the IR, the compiler’s optimizer analyzes the IR and rewrites the code in a way to reduce the running time, code size or other properties in order to increase efficiency. This analysis includes data flow analysis and dependence analysis. Rountev et al. [46] indicate that by the semantic information which can be obtained from the data-flow analysis, the code optimization and testing, program slicing and restructuring, and semantic change analysis is possible.

The back end reads the IR and maps it to the instruction set of the target machine or to source code. Figure 2.1 depicts this concept.

Figure 2.1: Compiler Front end and Back end concept.

2.2 Dependence analysis

In the procedure of converting sequential code into parallel, beside checking the syntax, we need to check dependencies in order to be sure that the result of execution of the new code is the same as for the source code. So, we have to perform both control and data dependence analysis. In control dependence, dependence arises by control flow. The control flow represents the order of execution of the statements based on the code syntax by identifying the statements which lead the execution of other statements. The following code depicts control dependence between statements S1 and S2. In this example S2 can not be executed before S1 because if the value of W is equal to zero changing the order of execution of statements S1 and S2 would cause a divide-by-zero exception. Here, we can execute S2 if the condition in S1 is fulfilled.

S1: if (W>0) {

S2: L=area/W;

}

Data dependence talks about providing and consuming data in the cor-rect order. See the following example:

(28)

S2: W=5;

S3: area= L* W;

Statement S3 can not be moved before either S1 or S2. There is no con-straint in the order of execution of S1 and S2. To understand both these dependencies we can use a dependence graph. The dependence graph is a directed graph whose vertices are statements and arcs show control and data dependency [51, 5].

A control dependence graph represents parallelism constraints based on control dependence in a program. If node M is controlled (dominated) by node N or vice versa we cannot run them in parallel, see Figure 2.2(a). But if they are siblings we can execute them in parallel, see Figure 2.2(b), unless prevented by data dependence.

Figure 2.2: Dominance Relations

It may happen that two statements are control-independent but have data dependency. Look at the following example:

S1 : a = b + c

S2 : d = 4 ∗ a S3 : e = a + b S4 : f = d + e

We cannot run all above statements in parallel since S2 and S3 need the output value of S1, and S4 needs the results of S2 and S3 .

In these situations we can use the Data Dependence Graph (DDG). The data dependence graph represents all statements of a program with their dependencies and can help to understand the semantic constraints on par-allelism. In a data dependence graph nodes are statements and edges show data dependencies among these statements [38]. Figure 2.3 represents the DDG of the above example.

We categorize data dependencies into flow dependence (true dependence), anti dependence, and output dependence.

(29)

Figure 2.3: Data Dependence Graph

• True dependence (flow dependence): In a true dependence from S1 to S2 one statement (S2) needs the result of a (possibly) previously executed statement (S1). As we can see in the following example, S1 writes to the memory location while S2 reads from it. We can not change the order of S1 and S2; if statement S2 executed earlier it would read the wrong value.

S1 : a = b + c S2 : d = 4 ∗ a

• Anti dependence: Anti dependence happens when one statement reads a value that is later changed by a possibly successively execut-ing statement. In the next example S1 reads from the memory location (a) while S2 writes to that location. If statement S2 executed first it would overwrite the value before S1 uses the old one.

S1 : d = 4 ∗ a S2 : a = b + c

• Output dependence: In an output dependence two statements write into the same memory location [29, 39]. The following example shows that both S1 and S2 write to the same memory location (a).

S1 : a = 4 ∗ d S2 : a = b + c

The latter two dependences can be eliminated by introducing new variables to avoid the storage reuse [39].

More complex data dependence issues can arise in data dependent loop it-erations, such as recurrences and reductions.

Recurrence means a constraint to execute the loop iterations in proper or-der, for example, when we need the value that is computed in the previous iteration, in the current iteration.

(30)

A reduction operation is used to reduce the elements of one array by using operations such as sum, multiply, min, max, etc. into a single value. For example, in the case of a loop summing the elements of one array, in each iteration the value of the sum variable is updated to add a new element. If we parallelize this loop, in the parallel version we divide the loop iterations between the active processors and each processor will calculate sum as above for its own subset; therefore, the processors may interfere with each other and overwrite each other’s values in the same memory location. To overcome this problem we have to consider that at each time only one processor be able to execute the summation which again serializes the loop execution. Data dependency in loops can in some cases be eliminated by rewriting the loops, for more information see [2].

Loop execution is one of the situations where we need to do data depen-dency analysis. There are two kinds of loop dependencies, loop-independent dependences and loop-carried dependences.

Loop-independent dependence means that the dependence occurs in the same loop iteration. For example, assume we have two statements S1 and S2 in the same loop that both access the same memory location (a[i]) in each iteration but the memory location in each iteration is different. Since we have a distinct memory location in each iteration the iterations are in-dependent of each other, see the following example.

for (i=0;i<10;++i) {

S1: a[i]=2*i; S2: e[i]=a[i]-2; }

Loop-carried dependence occurs when one statement accesses a memory location in one iteration and in another iteration there is a second access to that memory location and at least one of these accesses is a write access [29]. The following example demonstrates this idea, as we can see that in every iteration statement S2 uses an element of a that was computed in the previous iteration by S1. for(i=0;i<10;++i) { S1: a[i]=2*i; S2: e[i]=a[i-1]-2; }

Most compilers can perform control and data dependency analyses but in a limited scale and mostly for loops.

In our project we use both control and data dependency analysis for selected parts of the code that include control statements such as if-statements, it-eration (for-loops, nested for-loops) and recursive functions.

(31)

In order to determine whether data dependences may exist among the code statements inside a loop or not, there exist several tests such as ZIV (zero index variable), SIV (single index variable), MIV (multiple index vari-able), GCD test, Banerjee test, etc. All these tests are based on the index variables of the loops enclosing the statements and the arrays’ indices that occur inside the statements. We refer the reader to [5] for details concerning these tests.

The two traditionally most well known tests that find dependencies among loop statements are GCD and Banerjee tests. In [5] Allen et al. mention that most compilers for automatic parallelization use these two tests to find dependencies.

2.2.1 GCD Test

Both GCD and Banerjee test are based on solving Linear Diophantine equa-tions, and they determine whether a dependence equation of the form (2.1) may have an integer solution which satisfies the constraint (2.2) or not. [5]

f (I1, I2, · · · , In) = a0+ a1I1+ a2I2+ · · · + anIn

g(J1, J2, · · · , Jn) = b0+ b1J1+ b2J2+ · · · + bnJn

f (I1, I2, · · · , In) = g(J1, J2, · · · , Jn)

a1I1− b1J1+ · · · + anIn− bnJn= b0− a0 (2.1)

Lk≤ Ik, Jk ≤ Hk ∀k, 1 ≤ k ≤ n (2.2)

In equation 2.2 in a sequence Lk and Hk show the lower and upper limits

for the loop index variable of loop k in a loop nest with n levels.

The GCD test does not consider the loop index limits and it only checks whether there may exist an integer solution that fulfills 2.1 or not. The GCD test is calculated by calculating the gcd (greatest common divisor) of all the coefficients of the loop index variables (gcd(a1, · · · , an, b1, · · · , bn))

and testing if it divides the constant terms (b0− a0). If not it means there is

no solution for the equation and definitely no dependence; otherwise, there may be a dependence. Here, if there is a solution for the test, we are still not sure about the existence of a dependence, because it may happen that the integer solutions may not fulfill the iteration space constraint (2.2).[5, 43]

In this situation we can run another dependency test and for this thesis we have selected the Banerjee test.

(32)

2.2.2 Banerjee Test

In contrast to the GCD test, the Banerjee test considers the loop index limits for its computations. It uses the loop index limits to calculate the min and max values on the left-hand side of equation (2.3).

a1I1+ a2I2+ · · · + anIn= a0 (2.3)

Lk ≤ Ik≤ Hk ∀k, 1 ≤ k ≤ n

Assume that in equation 2.3, min and max are calculated as follows:

min = n X i=1 (a+_i Li− a−i Hi) max = n X i=1 (a+_i Hi− a−i Li) Where a+= if a ≥ 0 then a else 0 a−= if a ≥ 0 then 0 else -a

Now, if a0 is not between min and max (min ≤ a0 ≤ max) there is

definitely no dependence; otherwise, there may be a dependence [5, 43, 44]. In the situation that both GCD and Banerjee tests reach to “maybe”, we will consider that there is a dependency.

2.3 Artificial intelligence

Artificial intelligence (AI) is an area of computer science. Among different existing definitions we will define it as, simulation of human intelligence on a computer in a way that enables it to make efficient decisions to solve complex problems with incomplete knowledge. Such a system must be capable of planning in order to select and execute suitable tasks at each step. AI is a wide research area and many researchers are working on different aspects of it. Now we will describe some of them.

2.3.1 Knowledge representation and Reasoning

Knowledge representation’s (KR) focus is on understanding requirements, capturing required knowledge, representing this knowledge in symbols and automatically making it available based on reasoning strategies. A KR sys-tem is responsible for analyzing the user’s queries and answering them in a reasonable time. KR is a part of a larger system which interacts with other parts by answering their queries and letting them add and modify concepts, roles and assertions, for more details see [7].

(33)

2.3.2 Inference

In our daily life it may happen that, according to the existing facts or our experiences, we draw a conclusion for some problems, in these situations we infer the result. For example if the ground is wet in autumn we usually infer that it was raining. This notion is expanded in different areas and we also have it in AI. Inference is classified into two groups, deductive and inductive. The deductive inference infers in two steps: at first, by assigning the truth values to the sentences, it specifies the premises. In the second step, it provides an inference procedure based on given premises which is lead to the certain conclusions. The inductive inference, similarly to deductive inference, infers in two steps: at first, it specifies premises by assigning probability values to the sentences and in the second step, it provides an inference procedure based on given premises which is lead to most possible conclusions. Since the inductive inference is only probable, even in the situation that the evidence is accurate, the conclusion can be wrong. From the mathematical point of view, how to specify the conclusions based on premises is the difference between inductive and deductive inferences, for more details see [36].

2.3.3 Planning, deductive reasoning and problem

solv-ing

An intelligent agent must be able to set the goals and predict the necessary states to achieve the successful goals by reasoning about the effect of each state in an efficient manner. Deductive reasoning means to take decisions based on existing facts, and intelligent agents use this method for creat-ing step by step a plan to solve a specific problem even with incomplete information.

2.3.4 Machine Learning

We can define learning as improving some tasks with experience. The field of machine learning includes studies of designing the computer programs that improve their performance at some tasks through learning from the previ-ous experience or history. In machine learning we encounter two concepts, supervised and unsupervised learning. In supervised or inductive learning, similar to human learning that gains knowledge from the past experiences to improve the human’s ability to perform the real world tasks, the machine learns from the past experiences data.

In this concept we have a set of inputs, outputs, and algorithms that map these inputs to outputs. For mapping data, supervised learning at first will classify the data based on their similarities and then it will execute the set of functions which lead to reaching the specified outputs from the existing inputs. Here input data must be complete otherwise the system cannot be able to infer correctly. One of the most common techniques of

(34)

supervised learning is decision tree learning which creates a decision tree based on predefined training data.

Unsupervised learning is a technique with emphasis in computer ability to solve a classification problem from a chain of observations; therefore, it is able to solve more complex problems. In this technique we do not need training data [48].

2.3.4.1 Classification and statistical learning methods

In supervised learning we have the notion of class which is defined as a decision to be made, and patterns belong to these classes. AI applications are grouped into classifiers and controllers. Classifiers are functions that based on pattern matching methods, find the closest pattern, and controllers infer actions [48]. Classifiers are used in support vector machine (SVM), k-nearest neighbor algorithm, decision tree, etc.

2.3.4.2 Decision Tree

Decision tree learning is a method of supervised learning. It uses a tree-like graph or model of decisions. In a decision tree, the internal nodes test attributes, branches represent the values corresponding to the attributes and classifications are assigned by leaf nodes [53].

Interpretability, flexibility and usability make this model transparent and understandable to human experts, applicable to a wide range of problems and accessible to non-specialists. This model is also highly scalable and the result is accurately predictable, for more details we refer the reader to [53].

2.3.5 Logic programming and AI languages

Logic programming is a part of the AI research area. In logic program-ming we categorize languages into two groups, declarative and imperative languages. A declarative language describes what the problem is while an imperative language describes how to solve the problem. A declarative state-ment is also called a declarative sentence. It is a complete expression in natural language which is true or false. A declarative program consists of a set of declarative statements and shows the relationship between them. An imperative sentence or command says what to do. An imperative program consists of a sequence of commands, for more information see [40].

Two main logic languages that are mostly used in AI are Lisp and Prolog which both are declarative languages.

Lisp was born in 1958; it seems that after Fortran it is the second oldest surviving language. Lisp is a functional language which is based on defin-ing functions. In Lisp all symbolic expressions and other information are represented by a list structure, which makes manipulation easy [37].

(35)

Prolog is a special-purpose language which is based on first order logic and mostly used for logic and reasoning. It is a declarative language. It has a limited number of key words which makes it easy to learn.

2.4 Parallelization mechanisms

There are four different methods for parallelizing a program:

• Manual parallelization: Traditionally, parallel programs have been manually written by expert programmers. The programmer is respon-sible for identification and implementation of parallelism. This mech-anism is flexible since the programmer can decide how to implement it. However, the programmer must have a good knowledge about the characteristics of the architecture where the program is intended to be run and take decisions about the ways to decompose and map data, and how the scheduling and synchronization procedures must be [19]. Since the programmer is responsible for doing all of the above tasks by him/herself and it also may happen that we have repetitive tasks, this method is time consuming, complex and error-prone [9].

• Compiler-based automatic parallelization: In this method com-pilers automatically generate a parallel version of the sequential code. This method is less flexible than the previous method. Here, most fo-cus is on loop level parallelization since most of the program execution time is spent in executing loop iterations. Compilers parallelize loops based on data dependence analysis. However, it is not always possible to detect data dependency at compile time. Overall we can say that the compilers can parallelize the loops with the situations mentioned in section 2.8.1.

Different parallel algorithms for the same sequential code may present different parallelism degrees. As Gonz´alez-Escribano et al. [19] men-tion, many compilers will not parallelize a loop when the overhead of the parallel execution is expected to exceed the gained performance. Some compilers have problems with parallelizing divide and conquer algorithms and generally recursive procedures due to dependencies that may exist among the recursive calls [47].

• Skeletons: Algorithmic skeletons which were introduced by Cole [15] is another approach for parallel programming. In this method, de-tails for parallel implementations are abstracted in skeleton which can increase programmer’s productivity. But, this method restricts paral-lelization, since it is only suitable for well structured parallelism and the source code also must be rewritten according to existing templates. • Semi-automatic: The semi-automatic method provides an interme-diate alternative between manual and compiler-based automatic par-allelization [33] as a method for locality-optimization [54]. In this

(36)

method the programmer conducts the compiler how to parallelize the code. For example, if we have a simple loop or a nested loop and we want to parallelize it, we can specify how the compiler distributes data between different processors in such a way that each processor executes operations on a specific amount of data. For loops with accumulative operations where two concurrent iterations update the same variable simultaneously, we will lose one of the updates due to overwriting the value by other iteration; in this case, the programmer annotates the specific parts of the code as a critical section and the compiler will parallelize it accordingly.

Semi-automatic parallelization uses directives which means to insert pragmas before the selected statement or block of statements such as “#pragma omp parallel for” in OpenMP.

#pragma omp parallel for for(i=0;i<n;++i)

{ ... }

2.5 Parallel Computers

Parallel computers are computers with multiple processor units. There are two architecture models for these computers, multicomputer and multipro-cessor.

In multicomputer or distributed memory model, a number of computers are interconnected to each other through a network and memory is distributed among the processors, which improves scalability. Each processor has direct access to its local memory, and in order to access data in other processors’ memory they use message passing (Figure 2.4). There are two models for distributed memory: Message passing architecture, and Distributed shared memory where the global address space is shared among multiple proces-sors and an operating system helps to give the shared memory view to the programmer, see [52, 27] for more information.

In the second model, which is called multiprocessor or shared memory, a number of processors are connected to a common memory through a net-work, such as a shared bus or a switch. Memory is located at a centralized location and all processors have direct access to that place (Figure 2.5). We have two designs for shared memory multiprocessors: symmetric mul-tiprocessor (SMP) with UMA (uniform memory access) architecture style and Cc-NUMA (cache-coherent NUMA1_{) [52, 27].}

1_{Non-uniform memory access: Access mechanism and time to various parts of a} mem-ory is varying for a processor.

(37)

Figure 2.4: Distributed Memory Architecture.

Figure 2.5: Shared Memory Architecture, here with a shared bus as inter-connection network.

(38)

2.6 Parallel Programming Models

As we have described in the previous section, there are two different memory architectures, shared memory and distributed memory for parallel comput-ers. Each architecture has its own design and programming model; therefore, before beginning to program we should define our architecture.

The Message Passing Interface (MPI) standard supports distributed mem-ory architectures and is defined for C/C++ and Fortran programs. In this model, during run time the number of tasks is fixed [10]. In MPI all proces-sors run the same code. The programmer’s parallelizing skills are required in order to write a parallel program.

OpenMP is an API (Application Programming Interface) that supports shared memory architectures in C/C++ and Fortran languages. OpenMP is simple and just by inserting some directives in different parts of the se-quential program we can parallelize the code. As Quinn describes in [45] in the shared memory model processors interact with each other through the shared variables. Initially one thread, which is called the master thread, is active and during the execution of the program it creates or awakens a num-ber of threads to execute some section of the code in parallel, this process is called fork. In the end, by dying or suspending the created threads just the master thread will remain, this is called join [45]. Figure 2.6 illustrates the fork/join concept.

Here we should mention that for both models, together with the full PCAM process (see section 2.8), we need to check dependencies among the different parts of the code during the process of parallelization (see section 2.2 for dependency analysis).

(39)

In contrast with MPI where the number of active processes during the execution of the program is fixed and all of them are active, in this method the number of active threads during the execution will change. As we have mentioned above, at the start and end of the code execution we have just one active thread.

2.7 Performance Metrics

Speedup and efficiency are two metrics to evaluate the performance of par-allel programs.

Speedup measures the gain we get by running certain parts of the program in parallel to solve a specific problem. There are several concepts for calcu-lating speedup and among them we will describe the two most well-known concepts; absolute and relative speedup [34].

• Absolute speedup: is calculated by dividing the time taken to run the best known serial algorithm on one processor (Tser) by the time taken

to run the parallel algorithm on p processors (Tpar).

Sabs=

Tser

Tpar

• Relative speedup: The relative speedup is calculated by dividing the execution time of the parallel algorithm on one processor (T1) by the

execution time for the same parallel algorithm on p processors (Tpar).

Srel=

T1

Tpar

Efficiency (E) is the ratio between speedup and the number p of processors used, which indicates the resource utilization of the parallel machine by the parallel algorithm.

E = S

p

In the ideal situation the efficiency is equal to one which means S = p where all processors use their maximum potential, but practically this cannot be achieved. Usually the performance decreases for several reasons and typi-cally will cause efficiency to be less than one. For more details we refer the reader to [34].

2.8 Code Parallelization

As we have discussed before, the aim of parallelization of sequential code is increasing the speed of computations. Thus, our strategy for parallelization is based on parallelizing parts of the codes which use most of CPU times

(40)

(41)

such as loops and function calls. According to Foster’s PCAM methodology [18], see Figure 2.7, parallel algorithms are designed in four steps:

• Partitioning: How to decompose the problem into pieces or subtasks by considering the existence of concurrency among the tasks. Grama et al. [20] mention several methods for partitioning such as domain, functional decomposition ( includes: recursive, speculative, and ex-ploratory decomposition) and hybrid decomposition.

– Domain decomposition, decompose large amounts of data (and accordingly the computations on them) into a number of tasks . – Recursive decomposition is suitable for divide and conquer prob-lems (see section 2.9). In this method, each sub-problem resulting from the dividing step becomes a separate task.

– Speculative decomposition is related to the applications where we can not define the parallelism between the tasks from the begin-ning. This means that at each phase of running the application there are several choices selectable as the next task and just we can identify the next task when the current task is completely finished. For parallelism, speculation must be done on possibly independent tasks where independence statically is not provable. In the case of misspeculation, we have to roll-back the state to the safe state.

– Exploratory decomposition is applied to break down computa-tions in order to search a space of solucomputa-tions.

– Hybrid decomposition is used when we need the combination of previously described decomposition methods.

• Communication: In this step the required communication and syn-chronization among the tasks is defined.

• Agglomeration: Tasks and communication between them are inves-tigated in this step and in the case of necessity, tasks are combined into bigger ones in order to improve performance and reduce commu-nication cost.

• Mapping: Tasks are assigned to the processors by fulfilling the goals of maximizing processor utilizations and minimizing communication costs.

2.8.1 Loop Parallelization

In most of the programs, loops are the critical points which take a lot of CPU time. In automatic parallelization compilers can parallelize the loop if:

(42)

• There is no loop-carried dependence between iterations. In order to be able to parallelize the loops we have to do dependency analysis for all statements inside the loop body. If there exists loop-carried dependence we can not parallelize the code unless we remove these dependencies [39].

• The function calls inside the loops do not affect the variables accessed in other iterations and not the loop index.

• The loop index variable must be integer [2].

Together with the above constraints, two points are really important and they should be considered while parallelizing the code. First, the number of loop iterations must be known since we usually divide the loop iterations between the processors.

The second point refers to conditional statements inside the body of the loop where for every iteration we have to execute one branch of the condi-tional statements. Sometimes, these statements cause different behavior in loops that makes the loop not able to be parallelized. Therefore, we should find them by analyzing the code and if it is possible remove them.

Index set splitting (ISS) is a method for loop parallelization which de-composes the loop index set into several loops with different ranges. This idea has been described by e.g. Banerjee [8], Allen and Kennedy[4] and Wolfe [58]. Barton [11] used the ISS technique to decompose a loop con-taining conditional statements into a number of simple loops. The code in Figure 2.8(a) shows a loop with an if statement and (b) represents the transformed model.

Figure 2.8: Index Set Splitting

In nested loops we usually parallelize the outermost loop where possible, because it minimizes the overhead and maximizes the work done for each processor. For loops with accumulative operations such as Sum, Product, Dot product, Minimum, Maximum, Or, And we usually use the reduction operations [2].

(43)

Where we are sure that there is no loop-carried dependency inside the loop, based on the selected parallel programming model (MPI / OpenMP) we can parallelize the code as follows.

2.8.1.1 OpenMP

Quinn [45] has mentioned that in the OpenMP programming model, the compiler is responsible for generating the code for fork/join of threads and also allocating the iterations to threads. In a shared memory architecture the user interacts with the compiler through the compiler directives (in C: Pragmas, pragmatic information). The syntax of OpenMP pragmas in C/C++ is as follows:

#pragma omp <rest of pragma>

By inserting pragmas in different parts of the code the user will indicate to the compiler which parts he/she wants to parallelize.

In order to parallelize a for-loop, the loop must be in canonical shape.

for(index=start; index        < ≤ ≥ >        end; step)

The for-loop is in canonical shape if :

• The initial expression has one of the following formats: loop-variable= lower-bound (e.g. i=0)

integer-data-type loop-variable= lower-bound (e.g. int i=0) • It contains a loop exit-condition.

• The incremental expression step be in one of the following formats: ++ loop-variable (e.g.++i)

loop-variable ++ (e.g. i++) −− loop-variable (e.g. −−i) loop-variable −−(e.g. i−−)

loop-variable += integer-value (e.g. i+=2) loop-variable -= integer-value (e.g. i-=2)

loop-variable= loop-variable + integer-value (e.g. i= i+2) loop-variable = integer-value + loop-variable (e.g. i=2+ i) loop-variable = loop-variable - integer-value (e.g. i=i - 2)

For nested loops, either all loops can be executed in parallel but often we parallelize only the outer loop in order to reduce the number of fork/joins [45]. Figure 2.9 illustrates two parallel versions of the following code:

(44)

L1: for(i=0;i<n;++i) { L2: for(j=0;j<n;j++) { statements } }

(45)

The rest of this subsection will describe different OpenMP methods for parallelizing for-loops.

• For-loop parallelization: Just by adding the “#pragma omp parallel for” directive before the loop, parallelization will begin. The code in Listing 2.1 represents an example of loop parallelization.

# p r a g m a omp p a r a l l e l for { for ( i =0; i < m ; i ++) { a [ i ]= a [ i ] + 1 ; } }

Listing 2.1: OpenMP for-loop parallelization

Based on the statements inside the loop we can add some clauses to this directive. For more details see the following examples.

• Shared vs Private variables: By default all variables in a parallel re-gion are shared and all active threads can access and modify them, but sometimes, this may cause that the result of execution is not correct. OpenMP allows to define such variables as private in parallel regions, see Listing 2.2 for an example. In this example, a temp variable is defined as private; otherwise, each processor would overwrite its own values in it and the final result would not be correct.

# p r a g m a omp p a r a l l e l for p r i v a t e ( t e m p ) s h a r e d ( a , b , n ) { for ( i =1; i <= n ; i ++) { t e m p = a [ i ]; a [ i ]= b [ i ]; b [ i ]= te m p ; } }

Listing 2.2: OpenMP shared vs. private variables

• Firstprivate variables: If privatized variables inside a parallelized loop are to be initialized with their value before the loop in the sequential version, we declare them in a firstprivate clause (Listing 2.3).

tmp = 1 0 0 ; # p r a g m a omp p a r a l l e l for f i r s t p r i v a t e ( tmp ) s h a r e d ( a , n ) { for ( i =1; i <= n ; i ++) { a [ i ]= a [ i ]+ tmp ; } }

(46)

• Lastprivate variables: It may happen that we need the result of execu-tion of the last thread in the sequential execuexecu-tion order of a parallelized loop to be copied into the shared version of a variable in order to have the same result as with the sequential execution of the code. The last-private clause will set the value of execution of the ending loop index for the specified variable (Listing 2.4).

# p r a g m a omp p a r a l l e l for l a s t p r i v a t e ( tmp ) { for ( i =1; i <= n ; i ++) { tmp = i *2 -3; a [ i ]= i ; } } l a s t t m p = tmp ;

Listing 2.4: OpenMP lastprivate variable

• Nested loops: To parallelize nested loops, in order to reduce the num-ber of forks/joins, we usually parallelize the outer loop and make the indexes of (also) inner loops private for each thread, see Listing 2.5.

# p r a g m a omp p a r a l l e l for p r i v a t e ( j ) { for ( i =0; i < m ; i ++) { for ( j =0; j < n ; j ++) { a [ i ][ j ]= a [ i ][ j ] + 1 ; } } }

Listing 2.5: OpenMP nested for-loop parallelization

• Keeping the same order: In some situations we need the output or-der of execution of our parallel code to be in the same oror-der as in sequential execution; then we can use the “ordered” pragma before the specific statements (see Listing 2.6).

# p r a g m a omp p a r a l l e l for o r d e r e d { for ( i =0; i < n ; i ++) { # p r a g m a omp o r d e r e d { p r i n t f ( " % d \ n " , a [ i ]) ; } } }

(47)

• Reductions: There are several ways to parallelize the loops with re-duction statements. In Listing 2.7, part (a) shows the sequential code, part (b) represents one way of parallelizing the code by using fea-tures we have described earlier, and in part (c) we can see another way which generates parallel reduction code automatically as hinted by the reduction clause.

( a ) for ( i =0; i < n ; i ++) { sum += a [ i ]; } ( b ) # p r a g m a omp p a r a l l e l p r i v a t e ( p a r t i a l S u m ) { p a r t i a l S u m = 0 . ; # p r a g m a omp for for ( i =0; i < n ; i ++) { p a r t i a l S u m += a [ i ]; } # p r a g m a omp a t o m i c sum += p a r t i a l S u m ; } ( c )

# p r a g m a omp p a r a l l e l for r e d u c t i o n (+: sum ) { for ( i =0; i < n ; i ++) { sum += a [ i ]; } }

Listing 2.7: OpenMP reduction for-loop

• Multiple loops: If we have multiple separate loops and we want to par-allelize all of them, instead of adding separate “#pragma omp parallel for” before each loop, we can parallelize them as shown in Listing 2.8 which increases the parallel code execution speed.

# p r a g m a omp p a r a l l e l { # p r a g m a omp for for ( i =0; i < 1 0 0 ; i ++) { a [ i ]= i + 1 0 ; } # p r a g m a omp for for ( i =0; i < 5 0 ; i ++) { b [ i ]= i +2; } }

(48)

• Schedule(static): In this method the loop iterations distribute almost equally between a fixed number of threads. Each thread executes a continuous range of loop iterations (see Listing 2.9).

# p r a g m a omp p a r a l l e l for s c h e d u l e ( static , n u m _ t h r e a d s ) { for ( i =0; i < = 1 0 0 ; i ++) { a [ i ] = 2 * i - c [ i ]; } }

Listing 2.9: OpenMP schedule(static) parallelization

2.8.1.2 MPI

In message passing interface (MPI) programming we have six basic functions which we can use to parallelize a full code in a simple way. These functions are MPI Init, MPI Finalize, MPI Comm size, MPI Comm rank, MPI send and MPI receive [45]. Beside these mentioned functions there are many other functions which can be used in some situations based on the charac-teristics of the program to increase the performance or ease of program-ming. Among those functions, MPI Bcast, MPI Scatter, MPI Scatterv, MPI Gather, MPI Gatherv and MPI Reduce are more well-known. We refer the reader to [20] for details concerning the functionality of these functions. By adding a call to MPI Init to the main function, the MPI system will be initialized. If some statements in a program need to be executed by a single processor or some of the processors only, we can place them inside conditional statements such as if-statements.

As described so far, for parallelizing a loop we usually divide the loop it-erations between the active processors. Figure 2.10 depicts two examples of this division. In example (a), iterations are divided between the active pro-cessors and each processor works on its own part. Example (b) demonstrates another method, where processor with rank equal to i executes iteration i mod np. In both examples np is the number of processors. In example (b) rank represents the processor rank and in example (a), Lb, Ub, and Step respectively represent lower-bound, upper-bound and incremental value of the parallelized for-loop in each processor.

2.8.1.3 Different methods for loop parallelization in MPI

This subsection will present four different loop parallelization methods. Methods one, two and four explain three different examples for parallelizing the for-loop in Listing 2.10. This loop will calculate the sum of all elements of array a. In Method three, we will show one method for parallelizing the for-loop in Listing 2.11. For all examples, we assume that the array a is created by the root processor (processor number zero) and the other

(49)

(50)

cessors do not have access to it. In these examples np holds the number of active processors and M yRank the rank of the specified processor.

for ( i =0; i < n ;++ i ) {

sum += a [ i ]; }

Listing 2.10: Sequential for-loop to calculate the sum of all array elements

for ( i =0; i < n ;++ i ) {

a [ i ]= a [ i ] + 1 ; }

Listing 2.11: Sequential for-loop to increase all elements of an array by one

• Method1: In this method the root processor broadcasts array a to all processors; then each processor will calculate its own part and send back the result to the root processor. The performance of this method is low since the communication time is high. The source of the prob-lem comes from the broadcast, where one node tries to send the array to all others and this will cause high traffic in a network [12]. The next problem arises by point-to-point communication (send and receive) af-ter executing the for-loop where all processors send their data to the root processor. Listing 2.12 shows the pseudocode of this method.

B c a s t ( a ) ; c a l c u l a t e _ l b _ u b ( a , a _ s i z e ) ; for ( i = lb ; i <= ub ; i ++) { p a r t s u m += a [ i ]; } // E a c h p r o c e s s o r s e n d s its own p a r t s u m v a l u e w i t h the s i z e e q u a l to one to the r o o t p r o c e s s o r if ( M y R a n k ! = 0 ) { S e n d ( partsum ,1 ,0) } // r o o t p r o c e s s o r r e c e i v e s p a r t s u m v a l u e s f r o m o t h e r p r o c e s s o r s and sum t h e m up if ( M y R a n k = = 0 ) { for ( i =1; i < np ; i ++) { R e c e i v e ( partsum ,1 , i ) sum += p a r t s u m ; } }

Listing 2.12: MPI for-loop parallelization method1