A Comparison of Different Parallel Programming Models for Multicore Processors

(1)

Bachelor of Science Thesis

N I C K L A S W A H L É N

A Comparison of Different Parallel

Programming Models for Multicore

Processors

(2)

(3)

A comparison of different parallel programming

models for multicore processors

Niklas Wahl´

en, nwahlen@kth.se

(4)

Abstract

As computers are used in most areas today improving their perfor-mance is of great importance. Until recently a faster processor was the main contributor to the increase of overall computer speed. Today the situation has changed as heating is becoming a bigger problem. Running a processor faster requires more power which also leads to the processor’s components getting warmer. A solution to this is to use several somewhat slower processors in the same computer, so called multiprocessor or multi-core processor. That way programs can execute on different processors, or functionality of one program can be divided and run on several processors simultaneously.

Programming for multicore architectures is however more complex than programming for computers with a single processor, as data in the memory now can be accessed by several instances, called threads, of a program at the same time. This calls for some kind of synchronization between such threads.

Many different models are available to simplify the implementation procedure of programs for multicore computers, and such models are com-pared in this thesis. The models in question are Pthreads, OpenMP and Cilk++.

The models differ from each other in many ways, and are found to be useful for different areas. While Pthreads is a good tool when one wants to expose the threading mechanisms and be sure to have high flex-ibility, OpenMP and Cilk++ offer simpler interfaces. OpenMP’s main strengths are its interface and good portability. Cilk++ is suitable when high performance is the most important aspect.

Sammanfattning

D˚a datorer används inom de flesta omr˚aden är det viktigt att förbättra deras prestanda för att kunna snabba upp olika processer. Tidigare var snabbare processorer det huvudsakliga sättet att öka beräkningskraften hos datorer. Idag är situationen annorlunda d˚a upphettning blivit ett allt större problem. Snabbare processorer kräver ofta mer effekt vilket leder till att processorernas komponenter blir varmare. En lösning p˚a detta problem är att använda flera stycken n˚agot l˚angsammare processorer i samma dator, eller att använda s˚a kallade flerkärniga processorer. P˚a det sättet kan olika program köras p˚a olika processorer samtidigt, eller s˚a kan funktionaliteten i ett program delas upp och köras p˚a flera processorer.

Programmering för flerkärninga processorer eller datorer med flera pro-cessorer är dock sv˚arare än att programmera för ett system med en en-staka processor. Detta p˚a grund av att datorns minne nu delas av flera processorer som exekverar samtidigt. Därför behövs det n˚agon typ av synkronisering mellan de exekverande instanserna, kallade tr˚adar.

Det finns m˚anga olika modeller tillgängliga för att förenkla implemen-tationen av program för flerkärninga processorer; dessa modeller evalueras och jämförs i denna uppsats. De modeller som undersöks är Pthreads, OpenMP och Cilk++.

Dessa modeller skiljer sig ˚at p˚a m˚anga sätt, och är därför användbara inom olika omr˚aden. Medan Pthreads kan vara ett bra verktyg när det ¨

ar viktigt att kunna se tr˚adar och deras mekanismer tydligt, eller n¨ar det ¨

(5)

implementationer lätt g˚ar att flytta mellan olika operativsystem. Cilk++ passar bra när hög prestanda är den viktigaste faktorn.

(6)

List of Figures

1 Process overview . . . 10

2 Thread overview . . . 11

3 Task overview . . . 12

4 A race condition example . . . 14

5 Semaphores . . . 15

6 Overview of multiprocessor architecture . . . 16

7 Pseudo code for Prime calculation program . . . 21

8 Pseudo code for Adaptive quadrature program . . . 21

9 Pseudo code for Post office simulation . . . 22

10 Pseudo code for Game of life . . . 23

11 Call to Pthreads variant of Adaptive quadrature function . . . . 26

12 OpenMP parallelization of a for loop . . . 28

13 OpenMP parallelization of recursive calls . . . 29

14 OpenMP parallelization of Adaptive quadrature using untied tasks 30 15 OpenMP parallelization of Post office simulation . . . 31

16 Cilk++ parallelization of a for loop . . . 32

17 Cilk++ example using cilk spawn and cilk sync . . . 32

18 Cilk++ parallelization of Prime . . . 32

19 Cilk++ parallelization of Adaptive quadrature . . . 33

20 Speedup on Prime calculation program . . . 35

21 Speedup on Adaptive quadrature program . . . 36

22 Speedup on Adaptive quadrature using additional threads . . . . 37

23 Speedup on Adaptive quadrature using OpenMP tasks . . . 38

24 Speedup on Post office simulation program . . . 40

25 Speedup on Game of life program . . . 41

List of Tables

1 Programs used to evaluate the parallel programming models . . . 23

2 Programs used for evaluation and the targeted parallel paradigm 24 3 Results of parallelizing the Prime calculation program . . . 34

4 Results of parallelizing the Adaptive quadrature program . . . . 36

5 Results of running Adaptive quadrature using additional threads 37 6 Results of parallelizing the Adaptive quadrature program using OpenMP tasks . . . 38

7 Results of parallelizing the Post office simulation . . . 39

8 Results of parallelizing the Game of life program . . . 41

(9)

1 Introduction

1.1 Acknowledgements

I would like to thank Mats Brorsson for giving me the opportunity of working on this thesis and for helping me along the way. I am also grateful for getting to do my work at SICS and to use their environment.

I would also like to thank everyone working or doing their thesis at the mul-ticore centre at SICS during this time, as the provided useful tips and inspired me with their work.

Finally I want to acknowledge my family and friends who have been very supportive during the time I have been working on this thesis.

1.2 Background

The need for better performance in computers is always a pressing issue, and there are several ways to achieve this, e.g. better hardware and more efficiently written programs. More specifically one could develop a processor that can ex-ecute more instructions per seconds or a bus with higher bandwidth, come up with faster algorithms, distribute calculations to several computers et cetera. In reality a combination of most of these solutions are used. However every area has its own problems and obstacles. Faster processors have up until recently been the main contributor to a faster computer as a whole, but this is no longer the case. The nowadays well-known, and surprisingly accurate, Moore’s Law shows us that the number of transistors that can be placed on an integrated cir-cuit doubles every second year. Despite this has the increase of CPU speed been stagnating during the last decade, due to heating, dispersion and similar prob-lems. To deal with this, multiprocessors, with several CPUs in one computer, and multicores, several cores in one CPU, have been developed to continue in-creasing the computing power. Thanks to this several processes, possibly part of the same program, can be executed simultaneously - in parallel. This result in other difficulties, mainly synchronization of data shared between the processes.

1.3 Objective

This thesis focuses on how to improve programs by designing them for parallel execution on multiprocessor and multicore machines. There are many different models available that can be used to parallelize programs. They differ in several ways such as syntax, available tools, scalability, if they make use of threads or processes, shared memory or channels (network), portability and so on. Be-cause of this it can be hard to figure out which model that fits a certain task or program. The models compared in this thesis are Pthreads, OpenMP and Cilk++. They will be used to parallelize sequential programs written in C.

The main questions the thesis tries to answer are:

• Which are the characteristics of the models and how do they differ from each other?

(10)

1.4 Limitation

The thesis will not

• evaluate any kind of hardware • evaluate compilers

• focus on fine tuning programs for certain compilers or architectures • evaluate any tools provided with the models

1.5 Motivation

The benefits of increased knowledge about parallelization techniques are huge, as it can be seen in many different fields and applications: easier and faster par-allel development, better and simpler syntax, serial versus parpar-allel execution, optimization of already parallel code, high-performance computing, energy sav-ings etc.

One of the most important areas to focus on is programmers’ ability to pro-duce well optimized parallel programs. This will then reflect on other areas such as high performance computing or (user) program responsiveness[9] . Possible ways to increase this ability of a programmer is to offer a clean and simple interface for thread initialization and data synchronization, or to try to build automatic parallelizing compilers. With easier tools available to the program-mer programs can be made more optimized faster and will follow a more general standard. This leads to great savings as less education is needed and less time is spent on programming threads, data synchronization and so on. Another eco-nomical benefit is that using several low-end processors now can give the same performance as a single high-end processor, but to a lower price. Thus many companies can save money by using parallel or distributed solutions.

The most apparent technical benefit is in the area of personal computers. The share of PCs equipped with multiprocessors is steadily increasing and thus a large amount of people and companies could take advantage of (more) paral-lel applications. The paralparal-lelization improves performance on several levels, e.g. responsiveness in GUI applications can be largely increased just by having a ded-icated thread for handling graphics, and threads that are waiting for I/O can be switched with another thread instead of another process so less time is spent on heavy context switching. Yet another technical area is high-performance com-puting which includes cluster-based supercomputers, data warehouses and grid computing, all of which require very well-formed and very optimized synchro-nization mechanisms. Multicore processors are also often more energy efficient and use less memory[9].

(11)

of parallelizing model. Which model that fits which type of program or company (organization) best is going to be investigated in this comparison.[2, 7, 13]

1.6 Thesis outline

In Section 1 the background to this thesis and parallel computing was given. It also summarized the thesis’ objectives and gave a motivation to why parallel programming models can be of great importance.

The thesis will continue by giving an introduction to different parallel con-cepts, hardware and paradigms in Section 2. Thereafter the models used for parallelization of programs are presented.

In Section 3 the methodology used in this thesis is dealt with; the approach taken and the critera used for evaluation are discussed.

Section 4 contains documentation and comments regarding the paralleliza-tion procedure of the programs according to the different parallel programming models used.

Section 5 presents the results of the benchmarks made on the programs. In Section 6 the models will be analyzed with respect to the critera presented earlier.

Section 7 presents the conclusions of this thesis.

(12)

2 The parallel universe

In this section some of the basic characteristics of the multicore and parallel programming environment will be introduced.

First some concepts of concurrent execution are presented and then there will be an overview of parallelism and mechanisms for thread synchronization. Some examples of concurrent hardware architectures and parallel programming paradigms are also given.

Finally the programming models investigated in this thesis are presented.

2.1 Execution concepts

2.1.1 Processes

Figure 1: Process overview

A process is generally an instance of program code. It contains the program code it is executing and data associated with the code. It can be thought of as having a virtual CPU, since the process contains all the data that is needed in a CPU: a program counter, registers, and variables. Whether this virtual CPU is currently running or not is decided by the scheduler.

There are three main states in which a process can be. A process which is currently executing is said to be in the running state. On a processor with only one core (see Section 2.4 on page 15 for more information about hardware architectures) at most one process can be in the running state at a time. A process waiting for an I/O operation or to gain access to a currently locked resource is in the blocked/waiting state. When a process is in the ready state it is waiting to gain access to the processor, i.e. to be switched to the running state by the system’s scheduler.

(13)

2.1.2 Threads

Figure 2: Thread overview

Threads can roughly be described as a lightweight version of processes. A thread resides inside a process and shares some of the process’ memory areas, and also has private ones. Some of these private—also called local—memory areas are a stack area, registers and a program counter. Threads usually have the same running states as processes.

Several threads can exist inside one process, which allows the programmer to distribute work amongst the threads in a more dynamic and intuitive way. Consider some part of the program being blocked, e.g. writing for an I/O operation to return. If that part is put in a separate thread another thread in the same process can be scheduled to run, instead of blocking the whole process. With this refined picture of processes and threads the process concept in the previous section can be described as a single-threaded process.

(14)

2.1.3 Tasks

Figure 3: Task overview

A task is something that needs to be done. Common tasks are execution of a function, a loop or other code region, explicitly defined by the programmer. Tasks are assigned to and carried out by threads. A task can either be tied or untied. When using tied tasks a task gets locked to a thread at thread creation. When the thread is done executing the task it will terminate. Untied tasks on the other hand are added to an already running thread’s task queue. Using untied tasks can reduce overhead in programs where threads are frequently created and destroyed, due to executing very small tied tasks. [11, 7]

(15)

2.2 Introduction to parallelism

Parallel programs are programs that use several threads to achieve speedup by running the threads on different processors simultaneously. The threads are working independent but also dependent of each other. The dependency occurs when threads needs to communicate. Often communication is needed to synchronize the threads—to reach consensus amongst them—so that it’s clear what needs to be done and by whom. This is crucial to the correctness of the program. There are several types of correctness in parallel programs, two which are often mentioned are:

• Partial correctness – A property of a program that computes the desired result, provided the program terminates.

• Total correctness – A property of a program that has partial correctness and terminates.

As stated a program executing in parallel has several threads executing si-multaneously. Depending on environmental circumstances—hardware, schedul-ing policy, etc.—as well as momentary ones—user input, CPU load, network usage—the execution order could differ a lot between program runs. This ex-ecution order is often called the program’s history or trace. For a program to have correctness the correctness property has to hold for every possible history of the program. To achieve this locks and other synchronization mechanisms needs to be introduced in the program.[2]

The independent parts of the threads’ execution are what allow parallelism. Therefore a main goal of parallel programming is to maximize the amount of code that can be executed without having to synchronize with other parts of the program. This means minimizing the amount of synchronization needed while still not risking the correctness of the program—this is the main issue when it comes to parallelizing programs.

The above is expressed mathematically in Amdahl’s Law, which defines the speedup S gained on n processors as

S = 1 1 − p +_np

where p is the fraction of the code which can be executed in parallel.[7]

2.3 Synchronization mechanisms

(16)

Figure 4: A race condition example Sequential execution

i ≡ 0

T1: Read i = 0 from memory T1: Edit i = i + 1 locally T1: Write i = 1 to the memory T2: Read i = 1 from the memory T2: Edit i = i + 2 locally T2: Write i = 3 to memory i ≡ 3

Parallel execution i ≡ 0

T1: Read i = 0 from memory T2: Read i = 0 from memory T1: Edit i = i + 1 locally T2: Edit i = i + 2 locally T1: Write i = 1 to memory T2: Write i = 2 to memory i ≡ 2

What has been described here is the so called the critical section problem. A critical section is a part of the code which must not be accessed by more than one thread at once. Otherwise the program is said to contain a race condition. A critical section can contain one non-atomic operation1or several atomic or non-atomic operations. When there is a race condition in a program the result will depend on the execution order of the program’s threads, and thus not fulfill the requirements for the partial correctness property. A concept or implementation that allows code to be executed by at most one thread at a time, so called mutual exclusion, is often called a mutex (mutual exclusion). Such implementations are presented in Section 2.3.1.

Other important synchronization concepts are condition variables and barri-ers which are used when threads must wait for a certain condition to be fulfilled. Barriers are presented in Section 2.3.2.

2.3.1 Mutual exclusion mechanisms

In this section different methods of providing mutual exclusion will be described. Locks A lock is conceptually one of the most basic ways to achieve mutual exclusion. It only has two states: the initial unlocked state and the locked state. A lock call can either be wait-free or blocking. If a lock call is wait-free the lock operation will return a value which indicates if the lock was already locked or not. If a lock call is blocking and the lock is already locked the thread will reside in the lock call until the lock is available to the calling thread. While the thread is waiting it will either yield (abandon the processor) or retry to acquire the lock until it is switched out by the scheduler. In the latter case the lock is said to be a spinlock.[13]

Locks are usually implemented with an atomic instruction called test-and-set. A lock which only uses the test-and-set without any modification is called a test-and-set lock. Some variants of it are the test-test-and-set lock and the (exponential) back off lock.[7]

Semaphores A semaphore holds one value which is a nonnegative integer. It has two operations, called P and V. P decrements the value of the semaphore—

(17)

assuming that the value is larger than zero—and V increments it. Both opera-tions are atomic and can be defined as in Figure 5.

Figure 5: Semaphores P(s):

wait for value to be positive decrement value

V(s):

increment value

Common semaphore concepts are general semaphores, binary semaphores and split binary semaphores. A general semaphore can take any non-negative value, while a binary semaphore only takes the values 0 or 1.

To provide mutual exclusion with semaphores, the semaphore protecting the critical section is initially set to 1. When a thread wants to enter the critical section it uses the P operation, executes the critical section and then uses the V operation.[2]

Passing the baton technique Unlike a lock’s lock and unlock operations, a semaphore’s P and V operations do not have to be executed in pairs in the same thread. A lock can not be unlocked by any other thread than the one that locked it. A semaphore on the other hand is never tied to a specific thread, and so it can be operated on by any thread at any time. This allows more fine-grained actions when it comes to thread management. Passing the baton is generally the technique of doing a V operation on a different semaphore than the one which the thread most recently did a P operation on. The goal of this technique is often to give a specific thread or group of threads access to a critical region rather than to any thread in the program.[2]

2.3.2 Conditional synchronizing mechanisms

Barriers Sometimes it is necessary for the threads to not continue execution until every thread has executed the previous code section. In such case a barrier is useful. It will keep track of how many threads that have arrived to it, and don’t let any of them trough until all of them have arrived.[2]

2.4 Multicore hardware

2.4.1 Multiprocessors

(18)

Figure 6: Overview of multiprocessor architecture

This allows threads to communicate through shared variables in the memory. Therefore this kind of architecture is often called shared-memory multiprocessor. Because of memory being shared by threads accessing shared variables has to be done in a synchronized manner, as described in Section 2.3 on page 13.[2] 2.4.2 Distributed systems

Distributed systems consist of several computers using a network to exchange messages, instead of using shared memory. The message passing interface (MPI) is often available through library routines, and the computers in the system can run different operating systems and use different architectures.[2]

2.5 Parallel programming paradigms

2.5.1 Iterative parallelism

Programs using an iterative style are often the result of parallelizing loops. The main characteristics of a program using iterative parallelism are that the threads often are homogenous and work together to solve a single problem. They usually communicate with shared variables or message passing. Iterative parallelism is most commonly used in the area of scientific computing.[2]

Bag of tasks Instead of distributing work immediately at the loop encounter a bag of tasks can be used to let the threads dynamically fetch work tasks while inside the loop. This provides better load balancing, especially when executing on a NUMA2 _{architecture, as threads that executes faster than the others will}

get to execute more tasks. However this technique comes to the price of a slight synchronization overhead as modification of the bag has to be done with mutual exclusion. To minimize the time threads are waiting to gain access to the bag the task size has to be chosen with some consideration.

2.5.2 Recursive parallelism

Recursion is a concept that involves a function calling itself, often several times. For each call the task which the function has been assigned is made easier, smaller or more refined. Every call will wait until all its children calls have returned.

When parallelizing a program that uses recursive calls it results in recursive parallelism. Often a new thread is created for every new recursive function call, with some restriction to keep the overhead at reasonable levels. One important

(19)

precondition for using recursive parallelism is that the calls to the recursive function are independent, i.e. the threads work on different parts of the shared data. Recursive parallelism is especially useful in logic and functional program languages since they’re already using the recursive paradigm, but it is also very common in imperative programs that use divide-and-conquer algorithms. Important areas that use recursive parallelism are sorting and scheduling.[2] 2.5.3 Producers and consumers paradigm

The producers and consumers paradigm is a communication paradigm involving two or more different sets of threads. The producers usually compute something and then output results to one or several consumers. The consumers read the output from the producers, analyzes the data and possibly executes depending on the results.

2.5.4 Interacting peers

Interacting peers are threads or processes that executes the same or similar code, but on different data, and then synchronizes by exchanging messages between each other. The messages are either sent to all other threads or to a subset of them. Interacting peers are often used in decentralized distributed programs.[2] Heartbeat algorithm This algorithm is an interacting peers variant where the processes work in three steps repeatedly: 1) Send messages to others, 2) Re-ceive messages from others, 3) Compute according to local and reRe-ceived data.

If the heartbeat algorithm is run on a machine with shared memory steps (1) and (2) are executed through reads and writes, and after computation the threads are synchronized at a barrier (see Section 2.3.2 on page 15).

2.5.5 Servers and clients paradigm

Servers and clients communicate trough requests and replies. A client sends a request to a server and then waits for an answer. A server waits for requests and then executes according to the request. The server itself can either be multithreaded and execute several requests in parallel or run on a single thread. [2]

2.6 Parallel programming models

This section explains the different parallelization models that are evaluated in this thesis in short – the story behind them, their characteristics and which organizations and applications that make use of them today.

2.6.1 Pthreads

(20)

is a shortening of POSIX3 _{threads and as such Pthreads is available on most}

computers running Unix or any flavor thereof.[2] Most of its functions are also available under Windows using Pthreads-win32[8].

The Pthreads library is used by Apache and MySQL[9]. 2.6.2 OpenMP

OpenMP tries to move the programmer’s focus away from having to program the parallelism and synchronization explicitly. Instead the programmer should only have to tell the compiler which code blocks that can be executed in paral-lel, and how it should be done. This is done by letting the programmer insert keywords—called directives—in sequential code. Development of OpenMP be-gan in the latter half of 1990s and a first set of directives for Fortran was introduced in 1997. Directives for C and C++ followed shortly thereafter. The directives are expressed with comments (Fortran) or pragmas (C/C++) so that they will be ignored by a compiler which is not using OpenMP. OpenMP sup-ports parallelizing of programs under both Unix and Windows. [4]

OpenMP with tasks With the most recent version of OpenMP a new way of parallelization is available—called the task construct—which allows the pro-grammer to declare and add tasks that can be executed by any thread, despite which thread that encounters the construct, i.e. an implementation of the task concept.[12]

2.6.3 Cilk++

The concept of Cilk++ is very much alike that of OpenMP. Cilk++ is an ex-tension to C++ in which the programmer inserts keywords into sequential code to tell the compiler which parts of the code that should be executed in parallel. Development of Cilk++ started in 2009 and is based on technology from Cilk[5], a parallel programming model for C. Cilk++ also provides some additional tools like performance analysis and the race condition detector Cilkscreen. Cilk++ has support for both Linux and Windows.[10, 14]

(21)

3 Methodology

In this chapter the outline of this thesis’ methodology will be described. First there will be a presentation of the approach. The evaluation criteria for the models are then described and finally there is a short overview of the programs which the models will be applied on.

3.1 Approach

This section will give a brief description of the different research steps in this thesis.

3.1.1 Studying the model

First the models presented in Section 2.6 are going to be studied so that it’s somewhat clear when parallelization of the programs is begun which modifica-tions that needs to be done and how to do them. The reading material will mostly consist of documentation provided with the model and books about the models or parallel programming in general[2, 3, 4, 14].

3.1.2 Using the model to parallelize programs

When the model has been studied it’s going to be used to parallelize the pro-grams listed in Section 3.3. These propro-grams has been chosen carefully to cover most of the styles—listed in Section 2.5—that parallel programs use.

3.1.3 Evaluation of the model

During the parallelization of the model it will be evaluated according to the qualitative evaluation criteria in Section 3.2.2. It will thus be a continuous evaluation of the model, which in the end will be summarized to use in the analysis.

When the parallelization is done the programs are going to be benchmarked and evaluated by the quantitative evaluation criteria listed in Section 3.2.1. 3.1.4 Analysis of the evaluation

The results of the qualitative evaluation are then going to be analyzed and discussed. Finally some conclusions of the models can hopefully be drawn, with respect to the evaluation criteria.

3.2 Evaluation criteria

The evaluation criteria have been developed together with the examinator and by looking at desirable properties of both programs in general and of parallel programs[2, 3, 4].

3.2.1 Quantitative evaluation

(22)

Run time The real clock time interval between program start and program termination. One of the main goals of parallelization is to minimize this. Wall time The time the program (threads) actually spent executing, i.e. the time which the program spent in the processor(s). Two different kinds of code are executed during program runs – user code and operating system code. Generally execution of operating system code, through system calls, is done when managing threads and the code which the threads execute is user code. Measuring both of these will give some information about how much overhead that comes with the models thread management, and how it increases with the number of threads.

Speedup The speedup of a program shows how much performance it gained from being parallelized. The speedup is calculated by dividing the run time for the serial program by the run time for the parallel one: Ts/Tp.

Efficiency How efficiently the processors are being used. Can easily be ex-plained by its formula: Ts/(Tp· p) which is equal to the speedup divided

by the number of processors. 3.2.2 Qualitative evaluation

This section describes the qualitative evaluation, which is more subjective and deals with factors that describe how easy the model is to work with. The models in this thesis will be evaluated according to the following criteria:

Portability There is different kinds of portability but the one considered most important in this thesis is operating system portability. Other kinds of portability are binary portability and architecture portability. Hav-ing good operatHav-ing system portability means that minor modification is needed to compile the code on different operating systems.

Documentation A very important tool when programming is documentation of the functions and the environment that are used. Accessibility and detail of the model’s documentation is going to be looked at when applying it. The feedback from the program environment, e.g. warnings and errors, will also be taken into consideration.

Modification Depending on which model that is used for parallelization differ-ent amounts of modifications to the sequdiffer-ential program are required. This criteria mainly concerns the amount of modification needed, not necessar-ily how hard it is, as that is mostly covered in the Complexity criteria. Complexity Before using a model for the first time one has to gain some

knowledge about its functions, the syntax, paradigms etc. The complexity of a model considers how deep knowledge one has to gain before being able to implement a function, or a set of functions.

(23)

3.3 Programs

3.3.1 Primes calculation

The primes calculation program is an iterative program that mainly consists of a for loop. It tests all integers in a given interval using the Miller-Rabin primality test. To deal with large numbers the GNU Multiple Precision library[1] (GMP) is used.

When running the program the variables startnumber, maxnumber and certainty can be defined. startnumber and maxnumber defines the interval. The Miller-Rabin primality test can return either false which means that the number is definitely composite, or true which means probably prime. This means that there is no risk of the test returning a false negative. However the prob-ability of a false positive—that the probable prime is in fact composite—is 4−certainty.

Pseudo code for the main functionality the Prime calculation program is shown in Figure 7.

Figure 7: Pseudo code for Prime calculation program for(i = startnumber; i <= maxnumber; i++) {

do Miller-Rabin test on i with certainty certainty; }

Note: the inner functionality of the Miller-Rabin test will not be subject for any parallelization.

3.3.2 Adaptive quadrature

Adaptive quadrature calculates the integral of a given function by using the trapezoidal rule, which is an approximation of Simpson’s rule. To achieve a good enough answer the program splits the interval in two and make recursive calls until the margin of error has been minimized to a given level.

The user can specify epsilon before running the program, which is the error tolerance of the integration.

Figure 8: Pseudo code for Adaptive quadrature program quad(left, right, fleft, fright, lrarea) {

mid = (left + right) / 2; fmid = function(mid);

larea = (fleft + fmid) * (mid - left) / 2; rarea = (fmid + fright) * (right - mid) / 2; if(abs((lrarea + rarea) - lrarea) > epsilon) {

larea = quad(left, mid, fleft, fmid, larea); rarea = quad(mid, right, fmid, fright, rarea); }

(24)

The code can be interpreted as “continue integrating with smaller intervals until the difference between integrating the whole interval [a, a + 2b] and in-tegrating the intervals [a, a + b] and [a + b, a + 2b] separately is smaller than epsilon”.

3.3.3 Post office simulation

The post office simulation program consists of two classes of actors, one class which fetches mail from its own, personal buffer (mailbox) and delivers the mail, called the consumer. The other class, called the producer, adds mail to every of the consumer’s buffers. Pseudo code of the Post office simulation is shown in Figure 9.

Figure 9: Pseudo code for Post office simulation producer() {

for(i = 0; i < rounds; i++) { do work ;

for(n = 0; n < consumers; n++) { add mails to consumer n’s buffer; }

} }

consumer(id) {

for(i = 0; i < rounds; i++) { take from my buffer; do work ;

} }

3.3.4 Game of life

Game of life is an implementation of the cellular automaton Conway’s Game of Life. When the game starts a number of cells on a two-dimensional surface are declared to be alive. Over some iterations, called generations, the following rules are applied to the cells[2]:

• A live cell with zero or one live neighbors dies from loneliness

• A live cell with two or three live neighbors survives for another generation • A live cell with four or more live neighbors dies due to overpopulation • A dead cell with exactly three live neighbors comes to life.

The new state of a cell mustn’t be changed until the new state of every other cell in the game is decided. Therefore it is suitable to save the new state in some other place while investigating the old one.

(25)

Figure 10: Pseudo code for Game of life for(g = 0; g < generations; g++) {

for(i = 0; i < size; i++) { for(j = 0; j < size; j++) {

calculate livingneighbors for cell (i,j);

set cell (i,j) in other dataset according to livingneighbors; }

}

switch working dataset ; }

3.3.5 Summary of the programs

A summary of the evaluation programs used in this thesis and their respective paradigm can be seen in Table 1.

Table 1: Programs used to evaluate the parallel programming models Program Sequential programming paradigm Primes calculation Iterative

Adaptive quadrature Recursive Post office simulation Iterative

(26)

4 Procedure

In this section the parallelization of the programs is documented. First some details about the working environment are given and then the parallelization procedures of the programs according to each of the models are presented.

4.1 Working environment

4.1.1 Hardware environment

The machine used for programming and tests is equipped with dual Quad-Core AMD Opteron processors with 512 kB L1 cache and 16 GB of memory. This means that a total of eight cores running at 2.3 GHz are available for the programs to run on.

4.1.2 Software environment

The operating system used is Ubuntu Linux with the 2.6.28-18-generic kernel and support for symmetric multiprocessing (SMP).

Three compilers will be used for compiling. One is the C compiler from the GNU Compiler Collection (GCC) since it is a very widely used compiler. For compiling Cilk++ the Intel Cilk++ SDK is going to be used, which is a wrapper compiler around GCC and is the only compiler available for Cilk++. Finally, for testing and benchmarking non-Cilk++ programs the Intel C++ Compiler (ICC) will be used—which despite its name is a compiler collection of C and C++ compilers—since it should provide similar optimizations as the Cilk++ SDK compiler.

4.2 Parallelization targets/goals

The programs listed in Section 3.3 are chosen carefully to cover most of the paradigms from Section 2.5 on page 16. In Table 2 the programs are shown complemented with their respective targeted parallel programming paradigm.

Table 2: Programs used for evaluation and the targeted parallel paradigm

Program Sequential programming paradigm

Parallel programming paradigm Primes calculation Iterative Iterative (Bag of tasks) Adaptive quadrature Recursive Recursive Post office simulation Iterative Producers/Consumers

Game of life Iterative Interacting peers (Heartbeat algorithm)

4.3 Pthreads

4.3.1 The Pthreads environment

(27)

Pthreads library is default in GCC. To enable Pthreads support in a program the header file pthread.h has to be included, and at compile the flag -lpthread has to be sent to the compiler’s linker.

When creating a thread it immediately has to be associated with a function. This function has to be of the type void * function(void *) and is sent as an argument to the thread creation call, i.e. untied tasks are not supported. 4.3.2 Pthreads parallelization of Prime

As previously stated a function has to be specified when creating a thread. Therefore the part of the program that is parallelizable—i.e. can run in parallel— has to be moved to a separate function to work with Pthreads. In the Prime program that part is the for loop which investigates one integer at a time; there is no problem with several integers being investigated simultaneously. This modification is not very complex, thanks to the fact that no communication is required when the threads are created or terminates and so the thread function does not need any input.

Some of the variables used has to be made global since the threads can’t reach neither the main thread’s nor the other thread’s local variables. The most important of those variables are the bag of tasks and the largest number to test. A variable holding the grainsize is also added to the global variables. The grainsize specifies how many new tasks that should be fetched from the bag of tasks by a thread when it is done executing all the previously fetched tasks.

The largest modification needed is the bag of tasks implementation. In this program the bag of tasks is simply holding the next number to be tested. At program start the bag of tasks is set to the lowest number that shall be tested. The threads fetch the number of tasks specified with the grainsize variable and increments the bag of task variable with the same number. This continues until every number up to the largest number has been tested. A variant of the for loop used in the sequential program—the actual for loop uses GMP functions—can be seen below:

for(i = startnumber; i <= maxnumber; i++) and the bag of task implementation will then be:

for(i = nexttask; i <= maxnumber; i = nexttask)

where nexttask is the next value in the bag of tasks to test and is set to startnumber prior to thread creation. The reason to not setting the loop variable i to startnumber in the first expression is that this will result in every thread testing startnumber, when it should only be tested by the thread that is first to encounter the loop. In addition to the modification of the loop expression a lock has to be added to provide mutual exclusion when modifying the bag of tasks.

4.3.3 Pthreads parallelization of Adaptive quadrature

(28)

double quad(left, right, fleft, fright, lrarea)

where left and right are the points which the mathematical function should be integrated between, and the fleft and fright are the mathematical function values at points left and right. The lrarea is the result of a previous integration of this interval that was found to have an error above the tolerable limit. The function definition should now be modified to

void * quad(void * arguments)

and a call to the function now has to be preceded by setting up the arguments structure, as seen in Figure 11.

Figure 11: Call to Pthreads variant of Adaptive quadrature function double arguments[5] = {left, right, fleft, fright, lrarea}; quad(arguments);

and inside the function the values has to be extracted from the structure. When the modification of the quad function is done there is not much work left. As this is a strict divide-and-conquer implementation no shared variables or mutual exclusion are needed. The only thing that has to be controlled is the number of threads created. If a new thread is created with every function call the overhead will eventually be so large that the program either runs much slower than the sequential version or crashes. Therefore the execution is carried out sequentially if a certain limit of threads is reached. Otherwise—when it is determined that an interval has to be split up in two—a new thread is created to integrate the left half of the interval. The current thread then integrates the right half. When they are both done they merge and the function returns. 4.3.4 Pthreads parallelization of Post office simulation

The Post office simulation program has two parts which can be run in parallel, both with respect to the part itself and to the other part. These two parts has to be moved into two separate functions. The functionality of them is described in detail in Section 3.3.3 on page 22. In the first part—the producer part—some work is done and then data is added to each of the buffers. This functionality is moved to the function

void * producer(void * argument)

and the second part—the consumer—which reads from its own buffer and then does some computation is moved to

void * consumer(void * argument)

(29)

4.3.5 Pthreads parallelization of Game of life

The Game of life program does not need any locks to run in parallel. Every thread is assigned its own subset of the whole dataset and do not interfere with each others’ sets. But the program works with discrete states—the threads reads some data from other threads sets to decide how to update its own data. These updates cannot be done at any time as it would result in incorrect behavior. To deal with this the threads decides how they will update their own data by reading both its own and other’s data. They then wait for each other at a barrier and only when all threads are done reading they update their respective data set.

The first modification which is needed is to create the function that the threads will run:

void * worker(void * thread id)

The thread id together with the number of threads running decides which and how large dataset the thread should be working on. The inner functionality of the worker function is mainly to iterate over the dataset and preparing the updates of it. This goes on until the dataset has been updated the number of times which was specified at program initialization, by setting the generations variable.

The only data of any importance that has to be made global to all threads are the dataset itself.

4.4 OpenMP

4.4.1 The OpenMP environment

OpenMP support is enabled as default in GCC. To enable OpenMP directives and functions in a program the header file omp.h has to be included. The flag -fopenmp has to be sent to both the compiler and the linker at compile.

To create threads different pragma keywords are used. The OpenMP prag-mas are called directives and are usually on the form

#pragma omp construct [clause 1 ] [clause 2 ] ...

(30)

Figure 12: OpenMP parallelization of a for loop Sequential program for(i = 0; i < 1000000; i++) { do work ; } Parallel program #pragma omp parallel {

#pragma omp for

for(i = 0; i < 1000000; i++) {

do work ; }

}

Some of the clauses available to the parallel construct are an if clause which by evaluating its expression decides whether to execute the following region in parallel or not, a private clause that contains the variables which should be copied and made local to every thread, and a shared clause that contains the variables that should be shared (global) between the threads. The for construct provides the schedule clause that defines how the work should be distributed, the private clause, and some others.

More information about OpenMP functions, constructs and clauses can be found in the OpenMP API[12].

The task construct In the previous versions of OpenMP there was no way of declaring tasks. With the most recent version a task construct was added, which allows tasks to be added to an implicit bag of tasks and then executed by any thread. For more information about tasks see Section 2.1.3 on page 12. To see how useful this new directive is the programs will first be parallelized without using the task construct. Thereafter programs which are suitable—i.e. performs badly and could make use of tasks—will be parallel with the task construct too. 4.4.2 OpenMP parallelization of Prime

The main part of the Prime program is the for loop in which numbers are tested. The most intuitive thing is maybe to use the for directive. However the loop expression is built up of GMP library functions, and the OpenMP for directive can only handle primitive datatypes. Therefore the for directive cannot be used. Moreover the for directive is mainly used to distribute work when the execution is predetermined. This is not the case when using a bag of tasks implementation, as the execution is dynamic and work load may not be equal between threads.

To parallelize the program a parallel region is added around the loop, and the loop expression is modified to implement a bag of tasks:

#pragma omp parallel shared(nexttask, maxnumber) \ private(i) \

num threads(threads)

for(i = nexttask; i <= maxnumber; i = nexttask)

(31)

to execute the for loop in parallel have to be decided prior to encountering the parallel directive by setting the variable threads. A lock also has to be added since modification of the bag of tasks has to be done with mutual exclusion. 4.4.3 OpenMP parallelization of Adaptive quadrature

When modifying the Adaptive quadrature program the target is to do the calls to the recursive function in parallel. To execute two different parts of the code in parallel the easiest way is to use the sections construct. To implement this construct the modification shown in Figure 13 is made.

Figure 13: OpenMP parallelization of recursive calls Sequential code

larea = quad(arguments for left interval); rarea = quad(arguments for right interval);

Parallel code

#pragma omp parallel sections shared(interval variables) {

#pragma omp section

larea = quad(arguments for left interval); #pragma omp section

rarea = quad(arguments for right interval); }

Every code region in a section construct is executed by exactly one thread, and can be executed parallel to every other section region in the same sections construct. Sections constructs in the same parallel region can not be nested, i.e. a sections construct can not reside inside another. Therefore the parallel construct is also added, since this will create a new thread and so the two threads can execute one section each. But this leads to the same problem as in the Pthreads implementation of Adaptive quadrature—with every new recursive call a new thread will be created and with it some overhead. This means that a check similar to the one in the Pthreads implementation has to be introduced. If the total number of threads reaches a certain limit the subsequent calls will be executed sequentially until the number of threads once again drops below the limit.

(32)

Figure 14: OpenMP parallelization of Adaptive quadrature using untied tasks #pragma omp task untied

larea = quad(arguments for left interval); #pragma omp task untied

rarea = quad(arguments for right interval);

And because no parallel region is created inside the function the first call to the function has to be preceded by a parallel construct as follows:

#pragma omp parallel #pragma omp single

result = quad(arguments);

The single construct is used because the quad function should only be ex-ecuted once. As soon as the single thread executing the function starts to encounter the task constructs the other threads will start executing the tasks. 4.4.4 OpenMP parallelization of Post office simulation

The OpenMP implementation of this producers/consumers program also uses the sections construct. First the inner functionality of the two parallelizable parts—the producers’ part and the consumers’ part—are made parallel by adding locking to the buffers and encapsulating their code regions inside parallel con-structs. Thus the producers are run by writing the code as follows:

#pragma omp parallel shared(buffers) num threads(producers) {

producer code }

and the consumers as

#pragma omp parallel shared(buffers) num threads(consumers) {

consumer code }

(33)

Figure 15: OpenMP parallelization of Post office simulation #pragma omp parallel sections shared(buffers) num threads(2) {

#pragma omp section

#pragma omp parallel shared(buffers) num threads(producers) {

producer code }

#pragma omp section

#pragma omp parallel shared(buffers) num threads(consumers) {

consumer code }

}

Thus after encountering the first parallel construct two threads will be work-ing in the sections region. One of the two threads will enter the first section and the other the second section. When the parallel construct in the first section is reached a team consisting of producers number of threads will be formed to execute the producer code. The same applies to the second section but with consumers number of threads.

4.4.5 OpenMP parallelization of Game of life

The main functionality of the Game of life program is a for loop iterating over a dataset using primitive datatypes. Thanks to this minimal modification is needed—the sequential for loop is

for(i = 0; i < size of dataset; i++) and with the for construct it will read #pragma omp for shared(dataset)

for(i = 0; i < size of dataset; i++)

The dataset will be iterated over for a specified number of times. As stated before, the dataset may not be changed until every thread is done reading from it. So when the threads are done reading they wait for each other at a barrier before they continue with modifying their respective dataset. A barrier in OpenMP is declared as:

#pragma omp barrier

4.5 Cilk++

(34)

program the header file cilk.h needs to be included and the main()-function renamed to cilk main().

To parallelize programs using the Cilk++ so called cilk keywords are used. The most common keywords are cilk for, cilk spawn and cilk sync. They are used to tell the running environment or scheduler that a certain piece of code or function can be run in parallel. Whether it will actually run in parallel or not is decided depending on the circumstances during execution.

The cilk for keyword is used as shown in Figure 16.

Figure 16: Cilk++ parallelization of a for loop Sequential program for(i = 0; i < 1000000; i++) { do work ; } Parallel program

cilk for(i = 0; i < 1000000; i++) {

do work ; }

Example implementation of the cilk spawn and cilk sync keywords, where function1() and function2() are functions that can be executed in parallel, is presented in Figure 17.

Figure 17: Cilk++ example using cilk spawn and cilk sync cilk spawn function1();

function2(); cilk sync;

4.5.2 Cilk++ parallelization of Prime

Just like with OpenMP the Cilk++ for keyword cilk for is used together with deterministic loops that use primitive datatypes. This is not the case in the Prime program, so instead we use a hack to spawn of all the available workers, which can be seen in Figure 18.

Figure 18: Cilk++ parallelization of Prime cilk for(w = 0; w < available cilk workers; w++) {

for(i = nexttask; i <= maxnumber; i = nexttask) { test i;

} }

(35)

4.5.3 Cilk++ parallelization of Adaptive quadrature

Parallelizing the Adaptive quadrature program with Cilk++ only requires two keywords. The modification is shown in Figure 19.

Figure 19: Cilk++ parallelization of Adaptive quadrature Sequential code

larea = quad(arguments for left interval); rarea = quad(arguments for right interval);

Parallel code

larea = cilk spawn quad(arguments for left interval); rarea = quad(arguments for right interval);

cilk sync;

4.5.4 Cilk++ parallelization of Post office simulation

To parallelize the Post office simulation using Cilk++ the functionality of the producers code region and the consumers code region was moved to two separate functions, producer() and consumer() respectively. By doing so the producers and the consumers were ran by executing

cilk spawn producer(); consumer();

And inside the functions a hack similar to the one in Prime was used: producer() {

cilk for(i = 0; i < producers; i++) { ...

} }

and similarly in the consumer() function.

cilk mutex.h also had to be included to support the locks needed for the consumers’ buffers.

4.5.5 Cilk++ parallelization of Game of life

As the Game of life program mainly consists of a for loop the easiest way is to use the cilk for keyword. By changing the outer loop expression from for(i = 0; i < size of dataset; i++)

to

cilk for(i = 0; i < size of dataset; i++)

(36)

5 Results

In this section the results of the quantitative evaluation will be presented. Graphs will be used to display the speedup of the parallelizations. If not men-tioned otherwise, the tests were carried out using Intel C++ Compiler 11.0 and Cilk++ (GCC) 4.2.4. The results represent the median of 10 trials and are presented in milliseconds.

In the result tables the following abbreviations are used: Sequential program - S, Pthreads parallelization - P, OpenMP parallelization - O, OpenMP with tasks - OT, Cilk++ parallelization - C, Wall time (user) - WTU, Wall time (kernel) - WTK, Run time - RT, Speedup - S, Efficiency - E.

5.1 Prime

The Prime calculation program was run with startnumber 2, maxnumber 1 000 000, certainty 10 and grainsize 10. The results of doing the quantitative test on this program are shown in Table 3.

Table 3: Results of parallelizing the Prime calculation program

1 2 3 4 5 6 7 8 S WTU 6300 WTK 24 RT 6330 P WTU 6528 6822 6954 6890 6952 7076 7186 7982 WTK 34 54 52 60 74 76 108 234 RT 6557 3445 2336 1741 1411 1195 1051 1047 S 0.97 1.84 2.71 3.64 4.49 5.30 6.02 6.05 E 0.97 0.92 0.90 0.91 0.90 0.88 0.86 0.76 O WTU 6802 6900 7138 7272 7422 7522 7780 8510 WTK 24 52 70 64 70 72 116 292 RT 6823 3473 2401 1841 1504 1278 1144 1130 S 0.93 1.82 2.64 3.44 4.21 4.95 5.53 5.60 E 0.93 0.91 0.88 0.86 0.84 0.83 0.79 0.70 C WTU 6540 7010 7164 7236 7448 7510 7772 8379 WTK 20 56 80 56 68 82 98 262 RT 6565 3533 2414 1824 1503 1265 1129 1094 S 0.96 1.79 2.62 3.47 4.21 5.00 5.61 5.79 E 0.96 0.90 0.87 0.87 0.84 0.83 0.80 0.72 The table shows how the behavior of the parallelized program depends on the

number of threads executing on eight cores.

(37)

Figure 20: Speedup on Prime calculation program

The graph shows how the speedup depends on the number of threads executing.

5.2 Adaptive quadrature

(38)

Table 4: Results of parallelizing the Adaptive quadrature program 1 2 3 4 5 6 7 8 S WTU 14789 WTK 0 RT 14788 P WTU 14819 14877 14919 15779 16379 16757 18097 18809 WTK 0 0 8 40 100 200 350 424 RT 14817 10324 10262 9146 8213 8048 7592 7060 S 1.00 1.43 1.44 1.62 1.80 1.84 1.95 2.09 E 1.00 0.72 0.48 0.40 0.36 0.31 0.28 0.26 O WTU 14983 15075 16009 16069 16861 17579 17877 18327 WTK 0 0 22 32 70 104 90 98 RT 14907 10475 9358 9236 7582 6457 6535 5949 S 0.99 1.41 1.58 1.60 1.95 2.29 2.26 2.49 E 0.99 0.71 0.53 0.40 0.39 0.38 0.32 0.31 C WTU 14953 14981 15095 15125 15209 15219 15207 15335 WTK 0 0 0 0 4 0 4 4 RT 14951 7490 5035 3784 3045 2539 2175 1919 S 0.99 1.97 2.94 3.91 4.86 5.82 6.8 7.71 E 0.99 0.99 0.98 0.98 0.97 0.97 0.97 0.96 The table shows how the behavior of the parallelized program depends on the

Figure 21 shows the speedup gained by parallelization of the Adaptive quadra-ture program.

Figure 21: Speedup on Adaptive quadrature program

(39)

Pthreads and OpenMP perform very poorly in this parallelization, while Cilk++ almost has 100 % efficiency. Table (and Figure 22) shows that one of the reasons to this is lack of proper load balancing. Apparently a lot of threads are just waiting for other threads to return—if all threads were running no speedup would be gained when creating more threads than processors available. By allowing more threads, here up to 50, to be created better utilization of the processors is achieved. This will however result in bad efficiency and huge overhead.

Table 5: Results of running Adaptive quadrature using additional threads 10 20 30 40 50 P WTU 20265 20461 15585 14829 14821 WTK 760 884 1052 1028 1222 RT 7033 4021 2591 2427 2399 S 2.10 3.68 5.71 6.09 6.16 E 0.21 0.18 0.19 0.15 0.12 O WTU 17941 18955 21793

WTK 3300 8995 21549 not needed since RT 4951 4681 6730 speedup is going

S 2.99 3.16 2.2 down E 0.30 0.16 0.07

The table shows how the behavior of the parallelized program depends on the number of threads executing on eight cores.

Figure 22: Speedup on Adaptive quadrature using additional threads

(40)

this gave some improvement it is far from being as good as the Cilk++ variant, as it can be seen in Table 6, and maybe even clearer in Figure 23.

Table 6: Results of parallelizing the Adaptive quadrature program using OpenMP tasks 1 2 3 4 5 6 7 8 OT WTU 14967 15075 15119 15261 15383 15583 15933 16537 WTK 0 6 8 10 18 26 72 118 RT 14967 7607 7597 5832 5685 5275 5146 4472 S 0.99 1.94 1.95 2.54 2.60 2.80 2.87 3.31 E 0.99 0.97 0.65 0.63 0.52 0.47 0.41 0.41 The table shows how the behavior of the parallelized program depends on the

Figure 23: Speedup on Adaptive quadrature using OpenMP tasks

The graph shows how the speedup depends on the number of threads executing. Note: the OpenMP program using tasks was compiled with GCC as ICC made optimizations which resulted in the program running 150 times faster than the sequential version. There were no significant difference in speedup between using GCC and ICC.

5.3 Post office

(41)

The results of running the Post office simulation with different number of available processors are shown in Table 7.

Table 7: Results of parallelizing the Post office simulation

1 2 3 4 5 6 7 8 S WTU 5094 WTK 0 RT 5094 P WTU 5410 6866 6918 7512 7668 7870 7898 8750 WTK 2198 3134 4250 4744 4836 4716 3938 4272 RT 7605 5014 4119 3473 3336 3213 2819 2953 S 0.67 1.02 1.24 1.47 1.53 1.59 1.81 1.73 E 0.67 0.51 0.41 0.37 0.31 0.26 0.26 0.22 O WTU 14974 10378 11559 11547 10997 11055 11615 13327 WTK 950 1152 1850 2702 3058 2988 2668 3192 RT 15952 6249 5547 4313 4195 3834 3704 3430 S 0.32 0.82 0.92 1.18 1.21 1.33 1.38 1.49 E 0.32 0.41 0.31 0.30 0.24 0.22 0.20 0.19 C WTU 8443 9479 9981 11065 11135 11747 12151 12911 WTK 0 232 2880 2836 5330 7726 9561 11099 RT 8440 6787 5070 4758 4642 3514 3352 3292 S 0.60 0.75 1.00 1.07 1.10 1.45 1.52 1.55 E 0.60 0.38 0.33 0.27 0.22 0.24 0.22 0.19 The table shows how the behavior of the parallelized program depends on the

(42)

Figure 24: Speedup on Post office simulation program

The graph shows how the speedup depends on the number of cores available.

5.4 Game of life

(43)

Table 8: Results of parallelizing the Game of life program 1 2 3 4 5 6 7 8 S WTU 5066 WTK 0 RT 5065 P WTU 5590 5816 5888 6134 6590 7386 8341 9333 WTK 0 2 4 8 12 8 18 16 RT 5591 2991 2013 1619 1411 1339 1298 1276 S 0.91 1.69 2.52 3.13 3.59 3.78 3.90 3.97 E 0.91 0.85 0.84 0.78 0.72 0.63 0.56 0.50 O WTU 5944 6090 6304 6670 7112 7926 9135 10193 WTK 0 0 6 8 12 14 24 24 RT 5944 3046 2100 1671 1430 1332 1317 1298 S 0.85 1.66 2.41 3.03 3.54 3.80 3.85 3.90 E 0.85 0.83 0.8 0.76 0.71 0.63 0.55 0.49 C WTU 4410 4466 4644 5200 5948 6864 7836 8869 WTK 0 58 120 140 162 132 216 290 RT 4412 2268 1599 1335 1220 1166 1150 1146 S 1.15 2.23 3.17 3.79 4.15 4.34 4.40 4.42 E 1.15 1.12 1.06 0.95 0.83 0.72 0.63 0.55 The table shows how the behavior of the parallelized program depends on the

Figure 25: Speedup on Game of life program

(44)

(45)

6 Analysis

6.1 Analysis of quantitative evaluation

The analysis in this section is done program-wise, since the programs have very much in common when benchmarking. There will first be a closer look at the programs’ characteristics and then some comments about why some models differ in performance, if any.

6.1.1 Prime

This program shows very good speedup overall. The threads can work indepen-dently most of the time; synchronization is only needed when accessing the bag of tasks. Since the bag of tasks implementation by definition offers a shared task queue load balancing task support is not required by the programming model used.

6.1.2 Adaptive quadrature

The benchmark of the Adaptive quadrature program shows how the utilization of untied tasks and work distribution can have great impact on programs where the number of threads changes dynamically and often. Just by looking at the basic characteristics of the program it should be fully parallelizable—there are no shared variables and no synchronization is needed to continue execution. The only synchronization needed is prior returning the calculated value.

The Pthreads model has no support for untied tasks and therefore a thread associated with a waiting task has to wait too. The OpenMP implementation also shows that not using tasks is to a great disadvantage, as some improve-ment is gained when switching to a task impleimprove-mentation instead. The fact that OpenMP with tasks still runs much slower than Cilk++ could depend on OpenMP’s task scheduling implementation, as others have noted it having a large overhead[11].

Thanks to the work stealing schedule and task concept in Cilk++ it achieves nearly 100 % efficiency.

6.1.3 Post office simulation

This communicating program does not achieve very good speedup with any of the models. The reason is most likely the fact that the threads communicate by modifying buffers, something which is done quite often and requires mutual exclusion. Therefore much of the time goes to waiting for locks and adding to buffers, something which can not be done in parallel.

To achieve better speedup in this kind of program (paradigm) a larger frac-tion of the work has to be done independent from buffer access.

6.1.4 Game of life

(46)

faster than the slowest thread, and is most likely an important factor that lowers the speedup. Otherwise the threads have simple, well-defined tasks which they run from program start to termination.

6.2 Analysis of qualitative evaluation

In this section the parallel programming models will be evaluated according to the criteria listed in Section 3.2.2 on page 20.

6.2.1 Pthreads

Portability Pthreads have great portability between systems supporting POSIX. There exists some support for translating the Pthreads to Windows API threads using Pthreads-win32 but it requires POSIX to be installed on the system. Documentation As Pthreads provides manpages it is easy to get a good and clear view over its functions. Since Pthreads has been around for a while it is quite well-known in the community which can also be of great help.

Modification Of all the models evaluated in this thesis Pthreads is by far the one which requires most modification of the code. One often has to create new functions to use with the thread creation function and many locks and attribute variables have to be initialized.

Complexity The Pthreads model is however not very complex. The one thing that is complex is how threads are created and associated with functions. Due to Pthreads only accepts functions of the type void * function(void * argument) a new data structure has to be created if more than one argument has to be sent to the function. Other Pthreads functions are straight-forward and do not require much studying.

Scope Pthreads has a very large scope and thanks to all thread management and synchronization being done explicitly this model offers great flexibility. Sev-eral synchronization mechanisms are available, such as locks of different kinds, semaphores, barriers and condition variables.

6.2.2 OpenMP

Portability Portability of OpenMP is good as it is supported by several com-pilers on several platforms, such as GCC on many POSIX operating systems and Visual Studio on Windows.

(47)

Modification When parallelization programs with OpenMP very little mod-ification is needed. Most of the time no existing code needs to be changed or moved; it is enough to add the OpenMP pragmas.

Complexity The complexity of OpenMP is probably the only problem with OpenMP in this qualitative evaluation. While the directives seem very intuitive some of them often require a detailed study, especially the clauses shared and variants of private needs to be implemented with care. This should be stressed more in the OpenMP specification and shown with different examples. It should be pointed out that most of the directives nonetheless really are as intuitive as they seem.

Scope OpenMP has support for most concurrent and parallel concepts. Many options are available to the programmer to define, such as number of threads, nested parallelism and scheduling of loop regions. Many of the directives also offer a wide range of clauses to specify the execution environment.

6.2.3 Cilk++

Portability Cilk++ is currently only available through the Intel Cilk++ Compiler, which works with both Linux and Windows. The Cilk++ compiler in Linux is a wrapper around GCC and requires the same environment as GCC. The Windows version requires Visual Studio 2005 Service Pack 1 or newer. This is currently the main issue with Cilk++. If this one compiler doesn’t work on the system there is no other to turn to, in contrast to having the model available as a standalone library.

Documentation Intel provides the Cilk++ Programmer’s Guide which is very detailed and exhaustive about functions, compiler options, the program-ming environment and available tools. Apart from the Programmer’s Guide there is however not much documentation to turn to, which can be a problem when information about a certain function is needed quickly.

Modification Parallelizing a program according to the Cilk++ model re-quires minimal amount of modification. There is often enough with a few key-words.

Complexity The keywords in Cilk++ are very intuitive and straight-forward, no deeper analysis of them is necessary to understand how they should be im-plemented.

Scope The Cilk++ model offers only a small set of functions and keywords. These are often enough to do what is needed to efficiently parallelize a program but do not offer much flexibility.

A Comparison of Different Parallel Programming Models for Multicore Processors

Bachelor of Science Thesis

N I C K L A S W A H L É N

A Comparison of Different Parallel

Programming Models for Multicore

Processors

A comparison of different parallel programming

models for multicore processors

Niklas Wahl´

en, nwahlen@kth.se

Contents

List of Figures

List of Tables

1

Introduction

1.1

Acknowledgements

1.2

Background

1.3

Objective

1.4

Limitation

1.5

Motivation

1.6

Thesis outline

2

The parallel universe

2.1

Execution concepts

2.2

Introduction to parallelism

2.3

Synchronization mechanisms

2.4

Multicore hardware

2.5

Parallel programming paradigms

2.6

Parallel programming models

3

Methodology

3.1

Approach

3.2

Evaluation criteria

3.3

Programs

4

Procedure

4.1

Working environment

4.2

Parallelization targets/goals

4.3

Pthreads

4.4

OpenMP

4.5

Cilk++

5

Results

5.1

Prime

5.2

Adaptive quadrature

5.3

Post office

5.4

Game of life

6

Analysis

6.1

Analysis of quantitative evaluation

6.2

Analysis of qualitative evaluation

6.3

Evaluation summary