• No results found

Analysis of Automatic Parallelization Methods for Multicore Embedded Systems

N/A
N/A
Protected

Academic year: 2022

Share "Analysis of Automatic Parallelization Methods for Multicore Embedded Systems"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

Analysis of Automatic

Parallelization Methods for Multicore Embedded Systems

FREDRIK FRANTZEN

DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND LEVEL

STOCKHOLM, SWEDEN 2015

(2)
(3)

Analysis of Automatic

Parallelization Methods for

Multicore Embedded Systems

Fredrik Frantzen

2015-01-06

Master’s Thesis

Examiner

Mats Brorsson

Academic adviser Detlef Scholle

KTH Royal Institute of Technology

School of Information and Communication Technology (ICT) Department of Communication Systems

SE-100 44 Stockholm, Sweden

(4)
(5)

Acknowledgement

I want to thank my examiner Mats Brorsson and my two supervisors Detlef Scholle and Cheuk Wing Leung for their helpful advice and for making this report possible. I also want to thank the other two thesis workers, Andreas Hammar and Anton Hou, that have made the time at Alten, really enjoyable.

(6)

Abstract

There is a demand for reducing the cost of porting legacy code to different embedded platforms.

One such system is the multicore system that allows higher performance with lower energy consumption and it is a popular solution in embedded systems. In this report, I have made an evaluation of a number of open source tools supporting the parallelization effort. The evaluation is made using a set of small highly parallel programs and two complex face recognition applica- tions that show what the current advantages and disadvantages are of different parallelization methods. The results show that parallelization tools are not able to parallelize code automati- cally without substantial human involvement. Therefore it is more profitable to parallelize by hand. The outcome of the study is a number of guidelines on how to parallelize their program and a set of requirement that serves as a basis for designing an automatic parallelization tool for embedded systems.

(7)

Sammanfattning

Det finns ett behov av att minska kostnaderna f¨or portning av legacykod till olika inbyggda system. Ett s˚adant system ¨ar de flerk¨arniga systemen som m¨ojligg¨or h¨ogre prestanda med l¨agre energif¨orbrukning och ¨ar en popul¨ar l¨osning i inbyggda system. I denna rapport, har jag utf¨ort en utv¨ardering av ett antal open source-verktyg, som hj¨alper till med arbetet att parallelisera kod. Detta g¨ors med hj¨alp av sm˚a paralleliserbara program och tv˚a komplexa ansiktsigenk¨annings-applikationer som visar vad de nuvarande f¨or- och nackdelar de olika par- allelliserings metoderna har. Resultaten visar att parallelliseringsverktygen inte klarar av att parallellisera automatiskt utan avsev¨ard m¨ansklig inblandning. Detta medf¨or att det ¨ar l¨onsam- mare att parallelisera f¨or hand. Utfallet av denna studie ¨ar ett antal riktlinjer f¨or hur man ska g¨ora f¨or att parallelisera sin kod, samt ett antal krav som agerar som bas till att designa ett automatiskt paralleliseringsverktyg f¨or inbyggda system.

(8)

Contents

Acknowledgement 1

List of Tables 7

List of Figures 8

Abbreviations 10

1 Introduction 11

1.1 Background . . . 11

1.2 Problem statement . . . 12

1.3 Team goal . . . 12

1.4 Approach . . . 12

1.5 Delimitations . . . 12

1.6 Outline . . . 13

2 Parallel software 14 2.1 Programming Parallel Software . . . 14

2.1.1 Where to parallelize . . . 15

2.1.2 Using OpenMP for shared memory parallelism . . . 17

2.1.3 Using MPI for distributed memory parallelism . . . 19

2.1.4 Using vector instructions for spatially close data . . . 20

2.1.5 Offloading to accelerators . . . 20

2.2 To code for different architectures . . . 21

2.2.1 Use of hybrid shared and distributed memory . . . 21

2.2.2 Tests on accelerator offloading . . . 22

2.3 Conclusion . . . 22

3 Parallelizing methods 24 3.1 Using dependency analysis to find parallel loops . . . 24

3.1.1 Static dependency analysis . . . 25

3.1.2 Dynamic dependency analysis . . . 26

3.2 Profiling . . . 26

3.3 Transforming code to remove dependencies . . . 26

3.3.1 Privatization of variables to remove dependencies . . . 26

3.3.2 Reduction recognition . . . 27

3.3.3 Induction variable substitution . . . 27

3.3.4 Alias analysis . . . 28

3.4 Parallelization methods . . . 29

3.4.1 Traditional parallelization methods . . . 29

3.4.2 Polyhedral model . . . 29

(9)

3.4.3 Speculative threading . . . 30

3.5 Auto-tuning . . . 30

3.6 Conclusion . . . 31

4 Automatic parallelization tools 32 4.1 Parallelizers . . . 32

4.1.1 PoCC and Pluto . . . 32

4.1.2 PIPS-Par4all . . . 33

4.1.3 LLVM-Polly . . . 33

4.1.4 LLVM-Aesop . . . 33

4.1.5 GCC-Graphite . . . 34

4.1.6 Cetus . . . 34

4.1.7 Parallware . . . 34

4.1.8 CAPS . . . 34

4.2 Translators . . . 35

4.2.1 OpenMP2HMPP . . . 35

4.2.2 Step . . . 35

4.3 Assistance . . . 35

4.3.1 Pareon . . . 35

4.4 Comparison of tools and reflection . . . 36

4.4.1 Polyhedral optimizers and performance . . . 36

4.4.2 Auto-tuning incorporation and performance . . . 36

4.4.3 Functional differences . . . 37

4.5 Conclusion . . . 38

5 Programming guidelines for automatic parallelizers 39 5.1 How to structure loop headers and bounds . . . 39

5.2 Static control parts . . . 40

5.3 Loop bodies . . . 41

5.4 Array accesses and allocation . . . 42

5.5 Variable scope . . . 43

5.6 Function calls and stubbing . . . 43

5.7 Function pointers . . . 44

5.8 Alias analysis problems: Pointer arithmetic and type casts . . . 45

5.9 Reductions . . . 45

5.10 Conclusion . . . 46

6 Implementation 47 6.1 Implementation approach . . . 47

6.2 Requirements . . . 48

7 The applications to parallelize 50 7.1 Face recognition applications . . . 50

7.1.1 Training application . . . 50

7.1.2 Detector application . . . 52

7.2 PolyBench benchmark applications . . . 53

8 Results from evaluating the tools 54 8.1 Compilation flags . . . 54

8.2 PolyBench results . . . 54

8.3 Parallelization results on the face recognition applications . . . 59

8.4 Discussion . . . 61

(10)

9 Requirements fulfilled by automatic parallelizers 63

9.1 Code handling and parsing . . . 63

9.2 Reliability and exposing parallelism . . . 63

9.3 Maintenance and portability . . . 63

9.4 Parallelism performance and tool efficiency . . . 64

10 Conclusions 65 10.1 Limitations of parallelization tools . . . 65

10.2 Manual versus Automatic parallelization . . . 65

10.3 Future work . . . 66

References 67

(11)

List of Tables

4.1 Functional differences in the tools. . . 37

4.2 A rough overview of what the investigated tools take as input and what they can output. . . 38

6.1 The list of requirements for an automatic parallelization tool. . . 48

8.1 Compilation flags for the individual tools. . . 54

8.2 Refactoring time and validity of parallelized training application. . . 60

8.3 Refactoring time and validity of parallelized classification application. . . 60

(12)

List of Figures

2.1 Two parallel tasks are in separate critical sections and holding a resource each,

when requesting to get the others resource, a deadlock is created. . . 15

2.2 Parallelism in a loop. . . 16

2.3 A false sharing situation. . . 16

2.4 A sequential program split up into pipeline stages. . . 17

2.5 Pipeline parallelism, displaying different balancing of the stages. . . 17

2.6 Thread creation and deletion in OpenMP. [1] . . . 17

2.7 A subset of OpenMP pragma directives. . . 18

2.8 Dynamic and static scheduling side by side. Forking and joining is done only once. 19 2.9 Example of a SIMD instruction. . . 20

2.10 An overview of different architectures. . . 21

3.1 Example on data dependencies, revealed after unrolling the loop once. . . 25

3.2 GCD test on the above code segment yields that there is an independence. . . 25

3.3 Example of a more difficult loop. . . 26

3.4 An example of a variable and an array that is only live within a scope of an iteration. . . 27

3.5 A reduction recognition example using OpenMP. . . 28

3.6 A simple example of induction variable substitution. . . 28

3.7 A simple example of a pointer aliasing an array. . . 29

3.8 Example code to illustrate dependence vectors. . . 29

3.9 A loop nest that has been transformed to be parallelizable. . . 30

5.1 Allowed loop bounds. . . 40

5.2 Disallowed loop bounds. . . 40

5.3 A loop that does not satisfy as a static control part because of the unpredictable branch. . . 40

5.4 A loop that satisfy as a static control part. . . 41

5.5 Critical region within the loop. . . 41

5.6 Critical region fissioned out of the loop. . . 42

5.7 Move private dynamic allocation inside the loop scope. . . 43

5.8 A is classified as shared, even though it is private in theory. . . 43

5.9 A is in a scope where it cannot be shared between the iterations over i, thus is private. . . 43

5.10 Function pointers should be avoided. . . 44

5.11 Two examples on how to complicate alias analysis. . . 45

5.12 Fission out the reduction. . . 45

7.1 Training application for face recognition. . . 51

7.2 Detector application for face recognition. . . 52

8.1 Results from Polybench benchmarks (part1). Y axis is speed up. . . 56

(13)

8.2 Results from Polybench benchmarks (part2). Y axis is speed up. . . 57 8.3 Results from Polybench benchmarks (part3). Y axis is speed up. . . 58 8.4 Speed up on different number of cores on the training application after paral-

lelization using the different tools. . . 60 8.5 Speed up on different number of cores on the classification application after par-

allelization using the different tools. . . 61

(14)

Abbreviations

Abbreviation Definition

CPU Central Processing Unit

CUDA Compute Unified Device Architecture, A parallel computing platform for Nvidia devices

DSP Digital signal processor

GPU Graphical process unit

Heterogeneous system System containing different computing units Homogeneous system System containing multiple identical cores

HMPP Hybrid multi-core parallel programming, Standard to write programs for Heterogenous systems

IP core Intellectual property core

MCAPI Multicore Communications API, Standard for communication between cores on-chip or board for embedded systems

MPI Standard defining library routines for writing portable message passing applications

OpenACC Standard to write programs for Heterogeneous systems OpenCL Standard to write programs for Heterogeneous systems OpenMP Standard to write parallel programs for shared memory

SMP Symmetric Multi Processor (Homogeneoneous system using shared memory)

(15)

Chapter 1

Introduction

1.1 Background

The demand for high performance in embedded systems is increasing, but at the same time the systems needs to be power efficient. A way to increase the performance is to add cores to the system and decrease the frequency. Power consumption can remain constant, but to get the performance, it is important to utilize the cores. Today, applications are still written for single-core execution. To utilize the processing power of a many-core system, developers have to modify their software. This can be very time consuming and difficult. The complexity can be worsened if the software has grown big with thousand lines of code.

The state of art report [2] by the ITEA2/MANY [3] project concludes that there is no unique architecture that will provide the best performance for all kind of applications but just for a set of applications with known complexity and required resources. It also predicts that the future embedded system will consist of hundreds of heterogeneous Intellectual Properties cores, which will execute one parallel application or even several applications running in parallel. Developing for these architectures will get more complex and this makes it necessary to create tools that will close the gap between hardware and software.

One big gap is the parallelism that exists in hardware but not in software. Tools that help the developers to create parallel software are needed. One such tool is the compiler that analyses the code and automatically parallelizes it. This allows developers to reuse their existing code and continue developing software without having to think about the hardware architecture.

There is a wide range of tools that a developer can use together with a compiler to get a more optimized application or more knowledge of their application.

The study reported here was done as a master’s degree project. It was conducted at Alten Sverige [4], an engineering and IT consulting firm. The main customers belong to the energy, telecommunication, manufacturing and automotive industries. This degree project is part of the ITEA2/MANY project that is working with putting together a developing environment that will allow more code reuse to lower the time-to-market for embedded systems development.

The degree project is also a part of the ARTEMIS/Crafters [5] project to some extent, which is developing a framework for developing applications on many-core embedded systems.

(16)

1.2 Problem statement

Parallelization of code can be a complex task depending on the legacy application and it can take a lot of time to move to a parallel platform. It is required to investigate if there are cost effective alternatives to parallelizing code by hand, such as using automatic parallelization tools. There are several automatic parallelization tools available to use, but in the context of the MANY and Crafters projects it was currently unknown how to draw benefits from them. It was also of interest how the tools can be improved to increase the benefits of using them.

The goal was to give a model for how automatic parallelization can be used in production. This report can be seen as a package containing guidelines and a knowledge base for using automatic parallelization tools. This will hopefully lead to a decrease in the amount of resources needed to port legacy code and serve as a basis for future improvement in automatic parallelization tools.

1.3 Team goal

During this degree project, a sub-project together with two other thesis workers was carried out. Each thesis worker was studying a separate subject and the goal was to combine the knowl- edge gained from these studies to design and develop a face-recognition application usecase that makes use of automatic parallelization tools investigated in this thesis and middle-ware com- ponents supplied by the other two workers. The other two technologies are run-time adaptive code by using self-managing methods and high-performance interprocess communication. The implementation was conducted on Linux on an x86 multi-core system. The usecase application was used to validate the efficiency on the automatic parallelizer.

1.4 Approach

In this master thesis, I have made an academic study on the state of art in automatic paralleliza- tion of software. I investigated what methods there are to create parallel code both manually and automatically. I also looked at the current technologies and methods used in different au- tomatic parallelizing tools. This includes material on compilers, parallel theory and scientific articles on parallelization. Different methods was investigated and analyzed for parallelizing software but the focus was on automatic parallelization.

The second half of the work consists of an evaluation of the automatic parallelization tools to get an insight of their usability. The result of the study gives a comparison on existing automatic parallelizing tools and distinguishes the differences of the tools in terms of what parallelizing methods they use and their efficiency. An analysis was then carried out using the results from the evailation and the findings of the study, on how these tools can be improved and what technology should be incorporated in an automatic parallelization tool for embedded systems.

1.5 Delimitations

This report considered only the parallelization of sequential C code, which is a widely used language in developing embedded systems. Furthermore, thread level parallelism for SMP systems was the main focus, but findings and discussions on how to target other systems was

(17)

presented as well. Tools that automatically parallelizes code and those that can improve the work flow when parallelizing by hand was investigated.

1.6 Outline

This report is divided into nine chapters excluding the introduction chapter. Chapter 2 describes the concept of parallelizing a program. This includes a description on how a developer can parallelize their program using libraries and compiler directives. It also includes concepts that the developers can think of when doing parallelization. It also includes an overview of different system architectures that a developer can decide to target, with their parallel program. Chapter 3 presents common techniques that are used in automatic parallelization compilers, a summary concluding the chapter reflects on there strength and weaknesses and why one method might be more favorable over the other.

Chapter 4 gives the reader a summary of existing tools that can perform automatic paralleliza- tion or assist the developer when developing a program to perform better on a parallel system.

Some of the tools do the same things as others and some do entirely unique things. A detailed map depicting how the tools differ is presented here and data strengthening why one tool is more favorable over the other. Chapter 5 presents the required refactoring steps needed to be able to take advantage of parallelization tools.

Chapter 6 discusses the implementation approach needed to be taken for creating an automatic parallelizer that works efficiently on general problems. Chapter 7 presents the applications that were used in the evaluation of the selected automatic parallelization tools, and chapter 8 presents the results from the evaluation. In chapter 9, the selected tools are compared against requirements that was identified in chapter 6, to get a basis for what improvements are necessary to make them useful. Chapter 10 presents the conclusions that can be drawn from all of the work carried out during this thesis work.

(18)

Chapter 2

Parallel software

Parallelizing software has been a research topic decades before the multi-core revolution, but now it is more relevant than ever. Since the CPU frequency increase in computer systems has begun to stall at about 3.5 GHz, it has become more interesting to add more cores. This allows computer systems to perform better using less power.

If the frequency (f) is halved in a system, the voltage (V) can be lowered. This leads to that the power consumption (P) becomes an eighth of the original system (see Equation 2.3). By adding an additional core to it, the theoretical performance is about the same, but only consuming a fourth of the power. Adding two more cores, the performance could be doubled that of the single core system while only consuming half of the power.

P = C · V2· f (2.1)

V = a · f (2.2)

P = C · a2·f3

23 (2.3)

The performance in practice is however a different thing. Software has yet to be written to utilize multi-core architectures efficiently. Most of the software today is written for single thread execution and has to be rewritten for the new architectures to achieve the performance.

Software is however limited to Amdahls law which states that the speed up (S) of a program is limited to the proportion that is not parallelizable (1-P) (see Equation 2.4). This can be seen in the formula, where if the number of cores increases, the second term in the denominator will move towards zero and will no longer affect the performance of the program significantly.

S = 1

(1 − P ) +NP (2.4)

Therefore the importance lays in how much code can be parallelized and thus methods for finding parallelism is needed. This chapter will give the reader an introduction to these methods and an understanding on how one can go about to parallelize a program or to port a program to a multi-core architecture.

2.1 Programming Parallel Software

There are several programming languages in existence today. A large subset of those also support parallel programming in different ways. To mention a few, there are C, Ada, Java,

(19)

Figure 2.1: Two parallel tasks are in separate critical sections and holding a resource each, when requesting to get the others resource, a deadlock is created.

Haskell and Erlang. These languages are interesting from a parallel programming perspective in different ways. This thesis will only look at C, which is one of the most popular languages out there, especially for embedded systems. C is a very low level language compared to the others mentioned, and has little abstraction for parallel programming without extensions. As of the release of C11 standard, however, it is now possible to use a standard thread library which does not require a POSIX based system.

To program parallel programs with C, thread library such as the standard one or the POSIX threads library (pthreads) can be used. This gives the developer full support to program tasks that execute in parallel. But programming for parallel systems is not trivial. When the developer wants tasks to share a resource, there are several problems that can occur. The parallel tasks cannot write and read on the resources in which way they like, because this will result in race conditions. A race condition can occur if a task writes to a shared resource and plans to read it in a near future, another task that is also using the resource can write on to it before the first task reads it. This means that the program will be non-deterministic in run-time. For the developer, it can look like there is no problem with the program especially if he is programming on a single core system, where tasks will run concurrently but not in parallel.

To prevent this, the developer can use protection mechanisms, e.g. locks or semaphores, to surround a critical section, so that a task that enters this section is guaranteed that the shared resource is not modified while it is in use. Using locks however can impose other problems. A deadlock occurs when two tasks wants a resource the other task is holding. See Figure 2.1 for a visual explanation. Task1 holds R1 and wants to get R2 but R2 is already held by Task2 and Task2 wants R1. What happens is that both tasks gets in a waiting state, and prevents execution.

An alternative to program threads in this error prone way is to use an annotation based approach using OpenMP [1]. The annotations are pragma statements in C code which should be handled by the compiler. When the compiler parses the statements, it will insert low level code that implements the intended functionality. This method is less verbose and maps better to the parallel paradigm. The benefit of using this method is that it abstracts much of the required synchronization. Meaning that the problems of race conditions and deadlocks can be avoided to some degree, however, it does not make the programmer completely safe. As a side note, to be able to program with OpenMP the developer needs to use a compiler that supports the OpenMP pragmas.

2.1.1 Where to parallelize

There can be several places in the software which can be parallelized. Loops are the common target for refactoring to parallelize a program. This is because a lot of the execution time is spent in these sections. The potential speed-up of loops is equal to the number of iterations.

Figure 2.2 shows how a loop can be split up into workers that execute a number of iterations each.

(20)

Figure 2.2: Parallelism in a loop.

When parallelizing loops there are several problems that needs to be avoided. Loops typically processes a data set where elements are stored spatially close in memory. On a symmetric multiprocessing unit it is common to have some form of cache coherence. Cache coherence makes sure that cores that are working on the same memory always have the latest version of a memory block. If one core writes to a memory address, the data will be written to a cache line first since it may be written or read in a near future. When the core has written to the cache, it has to be invalidated in all the other caches used by the neighboring cores. Since their copy of the data is not the latest version.

In a cache coherent system false sharing can occur. It means that a cache line is invalidated in neighboring caches on accident. When a memory address is read, the elements that are spatially close to it will be loaded into the same cache line. Figure 2.3 shows a simple example of a cache line containing two data blocks. When core 1 writes to A, the cache line for core 2 is invalidated even though core 2 is only interested in B.

Figure 2.3: A false sharing situation.

Another pattern to look out for in a sequential program is the pipeline (see Figure 2.4). This pattern assumes that there is a batch of data that needs to be processed. A pipeline will process a data element at a time, in several stages. When the first message is received at the second stage, a second message can be handled at the first stage simultaneously.

By splitting up a program into pipeline stages, the potential speed-up is equal to the number of pipeline stages. But in reality the load balancing of the pipeline stages has to be perfect to get that performance. The balancing in terms of execution time is shown in Figure 2.5;

A shows a perfectly balanced pipeline, B realizes the stalling pipeline problem. Replicating a pipeline stage may yield a non-stalling pipeline as seen in C, but performance is spilled because threads have to go idle. This pattern is harder to solve automatically and have to make use of

(21)

techniques similar to that of Tournavits et al. [6], where automatic tuning is used to find the optimal load balance.

Figure 2.4: A sequential program split up into pipeline stages.

Figure 2.5: Pipeline parallelism, displaying different balancing of the stages.

2.1.2 Using OpenMP for shared memory parallelism

OpenMP [1] is a standard that provides an extension for C. It provides a simple and flexible interface for handling threads. It is an API that consists of a set of compiler directives, li- brary routines and environment variables that influence run-time behaviour. These compiler directives hides the complicated parts such as synchronization of threads, resource sharing and thread scheduling. This paper will continuously return back to OpenMP and its directives since it is a popular format used by automatic parallelizing compilers which will be presented in chapter 4.

Figure 2.6: Thread creation and deletion in OpenMP. [1]

The OpenMP parallel directive is used to fork a number of threads, the number is defined either by the developer or an environment variable, to execute the region that follows the directive

(22)

in parallel. The forked threads will execute until they reach a synchronization clause such as the barrier directive where they will wait until all threads has finished executing. If it is in the end of a parallel region (where barriers are implicitly inserted), the threads will be joined with the master thread and the application will continue running on a single thread. Otherwise the threads will continue executing until the next synchronization clause is reached. Figure 2.6 displays the fork and join model. OpenMP also supports a tasking model where directives in the code will put a task on a work queue. Each thread can put more work on the work queue and when all tasks are finished the threads can continue. This is a model that can be used for recursive algorithms. This study will mainly look at the work sharing constructs defined by OpenMP.

1 # p r a g m a omp p a r a l l e l 2 {

3 // C o d e 4 }

5 # p r a g m a omp for s c h e d u l e ((s t a t i c) |( d y n a m i c ) |( g u i d e d ) , c h u n k _ s i z e ) 6 for( i =0; i < N ;++ i )

7 { 8 // C o d e 9 }

10 # p r a g m a omp c r i t i c a l 11 {

12 // C o d e 13 }

14 # p r a g m a omp r e d u c t i o n ( o p e r a t o r : v a r i a b l e 1 , . . . ) 15 # p r a g m a omp p r i v a t e ( v a r i a b l e 1 , . . . )

16 # p r a g m a omp p a r a l l e l if ( c o n d i t i o n )

Figure 2.7: A subset of OpenMP pragma directives.

The work sharing constructs in OpenMP are directives that define which thread is going to execute which part in the region. By using the parallel for directive (followed by a for loop) in OpenMP you can execute iterations of a loop in parallel. Assigning each thread a number of iterations to execute. Iterations can either be scheduled during compile time or run-time defined by the developer using the OpenMP schedule clause. The directives mentioned so far can be found in Figure 2.7.

There are three scheduling clauses in OpenMP: Static, Dynamic and Guided. The static schedule determines at compile time which iteration are going to be executed by which thread, the work of the loop is split up into chunks of iterations where the size of the chunk is determined in the schedule clause. In the dynamic schedules, the iterations are scheduled at run-time. The benefit of using a dynamic schedule over a static one is that the work load will be better balanced over all threads as shown in Figure 2.8. When a thread is out of work, it can request more work from the scheduler in a dynamic schedule. In contrast to the static schedule where if a thread has executed all its iterations, it will have to go idle until the other threads has finish. The drawback with the dynamic schedule is the additional overhead of assigning chunks to threads at run-time. The guided schedule is in principle the same as the dynamic schedule. The difference is that it has different chunk sizes that it can give to a thread.

Like pthreads, OpenMP does not make the developer safe from race conditions. A variable can be shared between threads using a share clause. This can be useful if a variable is going to be read by several threads. But if writes are going to be performed on the variable, it is up to the developer to either specify it as private to let each thread have their own private copy of the variable, or to insert a critical section clause.

(23)

Figure 2.8: Dynamic and static scheduling side by side. Forking and joining is done only once.

Forking and joining threads may lead to significant overhead if they are placed inside nested loop statements. Creating threads is an expensive process [7]. If the creation of threads is made inside a loop, it means that the creation of threads occurs the same number of times there are iterations of the outer loop. This can create significant overhead if the region that will execute in parallel is short leading the program to become slower. A better approach is to move the creation of threads to the outer loop since this results in only one instance of forking threads.

2.1.3 Using MPI for distributed memory parallelism

MPI [8] or message passing interface is a standard that defines an API for interprocess com- munication. MPI is a useful abstraction for course grain parallelism i.e. when threads can be divided into separate tasks that communicates with each other to a small degree. The advantage that MPI has over OpenMP is that nothing is assumed of the underlying architecture, which means that the application can be deployed on any system. While OpenMP would only execute in parallel on SMP architectures [9] that has, as of version 4, added support for heterogeneous systems with shared memory. An MPI applications can also be distributed on several systems simultaneously since the links connecting two tasks of a program hides the location of a task.

Several implementations of the MPI API exists. The full MPI implementation is too big to be included with most embedded systems but there are libraries that implements a subset of functions with similar functionality to that of the MPI API such as MCAPI [10].

The disadvantage of using MPI is that there will be an additional overhead when tasks that are running on the same SMP unit communicate. When processors are running on the same node, sending data with MPI will use shared memory. Although the memory is shared, there will be a copy from the send buffer into the shared memory and then a copy from the shared memory to the receive buffer. Therefore, MPI is more well suited for applications that do not have to send a lot of data around or that is not running on a cache coherent system.

(24)

2.1.4 Using vector instructions for spatially close data

The finest-grain of parallelism is when each data element in an array can be processed in parallel.

Vectorization or (SIMD, Single Instruction Multiple Data) is a technique that makes it possible to execute one instruction on several elements spatially close in memory (see Figure 2.9). This requires that the hardware has support for these instructions. The width i.e. the number of elements that can be processed at a time varies. Many popular compilers have support for automatically inserting vector instructions for trivial cases e.g. GCC. A trivial case can be a loop that contains a chain of binary instructions (Addition, Multiplication, etc.), that is performed on each data element in an array. These instructions can be replaced with vector instructions and lower the iteration count of the loop.

Figure 2.9: Example of a SIMD instruction.

2.1.5 Offloading to accelerators

Many architectures are heterogeneous, i.e. there is CPU combined with an accelerator of some kind. PCs typically have a Graphical Processing Units (GPU) and embedded systems have a whole range of different accelerators such as Digital Signal Processing units (DSP). Accelera- tors are highly specialized on a particular type of problem which often means that they can execute faster than the CPU. Lately it has become very common to make use of accelerators for general purpose applications, especially GPUs. To program for accelerators there are several libraries such as OpenCL [11] or CUDA [12]. CUDA is used for programming Nvidia GPUs and OpenCL is designed for programming on accelerators in general. Similar to pthreads, program- ming with these languages can become complex and are quite verbose compared to OpenMP pragmas.

Annotation based approaches similar to OpenMP are also available such as OpenACC [13] and OpenHMPP [14]. They provide compiler directives that makes it possible to offload pieces of the execution onto an accelerator. Which accelerator to offload to is specified in the directive.

The compiler is then responsible to insert the accelerator code that is adapted to the specified accelerator.

The two annotation languages mentioned are very similar, OpenACC is very influenced by the OpenHMPP directives. OpenHMPP was first developed by CAPS for their own compiler but later several companies working with accelerators created a committee that together developed the standard OpenACC [13]. Currently OpenHMPP has more directives than the OpenACC standard, but it has not gained popularity. OpenACC has slowly been adopted by GCC, but it is currently only able to target the host (CPU) and not accelerators. In the latest OpenMP version (4.0) support for accelerators has been added. It is not implemented in any compiler yet although some preliminary implementations have been made [15]. In the next version of GCC, OpenMP 4.0 will be supported but just as OpenACC the only possible target will be the host [16].

(25)

2.2 To code for different architectures

There exists several different hardware architectures. OpenMP, MPI and OpenACC standards solves problems in different domains that makes sense to combine. The standards are good at their domain and can complement each other.

Figure 2.10: An overview of different architectures.

In Figure 2.10 an overview of different architectures is given. A) depicts a system with four threads using OpenMP to take advantage of using shared memory and B) depicts an MPI implementation on a system using distributed memory. C) and D) displays two different setups of a heterogeneous system. C) displays one thread that is offloading tasks to a number of threads on a connected accelerator. D) displays a heterogeneous system where accelerator threads and the CPU share memory. The coding standards for accelerators cannot be used to program multiple CPU threads. E) shows the combination of using MPI together with OpenMP. F) is a system that is able to take full advantage of the parallelism in the hardware. It requires the developer to program with the previous mentioned standard to achieve this mapping with software. This can become very complex depending on how the system is set up.

To give some examples on architectures, ARM who has a big market share in mobile units, has released an SMP processor called Cortex-A53 [17] which is the system depicted in A). This processor supports being connected to one additional ARM processor, which creates a system that is similar to what is depicted in E). The system depicted in F) is similarly as complex as Adaptevas Parallella board [18], which has an ARM Coretex A9 Dual-Core [19] together with an Zynq-7000 series FPGA from Xilinx [20] with the capability to use shared memory. It also has a co-processor developed by Adapteva called Epiphany IV [21] which consists of 64 accelerators.

2.2.1 Use of hybrid shared and distributed memory

A usual approach is to create a distributed program with MPI where each component uses OpenMP within to benefit from using shared memory. This is called hybrid programming.

Comparisons on performance of hybrid programming with OpenMP and MPI and using dis- tributed with MPI have been made by Jin et al. [22] and Rabenseifner et al. [23]. In summary, they show that using the hybrid combination over a pure distributed implementation is not always a performance increase.

(26)

Rabenseifner et al. [23] have tested different set-ups of distributed models. They tested a pure-MPI implementation with a hybrid MPI/OpenMP implementation. The results are that the hybrid implementation outperforms the pure MPI implementation. There are however multiple issues with combining the standards acknowledged by Rabenseifner et al [23]. The OpenMP parallel region either have to join into the master thread for MPI communication or overlapping communication with the computations. The former has the disadvantage of having to put threads in idle during communication. Jin et al. have shown that a pure MPI implementation performs better on big clusters, and that the limitation of hybrid version is the data locality in the implementations used.

2.2.2 Tests on accelerator offloading

Liao et al. [15] compared a preliminary implementation of the OpenMP 4.0 standard in the Rose[24] compiler infrastructure, together with the PGI [25] and HMPP(CAPS) [26] compiler where code was annotated with OpenACC. There were also an OpenMP implementation that did not use accelerators but made use of the 16 cores on the Xeon processor. The accelerator code was run on a Nvidia Tesla K20c. Shown in the comparison, there was a significant speed- up on matrix multiplaction compared to the 16 core OpenMP implementation, this is visible on matrices that are bigger than 1024x1024 elements. The paper also displays how much time is spent on preparing the kernel for the matrix multiplication. When the kernel where small (matrices lower than 512x512 elements ), the preparation accounts for 50% or more of the execution time.

In the second test, the computation is not heavy and the communication costs shadows the computation cost. Here the accelerator code does not outperform the sequential version until the vector size reaches 100M floats. The third test is computationally heavy but the threads on the GPU is not used optimally. Adding the collapse clause in OpenACC which the HMPP compiler supports showed that it can outperform the 16 core OpenMP implementaion, this is however only visible for matrices of size 1024x1024 and above. It is at this point the sequential version gets outperformed of the other accelerator implementations.

This shows that a disadvantage of using accelerators is that copying is needed between the memory of the accelerator and the main thread, since they typically do not share memory.

This means that the accelerators will have to be used when there are lots of data that will be processed in computationally heavy kernels so that the cost of copying is out weighted by the speed-up due to parallelism. Another alternative is to use architectures where accelerators and CPUs share memory like the hybrid core developed by Convey Computer [27] which combines a CPU with an FPGA accelerator or AMDs [28] APU that combines the CPU with the GPU, this type of system is depicted in Figure 2.10.D.

Another problem with accelerators is that they are very target dependent. The portability versus the performance of accelerator programming has been discussed by Sa`a-Garriga et al.

[29] and Dolbeau et al. [30] that confirms this point.

2.3 Conclusion

This chapter has presented several techniques to create parallel software, such as OpenMP, MPI, OpenACC and vectorization. These simplify the work for a developer that parallelize by hand. But there are multiple problems that the developer has to solve. Some of these problems are load balancing, inefficient use of caches and overheads. It is not obvious how to parallelize code for optimal performance, thus tools that can parallelize code automatically becomes a valid

(27)

alternative to manual coding. Potentially a general version of source code could be ported and optimized to multiple platforms without the involvement of the developer. OpenMP has well defined library functions for dealing with SMP systems. It is also an active standard that soon incorporates heterogeneous systems. Furthermore, compilers have the functionality to deal with load balancing using the dynamic scheduling clause. Thus it will be the standard that will be the main focus in this report.

(28)

Chapter 3

Parallelizing methods

The popular compilers have several optimization transforms and parallelization techniques al- ready incorporated. Automatically inserting vector instructions is one example. New techniques have been researched and tested on research compilers and many are still in an experimental state. Some of these techniques may however be ready for production compilers. This section gives the reader an overview of the different techniques used in compilers today both in products and in research.

The compiler’s main task is to take source code and produce binaries that can be run on a target hardware architecture. However, the compilers responsibility is more than that. The compiler is able to detect syntactical errors and some semantic errors in the source code. The compiler is also able to reorganize code so that the processor is kept busy while fetching data from memory. Currently the main memory is the big bottle-neck in computers, and fetching from it slows the execution. Keeping a resource in a register that will be used again very soon is one way of reducing the number of accesses to memory. This is one particular decision a compiler is able to do. This thesis is not about the common optimization techniques compilers make to speed up a program, it is instead focused on the particular methods that are useful to parallelize a program.

3.1 Using dependency analysis to find parallel loops

Analysis of the program is needed to find parallelism in a program. There are two kinds of analyses a compiler can do: static dependency analysis and dynamic dependency analysis.

Dependency analysis is an important step in order to find parallel regions. A parallel program does not have a specific order on when a thread will execute. Therefore the data the threads handle has to be independent. The dependencies come from the use of shared memory between the threads. There are three types of dependencies: Read-after-write (RAW), write-after-read (WAR) and write-after-write (WAW) dependencies. In Figure 3.1, a simple loop is unrolled once to display the three different dependencies mentioned. A RAW dependency in a loop would mean that the iteration that is doing the read access would have to come after the iteration that is doing the write access. If the read comes before the write, then it would mean that the read access will read the old data that was stored on that memory address. Similarly the WAR dependency means that a write has to occur after a read or the read access would read the overwritten data. A WAW-dependency needs the writes to be in order or otherwise a future read access will read the wrong data. Static dependency analysis relies on looking at the source as it is (in an intermediate representation form) and finds memory dependencies by using complex

(29)

algorithms. In contrast, dynamic dependency analysis looks at how the program executes and finds dependencies by looking at which memory addresses are accessed at run-time.

1 for( i =0; i < N ; i = i +1) { 2 A [ i ] = A [ i + 1 ] + 1 ; 3 B [ i +1] = B [ i ];

4 C [ i * i - i ] = i ; 5 }

1 for( i =0; i < N ; i = i +2) { 2 A [ i ] = A [ i + 1 ] + 1 ;

3 A [ i +1] = A [ i + 2 ] + 1 ;// WAR ( A [ i + 1 ] ) 4 B [ i +1] = B [ i ];

5 B [ i +2] = B [ i + 1 ] ; // RAW ( B [ i + 1 ] ) 6 C [ i * i - i ] = i ;

7 C [ i * i + i ] = i +1; // WAW ( C [ 0 ] ) 8 }

Figure 3.1: Example on data dependencies, revealed after unrolling the loop once.

3.1.1 Static dependency analysis

Loop dependencies are a hot topic in automatic parallelization research as loops are the regions where a program can potentially run in the same amount of threads as there are iterations. But if there are dependencies between iterations, the loop will have to be executed serially. To find out if there exists a dependency (if it is non trivial), data dependence tests can be used.

Finding data dependencies is an NP-hard problem, meaning that the problem cannot be solved in reasonable time. Instead, approximations of the problem has to be made, these algorithms are called data dependency tests. These tests will be able to tell if there is an absence of a dependence in a subset of the problems in polynomial time. A dependency has to be assumed if an independence cannot be proven. This guarantees that the program will execute correctly. In Figure 3.2, a simple dependency test called greatest common divisor test (gcd-test) is presented.

The test is evaluates that given two array accesses A[a · i + b], A[c · i + d] a dependence may exist if GCD(a, c) divides (d − b).

1 for( i =1; i < N ; i = i +1) { 2 X [4* i +2] = . . . ; 3 ... = X [ i * 6 + 3 ] ; 4 }

Figure 3.2: GCD test on the above code segment yields that there is an independence.

These tests do not say when a dependency occur, only that one may occur. The loop in Figure 3.3.A contains a non-linear term in the subscript of A, which is not handled by some dependency test algorithms. A dependency will be assumed here. In this case, there is a WAR dependency in the first two iterations, but the rest are dependency free. This means that the loop could have been parallelized if the first iteration had been hoisted out (see B), or if each thread would have executed a chunk of two iterations at a time (see C).

Popular data dependency analysis methods today are the GCD-test, I-test, Omega-test, Banjaree- Wolf test and range-test. Kyriakopoulos et al. [31] have summarized these different dependency tests on performance, together with a new dependency test that is compared to the ones men- tioned. Where results shows that each of these are good in different ways. The I-test is fast and detects many dependencies while omega-test is slower but detects some dependencies that I-test cannot detect.

(30)

1 for( i =0; i < N ; i = i +1) { 2 X [ i * i +1] = X [ i + 2 ] ; 3 }

1 X [1] = X [ 2 ] ;

2 for( i =1; i < N ; i = i +1) { 3 X [ i * i +1] = X [ i + 2 ] ; 4 }

1 for( i =0; i < N ; i = i +2) {

2 X [ i * i +1] = /

X [ i + 2 ] ;

3 X [ i * i +2* i +2] = / X [ i + 3 ] ;

4 }

Figure 3.3: Example of a more difficult loop.

3.1.2 Dynamic dependency analysis

In contrast to static dependency analysis, dynamic dependency analysis looks at the actual accesses to memory. If two iterations in a loop read and write to the same memory address during execution, there is a dependency. With this method, it is possible to find more loops that can be parallelized than with the static dependency analysis. Loops that may contain a dependency according to the static dependency test, can be shown that the dependency does not occur in the loop range.

The problem with dynamic dependency analysis is that it can be overly positive and classify loops that contains a dependency as parallel, because the given input data did not result in the memory access pattern that might occur in other situations. This can be a huge problem and to work around this the test input has to cover all memory access pattern cases to guarantee that a race condition does not occur in the future. Another drawback with dynamic analysis is that running the program for gathering the data can be very time consuming depending on the program and is nothing you would want to do every time you compile.

3.2 Profiling

Profiling is a dynamic way to get more knowledge on how your program executes in terms of flow and memory usage. This is made in run-time, where different methods can be used such as instrumenting the code or gather sampling of where the program counter is on regular intervals. The methods can for example count number of jumps to a function (edge counters), branch prediction miss rate and cache miss rate. Execution times can also be monitored. With this functionality, the developer can see after an execution where a program spends most of its execution time which can be used as a hint to where he or she should try to optimize. But it can also be used in a compiler to do decide whether parallelization is beneficial or not when optimizing the program.

3.3 Transforming code to remove dependencies

Some reoccurring dependencies can trivially be removed which may turn the loop parallel. The following section presents some of them.

3.3.1 Privatization of variables to remove dependencies

If a loop region will be executed in parallel, some of the dependencies can be removed that was just inserted for sequential efficiency. An array can for example be allocated outside of a loop region, and then the allocated space is reused during execution. This reduces the memory usage.

(31)

When running in parallel this will result in dependencies, because the same space accessed is written and read several times. Using liveness analysis, it might be detected that the variable or array is live only during one iteration at a time. This particular example can be seen in Figure 3.4.

1 f l o a t A [ N ];

2 f l o a t s = 0;

3 f l o a t t = 0;

4 for( i =0; i < N ; i = i +1) { 5 // A is d e f i n e d 6 for( j =0; j < N ; j = j +1) { 7 A [ j ] = i * j ;

8 }

9 // t is defined , A is u s e d 10 t = f ( A ) ;

11 // A is u n u s e d a f t e r t h i s p o i n t 12 // t is u s e d

13 s += t ;

14 // t is u n u s e d a f t e r t h i s p o i n t 15 }

1 f l o a t s = 0;

2 for( i =0; i < N ; i = i +1) { 3 // A and t is o n l y l i v e 4 // in t h i s i t e r a t i o n 5 // and can t h e r e f o r e be 6 // c l a s s i f i e d p r i v a t e 7 f l o a t A [ N ];

8 f l o a t t ;

9 for( j =0; j < N ; j = j +1) { 10 A [ j ] = i * j ;

11 }

12 t = f ( A ) ;

13 s = s + t ;

14 }

Figure 3.4: An example of a variable and an array that is only live within a scope of an iteration.

3.3.2 Reduction recognition

A reduction statement in a loop introduces a dependency between iterations. An example of a reduction statement is when a sum is calculated. Because the add operation is associative, i.e.

(a + b) + c = a + (b + c) the order on operating the terms does not matter.

This is not always true, however. Reductions on floats can give different results depending on which order it is executed in because of rounding errors. OpenMP implements a pragma that handles simple reductions. It works by letting each thread has its own private reduction variables, which are then reduced into a global shared variable in a succeeding critical region, similar to what is seen in Figure 3.5.

3.3.3 Induction variable substitution

Another common dependency is the one created by induction variables. An induction variable is a variable that, for every iteration in a loop, is changed in the same manner. Meaning that it can be predicted what the value of the variable will be in N iterations. These can often times be substituted with the induction formula that defines them, as shown in Figure 3.6. The benefit of this is that the loop dependency is then removed, but the evaluation of the induction variable can be costly meaning that the transform is not always beneficial.

(32)

1 # p r a g m a omp p a r a l l e l for r e d u c t i o n (+: q ) 2 for( i = 0; i < N ; i = i +1) {

3 l = f ( i ) ; // h e a v y e x e c u t i o n

4 q += l ;

5 } 6

7 # p r a g m a omp p a r a l l e l 8 {

9 # p r a g m a omp for p r i v a t e ( q1 ) 10 for( i = 0; i < n ; i ++) {

11 l = f ( i ) ; // h e a v y e x e c u t i o n

12 q1 += l ;

13 }

14 }

15 # p r a g m a omp c r i t i c a l 16 for( i = 0; i < n ; i ++) {

17 q += q1 ;

18 }

19 }

Figure 3.5: A reduction recognition example using OpenMP.

1 c = 10;

2 for ( i = 0; i < 10; i ++) {

3 // c is i n c r e m e n t e d by 5 for e a c h l o o p i t e r a t i o n

4 c = c + 5;

5 A [ i ] = c ; 6 }

7

8 for ( i = 0; i < 10; i ++) {

9 // the d e p e n d e n c y on p r e v i o u s i t e r a t i o n is now r e m o v e d

10 c = 10 + 5*( i +1) ;

11 A [ i ] = c ; 12 }

Figure 3.6: A simple example of induction variable substitution.

3.3.4 Alias analysis

When a pointer is referring to the same memory address as another pointer (or array), it is said that they alias each other. In parallel programs, if threads have pointers that alias the same memory location problems can arise such as race conditions if reading and writing to the address is performed. The behavior of the program can become non-deterministic as in different runs of the same program on the same data may produce different results. Thus, it is often important to impose an explicit ordering between threads which have aliases. In Figure 3.7, a pointer is aliasing an array, and both variables are being accessed within the loop. By replacing all aliases with the original variable name, it becomes simpler to analyze for dependencies.

(33)

1 p = & A [ 3 ] ;

2 for ( i = 0; i < 10; i ++) {

3 * p = y ;

4 A [ i ] = x ;

5 p ++;

6 }

1 p = 3;

2 for ( i = 0; i < 10; i ++) { 3 A [ p ] = y ;

4 A [ i ] = x ;

5 p ++;

6 }

Figure 3.7: A simple example of a pointer aliasing an array.

3.4 Parallelization methods

The following methods are optimization techniques that can make the program parallel, they have been categorized under these three categories, traditional parallelization methods, poly- hedral parallelization methods and thread level speculation parallelization. These methods can also be combined with the pattern recognition techniques that were presented in the previous section, such as reductions and privatization of variables.

3.4.1 Traditional parallelization methods

The traditional way of parallelizing programs is to gather the dependencies within the loop and identify on which iteration space a loop is parallelizable. This is done using dependence vectors.

Given a loop nest as seen in Figure 3.8, the dependence vector is equal to [g1 − f 1, g2 − f 2].

The statements are loop independent if the distance vector only contains zero. Otherwise there is a loop carried dependency. Traditional methods typically do not change the iteration space like the polyhedral methods. Only loop switching occurs, so that the loop with independent iterations becomes the most outer loop. The traditional methods are much simpler than the polyhedral parallelization technique and faster, but can only handle simple cases.

1 for( i1 =0; i1 < N1 ;++ i1 ) 2 for( i2 =0; i2 < N1 ;++ i2 ) 3 // S t m t 1

4 a [ f1 ( i1 , i2 ) ][ f2 ( i1 , i2 ) ] = ...

5 // S t m t 2

6 ... = a [ g1 ( i1 , i2 ) ][ g2 ( i1 , i2 ) ];

7 }

8 }

Figure 3.8: Example code to illustrate dependence vectors.

3.4.2 Polyhedral model

The polyhedral model [32] is a mathematical model that is used for loop nest optimizations. It can be used to optimize data locality by using tiles based on cache sizes and levels. Tiling is the process of grouping data accesses into chunks that will be able to fit in cache to reduce cache misses. The polyhedral method can also optimize the tiles for parallelism. Figure 3.9 shows an iteration space of a loop nest with an outer loop iterating over j and an inner loop iterating over i. The polyhedral optimization method is able to transform this loop nest into a skewed loop.

Creating an outer loop that iterates over t and an inner loop iterating over P . The inner loop has now become parallelizable with 2 threads. In Figure 3.9, the circles represent a statement

(34)

and the smaller arrows represent the dependency the statement has. In the skewed loop there is two statements on the same iteration of t which are independent thus are able to be run in parallel. The long arrows shows how the statements can be distributed over two threads.

Figure 3.9: A loop nest that has been transformed to be parallelizable.

3.4.3 Speculative threading

Speculative threading is whole other approach to parallelization. Instead of looking at depen- dencies statically or dynamically, speculative read and writes can be used, which keep track on which memory blocks are being accessed. This way, all loops can be assumed parallel. If two iterations running in parallel in run-time imposes a dependency from a read or write access, the loop is clearly not parallel and the loop will be restarted. Gonzales-Escribano et al. [33] has presented a proposal for support of speculative threading in OpenMP. They suggest that read and write operations are replaced with functions calls that will check if a dependency violation is made before reading and writing. From the programmers point of view, a worksharing directive would just contain an additional clause defining if the variables should be speculatively checked.

This requires that a speculative scheduler is implemented that is able to restart iterations that has failed due to dependency violations. Automatic speculative threading has been shown to work by Bhattacharyya et al. [34]. They use Polly [35] to implement heuristics for paralleliz- ing regions that have been classified to maybe contain a dependency (regions that would not have been parallelized). They show that the heuristics manages to gain speedup comparable to normal parallelized regions.

3.5 Auto-tuning

Auto-tuning is the process of automatically tuning the program for optimal performance. It is not always profitable to parallelize a loop if for example the cache locality becomes terrible.

Different heuristics can be created and used to approach the typical problems of parallelization.

An auto-tuning step can look at dynamic information such as cache misses and decide if the loop benefits from executing in parallel. The auto-tuner could in this case try to do a loop swap (inner loop and outer loop switch places) and measure the execution time of the change.

Without dynamic data, heuristics can be used to base the decisions on the number of instructions in a loop, or if the iteration count is too small and decide based on heuristics that it is not profitable to parallelize the particular loop.

(35)

There is work done in finding pipeline parallelism by Tournavits et al. [6] which uses auto- tuning. This is highly relevant for embedded systems that processes streaming data. By finding a pipeline stages it will be able to get information on how long a pipeline stage takes to execute.

The stage that has the most load will be targeted to extract additional pipeline stages. The algorithm used, targets the bottle neck stages and divides them up into smaller pipeline stages, in order to get a better load balancing of the system. However, this technique has not been researched to the same degree as parallelizing independent loops and has not been popularized in parallelization tools yet thus it will not be investigated further in this report.

Another interesting auto-tuning technique is implemented by Wang et al. [36]. Their imple- mentation uses dynamic dependency analysis to determine the parallelism of the loop. It also gathers run-time information to determine the profitability of parallelizing the loop together with heuristics they developed. What can be seen from their tests is that in many of the bench- marks their automatic parallelizer is able to achieve equal performance to hand optimized code.

Further evaluation of their implementation shows the strength of their tuning heuristics where optimized programs never give worse performance than the original.

3.6 Conclusion

In this chapter a summary of different methods that are commonly used to parallelize and optimize a parallel program has been presented. Many other methods for optimization exists but they generally do not apply only to parallel programs but sequential programs as well such as in-lining of functions at function calls or unrolling a loop to minimize the amount of jumps.

According to Bae et al. [37] the pass that by themselves enabled the most speedup is the variable privatization pass. This is due to the parallelism that it creates. Reduction recognition showed an impact as well in two programs. On the benchmark used the induction variables are already substituted and did not show any impact. Bae et al. [37] acknowledges that earlier work independent from their research that these three passes has significant impact which is confirmed by their results.

In a future benchmark, it can be interesting to see how efficient parallelizing tools are at rec- ognizing these dependencies and how they deal with them. Many loops contain these type of dependencies and can be an essential part in parallelizing a program.

Additionally, static dependency analysis is preferable over dynamic dependency analysis, as safer code but also tolerable compilation time is preferred. Speculative threading is similar to dynamic dependency analysis, but can be used for loops where static dependency analysis is unsure of the parallelism and does not affect the compilation time. This makes the dynamic dependency analysis redundant for automatic parallelization. Dynamic information of the loop however such as execution time is still something to consider, to deduce the profitability of a parallelization for an auto-tuner.

The parallelization methods sound effective and could potentially remove the burden from the programmer to parallelize complex loops. The polyhedral method together with speculative threading sound to be the most effective parallelization technique.

Auto-tuning is another technique that sounds useful for getting the most performance out of the program. It does, however, appear to be a complicated subject, and finding the best heuristics, based on target platform and other parameters such as code size requires a lot of work, perhaps even a step that uses machine learning methods.

(36)

Chapter 4

Automatic parallelization tools

There are many tools that can help a developer to parallelize their programs. In this chapter a subset of them is presented followed by a comparison on their functionality. They were selected based on popularity, they are commonly referenced in the studied material. The found tools has been categorized into Parallelizers, Translators and Assistance tools. Parallelizers are tools that analyses the code and finds parallel regions and parallelizes them. Translators are capable of taking an already parallelized application and generate code for a different architecture.

Assistance tools are applications that can help a programmer to parallelize an application by hand, or gain more knowledge of their program.

4.1 Parallelizers

The following parallelization tools transform sequential code to run in parallel. They all have in common that they operate on loops. What differentiates them are the methods they use and where the parallelization is visible. The methods they use does not differentiate a lot, these two categories can be seen: traditional parallelization and polyhedral parallelization. Some tools transforms source code to parallel source code. Meaning that the source code is visibly modified. Other tools transform the code but do not write parallel source code, instead it is only visible in the executable.

4.1.1 PoCC and Pluto

The polyhedral compiler collection (PoCC [38]) is a chain of tools that is able to do polyhedral optimizations on annotated loops. Each tool has their responsibility in the chain such as parsing, dependency analysis, optimization and code generation. The actual parallelizer is the tool called Pluto [39] in the tool chain. The code generation is capable of outputting OpenMP annotation and vector instructions. It can also generate a hybrid parallel solution using OpenMP and MPI. To use this tool it is required that the loops in the source code are annotated with a pragma that isolates the loop. This loop is the only thing that will get parsed and converted to polyhedral representation, if the loop fulfilled all the properties required to be considered a static control flow region. The output will then either be a sequential but optimized loop, or a parallel optimized loop if it was found parallel.

References

Related documents

Som ett steg för att få mer forskning vid högskolorna och bättre integration mellan utbildning och forskning har Ministry of Human Resources Development nyligen startat 5

Tillväxtanalys har haft i uppdrag av rege- ringen att under år 2013 göra en fortsatt och fördjupad analys av följande index: Ekono- miskt frihetsindex (EFW), som

• Utbildningsnivåerna i Sveriges FA-regioner varierar kraftigt. I Stockholm har 46 procent av de sysselsatta eftergymnasial utbildning, medan samma andel i Dorotea endast

To obtain balanced nodes in each time step the nonbonded pairs have to be split evenly among the slaves instead of the charge groups.. This is done by counting all nonbonded pairs

the mutators on another, keeping a mutation log for synchronization, allows us to traverse the cells in the way we wish to using pointer reversal. The memory overhead for using

In this work we investigate whether it is possible to synthesize the control law for a discrete event dy- namic system, using a polynomial representation of the system and

20 12Anna JohanssonSleep-Wake-Activity and Health-Related Quality of Life in Patients with Coronary Artery Disease. Linköping University Medical

As mentioned in 4.3.1.4 Long Setup Times, the machines on the production line are flexible in that they can be setup to process many types of articles, which result in many