Characterization of Task-based Benchmarks from the Barcelona OpenMP Task Suite

(1)

Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:159

J O H A N E L A N D E R A M A N a n d A L E K S A N D A R S E K U L I C

Characterization of Task-based Benchmarks from the Barcelona OpenMP Task Suite

K T H I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y

(2)

(3)

Characterization of Task-based Benchmarks from the Barcelona OpenMP Tasks Suite

JOHAN ELANDER AMAN ALEKSANDAR SEKULIC

Bachelor Thesis at the department of Software and Computer Systems, KTH Supervisor: Ananya Muddukrishna

Examiner: Mats Brorsson

(4)

Abstract

The parallel programming community is witnessing two main trends - the growing popularity of task-based programming mod- els and the growing complexity of multicore hardware. In order to see how these two trends fit each other, it is important to characterize the behavior of task-based benchmarks on modern multicore hardware. Besides leading to benchmark design opti- mizations, such characterization enables making educated trade- offs in the design of task-handling middleware such as compilers and runtime systems.

In this thesis work, we characterize task-based benchmarks from the Barcelona OpenMP Tasks Suite (BOTS) at the task-level. We focus on two aspects: how does the task scheduler fare in han- dling programmer exposed parallelism, and how does the task decomposition make use of the hardware memory hierarchy. Our characterization considers two diverse multicore architectures - one built for server systems, and the other for embedded systems.

With respect to middleware, we consider GCC’s implementation

of OpenMP tasks. Our contributions are two-fold. First, we com-

plement existing thread and application level characterization of

BOTS benchmarks with a finer task and memory hierarchy level

characterization. Next, we identify BOTS performance bottle-

necks and optimize Sort and FFT benchmarks to reduce their

execution time by a factor 2 and 1.65 respectively on the server

architecture.

(5)

Introduction

Task-based programming frameworks such as OpenMP, Cilk and TBB are becoming in- creasingly popular in the parallel programming landscape. Task-based programming allows programmers to easily write parallel programs which are performance-portable across a diverse range of modern multicore hardware. Task-based programming requires the pro- grammer to express potential parallelism using abstract structures known as tasks. In the back-end, an architecture-specific runtime system and compiler work together to map tasks to available execution resources in order to achieve the lowest possible execution time.

Modern multicore architecture are complex. Large core counts, deep memory hierarchies and multiple interconnection networks are all thrown into the mix. Getting performance on multicore processors with few cores (< 8) is a simple matter of load balancing. Getting per- formance on multicore processors with larger core counts is however not so simple. Memory access and communication costs begin to dominate in performance models. Both program- mers and task-handling middleware must, in addition to load balancing, take memory and communication behavior into account.

Of particular interest to our thesis work are the overheads introduced by the memory hierarchy of modern multicore architectures. Cores typically have their own private caches, at one or two levels. At the last level, all cores typically share a cache. Keeping this cache hierarchy coherent, especially in a multi-socket configuration, introduces a lot of communication overheads. Another architectural feature that introduces overheads is the way in which memory modules are distributed. Each socket typically has its own memory module. In a multi-socket configuration, cores can access their own memory module faster than accessing a memory module belonging to a remote socket. This non-uniformity in memory access time is called Non-Uniform Memory Access (NUMA) behavior. On tiled many-core processors with local and remote caches, cache access times are also non-uniform leading to Non-Uniform Cache Access (NUCA) behavior.

In our thesis work we aim to characterize task-based benchmarks on different multicore ar- chitectures, with particular focus on memory system performance. Task-based benchmarks consist of different types of tasks each with peculiar memory access patterns. A single unit characterization, for example at a thread-level, cannot provide detailed information regarding benchmark behaviour. We address this limitation by characterizing benchmark behavior at task-level.

1

(8)

We employ two different multicore machines in our characterization. The first is a 48- core server machine based on the AMD Opteron 6172 processor. The second is a 64-core embedded system machine based on the Tilera TILEPro64 processor. The Opteron 6172 based machine has a multi-socket multicore architecture. The TILEPro64 based machine has a single chip manycore architecture.

The benchmarks we characterize are from the Barcelona OpenMP Task Suite (BOTS) [1]

developed by Barcelona Supercomputing Center. BOTS consists of ten task-based bench- marks from various domains and decomposition strategies. BOTS is used extensively in task-handling middleware design and especially in task scheduling research.

Our objective is to characterize BOTS benchmarks and explain if memory access and com- munication overheads degrade scaling performance.

The objectives of the thesis are as follows:

• Quantify, as baseline, the scaling and memory hierarchy performance of all BOTS task-based benchmarks.

• Modify code and use tools to improve memory hierarchy performance of at least two

benchmarks from the baseline.

(9)

Chapter 2

Background information

2.1 Parallel Computer Architecture

A parallel system can be characterized as a system where multiple processing units cooperate in executing one or several tasks. There exists numerous different parallel architecture with different approaches, which requires a classification of some sort. Flynn’s taxonomy [15]

classifies computer architectures according to the multiplicity of their data and instruction flows. The two possible combinations for parallel architectures in Flynn’s taxonomy are SIMD (Single Instruction Multiple Data), and MIMD (Multiple Instruction Multiple Data) architectures.[22]

In SIMD architectures, a central control units issues the same instruction stream to multiple processing units who then execute the same instructions on different data sets. These architectures are also called data parallel or vector architectures and are well adapted in fields where computations can be expressed as vector and matrix operations.

In contrast to SIMD architectures, every processing unit in a MIMD structure, has its own control unit. This enables the processing units to work independently of each other and execute independent instructions on different data sets. However, this requires synchroniza- tion and data exchanges between processing units over the interconnection network, when collaborating on a task.

In this study we consider two diverse MIMD architectures. The first is a 48-core server sys- tem based on the AMD Opteron 6172 processor. The second is a 64-core embedded system machine based on the Tilera TILEPro64 processor. The two architectures are explained in more detail below.

2.1.1 AMD Opteron 6172 Overview

The AMD server is a multi-chip multi-core machine with a total of 48 cores and 64 GB of main memory. The cores are dived on four Opteron 6172 Magny Cours processor chips.

Each chip contains two dies with 6 physical cores, operating at 2100 MHz. The main memory is divided over 8 nodes, one for each die.

3

(10)

Memory hierarchy

Each core has its own 128 KB L1 cache, which is divided to equally big part one for instruc- tion and in for data. Further each core has also a 512 KB L2 cache. There is also a 6 MB shared L3 cache per die, however, by default 1 MB is reserved for AMD’s HyperTransport Assist(HT Assist). HT Assist is a directory-based cache coherency protocol which stores its directory in the L3 cache. This to reduce probe communication of the broadcast protocols and the latency issues of DRAM-based directory protocols.[13]. To ease contension on the main memory the AMD use a Non-Uniform Memory Access(NUMA)design where each die is fitted with its own dual-channel 1333MHz DDR3 memory controller. This naturally leads to cores having lower latencies to local memory controllers than remote ones. The complete memory hierarchy is shown in figure A.1 in appendix 6

Interconnects

The processors communicate through a interconnect of 8-Bit and 16-Bit HyperTransport 3 (HT3) links, as illustrated in figure 2.1 [13]. Dies on the same chip are connected via a 16-Bit and an additional 8-Bit HT3 link. Each die is also connected to three other dies on different chips via three 8-Bit links. There are two groups of directly connected dies:

{0,2,4,6} and {1,3,5,7}. The cost for data transfers between these groups are two hops, unless the dies are on the same chip. The latencies and costs for data transfers across the dies can be seen in table 2.1.

Table 2.1: Opteron 6172 access latencies in ns [17]

Source Local Within socket Other socket On die 2nd die 1 hop 2 hops

L1 1.4 70.4 109.5 113.3 153.3

L2 7.1 70.4 109.5 113.3 153.3

L3 19.0 19.0 107.6 111.9 152.4 RAM 65.7 65.7 114.3 119.0 159.0

2.1.2 TILEPro64 Overview

The TILEPro64 machine is a 64 core mesh-architecture distributed memory processor with 4GB of main memory. The cores or tiles are interconnected with one each other in a 8x8 grid-like two-dimensional array. Each tile consists of a fully featured 700MHz 32-bit VLIW (Very Long Instruction Word) processor, a cache hierarchy and a network switch to connect it to the grid. Applications may create sub-grids of tiles, allowing them to run on 1-64 cores. The chip holds 4 GB of main memory distributed over four 800MHz DDR2 memory controlers connected to the grid.

Memory hierarchy

Each tile has its own local L1 cache, which consists of 16 KB instruction and 8 KB data,

the tile also has a 64 KB L2 cache. The TILEPro64 uses something Tilera calls a Dynamic

Distributed Cache (DDC) as a last level cache shared across all tiles of the grid. DDC allows

a page of shared memory to be homed on a specific tile called a Home tile or distributed

across several tiles. These pages can then be cached remotely by other tiles, forming a

(11)

2.1. PARALLEL COMPUTER ARCHITECTURE 5

Die 1

Die 0

Die 3

Die 2 Die 5

Die 4

Die 7

Die 6

I/O I/O

HT 16x Link HT 8x Link

Mem Mem Mem

Mem Mem Mem Mem

Mem

Mem Mem

Mem Mem Mem

Figure 2.1: AMD Opteron 6172 interconnects [17]

dynamic last level cache of combined L2 caches in the grid. Cache coherence is upheld by letting each cache-line have a home tile. When experiencing a local L2 cache miss, a tile will go to the home tile of the wanted cache-line and read the cache-line into its L2 and L1 cache. The home tile also maintains a directory of tiles sharing the cache-line and will send invalidations messages to these if the cache-line is altered. This process is illustrated in figure 2.2. As messages are routed from tile to tile to their destination, latecies increase with every hop, making locality in the DDC of high importance. Latencies through out all levels of the memory system can be seen in table 2.2. [9]

Table 2.2: TILEPro64 access latencies in cycles, 1 cycle is 1.33ns [9]

Source Cycles

L1 2

Local L2 8

Dynamic Distributed Cache

30-60

RAM (Typical) 80

Tile to tile network hop 1

Interconnects

All communication between the tiles and between tiles and I/O devices is done over a collec-

tion of networks called the iMesh interconnect. This can be seen in figure 2.3. The iMesh

(12)

Write (X[0]=1) Requesting (Remote) Tile

L1 L2

X=0 Home Tile

Sharing Tile

L1 L2

X=0 X=0

WriteAck (X)

X is invalidated from cache

X=1

Invalidate (X)

InvalidateAck (X)

X = Cacheline X

X[n] = The nth word in cacheline X

4

1 7

3

6 5 2

Figure 2.2: Illustration of DDC cache coherency [9]

can be divided into two classes of networks, one with user visible networks for streaming and messaging and one for the memory system, I/O devices and other system related tasks. To fit onto the iMesh interconnect, each tile is equipped with a dedicated non-blocking switch engine, connecting not only the processing engine but also the cache-system to the grid.

This allows the data routing to be separated from the processing engine.

MAC / PHY SerDes

GbE 0 GbE 1 Flexible

I/O

Flexible I/O UART,

HPI, I2C, JTAG, SPI

DDR2 Controller 3 DDR2 Controller 2 DDR2 Controller 1 DDR2 Controller 0

XAUI 1 MAC/PHY XAUI 0

Mac/PHY SerDes

PCIe 0 MAC/PHY

SerDes SerDes

Reg File P2 P

1 P

0

L2 CACHE

PROCESSOR CACHE

SWITCH

2D DMA L-1I L-1D I-TLB D-TLB

MDN TDN UDN IDN CDN STN

Figure 2.3: The TILEPro64 iMesh interconnect [7]

(13)

2.2. TASK-BASED PARALLELISM 7

2.2 Task-based parallelism

In task-based models, algorithms are designed in terms of tasks. A task is essentially a block of code that can be executed independently and is in general far less expensive to spawn and destroy compared to threads. This allows for very fine-grained parallelism.

When generated, tasks are placed in a task-pool from which threads can fetch work. This allows the programmer to use an arbitrary number of threads without having to rewrite the application, unlike other models i.e thread-based and data parallel models, such as Pthreads or MPI. Another advantage that comes from having a task-pool is that it becomes easier to exploit irregular parallelism as threads do not have predefined workloads, but will instead fetch a task to execute from the task-pool when idle.

There are several task-parallel models in the industry, including Cilk, TBB, Wool and OpenMP. Since this report investigates the architectural implications of the BOTS bench- marks which are paralleled using OpenMP, let us familiarize our-selfs more with it.

2.2.1 OpenMP

OpenMP (Open Multiprocessing) is a multi-threaded application programming interface (API) for C/C++ and FORTRAN. It is supported by numerous platforms and compilers and has been widely used for high-performance computing [4]. OpenMP uses a shared memory programming model and all threads have access to the same, globally shared address space.

Data, however, can be either shared amongst all threads or private to single threads and explicit tasks.

OpenMP has always had implicit tasks in the form of parallel constructs which, once en- countered create an implicit task per thread. The notion of creating explicit tasks−with their own data-environments−however, is a relatively new feature to OpenMP introduced in version 3.0[10]. Explicit tasks are created using the OpenMP task construct, shown in listing 2.1. As a thread encounters a task construct, the structured-block, or task-region, is packaged into a task instance together with any private data[23]. However tasks are not guaranteed to be executed immediately at the time of task creation. A generated task will be put in the task pool and may be fetched by any available thread in the worker group

¹

. This makes the order in which tasks are spawned and executed non-deterministic, leaving it to the programmer to take proper synchronization measures. Tasks can either be spawned as tied tasks−by default−or untied tasks; tied tasks are tied to the thread which begins executing them while untied can migrate to another thread at task scheduling points. Tied tasks may only be suspended at these scheduling points so that the thread can execute a new task while waiting. A task scheduling point can be any of the following[11]:

• task constructs

• taskyield constructs

• taskwait constructs

• barrier directives

1The group of threads that were spawned in the covering parallel-section

(14)

Listing 2.1: The OpenMP task construct

#pragma omp t a s k [ c l a u s e [ [ , ] c l a u s e ] . . . ] s t r u c t u r e d −b l o c k

c l a u s e : i f ( e x p r e s s i o n ) u n t i e d

d e f a u l t ( s h a r e d | none ) p r i v a t e ( l i s t )

f i r s t p r i v a t e ( l i s t ) s h a r e d ( l i s t )

Listing 2.2: Simple OpenMP task-parallel example in C

i n t n , y ;

i n t f o o ( ) {

i n t x ; n = 1 ;

#pragma omp t a s k s h a r e d ( x ) f i r s t p r i v a t e ( n ) x = work1 ( n ) ;

#pragma omp t a s k p r i v a t e ( n ) y = work2 ( n ) ;

#pragma omp t a s k w a i t r e t u r n x + y ; }

i n t main ( ) {

#pragma omp p a r a l l e l

#pragma omp s i n g l e f o o ( ) ; }

• implicit barrier regions

• the end of the tied task region

The task data-environment

As mentioned above OpenMP tasks are entitled to private data. Therefore there is a need for tasks to have their own data-environment. New storage is allocated for private and firstprivate variables and all the references in the task-region are redirected to point to the new storage. The difference between them is that the storage created for firstprivate variables is initialized to the original variables value at the time of task creation, while storage for private variables is left un-initialized. No new storage is allocated for shared variables. All references of shared variables in the task-region point to the original storage.

Global and scope variables prior to the parallel section are shared by default and do not

need to use the shared-clause. Latencies for communication between distant cores can be

considerably longer than between close ones. Therefore it is important to know where a

shared variable is allocated and which threads reference it the most. Untied tasks experience

its data environment is copied over to the new thread; 2) the variable references point to

the original storage [18]. As the OpenMP documentation does not state this clearly it

depends on the implementation. Both cases however experience overheads when migrating

to a thread on a distant chip.

(15)

2.2. TASK-BASED PARALLELISM 9

An illustration

To help us understand the task data-environment and OpenMP in general better, let us consider the simple example in Listing 2.2.

In the main function we see a parallel construct followed by a single construct. The parallel construct initiates a parallel region by spawning a pool of threads. The number of threads spawned are determined either by the environment variable OMP_NUM_THREADS or in runtime by the api call set_omp_num_threads. If the number of thread is not defined by either, as many threads as there are cores or hardware threads is used by default. The OpenMP scheduler will then dynamically bind threads to hardware threads or cores. The single construct states that only of these threads within the parallel region are allowed to call foo.

The function foo consists of two task directives, as well as the variable x. When the thread which is executing foo encounters these contstructs it will package the two structured blocks x = work1(n); and y = work2(n); together with their respective private data into task instances and place them into the task-pool, allowing any idle threads to pick them up.

The thread will then wait by the taskwait construct before returning the result x + y. The first task has the shared(x) and firstprivate(n) clauses. The shared clause will as previously mentioned update all referenses to x in the task region to point to the original variable while the firstprivate will create a copy of n initiated to the value 1. The second task only has a private(n) clause, which will create a un-initialized copy of n in the data environment for that task. However, due to the fact that the variable y is in the global scope of the task and declared prior to the parallel region, it is shared by default, meaning all referenses in the second tasks task-region will point to the original variable as well.

A simplified illustration of the data environments for the program in Listing 2.2 is shown in figure 2.4.

Global data environment

Tasks

pointer initialized copy zero-initialized copy

n=1

*

work 1

n

*

work 2

x

n=1 y

Figure 2.4: Data-environment illustration of the OpenMP example in Listing 2.2

(16)

2.3 Existing characterizations of task-based benchmarks

A litterature study was made at the beginning of this project to find guidelines and moti- vation for a workload characterization of task-parallel benchmarks.

2.3.1 Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP

Duran at el. [14] presents an evaluation of the BOTS benchmark suite performed on a 32 core ALIX 4700 system. The focus of this experiment lies on the different compilation and runtime alternatives of the benchmarks and how they affect the performance of the applications.

2.3.2 A comparison of some recent task-based parallel programming models The work of Podobas et al. [21] compare different task-based parallel models ability handle fine grained tasks. Microbechmarks are used to characterize spawning and stealing costs of tasks while BOTS benchmark suite is used to characterize application performance. The machine used for the experiments was a Dell PowerEdge SC1435 dual quad-core Opteron server with 16GB of memory. A noteworthy result for our experiment is that OpenMP performed poorly when the task-granularity became small.

2.3.3 Scheduling task parallelism on multi-socket multicore systems

Olivier at el. [20] proposes a work-stealing scheduler for OpenMP which is aware of memory locality. They charaterize and compare their implemenation to the OpenMP implemen- tations of GCC and ICC using the BOTS benchmark suite. The characterization was performed on a quad-socket Dell PowerEdge M910 using four 8-core Intel x7550 processors.

2.3.4 The PARSEC Benchmark Suite: Characterization and Arcitectural Implications

Bienia et al. [12] is a workload characterization of a multithreaded benchmark suite which implements multiple parallel models. The objective of their work was to examine archi- tectural implications of different parallel models on a simulated chip-multiprocessor. The main points of interest for this report were working sets and locality, communication-to- computation ratio and sharing and off-chip traffic.

2.3.5 Need for a different type of characterization

Allthough many of the existing solutions evaluate and characterize the BOTS benchmark suite, there are none to this date−to our knowledge−that have proposed a task-wise work- load characterization for our architectures. The benefit of doing a task-wise characterization is the possibility of excluding the overheads of the parallel model and therefore getting closer to the actual memory system implications on the benchmarks.

2.4 Barcelona OpenMP Task Suite

Barcelona OpenMP Task Suite (BOTS) is a task-parallel benchmark suite, developed by

Barcelona Supercomputing Center with the purpose of testing different implementations

(17)

2.4. BARCELONA OPENMP TASK SUITE 11

Listing 2.3: if-clause cutoff

#pragma omp t a s k i f ( c o n d i t i o n ) work ( ) ;

Listing 2.4: manual cutoff

i f ( c o n d i t i o n )

#pragma omp t a s k work ( ) ; e l s e

w o r k _ s e q u e n t i a l ( ) ;

of OpenMP tasks on multicore architectures. However, since the variety of benchmarks and their respective domains, it is also well suited for investigating architectural effects on task-parallel applications. The suite consists of ten irregularly parallel applications which are written in C and use OpenMP tasks. Some of the applications are developed by BSC while others are modified versions from other benchmark suites, including the Cilk project, the Application Kernel Matrix project and the Olden suite.[14]

The suite offers several implementations of each benchmark(Table 2.3), most of which have to do with the way tasks are spawned. A list of implementation choices that are of relevance to this thesis follows:

• Tied or untied tasks − Every benchmark in the suite has the option to be run using tied or untied tasks.

• Single and for versions − Some of the benchmarks also have the option to spawn either tasks from single-region called single, or spawn tasks from a parallel for loop called for.

• Depth-based cutoff − The goal with task parallelization is to have enough tasks so that no thread has to idly wait for work, however this needs to be balanced with the fact that having too many concurrent tasks will only create unnecessary overhead to no gain. The task creation can therefore be limited on some benchmarks with a cutoff value. If the cutoff condition becomes false no new tasks will be spawned, but the work will instead be appended to the calling task. Hence if no cutoff is used there will be no limit on task creation. One out of two ways to limit the creation of new tasks in BOTS is to use the if_clause implementation which is dependent on the task directive, demonstrated in Listing 2.3. The other way, as shown in Listing 2.4, is to use manual-cutoff which is not dependent of the task directive but a regular if-clause.

The baseline speed-up graphs presented in fig 2.5 are only of one implementation per benchmark, for reasons explained in chapter 3. The tied options is used for all benchmarks.

The benchmarks that have a cut-off uses the manual-cutoff version and the benchmarks

with the option of either single or for uses the single version. The graphs shows the speed-

up for the bots benchmarks with the same options and implementations as when counting

events, except there are no overheads from counting events. The declines and stagnations

shown in the graphs are the point of interest for this thesis.

(18)

0 5 10 15 20 25 30 35 40 45 50 0

5 10 15 20 25 30 35 40 45 50

# of threads

Speed−up

Alignment FFT Fibonacci Floorplan Health N−queens Sort SparseLU Strassen Uts

(a) Opteron 6172

0 5 10 15 20 25 30 35 40 45 50 55 60 65

# of threads

Speed−up

Alignment FFT Fibonacci Floorplan Health N−queens Sort SparseLU Strassen

(b) TILEPro64

Figure 2.5: The baseline speedups for the BOTS benchmarks on the Opteron 6172 and TILEPro64 processors using gcc 4.4.6 (Opteron) and gcc 4.4.3 (TilePro64).

Table 2.3: Bots benchmarks [14]

Application Domain Computation structure

Nr of task directives

Nr of task types

tasks inside omp...

nested tasks

Application cutoff

Alignment Dynamic programming

Iterative 1 1 Single/For no none

FFT Spectral

method

At leafs 41 15 Single yes none

FIB Interger At each

node

2 1 Single yes Depth-based

Floorplan Optimization At each node

Health Health-care simulation

At each node

Nqueens Search At each

node

Sort Integer soritng

At leafs 9 2 Single yes none

SparseLU Sparse linear algebra

Iterative 4 4 Single/For no none

Strassen Dence linear algebra

At each node

UTS Unbalanced

Tree Search

At each node

2 1 Single yes none

(19)

2.5. PERFORMANCE CONSIDERATIONS 13

2.5 Performance considerations

2.5.1 Speedup

The speedup is the performance gain using a parallel application on a parallel platform.

The offered speed-up is the maximal theoretical speed-up which is calculated by dividing the performance of all cores with the performance of a single core. The achieved speed-up is given by the execution time of in parallel divided by the sequential execution time, and provides the actual speed-up for a application.

Performance of n cores

Performance of 1 core = Offered-speedup

Execution time of 1 core

Execution time of n cores = Achieved-speedup 2.5.2 Performance Counters

Another way to measure performance is to monitor the behaviour of the application through hardware-counters. These counters are used to measure occurrences of hardware-events, such as cpu clock cycles and cache misses etc. By counting cpu clock cycles of an parallel appli- cation the work-balance between cores and the core utilization can both be determined. The hardware-counters can also be used to find hardware-bottlenecks in a system architecture.

2.5.3 Specific to the Opteron 6172 and the TILEPro64 Opteron 6172

Each core in the Opteron 6172 possess a set of four 48-bit hardware-counters, allowing them to count hardware-events independently of other cores. The AMD processor allows for multiplexing, making it possible cores to count up to 512 simultaneous events. This however makes the results less precise and therefore not very suitable for measuring fine- grained tasks. There are also some hardware-events that occur on shared resources and may not be counted by individual cores. One core will instead keep a shared count of event occurrences for the cores that share these resources, i.e L3 cache and main memory. [2]

TILEPro64

The tiles in the TILEPro64 are equipped with four 32-bit hardware-counters each. These counters can, if needed be combined and used as two 64-bit counters. Unlike the Opteron however the TILEPro64 allows no further multiplexing. Another difference from the Opteron is that there does not exist a notion of shared hardware-events. This allows each tile to count any event independently of other tiles.[8]

2.5.4 Granularity

Having an appropriate task granularity is central to achieving good performance in an

parallel application, since work balance is heavily correlated to granularity. If the tasks are

coarse grained the work may not be distributed over the cores and having a too fine grained

tasks may introduce unnecessary overhead of creating tasks.

(20)

2.5.5 Improving memory access patterns

As shown in Tables 2.1 and 2.2, there is a significant difference in latencies for retrieving data between lower and higher cache levels, or local and distant NUMA nodes. This is why good cache locality and memory access patterns are key for parallel computing, and algorithms in general.

It is important to, when possible, group accesses of variables that reside in the same cache-

lines to avoid unnecessary evictions. Another important aspect is to consider where memory

is being allocated in many-core machines. Keeping data allocated in NUMA nodes closer to

the cores that will be utilizing it the most can have a noticeable impact on the performance

if it is accessed frequently. Having the majority of the data allocated on a single or only a

few NUMA nodes can also cause unnecessary contention in the memory system. Using an

allocation strategy which distributes the data over the NUMA nodes can in some cases be

the answer for this issue.

(21)

Chapter 3

Experimental Setup

In this chapter we explain the methodology and tools used in order to characrize BOTS benchmark performance. We first explain the experimental design decisions. Next, we de- scribe in detail the tools we used for performance counting. Lastly, we discuss the limitations of our characterization methodology.

3.1 Design decisions

Existing task-based benchmark characterizations from Bienia et al.[12] and Woo et al.[24]

provide a deep architectural analysis at the thread level. However benchmarks are often too complex and irregular in their parallelism to be characterized and measured as one unit (thread level).

We argue that thread level characterization can be complemented with one at a task level.

By doing so, not only does one get a finer picture, but can also avoid measuring the over- heads of parallelization such as task creation and synchronization. However the drawback of measuring task-wise is the added work for the programmer. There are several tools that allow for thread level measurements with little or no code modifications. For task-wise mea- surements, however, there are no such tools to the best of our knowledge. The programmer has to annotate the source-code directly which can be quite tedious and time consuming.

Of course there can be a runtime system level measurement, but this depends on runtime source code availability and the competence to understand it.

To further narrow down our testing environment and avoid measuring events and occur- rences that are outside the scope of the characterization, tied-tasks were used as well as tied-threads. This due to the fact that a task or a thread migration will inevitably cause increased cache misses when transferred to a new core, as well as the added overhead for the migration itself. This also simplifies the measuring of performance counters as hardware- counter values do not migrate with tasks nor threads.

3.1.1 What to measure

This characterization is mainly focused on the memory access patterns. To measure this a set of hardware-events to monitor have been selected for both architectures, including accesses and misses for all levels of cache and NUMA nodes etc. The measuring is focused

15

(22)

Listing 3.1: Snapshot from the Fibonacci benchmark displaying how the performance coun- ters are measured

i n t f i b ( i n t n , i n t d , i n t i d 0 ) {

r e s e t _ c o u n t e r s ( ) ; i n t x , y ;

i f ( n < 2 ) {

s t o r e _ c o u n t e r s ( i d 0 , t a s k t y p e ) ; r e t u r n n ;

}

i f ( d < b o t s _ c u t o f f _ v a l u e ) { s t o r e _ c o u n t e r s ( i d 0 , t a s k t y p e ) ;

#pragma omp t a s k s h a r e d ( x ) f i r s t p r i v a t e ( n )

x = f i b ( n − 1 , d + 1 , __sync_fetch_and_add(& i d , 1 ) ) ;

#pragma omp t a s k s h a r e d ( y ) f i r s t p r i v a t e ( n )

y = f i b ( n − 2 , d + 1 , __sync_fetch_and_add(& i d , 1 ) ) ;

#pragma omp t a s k w a i t r e s e t _ c o u n t e r s ( ) ; } e l s e {

s t a t e _ c u t o f f ( i d 0 ) ; x = f i b _ s e q ( n − 1 , 1 , i d 0 ) ; y = f i b _ s e q ( n − 2 , 1 , i d 0 ) ; }

s t o r e _ c o u n t e r s ( i d 0 , t a s k t y p e ) ; r e t u r n x + y ;

}

only of the workload of the benchmarks and will not include the spawning and destruction of threads and tasks, or the OpenMP implicit synchronization of tasks. As the benchmarks consist of a large number of task-types one will also gain some understanding as to how the architectures handle tasks of different granularity.

3.1.2 How to measure

In order to avoid measuring as much as possible of the parallel models overhead−OpenMP in this case−it is of importance to begin and stop monitoring the counters at the right points in the code. Firstly all counters are reset at the beginning of the task-region for the core or tile that is executing task, to avoid reading any garbage values. Upon encountering any task-scheduling points, the counters would be read and stored prior to the scheduling point and then reset again directly after. This due to the reasons that were explained earlier in section 2.2.1. Lastly the counters are read and stored a final time before exiting the task-region. All these values are appended and stored together with a time-stamp and a unique task id to help us separate tasks and process the data. This is demonstrated in algorithmic form in algorithm 3.1

To attain as accurate measuring results as possible, the testing is done under undisturbed sessions on the AMD machine. Due to the way the TILEpro64 is controlled over the PCI interface there can only be one user present at a time, which ensures undisturbed conditions.

To further increase the accuracy and avoid any extreme values, each set of performance- counters were measured 5 times on both machines.

3.2 Performance counting

The tools used for the performance-counter measuring in this thesis are performance Appli-

cation Programming Interface (PAPI) [6] and Tileras own performance counter API, based

(23)

3.2. PERFORMANCE COUNTING 17

Listing 3.2: PAPI setup example for AMD Opteron 6172

v o i d f u n c t i o n ( ) {

i n i t _ p a p i ( ) ;

#pragma omp p a r a l l e l {

i n t t h r e a d _ i d = omp_get_thread_num ( ) ; s e t _ p r o c _ a f f i n i t y ( t h r e a d _ i d ) ;

r e g i s t e r _ t h r e a d ( t h r e a d _ i d ) ; add_event ( t h r e a d _ i d , e v e n t _ l i s t ) ; s t a r t _ c o u n t e r s ( t h r e a d _ i d ) ; PARALLEL WORK

u n r e g i s t e r _ t h r e a d ( t h r e a d _ i d ) ; }

p r i n t _ c o u n t e r s ( ) ; }

on a modified version of Oprofile [5]. They are used for the Opteron 6172 and TILEPro64 respectively. These APIs are not specifically designed for task-wise measuring but they allow the user to annotate the source-code and offer the ability to choose which events to monitor, where to start, where to stop and where to reset counters. This is essentially all that is needed in order to measure performance-counters task-wise. A special feature in PAPI is the overflow detection which allows the user to specify the threshold for an over- flow and how it should be treated. A combination of PAPI preset events and native events where used to be able to count all events of interest.

3.2.1 Performance counting setup and considerations for AMD Opteron 6172 Listing 3.2 shows the way PAPI was set up for our benchmarks. Firstly the init_papi function is invoked to initiate PAPI. The functions contained in the scope of the parallel section must be executed in there, so that they are executed by each running thread. The set_proc_affinity function ties the executing thread to a core, by internally calling sched_setaffinity. This is to avoid thread migrations as discussed in Section 3.1.

The register_thread function register the executing thread with PAPI and creates an empty event set for that thread. Events are added to the empty event sets by invoking add_event, by doing so an overflow handler is also added to one of the events. The event we choose to trigger the overflow handler was the total-cycle counter as that is always incremented and will therefore overflow first. If an overflow occurs all counters will be stored and reset, however values will not be stored if it occurred in a synchronization state. Each task has a identifier, which is generated with the synchronized atomic function __sync_fetch_and_add. In the end of the parallel section all executing threads are unregistered from PAPI. Outside the parallel section the results are printed to either output stream of file.

PAPI includes some useful tools including papi_avail or papi_native_avail which

present available PAPI preset event and native events respectively. When choosing a native

event it is important to chose an event code with a sub-mask, defining how that event

should be measured. Choosing the event without a sub-mask will not give correct values

if any. The papi_event_chooser tool is useful as tells which events can be counted

simultaneously.

(24)

Listing 3.3: Reseting counters for TILEPro64

v o i d r e s e t _ c o u n t e r s ( ) {

__insn_mtspr (SPR_PERF_COUNT_0, 0 ) ;

__insn_mtspr (SPR_PERF_COUNT_1, 0 ) ;

__insn_mtspr (SPR_AUX_PERF_COUNT_0, 0 ) ;

__insn_mtspr (SPR_AUX_PERF_COUNT_1, 0 ) ;

}

Listing 3.4: Setting up performance counters for TILEPro64

v o i d s e t u p _ c o u n t e r s ( ) {

__insn_mtspr (SPR_PERF_COUNT_CTL, t i l e c o u n t e r [ c o u n t e r _ g r o u p ] [ 0 ]

| ( t i l e c o u n t e r [ c o u n t e r _ g r o u p ] [ 1 ] << 1 6 ) ) ;

__insn_mtspr (SPR_AUX_PERF_COUNT_CTL, t i l e c o u n t e r [ c o u n t e r _ g r o u p ] [ 2 ]

| ( t i l e c o u n t e r [ c o u n t e r _ g r o u p ] [ 3 ] << 1 6 ) ) ; }

3.2.2 Performance counting setup for TILEPro64

The reading of performance counters is done by direct access to the registers using the __insn_mtspr function. To set up the counting, at the beginning of every parallel sec- tion, each thread calls setup_counters(Listing 3.4 which will assign the chosen per- formance counters to be measured. In each parallel section, each thread will also call set_proc_affinity, a function that ties the thread to a specific core by internally call- ing sched_setaffinity.

Resetting of the counter values is shown in Listing 3.3. Reading the counters is done by calling __insn_mtspr with the mask for the specific register, e.g:

__insn_mfspr(SPR_PERF_COUNT_0) where SPR_PERF_COUNT_0 is 0x4205.

3.2.3 BOTS inputs and implementations

Applications that run in a very short time are more vulnerable to be affected by small factors such as fluctuations in the cpus or runtime system etc. Therefore in order to get reliable data, no benchmark should execute in under a few seconds as fastest. The BOTS inputs were, for this reason, custom fitted for every benchmark on both Opteron and TILEPro64.

Benchmarks that require input files only have a limited set of inputs to choose from and can therefore not always uphold this standard.

As for the different implementations of the BOTS benchmarks tied-task implementations

are used for reasons explained in 3.1, as well as manual cutoff and single versions where

available. With if-clause cutoff versions it is impossible to know if whether the cutoff

condition is fulfilled or not. This means that it is also impossible to know if a new task

is created or if the work should be appended to the current task. The manual cutoff

implementation however uses separate calls depending on whether they spawn a new task

or not. This allows us to know when a new task is created, so it also can be measured as an

individual task. The reason behind using single versions is simply to avoid using implicit

tasks, that the for versions create. The single version also out-performed the for version in

all benchmarks were it was available.

(25)

3.3. THESIS’ LIMITATIONS 19

All the benchmarks in the suite were compiled using GCC. For the opteron, GCC version 4.4.6 was used, as well as the third optimization level (O3). The TILEPro64 was cross- compiled with tile-gcc, version 4.4.3. For the tilera, only the second optimisation level was used, since some of the Tilera specific calls where affected by O3.

Table 3.1: Benchmark versions, inputs and parameters

Application Version Input Opteron Cutoff

Opteron

Input TilePro64 Cutoff TilePro64

Alignment single, tied prot.100.aa - prot.100.aa -

FFT tied 33554432 - 33554432 -

FIB manual, tied 48 12 45 15

Floorplan manual, tied input.20 5 input.15 5

Health manual, tied medium.input 1 medium.input 2

Nqueens manual, tied 15 5 13 5

Sort tied 40000000 - 40000000 -

SparseLU single, tied 30 - 15 -

Strassen manual, tied 4096 5 1024 6

3.3 Thesis’ limitations

• The characterization will focus on the memory system.

• Only characterize task-parallel benchmarks.

• Using the GCC implementation of OpenMP. The reasoning behind this is to be con- sistent as Tile-GCC is the only OpenMP version we had available for the TILEPro64, and to provide insight as to why GCC performs poorly.

• Only characterize tied-task implementations.

• Only characterize using threads that are tied to cores.

• Only characterize for one input per application.

• Using manual-cutoff implementation were the options is given.

• Using single implementation were the options is given.

(26)

(27)

Chapter 4

Results

In this chapter, we characterize benchmarks chosen in section 3.2.3 using three kinds of graphs.

The first graph type is a set of box plots indicating load balance over threads. A thin box- pot means that the workload is well balanced and vice-versa. The box-plots are calculated with a 95% confidence interval. Outliers, marked with red crosses, indicate values that are not within the confidence interval.

The second graph type shows total work and stall cycles for each task-type and thread configuration. Due to limited availability of performance counters on the Opteron 6172, we are only able to show combined stall cycles on any resource. However, extensive performance counters on TILEPro64 allow us to look at the stall cycles in more detail.

The third graph type shows cache behaviour separated by task-type and cache level. Again due limited availability of performance counters on the Opteron 6172 we are only able to characterize the L1 and L2 caches.

Table 4.1 explains what the different task-types compute and their granularity for both machines.

21

(28)

Table 4.1: Task-type description of the benchmarks. ’*’ indicates a benchmark which has reached its cut-off

Task- type (Index)

Description

Granularity (Cycles) Opteron

Granularity (Cycles) TILEPro64 Alignment

1 pairalign: Computes one alignment of a protein sequence using the

Myers and Miller Algorithm[19]. ∼ 6.15 ∗ 10⁶ ∼ 6.74 ∗ 10⁷

Fast Fourier Transform

1 compute_w_coefficient: Computes the W (Twiddle-factors) coeffi-

cients and stores them into an array. ∼ 17000 ∼ 240000

2 fft_aux: Calculates the length of the FFT that will be computed.¹ ∼ 930 ∼ 60000

3

unshuffle_16: Separates the elements of a given sequence into two shorter sequences, the first containing the elements in the even-numbered positions of the given sequence, the second containing those in the odd- numbered positions.

∼ 4300 ∼ 30000

4

twiddle_16: Performes the butterfly operation of the Cooley-Tukey alg with length. Divides the FFT computation in to smaller DFT computations.

∼ 72000 ∼ 9.2 ∗ 10⁵

Fibonacci

1 fib: Performs one step in the recursive fibonacci algorithm. Spawns a

new task at each recursive call. ∼ 800 ∼ 280

1* fib: Computes a recursive call where the depth-based cutoff has been

reached and does therefore not spawn any new tasks. ∼ 8.9 ∗ 10⁶ ∼ 6.2 ∗ 10⁶ Floorplan

1 add_cell: Computes a branch of the recursive branch and bound search

of the floorplan algorithm. ∼ 1500 ∼ 4100

1*

add_cell: Computes the recursive branch and bound search of the floorplan algorithm where the depth-based cutoff has been reached and does therefore not spawn any new tasks for new branches.

∼ 1.06 ∗ 10⁶ ∼ 4.6 ∗ 10⁶

Health

1 sim_village_par: Each task performs the simulation for a single vil-

lage, recursively creating new tasks for each village in its list. ∼ 45000 ∼ 48000 1*

sim_village_par: Each task performs the simulation for all the re- maining villages in its list once the cutoff has been reached (cutoff is based on the village level in the hierarchy of the simulation).

∼ 8.2 ∗ 10⁶ ∼ 6.2 ∗ 10⁵

N-Queens

1 nqueens: A task is generated for each possible position of a queen on

the board in each step of the solution. ∼ 45000 ∼ 770

1*

nqueens: Recursively testing each possible position of a queen on the board in each step of the solution without generating new tasks, once the depth-based cutoff has been reached.

∼ 8.25 ∗ 10⁶ ∼ 2.7 ∗ 10⁵

Sort

1

Cilksort_par: Splits the array four ways in each recursive step, gener- ating a task for each "quarter". Once the arrays are small enough they are sorted by a sequential quicksort.

∼ 62000 ∼ 2.5 ∗ 10⁵

2 Cilkmerge_par: A parallel mergesort, generating a task for each re-

cusive step. ∼ 14500 ∼ 60000

Sparse-LU

1 sparselu_call_par ∼ 6.9 ∗ 10⁷ ∼ 1.9 ∗ 10⁹

2 fwd_par ∼ 3.3 ∗ 10⁶ ∼ 9.2 ∗ 10⁷

3 bdiv_par ∼ 3.06 ∗ 10⁶ ∼ 9.7 ∗ 10⁷

4 bmod_par ∼ 1.29 ∗ 10⁷ ∼ 2.23 ∗ 10⁸

Strassen

1

OptimizedStrassenMultyply: Performs the operation C = A x B efficiently, for large matrices A, B and C, by splitting the matrices into quadrants and spawns a new task for each recursive call.

∼ 1.03 ∗ 10⁷ ∼ 2.8 ∗ 10⁷

1* OptimizedStrassenMultyply: A depth-based cutoff hinders the gen-

eration of new tasks to avoid tasks from getting too fine grained. ∼ 2.3 ∗ 10⁷ ∼ 5.5 ∗ 10⁶

(29)

23 Table 4.2: Breakdown of BOTS benchmarks characterization for the Opteron 6172 archi- tecture. Arrows are used to indicate the growth of the cache miss-rates.

Application Workbalance Threadworkload L1 Cache miss-rate

L2 Cache miss-rate

Alignment Excellent Low stall-cycles Very low Low %

FFT Poor High stall-cycles Low High

(Sparse)

FIB Poor Low stall-cycles Low High

(Sparse)

Floorplan Poor High for high core counts Low Moderate

Health Good apart from 48 cores

High stall-cycles Low % Moderate

Nqueens Fairly good High for high core counts Low High

Sort Poor High for high core counts Low Low

SparseLU Very good High stall-cycles Low Low/Moderate

Strassen Good apart from eight cores

Moderate/High stall-cycles Low (Sparse)

Moderate (Sparse) &

Table 4.3: Breakdown of BOTS benchmarks characterization for the TILEPro64 architec- ture. Arrows are used to indicate the growth of the cache miss-rates.

Application Workbalance Threadworkload L1 Cache miss-rate

Local L2 Cache miss-

rate

Remote L2 Cache miss-rate

Alignment Excellent Low stall-cycles Low Very

low

Low %

FFT Poor Moderate stall-cycles Moderate

(Sparse) Low (Sparse)

High (Sparse)

FIB Poor Low stall-cycles Low

(Sparse) Very

low

Low %

Floorplan Poor apart from four cores

Low High Low Moderate/High %

Health Poor High stall-cycles Moderate Low High

Nqueens Very poor Low/Moderate stall-cycles Low Low Moderate/High %

Sort Poor Moderate/High stall-cycles Very low Low High (Sparse)

SparseLU Good apart from

two cores

Low/Moderate stall-cycles Low % Very

Low

Strassen Poor Low stall-cycles Low

(Sparse)

Low Moderate/High

(30)

4.1 AMD Opteron 6172

All benchmarks were ran using 2, 4, 8, 16, 32 and 48 threads, on as many cores.

4.1.1 Alignment

0.0 0.5 1.0 1.5 2.0 2.5

Work per thread (cycles) 1e10

48 32 16 8 4 2

# Threads

Alignment

(a) Load balance

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10¹¹

2 cores1 1

4 cores 1

8 cores 1

16 cores 1

32 cores 1 48 cores

Tasktype(s)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰

Log stall-cycles

Total work Stall on any resource (b) Task-type breakdown

0 5 10 15 20 25 30 35 40 45 50

# of threads

0.0030 0.0035 0.0040 0.0045 0.0050 0.0055 0.0060

Missrate %

L1 data cache

tasktype 1

(c) L1 miss-ratio

0 5 10 15 20 25 30 35 40 45 50

# of threads

0 1 2 3 4 5 6

Missrate %

L2 data cache

tasktype 1

(d) L2 miss-ratio

Figure 4.1

(31)

4.1. AMD OPTERON 6172 25

4.1.2 Fast Fourier Transform

0.0 0.5 1.0 1.5 2.0 2.5 3.0

48 32 16 8 4 2

# Threads

FFT

(a) Load balance

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10¹¹

2 cores1 2 3 4 1

4 cores2 3 4 1

8 cores2 3 4 1

16 cores2 3 4 1

32 cores2 3 4 1 48 cores2 3 4

Tasktype(s)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10¹¹

Log stall-cycles

0 5 10 15 20 25 30 35 40 45 50

# of threads

0 1 2 3 4 5 6 7

Missrate %

L1 data cache

tasktype 1 tasktype 2 tasktype 3 tasktype 4

(c) L1 miss-ratio

0 5 10 15 20 25 30 35 40 45 50

# of threads

0 10 20 30 40 50 60 70

Missrate %

L2 data cache

tasktype 1 tasktype 2 tasktype 3 tasktype 4

(d) L2 miss-ratio

Figure 4.2

(32)

4.1.3 Fibonacci

0.0 0.5 1.0 1.5 2.0 2.5 3.0

48 32 16 8 4 2

# Threads

Fibonacci

(a) Load balance

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10¹¹

2 cores1 1* 1

4 cores1* 1

8 cores1* 1

16 cores1* 1

32 cores1* 1 48 cores1*

Tasktype(s)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸

Log stall-cycles

0 5 10 15 20 25 30 35 40 45 50

# of threads

0.0 0.5 1.0 1.5 2.0

Missrate %

L1 data cache

tasktype 1 tasktype 1 cutoff

(c) L1 miss-ratio

0 5 10 15 20 25 30 35 40 45 50

# of threads

0 10 20 30 40 50

Missrate %

L2 data cache

(d) L2 miss-ratio

Figure 4.3

(33)

4.1. AMD OPTERON 6172 27

4.1.4 Floorplan

0 1 2 3 4 5 6

48 32 16 8 4 2

# Threads

Floorplan

(a) Load balance

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10¹¹ 10¹²

2 cores1 1* 1

4 cores1* 1

8 cores1* 1

16 cores1* 1

32 cores1* 1 48 cores1*

Tasktype(s)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10¹¹ 10¹²

Log stall-cycles

0 5 10 15 20 25 30 35 40 45 50

# of threads

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Missrate %

L1 data cache

(c) L1 miss-ratio

0 5 10 15 20 25 30 35 40 45 50

# of threads

0 2 4 6 8 10 12 14 16 18

Missrate %

L2 data cache

(d) L2 miss-ratio

Characterization of Task-based Benchmarks from the Barcelona OpenMP Task Suite

Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:159

J O H A N E L A N D E R A M A N a n d A L E K S A N D A R S E K U L I C