On-chip Pipelined Parallel Mergesort on the Intel Single-Chip Cloud Computer

(1)

Final thesis

On-chip Pipelined Parallel Mergesort

on the Intel Single-Chip Cloud

Computer

by

Kenan Avdić

LIU-IDA/LITH-EX-A–14/012–SE

October 18, 2014

(2)

(3)

Final thesis

On-chip Pipelined Parallel Mergesort

on the Intel Single-Chip Cloud

Computer

by

Kenan Avdić

LIU-IDA/LITH-EX-A–14/012–SE

October 18, 2014

Supervisor:

Nicolas Melot, Christoph Kessler

Examiner:

Christoph Kessler

(4)

(5)

With the advent of mass-market consumer multicore processors, the

grow-ing trend in the consumer off-the-shelf general purpose processor industry

has moved away from increasing clock frequency as the classical approach

for achieving higher performance. This is commonly attributed to the

well-known problems of power consumption and heat dissipation with high

fre-quencies and voltage.

This paradigm shift has prompted research into a relatively new field

of “many-core” processors, such as the Intel Single-chip Cloud Computer.

The SCC is a concept vehicle, an experimental homogenous architecture

employing 48 IA32 cores interconnected by a high-speed communication

network.

As similar multiprocessor systems, such as the Cell Broadband Engine,

demonstrate a significantly higher aggregate bandwidth in the interconnect

network than in memory, we examine the viability of a pipelined approach

to sorting on the Intel SCC. By tailoring an algorithm to the architecture,

we investigate whether this is also the case with the SCC and whether

employing a pipelining technique alleviates the classical memory bottleneck

problem or provides any performance benefits.

For this purpose, we employ and combine different classic algorithms,

most significantly, parallel mergesort and samplesort.

(6)

(7)

1 Introduction

1

1.1 Background . . . .

1

1.2 Previous work . . . .

3

1.3 Contributions of this thesis . . . .

4

1.4 Organisation of the thesis . . . .

5

1.5 Publications . . . .

5

2 The Intel SCC

6

3 Preliminary Investigation

10

3.1 Main Memory . . . .

10

3.2 Mesh Interconnect . . . .

16

3.3 Conclusions . . . .

24

4 Mergesort Algorithm

25

4.1 Simple approach . . . .

25

4.1.1 Algorithm . . . .

25

4.1.2 Experimental Evaluation

. . . .

27

4.2 Pipelined mergesort

. . . .

32

4.2.1 Design . . . .

33

4.2.2 Algorithm . . . .

34

4.2.3 Experimental Evaluation

. . . .

39

5 Conclusions and Future Work

42 A Code Listing

48 A.1 mem_sat_test.c . . . .

48 A.2 mpb_trans.c

. . . .

52 A.3 priv_mem.c . . . .

54 A.4 pipelined_merge.h . . . .

61 A.5 pipelined_merge.c . . . .

66

(8)

(9)

Introduction

1.1 Background

The increasingly difficult problems of power consumption and heat

dissipa-tion have today all but eliminated the classic means of improving processor

performance — increasing its frequency. Instead, to increase performance,

technology has moved towards adding more cores to the chip. In

combi-nation with redundant processing units and multiple pipelines this allows

varying degrees of support for thread-level parallelism. In turn, software

development in general is being forced to adapt to a parallel paradigm in all

areas; desktop, entertainment and recently even embedded applications.

The transition towards hetero- and homogeneous multi- and many-core

architectures is by no means a simple one. Efficient and effective utilisation

of chip resources requiring parallelisation becomes more difficult.

The development of new hardware such as processors and memory sizes

has, until recently, largely been following the well-known Moore’s law.

Off-chip memory speeds, however, are lagging behind. As these memories are,

in relative terms, orders of magnitude slower than on-chip memory, main

memory access becomes prominent as one of the major causes of processor

stalls. This bottleneck effect is especially pronounced in memory intensive

operations, such as sorting. In order to lessen the impact of high latencies

of main memory, program behaviour can be altered so as to reduce main

memory access or avoid accessing main memory altogether. By

employ-ing on-chip pipelinemploy-ing, storemploy-ing intermediate results of sub-tasks in memory

can be avoided. These intermediate results can instead be immediately

for-warded onto the next processing unit. In addition, further performance

im-provement can be achieved by “parallelising” memory access to either main

memory or buffers by making it concurrent with computation. This can

be achieved e.g. through asynchronous memory transfers (Direct Memory

Access) combined with multi-buffering.

(10)

Single-Chip Cloud Computer 48-core concept vehicle as an algorithm engineering

problem. The implementation of such an algorithm involves many variables,

most significantly load balancing and memory access and communication

patterns. A sorting algorithm shares similar requirements with many

prac-tical applications, such as image processing, which makes solving such a

problem all the more relevant. As a pipelined variant of parallel mergesort

[1] has been shown to achieve higher performance on other architectures [2],

we focus on this algorithm primarily, but also look at parallel variations of

samplesort [3] [4].

Parallel sorting algorithms have been investigated for many years, on

many different platforms. Mergesort in particular originated as an external

sorting algorithm and combined well with the sequential access requirements

of early tape drives. Today, tapes have been replaced by disks or slower

off-chip memory, but the sequential nature of mergesort is still highly beneficial

due to good synergy with memory hierarchies available in almost all

hard-ware and the locality effects of such memory accesses.

The mergesort algorithm operates recursively using a divide-and-conquer

paradigm. The array to be sorted is split recursively into smaller chunks,

until chunk size is one. Each chunk is then merged in the correct order,

until the sequence is again the complete starting length (Fig. 1.1).

1 2 3 4 5 6 7 8

2

3

2

3 2 3

4 6

2 3 4 6

2

3 4 6

1

5

1

5 1 5

7

8

7

8 7 8

1 5 7 8

1

5

7

8

1

4

3

2

5

6

7

8

2

3 4 6

1

5

7

8 1 2 3 4 5 6 7 8

2 3

4 6

1 5

7 8

2 3 4 6

1 5 7 8

2

3 4 6

1

5

7

8

2

3 4 6

1

5

7

8

2

3 4 6

1

5

7

8 split

m erge

Figure 1.1: The mergesort algorithm [5].

The split operation has negligible cost and is considered trivial. The

merge tasks are independent of each other and can be performed separately.

This task independence is a natural recursive decomposition of tasks and

allows for their concurrent execution on different processing units, resulting

in a parallel mergesort algorithm. The splitting of the sequence results in a

binary tree, the depth of which can be used as a variable for modification

of task parallel granularity. That is, tasks are assigned to processing units

down to a certain tree level, after which each subtree is locally sorted, i.e.

on the mapped processing unit. These lowest level tasks are thus executed

sequentially.

The obvious method of transferring sorted subsequences between the

tasks is for the tasks to write the results into memory, where they are read

by the processing unit that is assigned the next task. This is, however, not

(11)

always necessary. Subsequent tasks do not need to wait until the previous

task is completed, as each task starts outputting a sorted sequence, it is

immediately input directly into the next. This is pipelined parallel mergesort.

In general, memory access cost is traded for a higher communication cost

instead. Such an algorithm is also significantly harder to optimise, as there

are many interdependent variables to consider.

1.2 Previous work

No previous work exists in algorithm performance on the Intel Single-Chip

Cloud Computer, however, a SIMD-enabled and/or pipelined approach has

been shown to be very effective in the case of sorting on the Cell Broadband

Engine processor.

The Cell is a heterogeneous PowerPC-based architecture that consists of

a single general purpose core combined with 8 streaming coprocessors [6].

The main core, the Power Processing Element (PPE) is a standard 64-bit

in-order dual-issue PowerPC core that supports two-way simultaneous

mul-tithreading (SMT) and Single-Instruction Multiple-Data (SIMD)

instruc-tions

1

_{. Being a general purpose core, the PPE runs the operating system,}

but its main task is controlling the 8 coprocessors, the Synergistic

Process-ing Elements (SPE). The SPEs, in turn, are each comprised of a Synergistic

Processing Unit (SPU) and a Memory Flow Controller unit (MFC). The

SPU is an in-order, dual-issue processing unit. It contains a large 128-entry

128-bit register file, supports integer and floating-point operations and is

SIMD-capable, or rather its processor intrinsics consist of only SIMD

in-structions. The SPU has no direct access to system memory. Instead, it

uses a local store of 256KiB for both programs and data. The MFC is

re-sponsible for translating addresses between the SPUs and the system and

performing DMA transfers to local stores.

At 3.2 GHz clock speed, the PPE theoretically delivers 25.6 GFLOPS

2

using single precision operations, while each SPE can reach 25.6 GFLOPS.

The PPE, the SPEs, system memory and peripheral input-output interfaces

on Cell communicate via a high-speed bus called the Element Interconnect

Bus (EIB). Typically, separate programs are compiled for the PPE and the

SPEs. The PPE controls the SPEs, initialising and running small programs

there. DMA transfers can be initiated by either the PPE or the SPEs.

Regarding sorting work on the Cell processor, recent advances in

GPGPU-programming

3

_{[7] were recently considered and applied by Inoue et al. [8].}

In their work, the authors follow the conclusions made by Furtak et al. [9] on

the benefits of exploiting available SIMD streaming instructions and

exam-ine the SIMD capabilities of the Cell; attempting to exploit them in a similar

way as previously done on GPUs [7] [10] [11]. The result is Aligned-Access

1_{AltiVec vector instructions}

2_{Billion floating operations per second}

(12)

sort, or AA-sort, which is a combination of an improved SIMD-optimised

combsort [12], used in-core, and the odd-even merge algorithm [13], used

out-of-core, both implemented with SIMD instructions. The relative speedup

achieved by AA-sort is 7.87x and 3.33x for the two constituent algorithms

over the same scalar implementation. The algorithm achieves a parallel

speedup of 12.2 with 16 cores when sorting 32-bit integers.

Gedik, Bordawekar and Yu identify similar Cell-specific requirements of

sorting algorithms: SIMD-optimisation of the SPE code, memory transfer

optimisation and effective utilisation of the EIB, but substitute the

odd-even merge algorithm above with two variations of bitonic sort [14].

A

SPE-local sort and two different variations of bitonic sort, distributed

in-core and distributed out-of-in-core sort are produced. The distributed in-in-core

sort uses the local sort algorithm and cross-SPE transfers to internally merge

a number of elements up to a size determined by the number of participating

SPEs. For larger sequences, the distributed out-of-core sort is used, which

utilises the in-core algorithm in phases to achieve the final sorted result. The

achieved speedup sorting floats for the in-core and out-of-core sorts, over an

Intel Xeon 3.2GHz, is 21x and 4x respectively.

By employing on-chip pipelining on the Cell, Hultén et al. [2] [15]

im-prove further upon these results and achieve an additional speedup of 70%

for the IBM QS20 and 143% for the PlayStation 3 over the AA-sort

im-plementation.

This is accomplished by minimising main memory access

through on-chip pipelining and asynchronous multi-buffered DMA transfers.

A pipelined on-chip version of the parallel mergesort algorithm is applied

using binary tree task partitioning and subsequently mapped to the SPEs.

Task mapping is optimised by expressing it as an integer linear programming

problem and solving it using an ILP solver.

Scarpazza and Braudaway [16] examine text indexing on the Cell,

adapt-ing this specific workload to its hardware. The solution provided affords a 4x

performance advantage over a non-SIMD reference implementation running

on all four cores of a quad-core Intel Q6600 processor.

Haid et al. leverage Kahn process networks [17] to generalise

stream-ing applications in general [18] and on Cell specifically [19], by executstream-ing

their model using protothreads [20] (for parallelism) and windowed FIFO

(for communication). The parallel speedup achieved here is nearly seven

when using seven processors on the PlayStation 3. This is especially

inter-esting due to the generic nature of a KPN application compared to otherwise

required architecture-specific code.

1.3 Contributions of this thesis

The most significant contribution of this thesis is the design and

implemen-tation of an on-chip pipelined parallel mergesort algorithm tailored to the

unorthodox hardware of the Intel Single-Chip Cloud Computer. Building

on known work mentioned in the previous section, we attempt to achieve

(13)

similar results on the SCC as on the Cell [2] [15]. Due to the lack of SIMD

instructions on the SCC hardware, however, no optimisation in that

direc-tion is possible, but some other features of the SCC are shown to benefit

from on-chip pipelining.

Due to there being no previous work on sorting on the SCC, an

investi-gation of the memory and mesh interconnect capabilities is performed first.

In addition, following the preliminary investigation, a simple naïve

imple-mentation is briefly handled and subsequently used for comparison with the

final pipelined algorithm.

1.4 Organisation of the thesis

The remainder of this thesis is organised as follows. Chapter 2 gives a

rela-tively high-level overview of the Intel SCC architecture, with the subsequent

chapters each adding more detail to its constituent parts as necessary.

Chap-ter 3 deals with preliminary investigation of the architecture details that are

identified to possibly impact the final algorithm design.

Chapter 4 describes the theory behind the mergesort algorithm, a naïve

parallel implementation of such an algorithm on the SCC as well as our

final design, implementation and results of the pipelined parallel mergesort

algorithm. Chapter 5 offers our conclusions on the results from chapter 4,

and future work.

1.5 Publications

Parts of this work have already been published in the following, in

chrono-logical order.

• Parallel sorting on Intel Single-Chip Cloud computer [5].

• Investigation of Main Memory Bandwidth on Intel

Single-Chip Cloud Computer [21].

• Pipelined Parallel Sorting on the Intel SCC [22].

• Engineering parallel sorting for the Intel SCC [23].

(14)

The Intel SCC

0

1

0

1

2

3

4

5

6

7

SCC die DIMM R tile tile R tile R tile R tile R tile R MC MC DIMM tile R tile R tile R tile R tile R tile R tile R tile R tile R tile R tile R tile R MC MC DIMM DIMM tile R tile R tile R tile R tile R tile R

Figure 2.1: Intel SCC Architecture Top View [24].

The Intel Single-Chip Cloud Computer [25] [24] is a chip

multiproces-sor. It is comprised of 24 tiles arranged in a 6x4 rectangular grid pattern.

The tiles are connected by an on-chip two-dimensional mesh interconnection

network. Each of the 24 tiles contains a pair of second generation Intel

Pen-tium IA32 cores (P54C), each in turn with its own L1 and L2 cache. The

L1 cache is 32KiB with 16KiB data and instruction cache. The L2 cache is

unified and is of 256KiB size. These caches are write-back, while L1 can be

configured as write-through.

The two cores on a tile are joined with a mesh interface unit (MIU) (Fig.

2.2) that has several responsibilities, but its main task is to provide

commu-nication resources between on-tile resources and the on-tile mesh interface,

the router. In addition to the two L2 caches and a mesh router, to the MIU

is attached a 16KiB message-passing buffer, the MPB. With 24 tiles, the

total available mesh memory is thus 384KiB. Since the IA32 cores on the

SCC use local addresses and are not aware of the global chip configuration,

the MIU translates core-local addresses using a look-up table (LUT) into

(15)

L2 256 KiB P54C _MPB 16 KiB P54C L1 32KiB L2 256 KiB traffic gen mesh I/F

R

L1 32KiB

Figure 2.2: An Intel SCC Tile [5].

non-local accesses, e.g. router, MPB, etc. The MIU is also responsible for

the hardware configuration of the cores, using tile configuration registers.

The architecture supports a special type of data to facilitate

message-passing which is new to the P54C Pentium cores, MPBT. This data type

bypasses L2 cache entirely and is only cached in L1. In addition, each line

in the L1 cache is expanded with a flag, which marks whether the line in

question holds MPBT-data. The IA32 instruction set is further expanded

with an instruction (CL1INVMB) that invalidates all MPBT-marked data

in L1.

Four DDR3 memory controllers are attached evenly to the routers on the

two shorter sides of the mesh rectangle. Each memory controller supports

DDR3-800 DRAM, up to 16GB per channel, allowing for a total of 64GB

memory capacity.

Six tiles are logically grouped in quadrants and each

use the closest memory controller. The memory variants are core-private

memory and shared memory. Each core has a certain amount of private

memory, which is a reserved area within main memory assigned to that core

only. This memory is cached in all available caches. The shared memory on

the other hand is evenly distributed over the four main memory controllers

and is either only cached in the L1 cache (using the aforementioned MPBT

memory type) or not cached at all.

The SCC provides voltage and clock control with a very high degree of

granularity and customisation. The voltage regulator controller (VRC)

al-lows for voltage adjustment in any of the 6 voltage islands (dashed regions)

in Figure 2.3 individually, or the entire mesh collectively. The voltage

set-tings can be altered from any core, allowing full application control of the

cores’ power state, or the system interface controller (SIF). The SIF is the

interface between the mesh and the external controller located on the system

board.

Even more granularity is allowed in clock frequency adjustment, as the

SCC can control each tile separately. The mesh and its routers, however, all

share a single frequency. Each tile uses the mesh clock as the input with a

configurable clock divider to arrive at a local clock. The mesh itself can be

(16)

System Interface

DDR3 MC DDR3 MC DDR3 MC DDR3 MC VRC PLL & JTAG

Figure 2.3: SCC Voltage and clocking islands [24].

considered to reside on its own frequency island.

The SCC can be programmed directly, in so called baremetal mode, or

an operating system can be loaded onto each core that subsequently runs

programs. A version of Linux called SCC Linux is provided for the latter

mode. A set of tools for management called sccKit is used for externally

controlling the SCC via the SIF. These tools can be used to configure and

manage the SCC, providing facilities to, e.g., hardware power cycle, reset

and reboot the SCC. SccKit is also used for starting the SCC in one of

the preset frequency profiles. The available frequency profiles are listed in

Table 2.1. SCC Linux is available as modified source code for recompilation

if kernel modification is necessary. Programs for the SCC Linux are compiled

using standard compilers provided by Intel, such as gcc or icc.

Tile (MHz)

Mesh (MHz)

Memory (MHz)

533

800

800 1066

800 1600

800 1066

Table 2.1: Available frequency profiles using Intel sccKit

As previously mentioned, an MPI-like API library called RCCE

(pro-nounced “rocky”) [26] exists for the SCC. The library provides three API

interfaces: two for message-passing support (a basic and a gory interface),

(17)

and one for power management. The basic message-passing API interface

is a simple interface with most implementation details (such as

synchro-nisation) hidden from the programmer. The gory interface exposes more

functions and allows for more power and flexibility in implementations.

The programs described in this work are cross-compiled on a

manage-ment console following the Intel SCC Programmer’s guide [27], and

subse-quently deployed onto the cores for execution and testing. The gory interface

is used in all algorithm implementations. The input/output control towards

the processing units is handled by SSH, more specifically, pssh.

(18)

Preliminary Investigation

There are several issues to be considered for algorithm design and

imple-mentation on the Intel SCC.

First, taking a closer look at the multi-core processor and applying

clas-sical multiprocessing paradigms, we see it bears a certain resemblance to

a Non-Uniform Memory Access system: there is an interconnect network,

its processing units vary in distance to their respective memory controllers,

and no cache coherence is provided. Additionally, it is programmed using

an SPMD

1

_{paradigm and there is an MPI-like library that provides}

collec-tive communication. These variations are very likely to have an effect in

the achieved results, and must be considered. The SCC is flexible in this

regard, as main memory address translation that is performed in hardware

near the processing units can be configured using the cores’ lookup table.

The amount of memory available e.g. can be changed by modifying this

table.

Second, we look at the availability of special SIMD or vector instructions.

Unfortunately, no such instructions are available on the Pentium P54C cores.

The first Pentium core that features such instructions is the P55C (Pentium

MMX).

Third, we consider the capacity and latency of the interconnection mesh

and memory. Intel specifies its bus width as 16B data plus 2B side band.

With a clock of 1600MHz, the mesh should thus be capable of a throughput

of 3052 MiB, or 2.98 GiB, per second, with a specified latency of four cycles,

including link traversal [24].

3.1 Main Memory

The memory hierarchy on the SCC from the point of a single tile and core is

not altogether different to a uniprocessor system. As previously mentioned,

(19)

each tile contains two cores, where each core has individual L1 and L2 caches.

The L1 caches are 16KiB instruction and 16KiB data each, while the L2

caches are 256KiB unified. Each tile has a local memory area intended as

a buffer for messaging, the MPB. This buffer is 16KiB per tile, which by

default is assigned one half per core, so that each core has access to 8KiB of

MPB. Since the SCC consists of 24 tiles, we have a total of 384KiB of MPB

memory.

There are four main memory interface controllers (MICs) attached to

the “east” and “west” corner tiles of the 6-by-4 mesh. The controllers each

support a maximum of 16 GB memory, allowing for a total of 64 GB main

memory. The supported memory type is DDR3-800. This memory is, in

the default configuration, logically divided in a quadrant-wise fashion to the

cores on the tiles belonging to the quadrant. Each core in a given quadrant of

the SCC is assigned a certain amount of exclusive (private) memory, served

by the quadrant-local MIC. This amount naturally depends on the amount

of main memory installed, as well as configuration parameters in the cores’

lookup tables (LUTs).

A lookup table is a set of configuration registers that are used for memory

address translation from core addresses to system addresses. Each core has

a LUT, and each LUT contains 256 entries. On a L2 cache miss, the top 8

bits of the core physical address are used as an index into the LUT which for

these 8 bits provides 22 bits of system address information. The remaining

24 bits of the core address are finally appended to result in a system address

of 34 bits. Most significantly, this LUT expansion contains a destination ID

for the mesh router where the translated system address is to be forwarded.

By configuring each core’s LUT with a certain exclusive address range and a

specific router (where the memory controller is located), cores are provided

with core-private memories. This is the default configuration of the LUTs.

In addition to the aforementioned private memory, a certain amount of

total system memory is reserved as shared memory. This memory can be

indexed by any core (i.e. the cores have overlapping LUT addresses) and is

evenly allocated from memory attached to the four memory controllers.

The SCC provides no cache coherence mechanisms. In the case of

pri-vate memory, no cache coherence mechanism is even necessary, as memory

is exclusively mapped to a single core. In this case, both L1 and L2 caches

are active. The shared memory, on the other hand, is not cached in L2.

Shared memory is either entirely uncached, having all the reads go directly

to memory, or only cached in L1 and marked as MPBT memory. As

previ-ously mentioned, an instruction was also added to clear memory flagged with

this flag from the L1 cache. Furthermore, P54C already has the capability to

reset the L1 cache completely. Presumably, not caching the shared memory

in L2 by default is due to the fact that the P54C is not equipped with any

means of clearing or resetting the L2 cache. Activating L2 in combination

with shared memory makes an implementation of a cache coherence

mech-anism a requirement. Ultimately, any cache coherence must be handled by

(20)

the programmer, by e.g. manual cache flushing such as a certain pattern of

access. The caches are preconfigured as write-back, while the L1 cache can

also be configured as write-through.

As memory speeds often have a large impact on the performance of

sorting algorithms, we begin by examining the memory performance [21].

This is measured in bandwidth, or bandwidth per core, where more than

one core is active. We examine variations in bandwidth with increasing

number of cores, as well as using different memory access type; read, write

or combined. Since the SCC is capable of clock speed modulation, the effect

of core clock on memory bandwidth is also examined. In these tests, memory

and mesh clock speeds are kept constant at 800Mhz, while the core clocks

are tested at 533MHz and 800MHZ respectively.

In order to consider the impact of cache, we look at two different memory

access strides. Since the cache line width is 32 bytes, reading and writing

to memory is performed in two different manners: stride 4 and stride 32

bytes. Stride 4 bytes is selected for convenience as it is the size of an integer

on this platform, while stride 32 is selected as it is the size of a cache

line (8 integers). Special care is taken to allocate memory with 32-byte

alignment, in order to ascertain that the correct part of the cache line is

read or written. The mixed pattern denotes a combination of these two

stride patterns. A pseudorandom access pattern is also used to attempt to

circumvent any locality optimisations inherent in hardware, whether it is

cache effect or memory bank optimisation. This pseudorandom pattern is

provided through a function [28] pi(j) = (a · j) mod S for the index j, a

large, odd constant a and where S is a power of two (see code example in

appendix A.1). The random access pattern also applies the previous strided

principle to the index j.

In addition to access patterns, we look at read, write and combined access

types separately, where combined access refers to simultaneous reading and

writing, as well as the scaling in the amount of cores that are participating

between 3 and 12 cores. 12 cores is the maximum default private memory

setup per controller.

The experiment is performed using a fixed data set of 200MiB per each

participating core. Time is measured from the point when the cores have

started up the program, throughout the memory operation and until

fin-ished. This is repeated for 100 attempts, after which an average, standard

deviation, minimum and maximum values are collected. The bandwidth per

core and the global aggregate bandwidth are measured. There is a number

of core in both measurements which signifies how many cores are active

dur-ing the measurement. This was achieved by usdur-ing variations of the code in

appendix A.1.

Figure 3.1 shows the total measured read bandwidth presented as a

func-tion of the number of cores. We see no surprises here, the 4-byte/1-int stride

access achieves the highest throughput for each of the two different clock

speeds respectively. The lowest performance comes from the read random

(21)

8 int pattern, as this type of access is designed to circumvent caches. The

same can be said about the results in the diagram for write access in Figure

3.2. The highest total throughput of 12-core aggregate, 120MiB per second,

is achieved by sequential int writes, which is an excellent example of the

effect of cache. Recall that the L2 cache is write-back on the SCC — it

fol-lows that the pattern that results in the fewest cache evictions will achieve

the highest performance here. The only patterns that repeatedly write to

the same cache line are the 1-int per write ones and naturally have the

high-est performance. We see that 1-int random and sequential access have the

same performance, since they result in the same amount of cache evictions.

The weakest performance is shown by 8-int stride random accesses, which

not only evict a cache line each time, but also are constructed to avoid any

memory optimisations for sequential reading that the memory controller

af-fords. This access pattern is likely to be very close to the lowest possible

write performance achievable on the SCC. These same results are presented

per core in Figures 3.3 and 3.4.

Since no bandwidth drop with increasing number of cores is evident

and the aggregate memory bandwidth previously shown rises linearly with

the number of cores, this shows that a single memory controller cannot be

saturated using a maximum of 12 cores. The slight drop in write bandwidth

in Fig. 3.4 is attributed to the L1 cache, which is configured as

no-write-allocate. This strategy causes a cache line to not be read into cache on a

write cache miss, i.e. when exclusively writing data, it is likely that the L1

is completely bypassed.

0 500 1000 1500 2000 2500 3000 3500 4000 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores

Global main memory read bandwidth at 533 and 800MHz Read stride 1 int (533)

Read stride 8 int (533) Read mixed (533) Read random 1 int (533) Read random 8 int (533) Read stride 1 int (800) Read stride 8 int (800) Read mixed (800) Read random 1 int (800) Read random 8 int (800)

(22)

20 40 60 80 100 120 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores Write stride 1 int (533)

Write stride 8 int (533) Write mixed (533) Write random 1 int (533) Write random 8 int (533) Write stride 1 int (800) Write stride 8 int (800) Write mixed (800) Write random 1 int (800) Write random 8 int (800)

Global main memory write bandwidth at 533 and 800MHz

Figure 3.2: Global main memory write bandwidth at 533 and 800Mhz [21].

0 50 100 150 200 250 300 350 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores

Per core main memory read bandwidth at 533 and 800MHz Read stride 1 int (533)

Read stride 8 int (533) Read mixed (533) Read stride 1 int (800) Read stride 8 int (800) Read mixed (800)

Figure 3.3: Strided read memory bandwidth per core at 533 and 800MHz

[21].

(23)

0 2 4 6 8 10 12 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores

Per core main memory write bandwidth at 533 and 800MHz Write stride 1 int (533)

Write stride 8 int (533) Write mixed (533) Write stride 1 int (800) Write stride 8 int (800) Write mixed (800)

Figure 3.4: Strided write memory bandwidth per core at 533 and 800MHz

[21].

Finally, in Figures 3.5 and 3.6, we see that memory locality is a

con-sideration, even for random access. Despite the high performance of the

memory controllers, they struggle to serve highly irregular access patterns

and perform better with sequential access.

1 2 3 4 5 6 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores

Bandwidth per core with random access 5 int gap read

13 int gap read 21 int gap read 5 int gap write 13 int gap write 21 int gap write 5 int gap combined 13 int gap combined 21 int gap combined

Figure 3.5: Random pattern read memory bandwidth per core at 533 and

800MHz [21].

(24)

2 4 6 8 10 12 14 16 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores

Bandwidth per core with random access 5 int gap read

13 int gap read 21 int gap read 5 int gap write 13 int gap write 21 int gap write 5 int gap combined 13 int gap combined 21 int gap combined

Figure 3.6: Random pattern write memory bandwidth per core at 533 and

800MHz [21].

3.2 Mesh Interconnect

The speed of the mesh and the message passing buffers is another issue that

influences the details in the construction of our algorithm.

The two-dimensional mesh network consists of 24 packet-switched routers,

or one per tile (Fig.3.7), organised in the aforementioned 6x4 configuration.

The mesh has its own power supply and clock source, in order to improve

support for dynamic power management. The flow control in the mesh is

credit based. Each core is connected to the router on the tile using the mesh

interface unit, which is responsible for, among other things,

packetising/de-packetising data and translating local addresses into system addresses. The

MIU has a buffer, MPB, which is 16KiB and divided in half for each core.

The MIU communicates directly with the tile router. Each router has eight

credits to give per port and can send a packet to another router only when

it has a credit from that router. Credits are returned to the sender once the

packet has moved on. Error checking is performed primarily through parity.

No error correction is performed.

We are interested in the performance of the mesh, routers and the mesh

interface unit when under a high load from the processors [5]. This is

evalu-ated using a test program (a variation of the listing in appendix A.2). The

evaluation method consists of investigating latency and throughput by

hav-ing a shav-ingle core (core 0) send a specified amount of data to every other core

not sharing the same tile, while monitoring the time taken to perform the

transfer. The variables of the test are core distance in hops (Fig. 3.8) and

the size of the transferred data. Each test is performed 1000 times and the

average is taken as a sample.

We do not test data sets larger than the size of the L2 cache. These

sizes would result in frequent main memory access, which in turn generates

(25)

Figure 3.7: SCC Tile Level Diagram [24].

extra mesh traffic and could naturally introduce undesirable variability in

our test. By ensuring that data is exchanged from within the L2 cache only,

we avoid any impact on timing that main memory access would have.

The results of the first round of tests are displayed in Figures 3.9 through

3.14 for data sizes of 2, 4, 8, 16, 32 and 64 kibi-integers or 8, 16, 32, 64, 128

and 256 kibibytes respectively.

First, in Fig. 3.14 we see the timings for the 64Ki integers are highly

inconsequent.

This is attributed to memory access.

A data set of this

size is highly unlikely to fit in L2, even if a single program is running on

the processing unit. Other processes along with the operating system are

assumed to be intruding on the utilisation of L2. It is evident that there is

some private memory access in this case, which is influencing the transfer

timings.

Second, for data sets of 2-32 kibi-integers (4-128 KiB), we see that the

timings are roughly doubling with the doubling of the data size. This

in-dicates again, as in the case of main memory, that the processing units are

unable to saturate the mesh. Another representation of the same data is

given in Fig. 3.15, where the same numbers can be seen as a function of hop

distance. The marginal timing increase is more prominent in this Figure,

along with the cache limit at 256 KiB.

Finally, a second round of testing is performed. This is done in order

to better ascertain the availability of L2 cache, i.e. to find out the amount

of data that can safely be cached before memory access starts to have a

significant impact on performance. For this, data sizes of 40, 48 and 56

(26)

Ki-7

1 (a) 3 hops between cores.

11

1 (b) 5 hops between cores.

2 3

1 (c) 6 hops between cores.

47

1 (d) 8 hops between cores.

Figure 3.8: Four different mappings of core pairs with increasing distance

[5].

0.085 0.09 0.095 0.1 0.105 0.11 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 8 KiB

(27)

0.28 0.285 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 16 KiB

Figure 3.10: Average transfer time for 4Ki integers/16 KiB

0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 32 KiB

(28)

Figure 3.12: Average transfer time for 16Ki integers/64 KiB

(29)

5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 7.2 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 256 KiB

Figure 3.14: Average transfer time for 64Ki integers/256 KiB

ints are selected (160, 192 and 224 KiB respectively). The results of these

additional tests can be seen in Figures 3.16 through 3.18.

From the above we see that main memory access interference begins

to make itself apparent at a data size of 192KiB. 160 KiB, in comparison

with the lower data set results, looks relatively unaffected. We see thus

that, ideally, to avoid added memory access in mesh communication when

designing and programming for pipelining (with the current configuration

of hardware and software), data sets of 160 KiB should preferably be used

and of definitely no more than 192KiB.

(30)

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 Ti m e in m il li s e c o n d s

Hamming distance in hops

Block transfer time and distance between cores

64KiB 128KiB 256KiB

Figure 3.15: Average time to transfer 64, 128 and 256KiB as a function of

the distance between cores [5].

2.3 2.35 2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 2.85 0 1 2 3 4 5 6 7 8 ms Distance in hops Cached transfers 160 KiB

(31)

2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 0 1 2 3 4 5 6 7 8 ms Distance in hops Cached transfers 192 KiB

Figure 3.17: Average transfer time for 48Ki integers/192 KiB

3.6 3.8 4 4.2 4.4 4.6 4.8 5 0 1 2 3 4 5 6 7 8 ms Distance in hops Cached transfers 224 KiB

(32)

3.3 Conclusions

Tests were performed on memory and mesh in order to obtain results relevant

to our tailored algorithm design. The following are important considerations

to be made during this process:

1. The P54C cores, albeit extended with new features and clocked to

a much higher clock than its original stock clock, are not at par

performance-wise with the rest of the hardware in the SCC. The DDR3

controllers and the mesh are extremely fast and can only be taxed by

the P54C cores using a heavy write load. This is not entirely

unex-pected, as there is limited die area and many cores are provided. Our

tests show that a single memory controller remains at nearly its

max-imum performance even when a full quadrant of the SCC is reading

from it. Furthermore, a single mesh link cannot be significantly slowed

down by communication between any two cores, as long as main

mem-ory access is avoided, i.e. for any type of pipelining considerations.

2. Any memory access other than cache will result in added mesh

com-munication, since the memory is accessed through the mesh itself. The

preferred data size to be used for local buffers with pipelining is thus

160 KiB, with no more than 192 KiB used at any time. Ideally, these

parameters should be made configurable.

3. Despite the high overall performance of the memory, write bandwidth

is comparatively low and the mesh interconnect even faster. Combined

with the low performance of the processing units, this makes the SCC

a good candidate for pipelined sorting.

(33)

Mergesort Algorithm

4.1 Simple approach

As an initial implementation, we begin by constructing a naïve parallel

mergesort algorithm. Each level of the mergesort tree is mapped to a set of

cores. This simplification means that we may only use a number of cores

that is a power of two, and at maximum, only 32 of the 48 available cores

are used. Furthermore, all of these 32 cores are only used in the first round;

as the number of sequences to be sorted halves every round, so does the

number of participating cores. With a large number of cores idle during the

sorting, the efficiency of this algorithm should be extremely low.

4.1.1 Algorithm

The algorithm uses the cores’ private memory to store integer blocks and

uncached shared memory as a buffer to transfer these blocks between them.

Since uncached shared memory is used, no cache coherence mechanism is

required. The algorithm is initialised by selecting the number of integers

N (the size of data), the number of participating nodes P and setting the

number of active nodes P

a

= P . In step 0, each node pregenerates two

pseudorandom nondecreasing sequences of length N/(2P ). These sequences

simulate the output from the initial sequential round of merging.

The algorithm then enters a sequence of rounds, where each round

con-sists of two phases, sorting and transfer. In the sorting phase, the active

nodes in the current round merge two sequences into one of combined length

N/P

a

. The sorting phase is then completed and the algorithm proceeds to

the transfer phase (Fig. 4.2). In the transfer phase, the number of active

nodes is integer-divided by two (using a logical right shift) and the nodes

that are becoming inactive transfer their sorted sequences to the nodes

re-maining active. The transfer is performed using buffers in shared memory.

During the transfer phase, flags are set in the communicating cores MPBs

(34)

for synchronisation.

The round is completed; active nodes (the nodes with their rank less

than P

a

) continue on to the next round while the inactive nodes become

idle.

When the last round completes and the algorithm ends, the root node

has merged the last two sequences into a single nondecreasing sequence of

length N . Figure 4.1 provides an illustration of this simple algorithm.

core 0 core 1 core 2 core 3

Round 1

Round 2

Round 3

merge communicate Round 0 generate_{non-decreasing}

Figure 4.1: Naïve Parallel Merge: each round half of the cores become

inactive after merging and transferring their assigned sequences [5].

Three other variants of the algorithm above are implemented. They use

the same basic algorithm as above, but alter it as follows.

Two of the variants rely on shared instead of private memory exclusively.

The shared memory variants of the algorithm do not have a transfer phase;

by relying directly on shared memory as storage for input and output blocks,

the transfer phase is avoided. That is, there is no copying of data between

rounds; all that is required is a synchronisation for each subsequent round

to begin. Two cores are assigned a common buffer in shared memory for

their exclusive use, where one core is the sender and the other the receiver.

Flags are set by the cores in their respective MPBs for synchronisation, i.e.

when they are allowed to read or write their assigned buffer. The shared

memory mergesort algorithm is implemented in two variants: one cached

and one uncached.

The uncached shared memory version uses no caches, but accesses

mem-ory directly. No cache coherence mechanism is provided or necessary.

In the cached shared memory version of the algorithm, the L1 and L2

caches are enabled for caching and an explicit cache flush is added in place

of the transfer phase. As the SCC has no cache coherence, this is required

to maintain main memory consistency for the next round of computation.

(35)

The final version uses the MPB as a buffer instead of shared memory, and

relies on mesh communication to transfer the data between working cores.

Note that this algorithm is still in no way pipelined, the memory blocks are

simply transferred from the private memory range of one set of cores to the

private memory of another set. This variant should nevertheless reduce the

amount of memory access compared to the first version.

core 2

core 0

blocking

transfer

data wr

_itten

data

read

Figure 4.2: A transfer phase of the naïve algorithm variants [5].

4.1.2 Experimental Evaluation

The measurements are performed as follows. Each of the initially active

nodes generates two pseudorandom nondecreasing integer sequences that

are to be merged. Once the starting sequences are randomised, timing and

then sorting starts. The sequences local to each core are sorted and the

respective algorithm above is followed. When the root task on the root rank

processing unit completes, the timer is stopped and the resulting sequence

is verified for correctness. Each measurement is performed in excess of 1000

runs, and the average of these is sampled. The results of the measurements

are represented in Figures 4.3 through 4.8. One additional test is performed

with constant values for comparison purposes (Fig. 4.9).

The results of the tests show the initial version of the algorithm having

the weakest performance in all cases except the single node one. This is

un-surprising as this version of the algorithm requires the most main memory

access. Starting with the 32 node case in Figure 4.8 we see that, for the first

algorithm that uses private memory with shared memory as a buffer, the

additional phase copying to shared memory and back induces a performance

penalty of over 60% over the same algorithm that instead uses mesh

com-munication. Recall that the writing and subsequent reading from shared

memory between two rounds of the algorithm are replaced here by instead

transferring the same data between two cores’ private memories using the

mesh. We see that very similar results are obtained for descending

num-bers of cores, the results in Figures 4.8, 4.7 and 4.6 for 32, 16 and 8 cores

respectively are nearly the same. The inefficiency of the base algorithm is

highly apparent in these, since there is no significant speedup in any of the

(36)

Figure 4.3: Merging time using 1 processor [5].

(37)

Figure 4.5: Merging time using 4 processors [5].

(38)

Figure 4.7: Merging time using 16 processors [5].

(39)

variants between 8 and 32 cores, despite the quadrupling of the number of

working cores. Furthermore, comparing any of the results to its single core

counterpart in Figure 4.3 reveals that there is actually no speedup at all. In

the case of the shared memory variants in these three Figures, we see that

the uncached shared memory offers particularly low performance. Despite

the complete lack of a transfer phase here, we still see almost as low

perfor-mance as the worst-case variant. Naturally, there is no cache used here, so

low performance is expected. The best results are achieved with the cached

shared memory algorithm, which both takes advantage of caching as well as

avoids extra copying.

Continuing in the reverse order, we look at the results for 4- and 2-node

tests (Figures 4.5 and 4.4). We see that, as the number of utilised cores

decreases between 8 and 2, the performance of the private memory version

of the algorithm with shared memory buffers improves over the uncached

shared memory one. This is attributed to the fact that these cases are less

parallelised in that there are fewer rounds. As the number of rounds is equal

to log

₂

nodes, each halving of the nodes reduces the number of rounds, and

thereby the block transfer operations between rounds, by one.

Ultimately, the results depicted in the Figure for a single processor show

best performance for all variants (Fig. 4.3). That is, our naïve attempt at

parallelisation of the mergesort algorithm does not yield any advantage over

the non-parallel version. Both private memory versions of the algorithm in

this special case are the same, and hence perform the same. They perform

better than the cached shared version as the memory required grows, since

the private memory is always allocated on the closest memory controller.

Naturally, uncached shared memory is, lacking cache, again significantly

slower. The single-core results confirm our previous experiments with

mem-ory with regards to private and shared memmem-ory speeds.

(40)

Figure 4.9: Merging time using 32 processors, using constant values [5].

4.2 Pipelined mergesort

The pipelined parallel mergesort algorithm is a version of the mergesort

algorithm. It shares the same basic features as even the simple parallel

mergesort described above, but optimises away as much as possible of the

memory access, usually trading it off for a communication cost. By

pipelin-ing the steps of the algorithm in a way much similar to how a processor

pipelines instructions, one constant stream of sorting can be executed,

read-ing, as input, unsorted elements from an external location while writread-ing, as

output, their sorted sequence. Assuming, again, a tree mapping, the leaves

of the tree read the unsorted sequences to be merged, merge a certain buffer

size of elements and communicate the subsequence upward in the tree,

un-til the stream reaches the root, which writes a fully sorted sequence. This

continues until all the elements are consumed.

There are many variables in designing such an algorithm. A sorting tree

depth must be selected that allows for a desired task granularity, but does

not introduce additional resource strain. The granularity is typically more

than a single task per processing unit. Task assignment must be done onto

the processing units of the underlying hardware in a way that optimises

its usage. Here, a trade-off must be made between the amount of memory

access, communication and computation.

(41)

4.2.1 Design

Ordinary sequential merging has a linear computation cost relative to the

input size. Due to this, we know that each full level of the merge tree has

the same computation cost. Assuming that a root task must be assigned to

a single core, one way of partitioning would assume a tree of similar depth

as there are nodes. For the SCC in particular, the size of this tree would

be infeasible, so the number of tasks must be reduced. Instead of a single

large tree, we opt for several smaller ones. Since this will introduce a second

phase of merging, the number of trees must be a power of two to allow a

balanced merge phase in the second phase. The number of trees should also

divide the total core number evenly, in order to efficiently map onto the

SCCs 48 cores. The locality of memory controllers on the mesh should also

be considered.

We opt for a forest of 8 trees with 6 levels each [29] [28], and the top-level

view of the algorithm results in the following phases:

• A local mergesort phase, phase 0, is required to obtain the starting

subsequences. The leaves of the 8 trees each read their assigned block

of input elements to be sorted and merge them in their private

mem-ories. After this phase, the pipelined merge phase can begin.

• Phase 1 runs a pipelined parallel merge with 8 6-level trees. This

results in 8 sorted subsequences.

• Phase 2 consists of a parallel sample sort algorithm. This is done in

order to achieve a higher core utilisation ratio compared to a solution

similar to phase 1.

Phase 2 is required to merge the 8 sorted subsequences produced by phase

1. If this phase is mapped to a parallel mergesort in the same manner, there

would be a significant number of idle cores, reducing efficiency. Instead, we

opt for a parallel sample sort and use all 48 cores even in the second phase.

The task mapping is modeled to the SCC using an integer linear

pro-gramming (ILP) based method [29] [28]. The models allow for optimisation

of either the aggregate overall hop distance between tasks, weighted by

inter-task communication volumes, or the aggregate overall hop distance of inter-tasks

to their memory controller, weighted by the memory access volumes. The

model balances computational load, in addition to distributing leaf tasks

across cores to reduce the running times of phase 0. The linear combination

is controlled using weight parameters.

An arbitrary manual task map is also produced, the layer map [29]. The

simple layer map is, as the name implies, based on tree levels. As we know

that each tree level has the same computation cost, we map each level of a

tree to a single core. With 12 cores and two 6-level trees, we have exactly

one tree level per core. Since we also know from the previous experiments

that the distance to the memory is the biggest influence on memory access

(42)

times, we place the first 6-level tree such that the root node (on level one)

is on a single core closest to the memory controller, with every subsequent

tree level leading away from the MIC in a semi-circular fashion (see Figure

4.10). The lowest level leaves are thus mapped on the second-nearest core

to the memory controller. The reverse is done with the second 6-level tree.

MIC

1

2

3

4 5 6 7 8 9 1011 1213 1415

16...31

32...63

Figure 4.10: Per-level distribution of the layer map and pipeline data flow.

4.2.2 Algorithm

The inputs to the algorithm and program are:

• A task map file. This file contains the task mapping to the SCCs

cores, in a per-quadrant fashion. This mapping is replicated internally

respective to the local memory controller.

• A data file containing the integer elements to be sorted.

• MPB buffer size to be used in communication.

Initially, the task map file input is parsed in order to generate an internal

representation of the task tree. For simplicity, the map file is represented as

a 7-level tree, where the root is ignored, i.e. two 6-level trees. The recursive

function generate_subtree is responsible for the task and tree generation, in

addition to calculating offset sizes (into the unsorted input array) for leaves.

Each task is represented by the data structure in Listing 4.1. After the task

tree is generated, each processing unit uses the mapping and the task tree

to determine which tasks it is responsible for executing. These tasks are

collected into a local task array.

Listing 4.1: The task data structure representation

numbersepnumbersep numbersep

1 struct task

2 {

3 unsigned shortid;

4 unsigned shortlocal_id;

5 struct task* left_child; /* tree structure */

(43)

7 struct task* parent;

8 unsigned shortcpu_id; /* the id of the cpu this task is running on */

9 unsigned shorttree_lvl; /* the level of the tree the task is on */

10 t_vcharp buf_start; /* pointer to the start of the buffer in the

11 * MPB */

12 unsigned shortbuf_sz; /* the size of the data buffer in 32B lines

13 * including the header */

14 unsigned size; /* total number of integers that need to be

15 * handled by this task */

16 unsigned progress; /* progress of task, i.e. how much of size

17 * has been completed. if equal to size

18 * the task is finished. */

19 void (*function)(struct task *task);/* pointer to the function that will

20 * run this task */

21 leaf_props_t *leaf; /* leaf properties */

22 };

23 typedef structtask task_t;

Based on the processor-local task array, the MPB buffer sizes are

calcu-lated and allocated (setup_buffers). Since the leaf nodes all read their input

from main memory, they do not need local MPB buffers. Instead, they push

sorted elements upward in the tree. The branches and root, however, each

have their respective MPB input buffer. The MPB is preallocated and used

proportionally, based on local task weighting as follows:

1. A task weight is assigned to each task and calculated based on the level

of the binary tree the given task is on. Each task has a weight score of

half of the task directly above in the task tree, starting with 1 for the

root node, i.e. w = 1/2

l

, where w is the task weight for a given task

and l = 0, ..., 5 its tree depth, starting with 0 for root. For example,

the root task has a score of 1, the branches immediately below the

root have the score of 1/2, and so on. The task weight is proportional

to the computation cost of a task. As it is simple to calculate, it is

not saved in the task structure.

2. Each core gathers its core-local tasks, and calculates the sum of their

weights. The remaining steps are calculated on a core-local basis.

3. The MPB buffer, the size of which to be used is provided as an input

variable to the program, and no more than 8128 bytes, is assigned to

each task proportionally. The proportion of this buffer that a task

receives is equal to the proportion of the task weight to the core-local

task weight sum. Given t = 1, ..., n local tasks on the node, B

tot

as

the constant total per-core buffer size and weight w as in step 1, each

tasks buffer is calculated as follows:

B

t

= B

tot

w

t n

P

j=1

w

j

As an example, assume an MPB size of 4000 bytes. Assume further that

the mapping is such that the current core executes 3 tasks, the root of the

(44)

6-level tree and its immediate branches. The MPB buffer size assigned to

the root task would be 4000 · 1/(1 + 1/2 + 1/2) = 2000 bytes.

Next, each node sets up its respective buffers in the MPB. An MPB

mem-ory descriptor data type is introduced for keeping track of an MPB buffer

(Listing 4.2). This descriptor is used in a similar manner as a protocol header

and contains metadata such as the progress of production/consumption of

data, etc.

During this process, tasks are also assigned their corresponding task

function. There are three types of tasks: root, branch and leaf tasks. Each

of these types of tasks performs the same function, but with a different set

of parameters. Therefore, a function is implemented for each: run_root,

run_branch and run_leaf. The function type is stored as a function pointer

in the task tree structure, so that it can be accessed from there directly.

Listing 4.2: The MPB memory descriptor

1 struct mpb_header

2 {

3 unsigned long seq; /* the sending tasks counter, equal to progress

4 * of task. incremented every time the buffer

5 * is written to */

6 unsigned long ack; /* the receiving tasks counter, set equal to

7 * seq when the buffer has been received */

8 unsigned shortstart_os; /* the offset to the first valid integer

9 * in buffer (since some may have been consumed

10 * already */

11 unsigned shortint_ct; /* number of valid (unconsumed) integers

12 * currently in data area */

13 unsigned shortsrc_task_id; /* the source task, writes to this buffer */

14 unsigned shortdst_task_id; /* the destination task, reads from buffer */

15 } __attribute__((aligned(32)));

16 typedef structmpb_header mpb_header_t;

Private and cache memory is allocated.

The cores that hold leaf or

root tasks allocate private memory for reading or writing as needed. The

cores holding branch tasks allocate small amounts of private memory that

is assumed to remain in cache and is to be used for work. The amount of

cache memory allocated is the minimum required for merging the task at

hand, depending on the MPB buffer size. Here, a constant is introduced to

increase the amount of cache memory for possible performance tuning. This

concludes the setup of the algorithm and sorting can begin.

Phase 0 local merge is initially performed by each leaf task. This is

executed by the function sequential_merge (see appendix A.5). An input

file containing the elements to be sorted is read, at a certain offset

deter-mined by each leafs position relative to the tree. The elements are copied