Final thesis
On-chip Pipelined Parallel Mergesort
on the Intel Single-Chip Cloud
Computer
by
Kenan Avdić
LIU-IDA/LITH-EX-A–14/012–SE
October 18, 2014
Final thesis
On-chip Pipelined Parallel Mergesort
on the Intel Single-Chip Cloud
Computer
by
Kenan Avdić
LIU-IDA/LITH-EX-A–14/012–SE
October 18, 2014
Supervisor:
Nicolas Melot, Christoph Kessler
Examiner:
Christoph Kessler
With the advent of mass-market consumer multicore processors, the
grow-ing trend in the consumer off-the-shelf general purpose processor industry
has moved away from increasing clock frequency as the classical approach
for achieving higher performance. This is commonly attributed to the
well-known problems of power consumption and heat dissipation with high
fre-quencies and voltage.
This paradigm shift has prompted research into a relatively new field
of “many-core” processors, such as the Intel Single-chip Cloud Computer.
The SCC is a concept vehicle, an experimental homogenous architecture
employing 48 IA32 cores interconnected by a high-speed communication
network.
As similar multiprocessor systems, such as the Cell Broadband Engine,
demonstrate a significantly higher aggregate bandwidth in the interconnect
network than in memory, we examine the viability of a pipelined approach
to sorting on the Intel SCC. By tailoring an algorithm to the architecture,
we investigate whether this is also the case with the SCC and whether
employing a pipelining technique alleviates the classical memory bottleneck
problem or provides any performance benefits.
For this purpose, we employ and combine different classic algorithms,
most significantly, parallel mergesort and samplesort.
1
Introduction
1
1.1
Background . . . .
1
1.2
Previous work . . . .
3
1.3
Contributions of this thesis . . . .
4
1.4
Organisation of the thesis . . . .
5
1.5
Publications . . . .
5
2
The Intel SCC
6
3
Preliminary Investigation
10
3.1
Main Memory . . . .
10
3.2
Mesh Interconnect . . . .
16
3.3
Conclusions . . . .
24
4
Mergesort Algorithm
25
4.1
Simple approach . . . .
25
4.1.1
Algorithm . . . .
25
4.1.2
Experimental Evaluation
. . . .
27
4.2
Pipelined mergesort
. . . .
32
4.2.1
Design . . . .
33
4.2.2
Algorithm . . . .
34
4.2.3
Experimental Evaluation
. . . .
39
5
Conclusions and Future Work
42
A Code Listing
48
A.1 mem_sat_test.c . . . .
48
A.2 mpb_trans.c
. . . .
52
A.3 priv_mem.c . . . .
54
A.4 pipelined_merge.h . . . .
61
A.5 pipelined_merge.c . . . .
66
Introduction
1.1
Background
The increasingly difficult problems of power consumption and heat
dissipa-tion have today all but eliminated the classic means of improving processor
performance — increasing its frequency. Instead, to increase performance,
technology has moved towards adding more cores to the chip. In
combi-nation with redundant processing units and multiple pipelines this allows
varying degrees of support for thread-level parallelism. In turn, software
development in general is being forced to adapt to a parallel paradigm in all
areas; desktop, entertainment and recently even embedded applications.
The transition towards hetero- and homogeneous multi- and many-core
architectures is by no means a simple one. Efficient and effective utilisation
of chip resources requiring parallelisation becomes more difficult.
The development of new hardware such as processors and memory sizes
has, until recently, largely been following the well-known Moore’s law.
Off-chip memory speeds, however, are lagging behind. As these memories are,
in relative terms, orders of magnitude slower than on-chip memory, main
memory access becomes prominent as one of the major causes of processor
stalls. This bottleneck effect is especially pronounced in memory intensive
operations, such as sorting. In order to lessen the impact of high latencies
of main memory, program behaviour can be altered so as to reduce main
memory access or avoid accessing main memory altogether. By
employ-ing on-chip pipelinemploy-ing, storemploy-ing intermediate results of sub-tasks in memory
can be avoided. These intermediate results can instead be immediately
for-warded onto the next processing unit. In addition, further performance
im-provement can be achieved by “parallelising” memory access to either main
memory or buffers by making it concurrent with computation. This can
be achieved e.g. through asynchronous memory transfers (Direct Memory
Access) combined with multi-buffering.
Single-Chip Cloud Computer 48-core concept vehicle as an algorithm engineering
problem. The implementation of such an algorithm involves many variables,
most significantly load balancing and memory access and communication
patterns. A sorting algorithm shares similar requirements with many
prac-tical applications, such as image processing, which makes solving such a
problem all the more relevant. As a pipelined variant of parallel mergesort
[1] has been shown to achieve higher performance on other architectures [2],
we focus on this algorithm primarily, but also look at parallel variations of
samplesort [3] [4].
Parallel sorting algorithms have been investigated for many years, on
many different platforms. Mergesort in particular originated as an external
sorting algorithm and combined well with the sequential access requirements
of early tape drives. Today, tapes have been replaced by disks or slower
off-chip memory, but the sequential nature of mergesort is still highly beneficial
due to good synergy with memory hierarchies available in almost all
hard-ware and the locality effects of such memory accesses.
The mergesort algorithm operates recursively using a divide-and-conquer
paradigm. The array to be sorted is split recursively into smaller chunks,
until chunk size is one. Each chunk is then merged in the correct order,
until the sequence is again the complete starting length (Fig. 1.1).
1 2 3 4 5 6 7 8
2
3
2
3
2 3
4 6
4 6
4 6
2 3 4 6
2
3
4 6
1
5
1
5
1 5
7
8
7
8
7 8
1 5 7 8
1
5
7
8
1
4
3
2
5
6
7
8
2
3
4 6
1
5
7
8
1 2 3 4 5 6 7 8
2 3
4 6
1 5
7 8
2 3 4 6
1 5 7 8
2
3
4 6
1
5
7
8
2
3
4 6
1
5
7
8
2
3
4 6
1
5
7
8
split
m erge
Figure 1.1: The mergesort algorithm [5].
The split operation has negligible cost and is considered trivial. The
merge tasks are independent of each other and can be performed separately.
This task independence is a natural recursive decomposition of tasks and
allows for their concurrent execution on different processing units, resulting
in a parallel mergesort algorithm. The splitting of the sequence results in a
binary tree, the depth of which can be used as a variable for modification
of task parallel granularity. That is, tasks are assigned to processing units
down to a certain tree level, after which each subtree is locally sorted, i.e.
on the mapped processing unit. These lowest level tasks are thus executed
sequentially.
The obvious method of transferring sorted subsequences between the
tasks is for the tasks to write the results into memory, where they are read
by the processing unit that is assigned the next task. This is, however, not
always necessary. Subsequent tasks do not need to wait until the previous
task is completed, as each task starts outputting a sorted sequence, it is
immediately input directly into the next. This is pipelined parallel mergesort.
In general, memory access cost is traded for a higher communication cost
instead. Such an algorithm is also significantly harder to optimise, as there
are many interdependent variables to consider.
1.2
Previous work
No previous work exists in algorithm performance on the Intel Single-Chip
Cloud Computer, however, a SIMD-enabled and/or pipelined approach has
been shown to be very effective in the case of sorting on the Cell Broadband
Engine processor.
The Cell is a heterogeneous PowerPC-based architecture that consists of
a single general purpose core combined with 8 streaming coprocessors [6].
The main core, the Power Processing Element (PPE) is a standard 64-bit
in-order dual-issue PowerPC core that supports two-way simultaneous
mul-tithreading (SMT) and Single-Instruction Multiple-Data (SIMD)
instruc-tions
1. Being a general purpose core, the PPE runs the operating system,
but its main task is controlling the 8 coprocessors, the Synergistic
Process-ing Elements (SPE). The SPEs, in turn, are each comprised of a Synergistic
Processing Unit (SPU) and a Memory Flow Controller unit (MFC). The
SPU is an in-order, dual-issue processing unit. It contains a large 128-entry
128-bit register file, supports integer and floating-point operations and is
SIMD-capable, or rather its processor intrinsics consist of only SIMD
in-structions. The SPU has no direct access to system memory. Instead, it
uses a local store of 256KiB for both programs and data. The MFC is
re-sponsible for translating addresses between the SPUs and the system and
performing DMA transfers to local stores.
At 3.2 GHz clock speed, the PPE theoretically delivers 25.6 GFLOPS
2using single precision operations, while each SPE can reach 25.6 GFLOPS.
The PPE, the SPEs, system memory and peripheral input-output interfaces
on Cell communicate via a high-speed bus called the Element Interconnect
Bus (EIB). Typically, separate programs are compiled for the PPE and the
SPEs. The PPE controls the SPEs, initialising and running small programs
there. DMA transfers can be initiated by either the PPE or the SPEs.
Regarding sorting work on the Cell processor, recent advances in
GPGPU-programming
3[7] were recently considered and applied by Inoue et al. [8].
In their work, the authors follow the conclusions made by Furtak et al. [9] on
the benefits of exploiting available SIMD streaming instructions and
exam-ine the SIMD capabilities of the Cell; attempting to exploit them in a similar
way as previously done on GPUs [7] [10] [11]. The result is Aligned-Access
1AltiVec vector instructions
2Billion floating operations per second
sort, or AA-sort, which is a combination of an improved SIMD-optimised
combsort [12], used in-core, and the odd-even merge algorithm [13], used
out-of-core, both implemented with SIMD instructions. The relative speedup
achieved by AA-sort is 7.87x and 3.33x for the two constituent algorithms
over the same scalar implementation. The algorithm achieves a parallel
speedup of 12.2 with 16 cores when sorting 32-bit integers.
Gedik, Bordawekar and Yu identify similar Cell-specific requirements of
sorting algorithms: SIMD-optimisation of the SPE code, memory transfer
optimisation and effective utilisation of the EIB, but substitute the
odd-even merge algorithm above with two variations of bitonic sort [14].
A
SPE-local sort and two different variations of bitonic sort, distributed
in-core and distributed out-of-in-core sort are produced. The distributed in-in-core
sort uses the local sort algorithm and cross-SPE transfers to internally merge
a number of elements up to a size determined by the number of participating
SPEs. For larger sequences, the distributed out-of-core sort is used, which
utilises the in-core algorithm in phases to achieve the final sorted result. The
achieved speedup sorting floats for the in-core and out-of-core sorts, over an
Intel Xeon 3.2GHz, is 21x and 4x respectively.
By employing on-chip pipelining on the Cell, Hultén et al. [2] [15]
im-prove further upon these results and achieve an additional speedup of 70%
for the IBM QS20 and 143% for the PlayStation 3 over the AA-sort
im-plementation.
This is accomplished by minimising main memory access
through on-chip pipelining and asynchronous multi-buffered DMA transfers.
A pipelined on-chip version of the parallel mergesort algorithm is applied
using binary tree task partitioning and subsequently mapped to the SPEs.
Task mapping is optimised by expressing it as an integer linear programming
problem and solving it using an ILP solver.
Scarpazza and Braudaway [16] examine text indexing on the Cell,
adapt-ing this specific workload to its hardware. The solution provided affords a 4x
performance advantage over a non-SIMD reference implementation running
on all four cores of a quad-core Intel Q6600 processor.
Haid et al. leverage Kahn process networks [17] to generalise
stream-ing applications in general [18] and on Cell specifically [19], by executstream-ing
their model using protothreads [20] (for parallelism) and windowed FIFO
(for communication). The parallel speedup achieved here is nearly seven
when using seven processors on the PlayStation 3. This is especially
inter-esting due to the generic nature of a KPN application compared to otherwise
required architecture-specific code.
1.3
Contributions of this thesis
The most significant contribution of this thesis is the design and
implemen-tation of an on-chip pipelined parallel mergesort algorithm tailored to the
unorthodox hardware of the Intel Single-Chip Cloud Computer. Building
on known work mentioned in the previous section, we attempt to achieve
similar results on the SCC as on the Cell [2] [15]. Due to the lack of SIMD
instructions on the SCC hardware, however, no optimisation in that
direc-tion is possible, but some other features of the SCC are shown to benefit
from on-chip pipelining.
Due to there being no previous work on sorting on the SCC, an
investi-gation of the memory and mesh interconnect capabilities is performed first.
In addition, following the preliminary investigation, a simple naïve
imple-mentation is briefly handled and subsequently used for comparison with the
final pipelined algorithm.
1.4
Organisation of the thesis
The remainder of this thesis is organised as follows. Chapter 2 gives a
rela-tively high-level overview of the Intel SCC architecture, with the subsequent
chapters each adding more detail to its constituent parts as necessary.
Chap-ter 3 deals with preliminary investigation of the architecture details that are
identified to possibly impact the final algorithm design.
Chapter 4 describes the theory behind the mergesort algorithm, a naïve
parallel implementation of such an algorithm on the SCC as well as our
final design, implementation and results of the pipelined parallel mergesort
algorithm. Chapter 5 offers our conclusions on the results from chapter 4,
and future work.
1.5
Publications
Parts of this work have already been published in the following, in
chrono-logical order.
• Parallel sorting on Intel Single-Chip Cloud computer [5].
• Investigation of Main Memory Bandwidth on Intel
Single-Chip Cloud Computer [21].
• Pipelined Parallel Sorting on the Intel SCC [22].
• Engineering parallel sorting for the Intel SCC [23].
The Intel SCC
0
0
1
0
1
2
3
4
5
6
7
SCC die DIMM R tile tile R tile R tile R tile R tile R MC MC DIMM tile R tile R tile R tile R tile R tile R tile R tile R tile R tile R tile R tile R MC MC DIMM DIMM tile R tile R tile R tile R tile R tile RFigure 2.1: Intel SCC Architecture Top View [24].
The Intel Single-Chip Cloud Computer [25] [24] is a chip
multiproces-sor. It is comprised of 24 tiles arranged in a 6x4 rectangular grid pattern.
The tiles are connected by an on-chip two-dimensional mesh interconnection
network. Each of the 24 tiles contains a pair of second generation Intel
Pen-tium IA32 cores (P54C), each in turn with its own L1 and L2 cache. The
L1 cache is 32KiB with 16KiB data and instruction cache. The L2 cache is
unified and is of 256KiB size. These caches are write-back, while L1 can be
configured as write-through.
The two cores on a tile are joined with a mesh interface unit (MIU) (Fig.
2.2) that has several responsibilities, but its main task is to provide
commu-nication resources between on-tile resources and the on-tile mesh interface,
the router. In addition to the two L2 caches and a mesh router, to the MIU
is attached a 16KiB message-passing buffer, the MPB. With 24 tiles, the
total available mesh memory is thus 384KiB. Since the IA32 cores on the
SCC use local addresses and are not aware of the global chip configuration,
the MIU translates core-local addresses using a look-up table (LUT) into
L2 256 KiB P54C MPB 16 KiB P54C L1 32KiB L2 256 KiB traffic gen mesh I/F
R
L1 32KiBFigure 2.2: An Intel SCC Tile [5].
non-local accesses, e.g. router, MPB, etc. The MIU is also responsible for
the hardware configuration of the cores, using tile configuration registers.
The architecture supports a special type of data to facilitate
message-passing which is new to the P54C Pentium cores, MPBT. This data type
bypasses L2 cache entirely and is only cached in L1. In addition, each line
in the L1 cache is expanded with a flag, which marks whether the line in
question holds MPBT-data. The IA32 instruction set is further expanded
with an instruction (CL1INVMB) that invalidates all MPBT-marked data
in L1.
Four DDR3 memory controllers are attached evenly to the routers on the
two shorter sides of the mesh rectangle. Each memory controller supports
DDR3-800 DRAM, up to 16GB per channel, allowing for a total of 64GB
memory capacity.
Six tiles are logically grouped in quadrants and each
use the closest memory controller. The memory variants are core-private
memory and shared memory. Each core has a certain amount of private
memory, which is a reserved area within main memory assigned to that core
only. This memory is cached in all available caches. The shared memory on
the other hand is evenly distributed over the four main memory controllers
and is either only cached in the L1 cache (using the aforementioned MPBT
memory type) or not cached at all.
The SCC provides voltage and clock control with a very high degree of
granularity and customisation. The voltage regulator controller (VRC)
al-lows for voltage adjustment in any of the 6 voltage islands (dashed regions)
in Figure 2.3 individually, or the entire mesh collectively. The voltage
set-tings can be altered from any core, allowing full application control of the
cores’ power state, or the system interface controller (SIF). The SIF is the
interface between the mesh and the external controller located on the system
board.
Even more granularity is allowed in clock frequency adjustment, as the
SCC can control each tile separately. The mesh and its routers, however, all
share a single frequency. Each tile uses the mesh clock as the input with a
configurable clock divider to arrive at a local clock. The mesh itself can be
System Interface
DDR3 MC DDR3 MC DDR3 MC DDR3 MC VRC PLL & JTAGFigure 2.3: SCC Voltage and clocking islands [24].
considered to reside on its own frequency island.
The SCC can be programmed directly, in so called baremetal mode, or
an operating system can be loaded onto each core that subsequently runs
programs. A version of Linux called SCC Linux is provided for the latter
mode. A set of tools for management called sccKit is used for externally
controlling the SCC via the SIF. These tools can be used to configure and
manage the SCC, providing facilities to, e.g., hardware power cycle, reset
and reboot the SCC. SccKit is also used for starting the SCC in one of
the preset frequency profiles. The available frequency profiles are listed in
Table 2.1. SCC Linux is available as modified source code for recompilation
if kernel modification is necessary. Programs for the SCC Linux are compiled
using standard compilers provided by Intel, such as gcc or icc.
Tile (MHz)
Mesh (MHz)
Memory (MHz)
533
800
800
800
800
800
1066
800
1600
800
1066
Table 2.1: Available frequency profiles using Intel sccKit
As previously mentioned, an MPI-like API library called RCCE
(pro-nounced “rocky”) [26] exists for the SCC. The library provides three API
interfaces: two for message-passing support (a basic and a gory interface),
and one for power management. The basic message-passing API interface
is a simple interface with most implementation details (such as
synchro-nisation) hidden from the programmer. The gory interface exposes more
functions and allows for more power and flexibility in implementations.
The programs described in this work are cross-compiled on a
manage-ment console following the Intel SCC Programmer’s guide [27], and
subse-quently deployed onto the cores for execution and testing. The gory interface
is used in all algorithm implementations. The input/output control towards
the processing units is handled by SSH, more specifically, pssh.
Preliminary Investigation
There are several issues to be considered for algorithm design and
imple-mentation on the Intel SCC.
First, taking a closer look at the multi-core processor and applying
clas-sical multiprocessing paradigms, we see it bears a certain resemblance to
a Non-Uniform Memory Access system: there is an interconnect network,
its processing units vary in distance to their respective memory controllers,
and no cache coherence is provided. Additionally, it is programmed using
an SPMD
1paradigm and there is an MPI-like library that provides
collec-tive communication. These variations are very likely to have an effect in
the achieved results, and must be considered. The SCC is flexible in this
regard, as main memory address translation that is performed in hardware
near the processing units can be configured using the cores’ lookup table.
The amount of memory available e.g. can be changed by modifying this
table.
Second, we look at the availability of special SIMD or vector instructions.
Unfortunately, no such instructions are available on the Pentium P54C cores.
The first Pentium core that features such instructions is the P55C (Pentium
MMX).
Third, we consider the capacity and latency of the interconnection mesh
and memory. Intel specifies its bus width as 16B data plus 2B side band.
With a clock of 1600MHz, the mesh should thus be capable of a throughput
of 3052 MiB, or 2.98 GiB, per second, with a specified latency of four cycles,
including link traversal [24].
3.1
Main Memory
The memory hierarchy on the SCC from the point of a single tile and core is
not altogether different to a uniprocessor system. As previously mentioned,
each tile contains two cores, where each core has individual L1 and L2 caches.
The L1 caches are 16KiB instruction and 16KiB data each, while the L2
caches are 256KiB unified. Each tile has a local memory area intended as
a buffer for messaging, the MPB. This buffer is 16KiB per tile, which by
default is assigned one half per core, so that each core has access to 8KiB of
MPB. Since the SCC consists of 24 tiles, we have a total of 384KiB of MPB
memory.
There are four main memory interface controllers (MICs) attached to
the “east” and “west” corner tiles of the 6-by-4 mesh. The controllers each
support a maximum of 16 GB memory, allowing for a total of 64 GB main
memory. The supported memory type is DDR3-800. This memory is, in
the default configuration, logically divided in a quadrant-wise fashion to the
cores on the tiles belonging to the quadrant. Each core in a given quadrant of
the SCC is assigned a certain amount of exclusive (private) memory, served
by the quadrant-local MIC. This amount naturally depends on the amount
of main memory installed, as well as configuration parameters in the cores’
lookup tables (LUTs).
A lookup table is a set of configuration registers that are used for memory
address translation from core addresses to system addresses. Each core has
a LUT, and each LUT contains 256 entries. On a L2 cache miss, the top 8
bits of the core physical address are used as an index into the LUT which for
these 8 bits provides 22 bits of system address information. The remaining
24 bits of the core address are finally appended to result in a system address
of 34 bits. Most significantly, this LUT expansion contains a destination ID
for the mesh router where the translated system address is to be forwarded.
By configuring each core’s LUT with a certain exclusive address range and a
specific router (where the memory controller is located), cores are provided
with core-private memories. This is the default configuration of the LUTs.
In addition to the aforementioned private memory, a certain amount of
total system memory is reserved as shared memory. This memory can be
indexed by any core (i.e. the cores have overlapping LUT addresses) and is
evenly allocated from memory attached to the four memory controllers.
The SCC provides no cache coherence mechanisms. In the case of
pri-vate memory, no cache coherence mechanism is even necessary, as memory
is exclusively mapped to a single core. In this case, both L1 and L2 caches
are active. The shared memory, on the other hand, is not cached in L2.
Shared memory is either entirely uncached, having all the reads go directly
to memory, or only cached in L1 and marked as MPBT memory. As
previ-ously mentioned, an instruction was also added to clear memory flagged with
this flag from the L1 cache. Furthermore, P54C already has the capability to
reset the L1 cache completely. Presumably, not caching the shared memory
in L2 by default is due to the fact that the P54C is not equipped with any
means of clearing or resetting the L2 cache. Activating L2 in combination
with shared memory makes an implementation of a cache coherence
mech-anism a requirement. Ultimately, any cache coherence must be handled by
the programmer, by e.g. manual cache flushing such as a certain pattern of
access. The caches are preconfigured as write-back, while the L1 cache can
also be configured as write-through.
As memory speeds often have a large impact on the performance of
sorting algorithms, we begin by examining the memory performance [21].
This is measured in bandwidth, or bandwidth per core, where more than
one core is active. We examine variations in bandwidth with increasing
number of cores, as well as using different memory access type; read, write
or combined. Since the SCC is capable of clock speed modulation, the effect
of core clock on memory bandwidth is also examined. In these tests, memory
and mesh clock speeds are kept constant at 800Mhz, while the core clocks
are tested at 533MHz and 800MHZ respectively.
In order to consider the impact of cache, we look at two different memory
access strides. Since the cache line width is 32 bytes, reading and writing
to memory is performed in two different manners: stride 4 and stride 32
bytes. Stride 4 bytes is selected for convenience as it is the size of an integer
on this platform, while stride 32 is selected as it is the size of a cache
line (8 integers). Special care is taken to allocate memory with 32-byte
alignment, in order to ascertain that the correct part of the cache line is
read or written. The mixed pattern denotes a combination of these two
stride patterns. A pseudorandom access pattern is also used to attempt to
circumvent any locality optimisations inherent in hardware, whether it is
cache effect or memory bank optimisation. This pseudorandom pattern is
provided through a function [28] pi(j) = (a · j) mod S for the index j, a
large, odd constant a and where S is a power of two (see code example in
appendix A.1). The random access pattern also applies the previous strided
principle to the index j.
In addition to access patterns, we look at read, write and combined access
types separately, where combined access refers to simultaneous reading and
writing, as well as the scaling in the amount of cores that are participating
between 3 and 12 cores. 12 cores is the maximum default private memory
setup per controller.
The experiment is performed using a fixed data set of 200MiB per each
participating core. Time is measured from the point when the cores have
started up the program, throughout the memory operation and until
fin-ished. This is repeated for 100 attempts, after which an average, standard
deviation, minimum and maximum values are collected. The bandwidth per
core and the global aggregate bandwidth are measured. There is a number
of core in both measurements which signifies how many cores are active
dur-ing the measurement. This was achieved by usdur-ing variations of the code in
appendix A.1.
Figure 3.1 shows the total measured read bandwidth presented as a
func-tion of the number of cores. We see no surprises here, the 4-byte/1-int stride
access achieves the highest throughput for each of the two different clock
speeds respectively. The lowest performance comes from the read random
8 int pattern, as this type of access is designed to circumvent caches. The
same can be said about the results in the diagram for write access in Figure
3.2. The highest total throughput of 12-core aggregate, 120MiB per second,
is achieved by sequential int writes, which is an excellent example of the
effect of cache. Recall that the L2 cache is write-back on the SCC — it
fol-lows that the pattern that results in the fewest cache evictions will achieve
the highest performance here. The only patterns that repeatedly write to
the same cache line are the 1-int per write ones and naturally have the
high-est performance. We see that 1-int random and sequential access have the
same performance, since they result in the same amount of cache evictions.
The weakest performance is shown by 8-int stride random accesses, which
not only evict a cache line each time, but also are constructed to avoid any
memory optimisations for sequential reading that the memory controller
af-fords. This access pattern is likely to be very close to the lowest possible
write performance achievable on the SCC. These same results are presented
per core in Figures 3.3 and 3.4.
Since no bandwidth drop with increasing number of cores is evident
and the aggregate memory bandwidth previously shown rises linearly with
the number of cores, this shows that a single memory controller cannot be
saturated using a maximum of 12 cores. The slight drop in write bandwidth
in Fig. 3.4 is attributed to the L1 cache, which is configured as
no-write-allocate. This strategy causes a cache line to not be read into cache on a
write cache miss, i.e. when exclusively writing data, it is likely that the L1
is completely bypassed.
0 500 1000 1500 2000 2500 3000 3500 4000 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of coresGlobal main memory read bandwidth at 533 and 800MHz Read stride 1 int (533)
Read stride 8 int (533) Read mixed (533) Read random 1 int (533) Read random 8 int (533) Read stride 1 int (800) Read stride 8 int (800) Read mixed (800) Read random 1 int (800) Read random 8 int (800)
20 40 60 80 100 120 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores Write stride 1 int (533)
Write stride 8 int (533) Write mixed (533) Write random 1 int (533) Write random 8 int (533) Write stride 1 int (800) Write stride 8 int (800) Write mixed (800) Write random 1 int (800) Write random 8 int (800)
Global main memory write bandwidth at 533 and 800MHz
Figure 3.2: Global main memory write bandwidth at 533 and 800Mhz [21].
0 50 100 150 200 250 300 350 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores
Per core main memory read bandwidth at 533 and 800MHz Read stride 1 int (533)
Read stride 8 int (533) Read mixed (533) Read stride 1 int (800) Read stride 8 int (800) Read mixed (800)
Figure 3.3: Strided read memory bandwidth per core at 533 and 800MHz
[21].
0 2 4 6 8 10 12 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores
Per core main memory write bandwidth at 533 and 800MHz Write stride 1 int (533)
Write stride 8 int (533) Write mixed (533) Write stride 1 int (800) Write stride 8 int (800) Write mixed (800)
Figure 3.4: Strided write memory bandwidth per core at 533 and 800MHz
[21].
Finally, in Figures 3.5 and 3.6, we see that memory locality is a
con-sideration, even for random access. Despite the high performance of the
memory controllers, they struggle to serve highly irregular access patterns
and perform better with sequential access.
1 2 3 4 5 6 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores
Bandwidth per core with random access 5 int gap read
13 int gap read 21 int gap read 5 int gap write 13 int gap write 21 int gap write 5 int gap combined 13 int gap combined 21 int gap combined
Figure 3.5: Random pattern read memory bandwidth per core at 533 and
800MHz [21].
2 4 6 8 10 12 14 16 2 4 6 8 10 12 B a n d w id th i n Mi B /s e c Number of cores
Bandwidth per core with random access 5 int gap read
13 int gap read 21 int gap read 5 int gap write 13 int gap write 21 int gap write 5 int gap combined 13 int gap combined 21 int gap combined
Figure 3.6: Random pattern write memory bandwidth per core at 533 and
800MHz [21].
3.2
Mesh Interconnect
The speed of the mesh and the message passing buffers is another issue that
influences the details in the construction of our algorithm.
The two-dimensional mesh network consists of 24 packet-switched routers,
or one per tile (Fig.3.7), organised in the aforementioned 6x4 configuration.
The mesh has its own power supply and clock source, in order to improve
support for dynamic power management. The flow control in the mesh is
credit based. Each core is connected to the router on the tile using the mesh
interface unit, which is responsible for, among other things,
packetising/de-packetising data and translating local addresses into system addresses. The
MIU has a buffer, MPB, which is 16KiB and divided in half for each core.
The MIU communicates directly with the tile router. Each router has eight
credits to give per port and can send a packet to another router only when
it has a credit from that router. Credits are returned to the sender once the
packet has moved on. Error checking is performed primarily through parity.
No error correction is performed.
We are interested in the performance of the mesh, routers and the mesh
interface unit when under a high load from the processors [5]. This is
evalu-ated using a test program (a variation of the listing in appendix A.2). The
evaluation method consists of investigating latency and throughput by
hav-ing a shav-ingle core (core 0) send a specified amount of data to every other core
not sharing the same tile, while monitoring the time taken to perform the
transfer. The variables of the test are core distance in hops (Fig. 3.8) and
the size of the transferred data. Each test is performed 1000 times and the
average is taken as a sample.
We do not test data sets larger than the size of the L2 cache. These
sizes would result in frequent main memory access, which in turn generates
Figure 3.7: SCC Tile Level Diagram [24].
extra mesh traffic and could naturally introduce undesirable variability in
our test. By ensuring that data is exchanged from within the L2 cache only,
we avoid any impact on timing that main memory access would have.
The results of the first round of tests are displayed in Figures 3.9 through
3.14 for data sizes of 2, 4, 8, 16, 32 and 64 kibi-integers or 8, 16, 32, 64, 128
and 256 kibibytes respectively.
First, in Fig. 3.14 we see the timings for the 64Ki integers are highly
inconsequent.
This is attributed to memory access.
A data set of this
size is highly unlikely to fit in L2, even if a single program is running on
the processing unit. Other processes along with the operating system are
assumed to be intruding on the utilisation of L2. It is evident that there is
some private memory access in this case, which is influencing the transfer
timings.
Second, for data sets of 2-32 kibi-integers (4-128 KiB), we see that the
timings are roughly doubling with the doubling of the data size. This
in-dicates again, as in the case of main memory, that the processing units are
unable to saturate the mesh. Another representation of the same data is
given in Fig. 3.15, where the same numbers can be seen as a function of hop
distance. The marginal timing increase is more prominent in this Figure,
along with the cache limit at 256 KiB.
Finally, a second round of testing is performed. This is done in order
to better ascertain the availability of L2 cache, i.e. to find out the amount
of data that can safely be cached before memory access starts to have a
significant impact on performance. For this, data sizes of 40, 48 and 56
Ki-7
1
(a) 3 hops between cores.
11
1
(b) 5 hops between cores.
2 3
1
(c) 6 hops between cores.
47
1
(d) 8 hops between cores.
Figure 3.8: Four different mappings of core pairs with increasing distance
[5].
0.085 0.09 0.095 0.1 0.105 0.11 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 8 KiB0.28 0.285 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 16 KiB
Figure 3.10: Average transfer time for 4Ki integers/16 KiB
0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 32 KiB
1.2 1.22 1.24 1.26 1.28 1.3 1.32 1.34 1.36 1.38 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 64 KiB
Figure 3.12: Average transfer time for 16Ki integers/64 KiB
2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 2.85 2.9 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 128 KiB
5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 7.2 0 1 2 3 4 5 6 7 8 ms Distance in hops Average MPB transfer time 256 KiB
Figure 3.14: Average transfer time for 64Ki integers/256 KiB
ints are selected (160, 192 and 224 KiB respectively). The results of these
additional tests can be seen in Figures 3.16 through 3.18.
From the above we see that main memory access interference begins
to make itself apparent at a data size of 192KiB. 160 KiB, in comparison
with the lower data set results, looks relatively unaffected. We see thus
that, ideally, to avoid added memory access in mesh communication when
designing and programming for pipelining (with the current configuration
of hardware and software), data sets of 160 KiB should preferably be used
and of definitely no more than 192KiB.
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 Ti m e in m il li s e c o n d s
Hamming distance in hops
Block transfer time and distance between cores
64KiB 128KiB 256KiB
Figure 3.15: Average time to transfer 64, 128 and 256KiB as a function of
the distance between cores [5].
2.3 2.35 2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 2.85 0 1 2 3 4 5 6 7 8 ms Distance in hops Cached transfers 160 KiB
2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 0 1 2 3 4 5 6 7 8 ms Distance in hops Cached transfers 192 KiB
Figure 3.17: Average transfer time for 48Ki integers/192 KiB
3.6 3.8 4 4.2 4.4 4.6 4.8 5 0 1 2 3 4 5 6 7 8 ms Distance in hops Cached transfers 224 KiB
3.3
Conclusions
Tests were performed on memory and mesh in order to obtain results relevant
to our tailored algorithm design. The following are important considerations
to be made during this process:
1. The P54C cores, albeit extended with new features and clocked to
a much higher clock than its original stock clock, are not at par
performance-wise with the rest of the hardware in the SCC. The DDR3
controllers and the mesh are extremely fast and can only be taxed by
the P54C cores using a heavy write load. This is not entirely
unex-pected, as there is limited die area and many cores are provided. Our
tests show that a single memory controller remains at nearly its
max-imum performance even when a full quadrant of the SCC is reading
from it. Furthermore, a single mesh link cannot be significantly slowed
down by communication between any two cores, as long as main
mem-ory access is avoided, i.e. for any type of pipelining considerations.
2. Any memory access other than cache will result in added mesh
com-munication, since the memory is accessed through the mesh itself. The
preferred data size to be used for local buffers with pipelining is thus
160 KiB, with no more than 192 KiB used at any time. Ideally, these
parameters should be made configurable.
3. Despite the high overall performance of the memory, write bandwidth
is comparatively low and the mesh interconnect even faster. Combined
with the low performance of the processing units, this makes the SCC
a good candidate for pipelined sorting.
Mergesort Algorithm
4.1
Simple approach
As an initial implementation, we begin by constructing a naïve parallel
mergesort algorithm. Each level of the mergesort tree is mapped to a set of
cores. This simplification means that we may only use a number of cores
that is a power of two, and at maximum, only 32 of the 48 available cores
are used. Furthermore, all of these 32 cores are only used in the first round;
as the number of sequences to be sorted halves every round, so does the
number of participating cores. With a large number of cores idle during the
sorting, the efficiency of this algorithm should be extremely low.
4.1.1
Algorithm
The algorithm uses the cores’ private memory to store integer blocks and
uncached shared memory as a buffer to transfer these blocks between them.
Since uncached shared memory is used, no cache coherence mechanism is
required. The algorithm is initialised by selecting the number of integers
N (the size of data), the number of participating nodes P and setting the
number of active nodes P
a= P . In step 0, each node pregenerates two
pseudorandom nondecreasing sequences of length N/(2P ). These sequences
simulate the output from the initial sequential round of merging.
The algorithm then enters a sequence of rounds, where each round
con-sists of two phases, sorting and transfer. In the sorting phase, the active
nodes in the current round merge two sequences into one of combined length
N/P
a. The sorting phase is then completed and the algorithm proceeds to
the transfer phase (Fig. 4.2). In the transfer phase, the number of active
nodes is integer-divided by two (using a logical right shift) and the nodes
that are becoming inactive transfer their sorted sequences to the nodes
re-maining active. The transfer is performed using buffers in shared memory.
During the transfer phase, flags are set in the communicating cores MPBs
for synchronisation.
The round is completed; active nodes (the nodes with their rank less
than P
a) continue on to the next round while the inactive nodes become
idle.
When the last round completes and the algorithm ends, the root node
has merged the last two sequences into a single nondecreasing sequence of
length N . Figure 4.1 provides an illustration of this simple algorithm.
core 0 core 1 core 2 core 3
Round 1
Round 2
Round 3
merge communicate Round 0 generatenon-decreasing
Figure 4.1: Naïve Parallel Merge: each round half of the cores become
inactive after merging and transferring their assigned sequences [5].
Three other variants of the algorithm above are implemented. They use
the same basic algorithm as above, but alter it as follows.
Two of the variants rely on shared instead of private memory exclusively.
The shared memory variants of the algorithm do not have a transfer phase;
by relying directly on shared memory as storage for input and output blocks,
the transfer phase is avoided. That is, there is no copying of data between
rounds; all that is required is a synchronisation for each subsequent round
to begin. Two cores are assigned a common buffer in shared memory for
their exclusive use, where one core is the sender and the other the receiver.
Flags are set by the cores in their respective MPBs for synchronisation, i.e.
when they are allowed to read or write their assigned buffer. The shared
memory mergesort algorithm is implemented in two variants: one cached
and one uncached.
The uncached shared memory version uses no caches, but accesses
mem-ory directly. No cache coherence mechanism is provided or necessary.
In the cached shared memory version of the algorithm, the L1 and L2
caches are enabled for caching and an explicit cache flush is added in place
of the transfer phase. As the SCC has no cache coherence, this is required
to maintain main memory consistency for the next round of computation.
The final version uses the MPB as a buffer instead of shared memory, and
relies on mesh communication to transfer the data between working cores.
Note that this algorithm is still in no way pipelined, the memory blocks are
simply transferred from the private memory range of one set of cores to the
private memory of another set. This variant should nevertheless reduce the
amount of memory access compared to the first version.
core 2
core 0
blocking
blocking
blocking
transfer
transfer
data wr
itten
data
read
Figure 4.2: A transfer phase of the naïve algorithm variants [5].
4.1.2
Experimental Evaluation
The measurements are performed as follows. Each of the initially active
nodes generates two pseudorandom nondecreasing integer sequences that
are to be merged. Once the starting sequences are randomised, timing and
then sorting starts. The sequences local to each core are sorted and the
respective algorithm above is followed. When the root task on the root rank
processing unit completes, the timer is stopped and the resulting sequence
is verified for correctness. Each measurement is performed in excess of 1000
runs, and the average of these is sampled. The results of the measurements
are represented in Figures 4.3 through 4.8. One additional test is performed
with constant values for comparison purposes (Fig. 4.9).
The results of the tests show the initial version of the algorithm having
the weakest performance in all cases except the single node one. This is
un-surprising as this version of the algorithm requires the most main memory
access. Starting with the 32 node case in Figure 4.8 we see that, for the first
algorithm that uses private memory with shared memory as a buffer, the
additional phase copying to shared memory and back induces a performance
penalty of over 60% over the same algorithm that instead uses mesh
com-munication. Recall that the writing and subsequent reading from shared
memory between two rounds of the algorithm are replaced here by instead
transferring the same data between two cores’ private memories using the
mesh. We see that very similar results are obtained for descending
num-bers of cores, the results in Figures 4.8, 4.7 and 4.6 for 32, 16 and 8 cores
respectively are nearly the same. The inefficiency of the base algorithm is
highly apparent in these, since there is no significant speedup in any of the
Figure 4.3: Merging time using 1 processor [5].
Figure 4.5: Merging time using 4 processors [5].
Figure 4.7: Merging time using 16 processors [5].
variants between 8 and 32 cores, despite the quadrupling of the number of
working cores. Furthermore, comparing any of the results to its single core
counterpart in Figure 4.3 reveals that there is actually no speedup at all. In
the case of the shared memory variants in these three Figures, we see that
the uncached shared memory offers particularly low performance. Despite
the complete lack of a transfer phase here, we still see almost as low
perfor-mance as the worst-case variant. Naturally, there is no cache used here, so
low performance is expected. The best results are achieved with the cached
shared memory algorithm, which both takes advantage of caching as well as
avoids extra copying.
Continuing in the reverse order, we look at the results for 4- and 2-node
tests (Figures 4.5 and 4.4). We see that, as the number of utilised cores
decreases between 8 and 2, the performance of the private memory version
of the algorithm with shared memory buffers improves over the uncached
shared memory one. This is attributed to the fact that these cases are less
parallelised in that there are fewer rounds. As the number of rounds is equal
to log
2nodes, each halving of the nodes reduces the number of rounds, and
thereby the block transfer operations between rounds, by one.
Ultimately, the results depicted in the Figure for a single processor show
best performance for all variants (Fig. 4.3). That is, our naïve attempt at
parallelisation of the mergesort algorithm does not yield any advantage over
the non-parallel version. Both private memory versions of the algorithm in
this special case are the same, and hence perform the same. They perform
better than the cached shared version as the memory required grows, since
the private memory is always allocated on the closest memory controller.
Naturally, uncached shared memory is, lacking cache, again significantly
slower. The single-core results confirm our previous experiments with
mem-ory with regards to private and shared memmem-ory speeds.
Figure 4.9: Merging time using 32 processors, using constant values [5].
4.2
Pipelined mergesort
The pipelined parallel mergesort algorithm is a version of the mergesort
algorithm. It shares the same basic features as even the simple parallel
mergesort described above, but optimises away as much as possible of the
memory access, usually trading it off for a communication cost. By
pipelin-ing the steps of the algorithm in a way much similar to how a processor
pipelines instructions, one constant stream of sorting can be executed,
read-ing, as input, unsorted elements from an external location while writread-ing, as
output, their sorted sequence. Assuming, again, a tree mapping, the leaves
of the tree read the unsorted sequences to be merged, merge a certain buffer
size of elements and communicate the subsequence upward in the tree,
un-til the stream reaches the root, which writes a fully sorted sequence. This
continues until all the elements are consumed.
There are many variables in designing such an algorithm. A sorting tree
depth must be selected that allows for a desired task granularity, but does
not introduce additional resource strain. The granularity is typically more
than a single task per processing unit. Task assignment must be done onto
the processing units of the underlying hardware in a way that optimises
its usage. Here, a trade-off must be made between the amount of memory
access, communication and computation.
4.2.1
Design
Ordinary sequential merging has a linear computation cost relative to the
input size. Due to this, we know that each full level of the merge tree has
the same computation cost. Assuming that a root task must be assigned to
a single core, one way of partitioning would assume a tree of similar depth
as there are nodes. For the SCC in particular, the size of this tree would
be infeasible, so the number of tasks must be reduced. Instead of a single
large tree, we opt for several smaller ones. Since this will introduce a second
phase of merging, the number of trees must be a power of two to allow a
balanced merge phase in the second phase. The number of trees should also
divide the total core number evenly, in order to efficiently map onto the
SCCs 48 cores. The locality of memory controllers on the mesh should also
be considered.
We opt for a forest of 8 trees with 6 levels each [29] [28], and the top-level
view of the algorithm results in the following phases:
• A local mergesort phase, phase 0, is required to obtain the starting
subsequences. The leaves of the 8 trees each read their assigned block
of input elements to be sorted and merge them in their private
mem-ories. After this phase, the pipelined merge phase can begin.
• Phase 1 runs a pipelined parallel merge with 8 6-level trees. This
results in 8 sorted subsequences.
• Phase 2 consists of a parallel sample sort algorithm. This is done in
order to achieve a higher core utilisation ratio compared to a solution
similar to phase 1.
Phase 2 is required to merge the 8 sorted subsequences produced by phase
1. If this phase is mapped to a parallel mergesort in the same manner, there
would be a significant number of idle cores, reducing efficiency. Instead, we
opt for a parallel sample sort and use all 48 cores even in the second phase.
The task mapping is modeled to the SCC using an integer linear
pro-gramming (ILP) based method [29] [28]. The models allow for optimisation
of either the aggregate overall hop distance between tasks, weighted by
inter-task communication volumes, or the aggregate overall hop distance of inter-tasks
to their memory controller, weighted by the memory access volumes. The
model balances computational load, in addition to distributing leaf tasks
across cores to reduce the running times of phase 0. The linear combination
is controlled using weight parameters.
An arbitrary manual task map is also produced, the layer map [29]. The
simple layer map is, as the name implies, based on tree levels. As we know
that each tree level has the same computation cost, we map each level of a
tree to a single core. With 12 cores and two 6-level trees, we have exactly
one tree level per core. Since we also know from the previous experiments
that the distance to the memory is the biggest influence on memory access
times, we place the first 6-level tree such that the root node (on level one)
is on a single core closest to the memory controller, with every subsequent
tree level leading away from the MIC in a semi-circular fashion (see Figure
4.10). The lowest level leaves are thus mapped on the second-nearest core
to the memory controller. The reverse is done with the second 6-level tree.
MIC
1
2
3
4 5 6 7 8 9 1011 1213 141516...31
32...63
Figure 4.10: Per-level distribution of the layer map and pipeline data flow.
4.2.2
Algorithm
The inputs to the algorithm and program are:
• A task map file. This file contains the task mapping to the SCCs
cores, in a per-quadrant fashion. This mapping is replicated internally
respective to the local memory controller.
• A data file containing the integer elements to be sorted.
• MPB buffer size to be used in communication.
Initially, the task map file input is parsed in order to generate an internal
representation of the task tree. For simplicity, the map file is represented as
a 7-level tree, where the root is ignored, i.e. two 6-level trees. The recursive
function generate_subtree is responsible for the task and tree generation, in
addition to calculating offset sizes (into the unsorted input array) for leaves.
Each task is represented by the data structure in Listing 4.1. After the task
tree is generated, each processing unit uses the mapping and the task tree
to determine which tasks it is responsible for executing. These tasks are
collected into a local task array.
Listing 4.1: The task data structure representation
numbersepnumbersep numbersep
1 struct task
numbersepnumbersep numbersep
2 {
numbersepnumbersep numbersep
3 unsigned shortid;
numbersepnumbersep numbersep
4 unsigned shortlocal_id;
numbersepnumbersep numbersep
5 struct task* left_child; /* tree structure */
numbersepnumbersep numbersep
numbersepnumbersep numbersep
7 struct task* parent;
numbersepnumbersep numbersep
8 unsigned shortcpu_id; /* the id of the cpu this task is running on */
numbersepnumbersep numbersep
9 unsigned shorttree_lvl; /* the level of the tree the task is on */
numbersepnumbersep numbersep
10 t_vcharp buf_start; /* pointer to the start of the buffer in the
numbersepnumbersep numbersep
11 * MPB */
numbersepnumbersep numbersep
12 unsigned shortbuf_sz; /* the size of the data buffer in 32B lines
numbersepnumbersep numbersep
13 * including the header */
numbersepnumbersep numbersep
14 unsigned size; /* total number of integers that need to be
numbersepnumbersep numbersep
15 * handled by this task */
numbersepnumbersep numbersep
16 unsigned progress; /* progress of task, i.e. how much of size
numbersepnumbersep numbersep
17 * has been completed. if equal to size
numbersepnumbersep numbersep
18 * the task is finished. */
numbersepnumbersep numbersep
19 void (*function)(struct task *task);/* pointer to the function that will
numbersepnumbersep numbersep
20 * run this task */
numbersepnumbersep numbersep
21 leaf_props_t *leaf; /* leaf properties */
numbersepnumbersep numbersep
22 };
numbersepnumbersep numbersep
23 typedef structtask task_t;
Based on the processor-local task array, the MPB buffer sizes are
calcu-lated and allocated (setup_buffers). Since the leaf nodes all read their input
from main memory, they do not need local MPB buffers. Instead, they push
sorted elements upward in the tree. The branches and root, however, each
have their respective MPB input buffer. The MPB is preallocated and used
proportionally, based on local task weighting as follows:
1. A task weight is assigned to each task and calculated based on the level
of the binary tree the given task is on. Each task has a weight score of
half of the task directly above in the task tree, starting with 1 for the
root node, i.e. w = 1/2
l, where w is the task weight for a given task
and l = 0, ..., 5 its tree depth, starting with 0 for root. For example,
the root task has a score of 1, the branches immediately below the
root have the score of 1/2, and so on. The task weight is proportional
to the computation cost of a task. As it is simple to calculate, it is
not saved in the task structure.
2. Each core gathers its core-local tasks, and calculates the sum of their
weights. The remaining steps are calculated on a core-local basis.
3. The MPB buffer, the size of which to be used is provided as an input
variable to the program, and no more than 8128 bytes, is assigned to
each task proportionally. The proportion of this buffer that a task
receives is equal to the proportion of the task weight to the core-local
task weight sum. Given t = 1, ..., n local tasks on the node, B
totas
the constant total per-core buffer size and weight w as in step 1, each
tasks buffer is calculated as follows:
B
t= B
totw
t nP
j=1w
jAs an example, assume an MPB size of 4000 bytes. Assume further that
the mapping is such that the current core executes 3 tasks, the root of the
6-level tree and its immediate branches. The MPB buffer size assigned to
the root task would be 4000 · 1/(1 + 1/2 + 1/2) = 2000 bytes.
Next, each node sets up its respective buffers in the MPB. An MPB
mem-ory descriptor data type is introduced for keeping track of an MPB buffer
(Listing 4.2). This descriptor is used in a similar manner as a protocol header
and contains metadata such as the progress of production/consumption of
data, etc.
During this process, tasks are also assigned their corresponding task
function. There are three types of tasks: root, branch and leaf tasks. Each
of these types of tasks performs the same function, but with a different set
of parameters. Therefore, a function is implemented for each: run_root,
run_branch and run_leaf. The function type is stored as a function pointer
in the task tree structure, so that it can be accessed from there directly.
Listing 4.2: The MPB memory descriptor
numbersepnumbersep numbersep
1 struct mpb_header
numbersepnumbersep numbersep
2 {
numbersepnumbersep numbersep
3 unsigned long seq; /* the sending tasks counter, equal to progress
numbersepnumbersep numbersep
4 * of task. incremented every time the buffer
numbersepnumbersep numbersep
5 * is written to */
numbersepnumbersep numbersep
6 unsigned long ack; /* the receiving tasks counter, set equal to
numbersepnumbersep numbersep
7 * seq when the buffer has been received */
numbersepnumbersep numbersep
8 unsigned shortstart_os; /* the offset to the first valid integer
numbersepnumbersep numbersep
9 * in buffer (since some may have been consumed
numbersepnumbersep numbersep
10 * already */
numbersepnumbersep numbersep
11 unsigned shortint_ct; /* number of valid (unconsumed) integers
numbersepnumbersep numbersep
12 * currently in data area */
numbersepnumbersep numbersep
13 unsigned shortsrc_task_id; /* the source task, writes to this buffer */
numbersepnumbersep numbersep
14 unsigned shortdst_task_id; /* the destination task, reads from buffer */
numbersepnumbersep numbersep
15 } __attribute__((aligned(32)));
numbersepnumbersep numbersep
16 typedef structmpb_header mpb_header_t;