ErikStubbf¨alt HardwareArchitectureImpactonManycoreProgrammingModel

(1)

March 2, 2021

Hardware Architecture Impact on

Manycore Programming Model

Erik Stubbf ¨alt

Civilingenj ¨orsprogrammet i informationsteknologi

(2)

(3)

Institutionen f ¨or informationsteknologi

Bes ¨oksadress:

ITC, Polacksbacken L ¨agerhyddsv ¨agen 2

Postadress:

Box 337 751 05 Uppsala

Hemsida:

http:/www.it.uu.se

Hardware Architecture Impact on Manycore

Programming Model

Erik Stubbf ¨alt

This work investigates how certain processor architectures can affect the implementation and performance of a parallel programming model.

The Ericsson Many-Core Architecture (EMCA) is compared and con- trasted to general-purpose multicore processors, highlighting differences in their memory systems and processor cores. A proof-of-concept implementation of the Concurrency Building Blocks (CBB) programming model is developed for x86-64 using MPI. Benchmark tests show how CBB on EMCA handles compute-intensive and memory-intensive scenarios, compared to a high-end x86-64 machine running the proof- of-concept implementation. EMCA shows its strengths in heavy computations while x86-64 performs at its best with high degrees of data reuse. Both systems are able to utilize locality in their memory systems to achieve great performance benefits.

Extern handledare: Lars Gelin & Anders Dahlberg, Ericsson Amnesgranskare: Stefanos Kaxiras¨

Examinator: Lars- ˚Ake Nord´en ISSN 1401-5749, UPTEC IT 21001

Tryckt av: ˚Angstr¨omlaboratoriet, Uppsala universitet

(4)

(5)

Det här projektet undersöker hur olika processorarkitekturer kan p˚averka implementa- tioner och prestanda hos en parallell programmeringsmodell. Ericsson Many-Core Ar- chitecture (EMCA) analyseras och jämförs med kommersiella multicore-processorer.

Skillnader i respektive minnessystem och processorkärnor tas upp. En prototyp av en Concurrency Building Blocks-implementation (CBB) för x86-64 tas fram med hjälp av MPI. Benchmark-tester visar hur CBB tillsammans med EMCA hanterar beräknings- intensiva samt minnesintensiva scenarion, i jämförelse med ett modernt x86-64-system tillsammans med den utvecklade prototypen. EMCA visar sina styrkor i tunga beräk- ningar och x86-64 presterar bäst när data ˚ateranvänds i hög grad. B˚ada systemen använ- der lokalitet i respektive minnessystem p˚a ett sätt som har stora fördelar för prestandan.

(6)

1 Introduction 1

2 Background 2

2.1 Multicore and manycore processors . . . 2

2.2 Parallel computing . . . 2

2.2.1 Different types of parallelism . . . 3

2.2.2 Parallel programming models . . . 3

2.3 Memory systems . . . 4

2.3.1 Cache and scratchpad memory . . . 5

2.4 Memory models . . . 6

2.5 SIMD . . . 7

2.6 Prefetching . . . 7

2.7 Performance analysis tools . . . 7

2.8 The actor model . . . 8

2.9 Concurrency Building Blocks . . . 8

2.10 The baseband domain . . . 9

3 Purpose, aims, and motivation 10 3.1 Delimitations . . . 10

4 Methodology 11 4.1 Literature study . . . 11

4.2 Development . . . 11

4.3 Testing . . . 11

(7)

5.1 Comparison of architectures . . . 12

5.1.1 Memory system . . . 12

5.1.2 Processor cores . . . 13

5.1.3 SIMD operations . . . 14

5.1.4 Memory models . . . 14

5.2 Related academic work . . . 15

5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper . . 15

5.2.2 A DSP Acceleration Framework For Software-Defined Radios On x86-64 . . . 17

5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre- fetches . . . 17

5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Sta- tistical Methods . . . 18

6 Selection of software framework 19 6.1 MPI . . . 19

6.1.1 Why MPI? . . . 19

6.1.2 MPICH . . . 20

6.1.3 Open MPI . . . 20

7 Selection of target platform 21 8 Evaluation methods 21 8.1 Strong scaling and weak scaling . . . 21

8.1.1 Compute-intensive benchmark . . . 22

8.1.2 Memory-intensive benchmark without reuse . . . 23

(8)

8.1.4 Benchmark tests in summary . . . 23

8.2 Collection of performance metrics . . . 24

8.3 Systems used for testing . . . 25

9 Implementation of CBB actors using MPI 26 9.1 Sending messages . . . 27

9.2 Receiving messages . . . 27

10 Creating and running benchmark tests 28 10.1 MPI for x86-64 . . . 28

10.2 CBB for EMCA . . . 29

11 Results and discussion 29 11.1 Compute-intensive benchmark . . . 30

11.1.1 Was the test not compute-intensive enough for EMCA? . . . 33

11.2 Memory-intensive benchmark with no data reuse . . . 35

11.3 Memory-intensive benchmark with data reuse . . . 39

11.4 Discussion on software complexity and optimizations . . . 43

12 Conclusions 44 13 Future work 45 13.1 Implement a CBB transform with MPI for x86-64 . . . 45

13.2 Expand benchmark tests to cover more scenarios . . . 45

13.3 Run benchmarks with hardware prefetching turned off . . . 45

13.4 Combine MPI processes with OpenMP threads . . . 46

(9)

(10)

1 Memory hierarchy of a typical computer system [7]. . . 4 2 Memory hierarchy and address space for a cache configuration (left)

and a scratchpad configuration (right) [2, Figure 1]. . . 5 3 Main artefacts of the CBB programming model. . . 9 4 Conceptual view of a multicore system implementing TSO [7, Fig-

ure 4.4 (b)]. Store instructions are issued to a FIFO store buffer before entering the memory system. . . 15

5 Categorization of DSP benchmarks from simple (bottom) to com-

plex (top) [5, Figure 1]. The grey area shows examples of benchmarks that BDTI provides. . . 16 6 Processor topology of the x86-64 system used for testing. . . 25 7 The CBB application used for implementation. . . 26 8 Normalized execution times for the compute-intensive benchmark test

with weak scaling. . . 30 9 Normalized execution times for the compute-intensive benchmark test

with strong scaling. . . 31 10 Speedup for the compute-intensive benchmark test with strong scaling. . 32 11 Speedup for the compute-intensive benchmark test with strong scaling

and 64-bit floating-point addition. Only EMCA was tested. . . 34 12 Normalized execution times for the memory-intensive benchmark with

no data reuse and weak scaling. . . 35 13 Normalized execution times for the memory-intensive benchmark with

no data reuse and strong scaling. . . 36 14 Speedup for the memory-intensive benchmark with no data reuse and

strong scaling. . . 37 15 Normalized execution times for the memory-intensive benchmark with

data reuse and weak scaling. . . 39

(11)

17 Normalized execution times for the memory-intensive benchmark with data reuse and strong scaling. . . 41 18 Speedup for the memory-intensive benchmark with data reuse and strong

scaling. . . 42 19 Cache miss ratio for the memory-intensive benchmark with data reuse

and strong scaling. . . 43

List of Tables

1 Flag synchronization program to motivate why memory models are

needed [20, Table 3.1]. . . 6 2 One possible execution of the program in Table 1 [20, Table 3.2]. . . 6

(12)

(13)

1 Introduction

This work is centered around the connections between two areas within computer sci- ence, namely hardware architecture and parallel programming. How can a programming model, developed specifically for a certain processor type, be expanded and adapted to run on a completely different hardware architecture? This question, which is a general problem found in many areas of industry and research, is what this thesis revolves around.

The project is conducted in collaboration with the Baseband Infrastructure (BBI) department at Ericsson. They develop low-level software platforms and tools used in baseband software within the Ericsson Radio System product portfolio. This includes the Con- currency Building Blocks (CBB) programming model, which is designed to take full advantage of the Ericsson Many-Core Architecture (EMCA) hardware.

EMCA has a number of characteristics that sets it apart from commercial off-the-shelf (COTS) designs like x86-64 and ARMv8. EMCA uses scratchpad memories and sim- plistic DSP cores instead of the coherent cache systems and out-of-order cores with simultaneous multithreading found in general-purpose hardware. These differences, and more, are investigated in a literature study with a special focus on how they might affect run-time performance.

MPI is used as a tool for developing a working CBB prototype that can run on both x86- 64 and ARMv8. This choice is motivated by the many similarities between concepts used in CBB and concepts seen in MPI. Finally, a series of benchmark tests are run with CBB on EMCA and on the CBB-prototype on a high-end x86-64 machine. These tests aim to investigate some compute-intensive and memory-intensive scenarios, which are both relevant for actual baseband software. Each test is run with a fixed problem size which is divived equally among the available workers, and also with a problem size that increases linearly with the number of workers. EMCA shows very good performance with the compute-intensive tests. The test (using 16-bit integer addition) is in fact deemed to not be compute-intensive enough to highlight the expected scaling behavior, and a modified benchmark (using 64-bit floating point addition) is also tested. In the memory-intensive tests, it is shown that x86-64 performs at its best when the degree of data reuse is high and it can hold data in its L1D cache. In this scenario it shows better scaling behavior than EMCA. However, x86-64 takes a much larger performance hit than EMCA when the number of processes exceed the number of available processor cores.

The rest of this report is structured as follows: Section 2 describes the necessary background theory on the problem at hand. Section 3 discusses the purpose, aims and mo-

(14)

tivation behind the project, along with some delimitations. Section 4 goes in to the methodology used. The literature study is contained in Section 5, and then Sections 6 and 7 describes the hardware and software that is used for development. The development of a CBB proof-of-concept and a series of benchmark tests is described in Sec- tions 8, 9 and 10. Finally Section 12 contains some conclusions and a summary of the contributions made, and Section 13 describes how this work could be continued in the future.

2 Background

To arrive at a concrete problem description, a bit of background theory on computer architecture and parallel programming is required. The following paragraphs provide details about some of the concepts that will be central later on.

2.1 Multicore and manycore processors

Traditionally, the key to making a computer program run faster was to increase the performance of the processor core that it was running on. The increase in single core performance over the years was made possible by Moore’s law [13, p. 17], which describes how the number of transistors that fits in an integrated circuit of a given size has doubled approximately every two years. However, this rate of progression started to level off in the late 00s, mainly as a consequence of limits in power consumption.

To get more performance per Watt of energy, the solution was to add more processor cores and have them collaborate on running program code. This type of construction is commonly referred to as a multicore processor. In cases where the number of cores is especially high, the term manycore processor is often used.

2.2 Parallel computing

The performance observed when running a certain algorithm rarely scales perfectly with the number of processor cores added. Instead, the possible speedup for a fixed problem size is indicated by Amdahl’s law [14], which can be formulated as

Speedup = 1

(1 − f ) +^f_s ,

(15)

where f is the fraction of the program that can benefit from additional system resources (in this case processor cores) to get a speedup of s. Another way of looking at this is that s is the number of processor cores used. This means that there is a portion of the program that can be split up in parallel tasks. The fraction of the code that can not be parallelized, and thus not benefit from more cores, is represented by 1 − f . Finding and exploring parallelism, i.e. maximizing f , is crucial to getting the most out of modern multicore hardware in terms of overall performance. Scaling the number of processors used for a fixed problem size, like described by Amdahl’s law, is often referred to as strong scaling[17].

There is also a possibility of increasing the size of the problem along with the number of processor cores. This is called weak scaling. The possible speedup gains for this type of scenario is depicted in Gustafson’s law [12],

Speedup = (1 − f ) + f · s ,

where f and s have the same meaning as previously. We can see that there is no the- oretical upper limit to the speedup that can be achieved with weak scaling, since the speedup increases linearly. In contrast, the non-parallelizable part 1 − f poses a hard upper-limit on the speedup for strong scaling even if s approaches infinity.

2.2.1 Different types of parallelism

There are many ways to divide parallelism into subgroups. In the context of multicore processors and especially in this project, two of the most important types are:

• Data parallelism: The same set of operations is performed on many pieces of data, for example across items in a data vector, and the iterations are independent from one another. This means that the iterations can be split up beforehand and then be performed in parallel. This is referred to as data parallelism.

• Task parallelism: Different parts of a program are split up into tasks, which are sets of operations that are typically different from one another. A set of tasks can operate on the same or different data. If multiple tasks are completely independent of one another they can be run in parallel, which gives task parallelism.

2.2.2 Parallel programming models

When creating a parallel program, the programmer typically uses a parallel programming model. This is an abstraction of available hardware that gives parallel capabilities

(16)

to an existing programming language, or in some cases introduces an entirely new programming language. Programming models can for example support features such as thread creation, message passing and synchronization primitives.

2.3 Memory systems

Ideally a processor core would be able to access data items to operate on without any delay, regardless of which piece of data it requests. In reality, memory with short access times are very expensive to manufacture and are also not perfectly scalable. This is why modern computer systems has a hierarchy of memory devices attached to them. Figure 1 shows an example of a memory hierarchy, and the technologies typically associated with each level.

Figure 1 Memory hierarchy of a typical computer system [7].

One of the main ideas behind memory hierarchies is to take advantage of the principle of locality, which states that programs tend to reuse data and instructions they have used recently [13, p. 45]. This means that access patterns both in data and in code can often be predicted to a certain extent. Temporal locality refers to a specific item being reused multiple times in the a short time period, while spatial locality means that items with adjacent memory addresses are accessed close together in time.

(17)

Figure 2 Memory hierarchy and address space for a cache configuration (left) and a scratchpad configuration (right) [2, Figure 1].

2.3.1 Cache and scratchpad memory

Cache memorytypically sits close to the processor core. These are high-speed on-chip memory modules that share the same address space as the underlying main memory.

Data and instructions are automatically brought into the cache, and there is typically a cache coherency mechanism to ensure that all data items get updated accordingly in all levels of the cache system. All this is typically implemented in hardware, and is therefore invisible to the programmer.

Some processor designs have scratchpad memory, which have the same high-speed characteristics that a cache has [2]. The transferring of data to and from a scratchpad memory is typically controlled in software, which makes it different from a cache where this is handled entirely in hardware. Scratchpad memory requires more effort by software developers, but are more predictable since they do not suffer from cache misses. Scratchpad memory consumes less energy per access than cache memory since it has its own address space, meaning that no address tag lookup mechanism is needed.

Figure 2 shows a schematic view of differences between cache and scratchpad memory.

Important to note is that there are many possible configurations of cache and scratchpad memory within a chip. They may for example be shared across multiple cores or private to a single core, and there may be multiple levels of cache or scratchpad memory where each level has different properties. It is also possible to utilize a scratchpad memory in conjunction with software constructs that make it behave similar to a cache.

(18)

2.4 Memory models

The memory model is an abstract description of what memory ordering properties that a particular system has. Important to note here is that these properties are visible to software threads in a multicore system. Different memory models give different levels of freedom for compilers to do optimizations in the code.

Core 1 Core 2 Comments

S1: Store data = NEW; /* Initially, data = 0

& flag 6= SET */

S2: Store flag = SET; L1: Load r1 = flag; /* L1 & B1 may repeat many times */

B1: if (r1 6= SET) goto L1;

L2: Load r2 = data;

Table 1 Flag synchronization program to motivate why memory models are needed [20, Table 3.1].

To understand what a memory model is and why it is needed, look at Table 1. Here, core 2 spins in a loop while waiting for the flag variable to be SET by core 1. The question is, what value will core 2 observe when it loads the data variable in the end?

Without knowing anything about the memory model of the system, this is impossible to answer. The memory model describes how instructions may be reordered at a local core.

Cycle Core 1 Core 2 Coherence Coherence

state of data state of flag

1 S2: Store flag = SET Read-only Read-write

for C2 for C1

2 L1: Load r1 = flag Read-only Read-only

for C2 for C2

3 L2: Load r2 = data Read-only Read-only

for C2 for C2

4 S1: Store data = NEW Read-write Read-only

for C1 for C2

Table 2 One possible execution of the program in Table 1 [20, Table 3.2].

One possible outcome of running the program is shown in Table 2. We can see that a store-store reorderinghas occured at core 1, meaning that it has executed instruction S2

(19)

before instruction S1 (violating the program order). In this case core 2 would observe that the data variable has the “old” value of 0.

With knowledge about the memory model, the programmer can for example know where to insert memory barriers to ensure correctness in the program.

2.5 SIMD

Processor designs with Single Instruction Multiple Data (SIMD) features offer a way of performing the same calculation on multiple data items in parallel within the same processor pipeline [13, p. 10]. These features can be used to obtain data parallelism of a degree beyond the core count of a system, and are often implemented using wide vector registers and operations on these registers. Since this approach needs to fetch and execute fewer instructions than the number of data items, it is also potentially more power efficient than the conventional Multiple Instruction Multiple Data (MIMD) approach.

2.6 Prefetching

Prefetching is a useful way of hiding memory access latency during program execution.

It involves predicting what data and instructions that will be used in the near future, and bringing them into a nearby cache. An ideal prefetching mechanism would accu- rately predict addresses, make the prefetches at the right time, and place the data in the right place (which might include choosing the right data to replace) [9]. Inaccu- rate prefetching may pollute the cache system, possibly evicting useful items. Many common prefetching schemes try to detect sequential access patterns (possibly with a constant stride), which can be fairly accurate. Another method is to, for each memory access, bring in a couple of adjacent items from memory instead of just the item referred to in the program. Prefetching can be implemented both in hardware and in software.

2.7 Performance analysis tools

Measuring how a computer system behaves while running a certain application can be done through an automated performance analysis tool. These tools can be divided into two broad categories: Static analysis tools rely on source code insertions for collecting data, while dynamic analysis tools makes binary-level alterations and procedure calls at run-time [27]. There are also hybrid tools that utilize both techniques.

(20)

Most tools use some kind of statistical sampling, where the program flow of the tested application is paused at regular intervals to run a data collection routine. This can for example provide information about the time spent in each function, and how many times each function has been called. Many tools also utilize a feature present in most modern processor chips, namely hardware performance counters. These are special-purpose registers that can be programmed to react whenever a specific event occurs, for example an L1 cache miss or a branch misprediction. This can provide very accurate metrics without inducing any significant overhead.

2.8 The actor model

The actor model is an abstract model for concurrent computation centered around primitives called actors [1]. They are independent entities which can do computations according to a pre-defined behavior. The actor model is built around the concept of asynchronous message passing, in that every actor has the ability to send and receive messages to and from other actors. These messages can be sent and received at any time without coordinating with the actor at the other end of the communication, hence the asynchrony. When receiving a message, an actor has the ability to:

1. Send a finite number of messages to one or many other actors.

2. Create a finite number of new actors.

3. Define what behavior to use when receiving the next message.

Actors have unique identification tags, which are used as “addresses” for all message passing. They can have internal state information which can be changed by local behavior, but there is no ability to directly change the state of other actors.

2.9 Concurrency Building Blocks

CBB is Ericsson’s proprietary programming model for baseband functionality development, designed as a high-level domain-specific language (DSL). A CBB application is translated into C code for specific hardware platforms through a process called the CBB transform. This makes for great flexibility, since developers do not need to target one platform specifically when writing baseband software.

Application behavior is defined inside CBB behavior classes (CBCs), seen in the middle of Figure 3. The CBCs are based on the actor model, described in Section 2.8. When

(21)

Figure 3 Main artefacts of the CBB programming model.

a CBC handles an incoming message it can initiate an activity, shown in the right of Figure 3. An activity is an arbitrarily complex directed acyclic graph (DAG) of calls to C functions, and can also contain syncronization primitives. Different message types can be mapped to different activities.

The simplest form of a CBC is an actor with a single serializing first-in first-out (FIFO) message queue. It is possible to define CBCs with different queue configurations, but those will not be focused on here.

At the top level, an application is structured inside a CBB structure class (CSC). This CSC can in itself contain instances of CBCs and other CSCs, forming a hierarchy of application components.

2.10 The baseband domain

In a cellular network, the term baseband is used to describe the functionality in between the radio unit and the core network. This is where functionalities from Layer 1 (the physical layer) and Layer 2 (the data link layer) of the OSI model [30] are found. The baseband domain also contains Radio Resource Management (RRM) and a couple of other features.

• Layer 1: Responsible for modulation and demodulation of data streams for down- link and up-link respectively. It also performs link measurements and other tasks.

This layer is responsible for approximately 75% of the compute cycles within the

(22)

baseband domain, since it does a lot of computationally intensive signal processing.

• Layer 2: Handles per-user packet queues and multiplexing of control data and user data, among other tasks. This layer has a lot of memory intensive work, and produces around 5% of the compute cycles.

• RRM: The main task of RRM is to schedule data streams on available radio chan- nel resources, which means solving bin-packing problems with a large number of candidates. This produces 15% of the compute cycles within the baseband domain.

3 Purpose, aims, and motivation

The broader purpose of this work is to investigate how hardware can affect software, and more specifically how certain hardware architectures affect the implementation and performance of a parallel programming model. CBB will be at the center of this investi- gation, and the result will be a proof-of-concept showing how it can be implemented on COTS hardware such as a x86-64 or ARMv8 chip. Differences between the selected architecture and EMCA will be analyzed, including how these differences manifest them- selves in performance.

Adapting a programming model to run on new architectures is a general problem that exists in many parts of industry and research. If done successfully it can create entirely new use cases and products. There is also potential to learn how to utilize hardware features that has not previously been considered.

3.1 Delimitations

This work focuses on important aspects for adapting a programming model to new hardware, and not on the actual implementation. Therefore this project does not include a new, fully functional CBB implementation. It instead results in a prototype with suf- ficient functionality for running performance tests. Section 13 of the report, which describes future work, discusses some necessary steps to make a more complete implementation.

All of the aspects described in the comparative part of the literature study will not be evaluated with performance tests, since this would require more time and resources than

(23)

available. Instead, the evaluation focuses on a few of the most relevant metrics. These are described in Section 8.2.

The project is also not focused on comparing different programming models with each other. This is however a topic that is investigated in another ongoing project within the BBI department at Ericsson.

4 Methodology

The project is divided into three main parts, which will be described in the following sections.

4.1 Literature study

The first part of this work will consist of a literature study. The goal is to identify key features and characteristics, both of the programming model and available hardware, to analyze. Special emphasis will be put on key differences between different hardware architectures. The literature study can be found in Section 5.

4.2 Development

A proof-of-concept implementation of some parts of CBB on a new hardware architecture will be built. This is described in Section 9. Some of the tools used in this process are provided by the BBI department at Ericsson. This includes EMCA IDE, which is the Eclipse-based Integrated Development Environment (IDE) that is used internally at Ericsson to create CBB applications.

The literature study will be used as a basis when selecting what hardware platform, and additional technologies, that will be used during the implementation phase. This selection process is outlined in Section 7 and Section 6.

4.3 Testing

A series of benchmark tests will be created using the previously created CBB prototype. The same set of tests will also be created using CBB for EMCA. This process is described in detail in Section 10.

(24)

A performance analysis tool will be used for gathering performance metrics from the targeted hardware platform. The most important requirement is to access hardware performance counters (see Section 2.7 for more details) for collecting cache performance metrics. The perf tool fits this requirement [18]. It is available in all Linux systems by default, and is therefore the performance tool of choice for this project. Execution will be measured in code, with built-in timing functions which are described in Section 8.2.

5 Literature study

The literature study is split up in three parts. Section 5.1 contains a comparison of the characteristics of three different hardware architectures. Section 2.9 describes the programming model used in this project. Section 5.2 summarizes academic work which may be valuable in later parts of the project.

5.1 Comparison of architectures

This section will compare EMCA to x86-64 and also to ARMv8, and highlight some of the key differences. Intel 64 (Intel’s x86-64 implementation) and ARMv8-A (the general-purpose profile of ARMv8) will be used as reference for most of the comparisons.

5.1.1 Memory system

The memory system of EMCA is one of the key characteristics that sets if apart from the typical commercial architecture. Each processor core has a private scratchpad memory for data, and also a private scratchpad memory for program instructions. These memory modules will be referred to as DSP data scratchpad and DSP instruction scratchpad throughout the rest of this report. There is some hardware support for loading program instructions into the DSP instruction scratchpad automatically, making it behave similar to an instruction cache, but for the DSP data scratchpad all data handling has to be done in software.

There is also an on-chip memory module, the shared memory, that all cores can use.

It has significantly larger capacity than the scratchpad memories, and it is designed to behave in a predictable way (for example by offering bandwidth guarantees for every access).

(25)

One of the main reasons for doing so much of the baseband software development in- house is the memory system of EMCA. Most software is designed to run on a cache coherent memory model, which is not present in EMCA.

x86-64 designs like Intel’s Sunny Cove cores (used within the Ice Lake processor fam- ily) has a three-tier cache hierarchy [8]. Each core has a split Level 1 (L1) cache, one for data (L1D) and one for instructions (L1I), and a unified Level 2 (L2) cache which has ∼5-10x the capacity of the combined L1. The architecture features a unified Level 3 (L3) cache which is shared among all cores. It is designed so that each core can use a certain amount of its capacity. Information about the cache coherency protocol used is not publicly available, but earlier Intel designs have been reported to use the MESIF (Modified, Exclusive, Shared, Invalid, Forward) protocol [13, p. 362], which is a snoop-based coherence protocol.

Contemporary ARMv8 designs feature a cache hierarchy of two or more levels [3].

Each core has its own L1D and L1I cache combined with a larger L2 cache, just like x86-64. The cache sizes vary between implementations. It is possible to extend the cache system with an external last-level cache (LLC) that can be shared among a cluster of processor cores, but this depends on the particular implementation. Details about the cache coherency protocol used by ARM is not publicly available.

5.1.2 Processor cores

EMCA is characterized as a manycore design, and it has a higher number of cores than many x86-64 or ARMv8 chip. However most x86-64 chips and many ARMv8 chips support simultaneous multithreading (SMT), so that each processor core can issue multiple instructions from different software threads simultaneously [28]. This is achieved by duplicating some of the elements of the processor pipeline, and this technique gives the programmer access to a higher number of virtual cores and threads than the actual core count. EMCA does not support SMT.

The processor cores inside EMCA are characterized as Very Long Instruction Word (VLIW) processors. This means that they have wide pipelines with many functional units that can do calculations in parallel, and it is the compiler’s job to find instruction- level parallelism(ILP) and put together bundles of instructions (i.e. instruction words) that can be issued simultaneously. The instruction bundles can vary in length depending on what instructions they contain. It has an in-order pipeline, as opposed to the out-of- order pipelines found in both x86-64 and ARMv8.

Since EMCA is developed for a certain set of calculations, namely digital signal processing (DSP) algorithms, its processor cores are optimized for this purpose. There is

(26)

however nothing fundamentally different about how they execute each instruction compared to other architectures.

5.1.3 SIMD operations

The x86-64 features a range of vector instruction sets, of which the latest generation is named Advanced Vector Extensions (AVX) and exists in a couple of different ver- sions. AVX512, introduced by Intel in 2013 [22], features 512 bit wide vector registers which can be used for vector operations. All instructions perform operations on vectors with fixed lengths. AMD processors currently support only AVX2 (with 256 bits as its maximum vector length), while Intel have support for AVX512 in most of its current processors.

ARMv8 features the Scalable Vector Extension (SVE) [24]. As the name implies, the size of the vector registers used in this architecture is not fixed. It is instead an implementation choice, where the size can vary from 128 bits to 2048 bits (in 128-bit increments). Writing vectorized code for this architecture is done in the Vector-Length Agnostic (VLA) programming model, which consists of assembly instructions that automatically adapts to whatever vector registers that are available at run-time. This means that there is no need to recompile code for different ARM chips to take advantage of vectorization, and also no need to write assembly intrinsics by hand. SVE was first an- nounced in 2017, and details about the latest version (SVE2) was released in 2019 [19].

As of today, only higher-end ARM designs feature SVE. Most designs do however support the older NEON extension, utilizing fixed-size vector registers of up to 128 bits.

The Instruction Set Architecture (ISA) used with EMCA has support for SIMD instructions targeted at certain operations commonly used in DSP applications. One example is multiply-accumulate (MAC) which is accelerated in hardware. Similar instructions are available in AVX and SVE as well.

5.1.4 Memory models

The x86-64 architecture uses Total Store Order (TSO) as its memory model [20, p. 39].

There has been a bit of debate about this statement, but most academic sources claim that this is true and the details are not relevant enough to cover here. With TSO, each core has a FIFO store buffer that ensures that all store instructions from that core are issued in program order. The load instructions are however issued directly to the memory system, meaning that loads can bypass stores. This configuration is shown in Figure 4.

ARM systems use a weakly consistent memory model [6] (also called relaxed consis-

(27)

Figure 4 Conceptual view of a multicore system implementing TSO [7, Figure 4.4 (b)].

Store instructions are issued to a FIFO store buffer before entering the memory system.

tency). This model makes no guarantees at all regarding the observable order of loads and stores. It can do all sorts of reordering: store-store, load-load, load-store and store- load. Writing parallel software for an ARM processor can therefore be more challenging than doing the same for an x86 processor, since weak consistency requires more effort to ensure program correctness (for example by inserting memory barriers/fences where order must be preserved). The upside is that more optimizations can be done both in software and in hardware, giving the weakly consistent system potential to run an instruction stream faster than a TSO system could. Two store instructions can for example be issued in reverse program order, which is not possible under TSO.

The memory model of EMCA does not guarantee a global ordering of instructions, although there are synchronization primitives for enforcing a global order when needed.

Further details on its memory model are not publicly available.

5.2 Related academic work

This section summarizes earlier academic work in related areas which are useful within this project.

5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper

Berkeley Design Technology, Inc. (BDTI) is one of the leading providers of benchmarking suites for DSP applications. This white paper [5] aims to explain the key factors that determine the relevance and quality of a benchmark test. It discusses how to distinguish

(28)

good benchmarks from bad ones, and when to trust their results.

Figure 5 Categorization of DSP benchmarks from simple (bottom) to complex (top) [5, Figure 1]. The grey area shows examples of benchmarks that BDTI provides.

They argue that a trade-off has to be done between practicality and complexity. Figure 5 shows four different categories of signal processing benchmarks. Simple ones based for example on additions and multiply-accumulate (MAC) operations may be easy to design and run, but may not provide very meaningful results for the particular testing purpose. On the other side of the spectrum there are full applications that may provide useful results, but may be unnecessarily complex to implement across many hardware architectures. Somewhere, often in between the two extremes, there is a sweetspot that provides meaningful results without being too specific. A useful benchmark must however perform the same kind of work that will be used in the real-life scenario that the processor is tested for.

Another factor is optimization. The performance-critical sections of embedded signal processing applications are often hand-optimized, sometimes down to assembly level. Different processors support different types of optimizations (for example different SIMD operations), and allowing all these optimizations in a benchmark makes it more complex and hardware-specific but can also expose more of the available performance.

Many benchmarks focus on achieving maximum speed, but other metrics (such as memory use, energy efficiency and cost efficiency) might also be important factors when determining if a particular processor is suitable for the task. It can also be important to reason about the comparability of the results across multiple hardware architectures, instead of only looking at one processor in isolation.

(29)

5.2.2 A DSP Acceleration Framework For Software-Defined Radios On x86-64

This article [11] is concerned with the use of COTS devices for implementing baseband functions within Software-Defined Radios (SDR). The goal is to accelerate common DSP operations with the use of SIMD instructions available in modern x86-64 processors.

The OpenAirInterface (OAI), which is an open-source framework for deploying cellular network SDRs on x86 and ARM hardware, is used as a baseline. Some of its existing functions are using 128-bit vector instructions. The authors extend OAI with an acceleration and profiling framework using Intel’s AVX512 instruction set. They implement a number of common algorithms, targeting massive multiple-input multiple output (MIMO) use cases.

A speedup of up to 10x is observed for the DSP functions implemented, compared to the previous implementation within OAI. Most previous studies within the field have focused on application-specific processors and architectures. This study highlights some of the potential for using SIMD features in modern x86-64 processors for baseband applications.

5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre- fetches

Prefetching is an important feature of modern computer systems, and its effects are widely understood in single core systems. This article [16] investigates side-effects that different prefetching schemes can cause in multicore systems with cache coherency, and when these can become harmful.

Four prefetching schemes are investigated – sequential prefetching, Content Directed Data Prefetching (CDDP), wrong path prefetching and exclusive prefetching. Mea- surements are done in a simulator implementing an out-of-order sequentially consistent system using the MOESI cache coherence protocol.

The result is a taxonomy of 29 different prefetch interactions and their effects in a multicore system. The harmful prefetch scenarios are categorized into three groups:

• Local conflicting prefetches: A prefetch in the local core forces an eviction of a useful cache line, which is referenced in the code before the prefetched cache line is.

(30)

• Remote harmful prefetches: A prefetch that causes a downgrade in a remote core followed by an upgrade in the same remote core, before the prefetched cache line is referenced locally. This upgrade will evict the cache line in the local core, making it useless.

• Harmful speculation: Prefetching a cache line speculatively, causing unnecessary coherence transactions in other cores. Can for example cause a remote harmful prefetch.

Performance measurements within the different prefetching schemes show that these prefetching effects can be harmful to performance. Some optimizations that can mitigate this effect are also briefly discussed.

5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Statis- tical Methods

Choosing the right memory technology is important to get good performance and energy efficiency out of embedded systems. This study [15] compares how cache memory and scratchpad memory perform in different types of data-heavy application workloads. It is commonly believed that scratchpad memory is better for regular and predictable access patterns, while cache memory is preferrable when access patterns are irregular.

The authors use a statistical model involving access probabilities, i.e. the probability that a certain data object is the next to be referenced in the code. They use this to calculate the optimal behavior of a scratchpad memory, and compare it to cache hit ratios. This is done both analytically and empirically. Matrix multiplication is used as an example of a workload with a regular access pattern. Applications involving trees, heaps, graphs and linked lists are seen as having irregular access patterns.

This work proves that scratchpad memory can always outperform cache memory, if an optimal mapping based on access probabilities is used. Increasing the cache associa- tivity is shown not to improve the cache performance significantly.

(31)

6 Selection of software framework

6.1 MPI

Message Passing Interface (MPI) is a library standard formed by the MPI Forum, which has a large number of participants (including hardware vendors, research organizations and software developers) [4]. The first version of the standard specification emerged in the mid-90s, and MPI has since then become the de-facto standard for implementing message passing programs for high-performance computing (HPC) applications. MPI 3.1, approved in Jun 2015, is the latest revision of the standard. MPI can create processes in a computer system, but not threads.

MPI is designed to primarily use the message-passing parallel programming model, where messages are passed by moving data from the address space of one process to the address space of another process. This is done through certain operations that both the sender and the receiver must participate in. One of the main goals of MPI is to offer portability, so that the software developer can use the same code in different systems with varying memory hierarchies and communication interconnects.

Since MPI is a specification rather than a library, there are many different implementations. Some examples are Open MPI, MPICH and Intel MPI. There are differences across implementations that might affect performance, but these differences are not focused on within this project.

6.1.1 Why MPI?

Choosing MPI as a basis for implementing CBB on new hardware architecture is motivated by the great compatibility of its concepts compared to the concepts found in the actor model and CBB. This includes:

• Independent execution. An MPI process is an independent unit of computation with its own internal state, just like an actor. It can execute code completely independent of other processes. A process has its own address space in the computer system.

• Asynchronous message passing. An MPI process can send any number of messages to other processes, using ranks (equivalent to process IDs) as addresses.

The operations for sending and receiving messages can be run asynchronously, just like with actors. An MPI processes has one FIFO message queue by default.

(32)

6.1.2 MPICH

The MPI implementation used within this project is MPICH. It was initially developed along with the original MPI standard in 1992 [25]. It is a portable open-source implementation, one of the most popularly used today. It supports the latest MPI standard and has good user documentation. The goals of the MPICH project, as stated on the project website, are:

• To provide an MPI implementation that efficiently supports different computation and communication platforms including commodity clusters, high-speed net- works and proprietary high-end computing systems

• To enable cutting-edge research in MPI through an easy-to-extend modular framework for other derived implementations.

MPICH has been used as a basis for many other MPI implementations including Intel MPI, Microsoft MPI and MVAPICH2 [29]. Since MPICH is designed to be portable it can run on x86-64, ARMv8 and also in a variety of other computer systems.

6.1.3 Open MPI

The initial choice for an MPI implementation to use within this project was Open MPI.

It is one of the most commonly used implementations and it has excellent user documentation. Open MPI is an open-source implementation developed by a consortium of partners within academic research and the HPC industry [10]. Some of the goals of the Open MPI project are:

• To create a free, open source, peer-reviewed, production-quality complete MPI implementation.

• To directly involve the HPC community with external development and feedback (vendors, 3rd party researchers, users, etc.).

• To provide a stable platform for 3rd party research and commercial development.

Unfortunately there were problems with getting Open MPI to run properly inside the x86-64 system used for testing (which is described in Section 8.3). Instead of spending time debugging these problems, Open MPI was replaced with MPICH which worked without problems.

(33)

7 Selection of target platform

As seen in the architecture comparisons in Section 5.1, modern x86-64 and ARMv8 hardware have many similarities. They both incorporate high-performance out-of-order cores with multiple levels of cache, since they are both targeted at general-purpose computing where these features can be beneficial. The differences between EMCA and x86- 64 are of the same nature as the differences between EMCA and ARMv8. The decision between x86-64 and ARMv8 is therefore not as significant as the overall question:

What happens when we run CBB applications on modern general-purpose hardware?

The choice of a target platform instead comes down to availability. Ericsson’s development environment is run on x86-64 servers, and accessing additional x86-64 hardware for testing within Ericsson has been less difficult than accessing ARMv8 hardware. Do- ing all development and testing on x86-64 hardware was therefore the natural choice.

As mentioned in Section 6.1.2, the CBB implementation created will work with both hardware platforms since MPICH programs can be compiled and run on both. Port- ing the CBB implementation from x86-64 to ARMv8 would simply mean moving the source code and recompiling it.

8 Evaluation methods

As described by BDTI [5] (see Section 5.2.1), there has to be a trade-off between practicality and complexity when designing benchmark tests. There is often a sweetspot of tests that provides meaningful results without being too specific. This is the goal of the tests that will be used in this project, which are all described in the following sections.

8.1 Strong scaling and weak scaling

One of the fundamental goals of CBB and EMCA, as described in Section 2.9, is to enable massive parallelism. With this in mind, it is reasonable to test how the degree of parallelism in an application affects performance within the targeted hardware platform.

A simple two-actor application will be used, following this basic structure:

1. Actor A and actor B gets initiated.

(34)

2. Actor A sends a message to actor B, containing a small piece of data.

3. Actor B receives the message and performs some work (described in Sections 8.1.1, 8.1.2 and 8.1.3).

4. Actor B sends a message back to actor A, acknowledging that all calculations are completed.

5. Both actors terminate and the test is finished.

To test different degrees of parallelism, multiple instances of the two-actor application will be run. These instances are completely independent of each other, which means that the parallelism present in the software (namely the data parallelism) will scale perfectly.

See Section 2.2 for more details on parallel computing. Each benchmark test will be run with four actors (two of actor A, two of actor B), and then the number of actors will be increased before running the test again. This process will be repeated until reaching 1024 actors, which is significantly more than the number of available processor cores in the computer systems used for testing (described in Section 8.3).

Two variations of each benchmark test will be evaluated:

1. Weak scaling: The size of the problem will increase along with the number of program instances.

2. Strong scaling: The size of the problem will be fixed, and split up among all instances of the two-actor applications.

8.1.1 Compute-intensive benchmark

As described in Section 2.10, Layer 1 functionality is computationally intensive and is responsible for most of the compute cycles within the baseband domain. This will be simulated by letting actor B perform a large amount of computation after receiving data from actor A. The computation at actor B will consist of addition operations of the following form:

result = result + 10000;

Here, result will be an unsigned 16-bit integer. It will overflow many times during the test, so that the result will always be between 0 and 2¹⁶. The addition operation will be repeated 1 million times by each actor B in the weak scaling scenario, and ^1million_N times by each actor B in the strong scaling scenario with N program instances.

(35)

8.1.2 Memory-intensive benchmark without reuse

The behavior of Layer 2 applications, as described in Section 2.10, does memory intensive work. This will be simulated by letting actor B allocate and loop through a data vector after receiving its message from actor A, touching each element once. Two vector sizes will be tested: 1 MB and 128 kB. In the weak scaling scenario, each actor B will have its own data vector with this size. With strong scaling, each actor B will have a data vector sized initial vector size

N with N program instances. The data vectors will be dynamically allocated (on the heap).

8.1.3 Memory-intensive benchmark with reuse

This is a variation of the test described in Section 8.1.2. The only difference is that, in this test, actor B will loop through its own data 1000 times. This means that there will be data reuse, so that the application can make use of locality and caches.

8.1.4 Benchmark tests in summary

To cover all variations of the different benchmarks, 10 individual test cases will be created and run. These are:

1. Compute-intensive test with weak scaling.

2. Compute-intensive test with strong scaling.

3. Memory-intensive test with no data reuse and weak scaling, with a 1 MB vector.

4. Memory-intensive test with no data reuse and weak scaling, with a 128 kB vector.

5. Memory-intensive test with no data reuse and strong scaling, with a 1 MB vector.

6. Memory-intensive test with no data reuse and strong scaling, with a 128 kB vector.

7. Memory-intensive test with data reuse and weak scaling, with a 1 MB vector.

8. Memory-intensive test with data reuse and weak scaling, with a 128 kB vector.

9. Memory-intensive test with data reuse and strong scaling, with a 1 MB vector.

10. Memory-intensive test with data reuse and strong scaling, with a 128 kB vector.

(36)

8.2 Collection of performance metrics

At each individual step of the two benchmark tests described above, these metrics will be recorded:

• Execution time: This indicates the overall performance, in terms of pure speed.

Shorter execution times are better. When running on x86-64, this metric will be measured using the built-in MPI Wtime function in MPI. A corresponding function available in EMCA systems will be used there. The timing methodology follows this scheme:

1. All actors in the system gets initialized, and then synchronize using a barrier or similar.

2. One actor collects the current time.

3. The actors do their work.

4. All actors synchronize again.

5. One actor collects the current time again and subtracts the previously collected time.

With this method, all overhead associated with initializing and terminating the execution environment is excluded. The measured time instead only shows how long the actual work within the benchmark tests take.

• Cache misses in L1D: This shows us how the cache system and the hardware prefetcher performs. Lower numbers of cache misses are preferable. See Sec- tion 2.3 for more details. This metric is available in x86-64 systems but not in EMCA, which do not have caches. The perf tool will be used for this. The command used for running a test and measuring cache behavior is:

$ perf stat -e L1-dcache-loads,L1-dcache-load-misses This shows the number of loads and misses in the L1D cache, and also the miss ratio in %.

All steps of each benchmark test will be run three times, and then the average numbers produced from these three runs will be used as a result. This is to even out the effects of unpredictable factors that might produce noisy results.

(37)

8.3 Systems used for testing

The x86-64 system that will be used for running benchmarks is a high-end server machine with AMD processors built on their Zen 2 microarchitecture [23]. Some of its specifications are summarized below.

• 2 x AMD EPYC 7662 processors.

• 128 physical cores in total (64 per chip).

• 256 virtual cores in total (128 per chip) using SMT.

• 32 kB of private L1D cache per physical core.

• 512 kB of private L2 cache per physical core.

• 4 MB of L3 cache per physical core (16 MB shared across four cores in each core complex).

• The system is running Linux Ubuntu 20.04.1 LTS.

Machine (503GB total) Package L#0

L3 (16MB) L2 (512KB) L1d (32KB) L1i (32KB) Core L#0

PU L#0 P#0 PU L#1

P#128

L2 (512KB) L1d (32KB) L1i (32KB) Core L#1

PU L#2 P#1 PU L#3

P#129

PU L#4 P#2 PU L#5

P#130

PU L#6 P#3 PU L#7

P#131

PU L#8 P#4 PU L#9

P#132

PU L#10 P#5 PU L#11

P#133

PU L#12 P#6 PU L#13

P#134

PU L#14 P#7 PU L#15

P#135

16x totalL3 (16MB) L2 (512KB) L1d (32KB) L1i (32KB) Core L#60

PU L#120 P#60 PU L#121

P#188

PU L#122 P#61 PU L#123

P#189

PU L#124 P#62 PU L#125

P#190

PU L#126 P#63 PU L#127

P#191 NUMANode L#0 P#0 (252GB)

Package L#1

PU L#128 P#64 PU L#129

P#192

PU L#130 P#65 PU L#131

P#193

PU L#132 P#66 PU L#133

P#194

PU L#134 P#67 PU L#135

P#195

PU L#136 P#68 PU L#137

P#196

PU L#138 P#69 PU L#139

P#197

PU L#140 P#70 PU L#141

P#198

PU L#142 P#71 PU L#143

P#199

16x totalL3 (16MB) L2 (512KB) L1d (32KB) L1i (32KB) Core L#124 PU L#248

P#124 PU L#249

P#252

L2 (512KB) L1d (32KB) L1i (32KB) Core L#125 PU L#250

P#125 PU L#251

P#253

P#126 PU L#253

P#254

P#127 PU L#255

P#255 NUMANode L#1 P#1 (252GB)

Figure 6 Processor topology of the x86-64 system used for testing.

Figure 6 shows the topology of the processor cores and cache hierarchy in the x86-64 system. This graphical representation was obtained with the hwloc command line tool in Linux.

(38)

The benchmarks will also be run on a recent iteration of Ericssons manycore hardware.

As discussed in 5.1 it has a number of specialized DSP cores with private scratchpad memories for instructions and data, an also an on-chip shared memory that all cores can use. Detailed specifications of this processor are confidential and can not be described in detail.

9 Implementation of CBB actors using MPI

sc

dataP

dataActor :dataActor

dataP ~out

printerActor :printerActor in

Figure 7 The CBB application used for implementation.

A simple CBB application was created using EMCA IDE, consisting of just two CBCs.

A graphical representation of the application is seen in Figure 7. Here, the outer box labeled sc represents the top-level CSC. There is one CBC instance named dataActor which is of the CBC type with the same name, and one instance printerActor of the type with the same name. The names of the CBCs originate from some of Ericsson’s user tutorial material, and are not representative of their behavior.

The out port of dataActor is connected to the in port of printerActor, meaning that they are aware of each other’s existence and can send messages to each other. The coloring of the ports in Figure 7 and the “∼” label on one of them symbolizes the direction of the communication; certain message types can be sent from dataActor to printerActor, and other message types can be sent in the other direction. The ports named dataP, which connects dataActor to the edge of the CSC, will not be used.

The application was run through the CBB transform targeting EMCA, which generated a number of C code and header files. These files were then used as a basis for creating new C functions targeting the x86-64 platform, with the help of MPI calls. Details about all the MPI routines mentioned in the following sections can be found on the official MPICH documentation webpage [26].

(39)

9.1 Sending messages

CBB generates a “send” function for each port of each CBC. The contents of these functions was rewritten to do the following:

1. MPI Isend is used to post a non-blocking send request. This call returns im- mediately without ensuring that the message has been delivered to its destination.

The function gets arguments describing the message contents and what process to deliver it to.

2. MPI Wait will then block code execution until the MPI Isend has delivered its message away from its own send buffer, so that it can be reused.

The reason for using these two MPI calls instead of MPI Send, which is a single blocking call that performs the same task, is to enable an overlap between communication and computation. This could be accomplished by doing some calculations in between the two MPI calls.

9.2 Receiving messages

There is a generated “receive” function for each port of each CBC. These were rewritten to have the following behavior:

1. MPI Probe is a blocking call that checks for incoming messages. When it de- tects a message, it will write some information to a status variable and return. The status information includes the tag of the message and also the ID of the source process.

2. MPI Get count is then used to determine how many bytes of data that the message contains.

3. MPI Recv is called last. This function is a blocking call, but will not cause a stall in execution since the previous calls have assured that there is actually an incoming message.

Using these three MPI calls instead of only MPI Recv allows for receiving messages without knowing all specifics; the MPI Recv functions requires arguments describing the tag, source and size of the incoming message, which we find out using MPI Probe and MPI Get count. This enables the definition of message types with varying data contents.

(40)

10 Creating and running benchmark tests

10.1 MPI for x86-64

With MPI, there is a need to initialize the execution environment and create the MPI processes corresponding each CBC. This is done in an additional code file, test cases.c, which is used to run the actual tests. Each MPI process runs its own copy of the code, which follows this basic structure:

1. Initiate the MPI execution environment with MPI Init and determine the process ID by calling MPI Comm rank.

2. Use the process ID to determine which actor type and instance that its ID corre- sponds to, according to a lookup table or similar. The actor now knows if it is a dataActor or printerActor in this case.

3. Run code corresponding to the current test case. This part will differ depending on what kind of test that is being run, see Section 8.1. The time measurement, as discussed in Section 8.2, is also a part of this step.

4. Terminate the MPI process and exit the execution environment with MPI Finalize.

All necessary code files and headers are compiled using mpicc, which is a compiler command for MPICH that uses the default C compiler of the system (gcc in this case) along with the addidional linkage needed by MPICH. make is used to produce a single binary for the complete application. To run a test case, a variation of the following command is used:

$ mpiexec -n X -bind-to hwthread bin/test

Here X is the total number of actors (MPI processes) that will be present in the system.

Since the actors operate in pairs (with one doing work and the other one just sending messages), X must be an even number. The -bind-to hwthread makes sure that every MPI process gets associated with one hardware thread (virtual core), which re- duces the process management overhead. This is beneficial for performance and also makes the behavior more predictable. bin/test points to the binary generated by mpicc.

(41)

10.2 CBB for EMCA

Creating the benchmark tests with CBB is a simpler process. The two-actor application seen in Figure 7 had already been generated using EMCA IDE and the CBB transform (as described in Section 9). The code for doing the actual work (as described in 8.1.1, 8.1.2 and 11.3) is then added inside code files generated by the CBB transform. Further details about the structure of the code structure inside CBB will not be described here.

Since the memory allocation in the DSP data scratchpad and the shared memory has to be done manually on EMCA, and they do not share address spaces, this structure was used for allocating memory in the memory-intensive benchmark tests on EMCA:

if (vector_size < threshold_size)

// allocate space in DSP data scratchpad else

// allocate space in shared memory

Here, the threshold size is used as a cross-over point between the two memory units. It is smaller than the actual size of the DSP data scratchpad memory, which leaves space that could be used for system-internal data.

11 Results and discussion

This section contains results collected when running the benchmark tests on EMCA and x86-64. To make the results comparable across architectures, all execution times have been normalized. This means that each individual execution time is divided with the first execution time in that series, so that every data series (every line in a graph) starts with 1.0. Hence any value above 1.0 means worse (slower) performance, and any value below 1.0 means better (faster) performance. This makes it possible to focus on the scaling properties in each benchmark test, instead of execution times in absolute terms.

Actual execution times are discussed briefly but not shown in figures.

In the tests that involve strong scaling it is also relevant to look at the speedup, which is the inverse of normalized execution time. This means that we divide the initial execution time with the current execution time. That will show us how many times faster the execution is, compared to the first run That will show us how many times faster the execution is, compared to the first run (represented by 1.0 here as well). Thus, a lower number means worse performance and a higher number means better performance.

(42)

In Section 11.3, covering the memory-intensive benchmark with data reuse, cache miss ratios in x86-64 will also be presented. This is the only test that takes advantage of caches, which is why this metric is relevant here but not in the other tests.

Finally there will be a discussion about complexity and optimizations in the software, and how these factors affect the performance results. This discussion is found in Sec- tion 11.4.

11.1 Compute-intensive benchmark

0 10 20 30 40 50 60 70 80 90

0 6 4 1 2 8 1 9 2 2 5 6 3 2 0 3 8 4 4 4 8 5 1 2 5 7 6 6 4 0 7 0 4 7 6 8 8 3 2 8 9 6 9 6 0 1 0 2 4

EXECUTION TIME (NORMALIZED PER DATA SERIES)

TOTAL NUMBER OF ACTORS/PROCESSES x86 EMCA

Figure 8 Normalized execution times for the compute-intensive benchmark test with weak scaling.

Figure 8 shows how both systems perform in the compute-intensive test with weak scaling. We can see that the execution time in the EMCA system scales linearly with the number of actors created. This is true even when the number of actors get significantly larger than the number of processor cores available.

The x86-64 system behaves very differently. It performs well compared to EMCA up until hitting 256 actors, which is also the number of virtual processor cores available