Evaluation of Energy-Optimizing Scheduling Algorithms for Streaming Computations on Massively Parallel Multicore Architectures

(1)

Final thesis

Evaluation of Energy-Optimizing

Scheduling Algorithms for Streaming

Computation on Massively Parallel

Architectures

by

Johan Janz´

en

LITH-IDA-EX-2014/043

2014-xx-xx

(2)

(3)

Final thesis

Evaluation of Energy-Optimizing

Scheduling Algorithms for Streaming

Computation on Massively Parallel

Architectures

by

Johan Janz´

en

LITH-IDA-EX-2014/043

2014-xx-xx

Supervisor: Nicolas Melot

(4)

(5)

Abstract

This report describes an environment to evaluate and compare static sched-ulers for real pipelined streaming applications on massively parallel architec-tures, such as Intel Single chip Cloud Computer (SCC), MPPA, and Tilera. The framework allows performance comparison of schedulers in their execu-tion time, or the energy usage of static schedules with energy models and measurements on real platform.

This thesis focuses on the implementation of a framework evaluating the energy consumption of such streaming applications on the SCC. The framework can run streaming applications, built as task collections, with static schedules including dynamic frequency scaling. Streams are handled by the framework with FIFO buffers, connected between tasks.

We evaluate the framework by benchmarking a pipelined mergesort im-plementation with different static schedules. The runtime is compared with the runtime of a previously published task based optimized mergesort im-plementation. The results show how much overhead the framework adds on to the streaming application. As demonstration of the capabilities, we schedule and analyse a Fast Fourier Transform application, and discuss the results.

Future work include quantitative comparative studies of a range of dif-ferent static schedulers. This has, to our knowledge, not been done before.

(6)

(7)

Introduction

With massively deployed embedded systems in various industrial domains such as telephony or distributed sensors, the need of more powerful and power-efficient processing devices is constantly increasing. Performance and energy consumption depends not only on hardware design, but also on how software is able to take profit of hardware capabilities.

Stream processing is a programming paradigm where an algorithm is di-vided into tasks. Each task produces output packets of data while consuming input packets. This paradigm is interesting today in a many-core context for several reasons.

• It is based on the message passing programming model rather than shared memory, which maps well into many-core architectures without hardware controlled memory coherency across cores.

• It facilitates the implementation of on-chip software pipelining, which reduces performance penalties due to memory wall, and decreases en-ergy consumption.

• As long as there are many tasks, a large part of the cores can be utilized simultaneously, which not only increases throughput capabilities, but is also energy efficient.

• Embedded devices are heading towards many-core architectures and are already running stream processing applications, such as signal pro-cessing or multimedia.

Today, energy efficiency is largely explored through the design of efficient and scalable parallel algorithms to provide more throughput without the need to increase clock frequencies. However, good scheduling techniques for multiprocessors are also crucial to achieve both throughput and energy saving.

(10)

1.1. PROJECT GOAL CHAPTER 1. INTRODUCTION

Streaming pipelines are optimized under a throughput constraint to min-imize energy consumption through parallelism and frequency scaling. This problem is recognized as NP-hard. Several energy-aware scheduling tech-niques are described in the research literature for streaming computation on embedded multiprocessor systems. However, due to differences in assump-tions and models, it is difficult to analytically compare different schedulers.

1.1 Project Goal

The main goal of this thesis project is to construct a portable scheduler testing framework for a subset of common architectures in the parallel and embedded system research community (e.g., Tilera, SCC, MPPA, Epiphany, etc.). We intend to use the framework to compare several schedulers regard-ing the energy consumption of the schedules they produce, by usregard-ing them to run streaming applications on the actual hardware and measure the energy they consume.

1.2 Clarifications

The terms malleability and moldability have different definitions in different sources. Unless otherwise specified, the following definition is used through-out this document. A moldable task is, in analogy to metal crafting, a task with a width, the number of cores it runs on, that can be changed before ex-ecution. During execution this number remains constant. A malleable task can have its width varied during execution. Usually, an efficiency function is associated with such tasks. Tasks can be defined as partially malleable/-moldable. This means that there is a finite number of cores they can run on. A task that is continuously malleable can run on a fractional number of cores. A moldable task is usually defined as unable to be preempted. Consequently no continuously moldable tasks can exist.

(11)

Chapter 2

Background

2.1 Introduction

2.2 Energy Consumption Models

2.3 Streaming Computation

Figure 2.1: Example of a task graph. The rings are tasks, the arrows repre-sent communication channels / data dependencies

Stream processing is a computer programming paradigm, where the pro-gram flow is data driven. Typically data arrives, gets processed, and leave the application in one or more streams. A well known example where stream processing is suitable is video decoding, where data is continuously fed into

(12)

2.4. PROCESS NETWORK MODELS CHAPTER 2. BACKGROUND

a decoding application to produce frames. A streaming application can be divided into a series of tasks (sometimes referred to as “kernels”), which apply algorithms to the data packets in the stream(s) to stepwise produce the desired output(s). This type of application can be modelled as a process network. Streaming applications can be executed on single- or multicore cpus. They can also run on dedicated hardware such as GPUs. When run-ning streaming applications on GPUs however, the number of tasks (kernels) are generally limited to one, or a few.

In addition to dividing the application into tasks, each task may be par-allelized. We call moldable, tasks that are parallelizeable and may run on different number of cores, computed at the scheduling time, and remaining conztant at run time. A malleable task can also have its width varying dur-ing execution. Usually, an efficiency function 0 ≤ e ≤ 1 is associated with such tasks to model how well it scales.

2.4 Process Network Models

A task based pipelined application can be modeled as a Kahn process net-work (KPN). A KPN is a dataflow-oriented computation model, presented by Kahn in 1974 [1]. In this model, a program is divided into a set of pro-cesses1_{. Processes are connected to each other with unbounded FIFO (First} In, First Out) buffers, forming a directed graph. The only way for each pro-cess to communicate with each other is by sending discrete packets of data called tokens through these channels. Additionally, there may be external sources providing data input streams. If there are no external sources, the program flow is driven by sources and sinks. In this model, write operations to the channels must be non-blocking while read operations are blocking. This makes the state model of the processes simple; either they are waiting for input data, or they are computing. Additionally, the memory of each process as well as the channels are modelled as infinite.

These rules aim at satisfying an important property of the program: de-terminism. This allows for formal verification of requirements, such as liveli-ness. Having only deterministic programs and requiring unlimited memory is however quite limiting on the type of programs that can be modelled. Additionally, it is in the general case not possible to create a static schedule for a Kahn process network. Fig. 2.2 shows an example where this is the case. If this network were to be scheduled on a single-processor system, the process would deadlock no matter what static scedule is being used, since all reads are blocking. Using schedule 2.2b task 1 produces one token and sends it to task 2. Task 2 consumes the token and sends it to 4. Task 3 is then scheduled and waits forever for a token from task 1. Kahn suggests in

1_{depending on the context the terms process and tasks, or algorithm and functions} are sometimes used instead.

(13)

2.4. PROCESS NETWORK MODELS CHAPTER 2. BACKGROUND

the original paper that the model could be extended to suit specific types of systems [1]. Such extensions can include limitations for the network.

(a) 1 Core Time ... 1 2 3 4 (b)

Figure 2.2: a: Example of a KPN which can not be scheduled statically on one processor. b: A schedule which will deadlock at task 3.

The Synchronous Dataflow Model [2](SDF) is an extension of the Kahn process network. Each time a process is activated (“fired”), it will consume and produce a specified fixed number of tokens on each connected commu-nication channel. This number is specified for each process. In order for a process to fire it must have at least the amount of tokens in its input channels that it needs to consume for the production of the output tokens. It is possible to create a static schedule for any SDF program that has a consistent production consumption rate between processes. [2]

(14)

2.5. STATIC SCHEDULING CHAPTER 2. BACKGROUND 1 1 1 1 1 1 1 1 1

Figure 2.3: Example of a SDF with a consistent production consumption rate. 2.2b is a valid schedule for this network.

2.5 Static Scheduling

In order to calculate an optimal periodic schedule for a pipelined stream-ing application we need to consider the constraints as well as the properties we optimize. A traditionally common property is data throughput. This thesis focuses on schedulers optimizing for energy consumption, which is of high importance for low-power embedded devices with demands on high computational capabilities. When optimizing for energy consumption, pos-sible constraints counts required throughput, a maximum response time, and/or thermal constraints. Throughput constraints translates to a max-imum makespan (schedule round time), while response time may put ad-ditional constraints on the mapping. The choice of constraints for the op-timization and, in extension, the best choice of scheduler depends on the domain of the application and target architecture.

Figure 2.4: A task graph and a schedule. Task 2 has a data dependency towards task 1, but can be scheduled first because of software pipelining.

(15)

2.5. STATIC SCHEDULING CHAPTER 2. BACKGROUND

In a pipelined streaming application each task, save the initial, has a data dependency towards its preceding tasks. If the application is modelled as an SDF, this translates to a task execution ordering constraint for the schedule. However, software pipelining allows tasks to move across schedule iterations, waiving precedence constraints. In Fig. 2.4, this is demonstrated with two tasks scheduled to run in opposite order of what is defined by the task graph. Looking at the schedule from software pipelining perspective, Task 1 produces packets for the next schedule iteration. This is only possible with the assumption that the message delay is short enough to guarantee that data reaches the input buffer of the receiving task at the following schedule iteration.

(a) (b)

Figure 2.5: Two schedules for the same streaming application. Schedule b has used moldability of the tasks and scaled tasks to a lower frequency.

When taking software pipelining to its extreme, there are no periodic de-pendencies, that is no precedence constraints in the schedule, at all. The application can then be viewed as a collection of independent tasks, giv-ing the scheduler a much higher level of flexibility. However, schedulgiv-ing in such manner has a bad worst case scenario in regards to response time, as each data packet only moves one step ahead in the pipeline every iteration. In order to improve the response time, the static scheduler must optimize task mapping so producers are finished before consumers begin, within the same schedule iteration. However, if the application is not constrained by response time requirements, pipeline delay is not relevant, making a set of independent tasks a good model to statically schedule streaming applica-tions.

Depending on the model of the task and the target architecture, the scheduler may assign several cores to execute a task in order to reduce the makespan, and may also assign a frequency and/or voltage level at which the cores will operate at while that task is scheduled.

(16)

2.6. SCHEDULED DYNAMIC VOLTAGE AND FREQUENCY

SWITCHING CHAPTER 2. BACKGROUND

2.6 Scheduled dynamic voltage and frequency

switching

An energy minimizing schedule does not only try to allocate and schedule tasks, but also tries to run the tasks with as low frequencies as the constraints allows. Running a core on a lower frequency may additionally allow for a lower voltage level, which is even more beneficial for the power consumption. Both frequency and voltage switching carry a time penalty however. On the Intel SCC, frequency switching is done in a few clock cycles, but a changing the voltage takes 2-10 ms.

To explore this scheduling problem and how it transfers to a schedule evaluation problem, let us look at a simplified example. In this example, there are two voltage and speed (frequency) settings, high or low, and fast or slow. At high voltage the speed may be either fast or slow. At low voltage the speed must be slow.

In the example shown in Fig 2.6, Task 2 is scheduled to run slowly. Since it is then advantageous to also run on a lower voltage level, the voltage is also dropped. It is safe to run slow on high voltage, so the best solution is to let Task 2 run immediately (bottom).

Figure 2.6: top: The scheduler waits for the voltage to settle until changing frequency to low and starting Task 2. bottom: The scheduler does not wait for the voltage to settle.

In the example shown in fig 2.7, Task 1 is scheduled to run slow while Task 2 is scheduled to run fast. The voltage must be confirmed high before Task 2 can start running at a high speed, or the core may behave unpredictably and even be damaged. An obvious solution is to stall Task 2 until the voltage is at the correct level. A more complex solution is to increase the voltage already at Task 1. Whether this is advantageous depends on the runtime of Task 1 and Task 2 and is a problem well suited for a static scheduler.

(17)

2.6. SCHEDULED DYNAMIC VOLTAGE AND FREQUENCY

SWITCHING CHAPTER 2. BACKGROUND

Figure 2.7: top: The scheduler does not wait for the voltage to settle. Task2 can not run fast as requested. middle: The scheduler waits for the voltage to settle. bottom: The voltage is raised ahead of time, no need to wait.

Fig 2.8 demonstrate what happens in the case of a schedule with two very short tasks that runs slow and fast respectively. Due to the time it takes to change the voltage the results can be disastrous. One might argue that this simply is an example of a bad schedule. However, some static schedulers may work with a simplified model of the architecture, not taking voltage scaling and its time penalty into account, only frequency scaling. Having the runtime backend change the voltage anyway would yield results unfair to the static scheduler and possibly break time constraints due to the extra wait times. A good live schedule evaluator should be able to evaluate schedulers using varying properties in its model.

(18)

2.7. OCM, A COLLECTION OF HEURISTICS FOR THE MPTS

PROBLEM CHAPTER 2. BACKGROUND

2.7 OCM, a collection of heuristics for the

MPTS problem

Fan et al has developed Optimizations Combined for MPTS (OCM), a collection of heuristics to solve the Malleable Parallel Task Scheduling prob-lem (MPTS) [3], partially by combining and modifying existing algorithms. The MPTS problem is concerned with producing a schedule with as small makespan as possible, given a set of independent, moldable tasks, and a ho-mogeneous architecture. In this context, moldable refers to tasks that can be executed on more than one processor. Depending on the individual task, execution on more processors can increase speed, but decrease efficiency. MPTS does not consider energy efficiency explicitly, only makespan and, by extension, utilization of existing resources. However, a schedule with a small makespan may present good opportunities for a subsequent frequency scaling algorithm to produce an energy efficient schedule.

Figure 2.9: Simplified flowchart for the OCM algorihtm

While analysing some existing algorithms they find two problems impor-tant to address when minimizing makespan. They assign high importance on the critical task, the task at the end of the schedule. The runtime of this particular task is directly affecting the makespan and since it is the last task in the schedule, increasing its width will not affect the rest of the schedule. They isolate this optimization to a separate step called OT. The second identified problem is malleability. Generally speaking, highly mold-able tasks should be assigned a large width and poorly moldmold-able a small. They argue that that different algorithms for the two types of tasks are needed. The authors develop an algorithm that first produces a schedule with the widths fixed at 1 and then incrementally increases the width and note that it performs quite well on poorly malleable tasks but not so well when the moldability is high. This algorithm is called OM. Finally, the

(19)

2.7. OCM, A COLLECTION OF HEURISTICS FOR THE MPTS

PROBLEM CHAPTER 2. BACKGROUND

last part, OB, is an algorithm considering highly moldable tasks. In OB, the entire task set is divided into two subsets, highly and poorly moldable tasks. The poorly moldable set is processed with OM to produce a partial schedule, and the highly malleable set is processed by a separate algorithm they call Multi LayerS. The two partial schedules are then combined into a complete schedule, which is then fed into OT for the final process.

2.7.1 Crown Scheduling

Crown scheduling [4] minimizes energy consumption of moldable stream-ing tasks under throughput constraint. Considerstream-ing a machine consiststream-ing of p identical and individually scalable processors, the algorithm decom-poses the scheduling problem into resource allocation, mapping and discrete voltage/frequency scaling phases. Kessler et al simplifies these steps by par-titioning the set of processors into two disjoint groups 2 of equal size, and recursively decomposing them into subgroups until obtaining groups of 1 processor, making 2p − 1 groups in total. Each task is mapped to exactly one of these groups, and is thereby assigned a width. This mapping algo-rithm constrains the tasks to be mapped to a power of two, consecutive processors. Communication is not taken into account in crown scheduling, but the algorithm implicitly waives data dependencies between tasks with the software pipelining method by delaying non-data-ready tasks to a later schedule iteration. time barrier task 1 task 2 task 3 task 12 task 6 task 7 task 4 task 5 task 18task 19 task 20task 21 task 22task 23

task 8 task 24 task 9 task 25 task 10 task 26 task 11 task 27 task 13 task 28 task 14 task 29 task 15 task 30 task 16 task 17 task 31 barrier P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 groups 3 2 1 frequencies 1 2 3 4 5

Figure 2.10: An example of a crown schedue with tasks running at different widths and frequencies. Curtesy of Nicolas Melot.

The authors compare the mapping step to the bin packing problem [4],for which efficient heuristics exist [5], although the nested feature of crown groups brings additional constraints. In the paper both a phase separated and an integrated approach is presented. Both variants can solve small to medium size problems to optimality with Integer Linear Programming

2_{The paper notes that the model allows for more than two groups, but they currently} see no advantage of doing this

(20)

2.8. STREAMIT PROGRAMMING LANGUAGECHAPTER 2. BACKGROUND

within a few seconds. Locally optimal phases in the separated variant al-lows for solving problems faster but can result in a less good solution, as successive steps fail to compromise for a better overall solution.

The authors conclude that the crown constraints might limit the quality of optimally-computed crown schedules. They consider quantitative com-parison with other non-crown schedulers.

2.7.2 Sanders and Speck’s Scheduling Algorithm

Sanders and Speck [6] investigate the problem of mapping continuously malleable tasks with a common deadline and a target function of reducing energy consumption. They assume convex efficiency functions and contin-uous, unbounded frequency levels, and propose a near-linear algorithm to solve the allocation, mapping, and frequency scaling in polynomial time, and a -approximation for bounded frequency domains.

By demonstrating common speedup functions, they argue that the con-straints are reasonable since most real world applications are not limited by them. The results are interesting since their contribution solves the same problem as crown scheduling, at a much higher speed. Simplifying the prob-lem by applying constraints on the input instead of the solution space is an approach in a quite different direction. It is unknown how the schedule com-pares against a crown schedule since, to our knowledge, no such comparison has yet been made.

2.8 StreamIt programming language

StreamIt [7] is a programming language and a compiler infrastructure de-signed for streaming applications, originally developed by MIT, Cambridge USA. Instead of explicitly defining communication channels, a StreamIt ap-plication only consist of a combination of four major component types, and sometimes therefore referred to as streams. A Filter is a functional unit per-forming work on a streaming input flow, producing a streaming output. The filter declares how many input packets it consumes and how many output packets it produces each time it is scheduled, and is the only component type that can not be composed with other components. By doing this declaration a StreamIt application can be modelled a Synchronous Dataflow Network (See section 2.4). A pipeline composes other streams in a sequential man-ner. A SplitJoin defines a fork (and subsequent join) in a stream. The split can either duplicate the incoming packets, or direct them in a round-robin fashion. The subsequent join is always of round-robin type. Finally it is also possible to create circular streams with a FeedbackLoop. Each stream can be instantiated several times, optionally using loop constructs, to build a complex streaming network. With the single input - single output compo-nent model the language does not allow the construct of general task graphs

(21)

2.9. INTEL SCC CHAPTER 2. BACKGROUND

or trees. The authors defend this design decision with the rationale that al-lowing arbitrary graphs makes it harder for the programmer to describe its programs and more difficult for the compiler and scheduler to create good results [8]. An analogy can be made to the classic programming model [7] where the programmers have moved from goto statements into more restric-tive do-while and subroutine constructs, allowing for more powerful ways of expressing algorithms and the realistic possibility for formal verification.

F(x) (a) Filter ... (b) Pipeline Split Join ... ... (c) Splitjoin Join Split ... ... (d) Feedback-loop

Figure 2.11: The four component types in StreamIt.

The compiler infrastructure can optimize for several target architecture types, including single core, multicore and cluster systems. As filters are by design sequential, there are no moldable tasks in StreamIt. Instead, opti-mization relies on having more tasks than available processors, and welding successive tasks together down to a desirable amount of parallelism. There is also ongoing research for static, dynamic and mixed scheduling of StreamIt programs [9].

2.9 Intel SCC

(22)

The Single-chip Cloud Computer (SCC) is a concept vehicle experimental many-core architecture issued from Intel’s Terascale research program [10]. With its tile based architecture it is designed to resemble future many-core processors.

There are 24 tiles on the chip, connected by a mesh network. Through the mesh, the tiles have shared access to four memory controllers. The controllers provide access to the off chip main memory which consist of up to 64 GiB of DDR3 memory, up to 16 GiB per controller. The tiles themselves consists of two P54C processors with a 16 KiB L1 cache, two 256 KiB L2 cache, 16 KiB globally shared Message Passing Buffer (8 KiB for each core), two cache controllers, two address lookup tables (LUT), and one mesh interface unit (MIU). The 24 tiles with two cores each makes a total of 48 cores on the chip.

The tiles are organised into six voltage islands on which the voltage levels may be adjusted separately. Additionally, the frequency can be set individually for each tile.

Figure 2.13: The 24 tiles grouped toghether in their six voltage islands.

The P54C is a 32 bit IA-32 processor. It does not have any SIMD instruc-tions and perform in-order execution. The processors can run in 15 different frequency levels from 100 to 800 MHz with Dynamic Voltage and Frequency Scaling (DVFS). Being a 32 bit architecture, it can map at most 4 GiB of memory. The modified version used on the SCC is patched with a pro-grammable address look-up-table (LUT). This table should not be confused with the page table handled by the operating system. The LUT maps the 32-bit address space to the global 36-bit address space with a 16 MiB ( 224 bytes) granularity (256 memory chunks). Each chunk is mapped to either one of the four memory controllers or the MPBs. In this way each core can be configured to have some private main memory, some shared main mem-ory, and shared MPB memmem-ory, with a total of max 4 GiB. The default value is to assign each core private memory through the topographically closest

(23)

memory controller, although any core can access memory from any memory controller, if the LUT is configured appropriately. The table also maps to itself as well as the LUTs belonging to the other cores, making it possible to change the mapping configuration at runtime. This can however be danger-ous. Since the operating system is not made aware of such changes, changes of the physical location of virtual memory is hard to predict.

Figure 2.14: A tile on Intel SCC. “Traffic gen” is for mesh testing purposes.

The on-chip MPB memory is distributed across the tiles, and is shared amongst all cores. Accesses to MPB are cached on L1 level, but bypasses L2. On the cores, the operating system marks data from MPB memory with a MPBT (Message Passing Buffer Type) bit. This flagging works on page level granularity (16 MiB).

When MPBT memory is cached in L1, cache lines must be invalidated before subsequent reads and writes, or the core would just read from the cache regardless of whether any core has written to the MPBT location or not. This is true even when the same core is writing to the memory location, see listing 2.1. The SCC-modified P54C provides an instruction called CL1INVMB for this purpose. Performing this instruction invalidates all MPBT flagged memory in the L1 cache. Subsequent reads on invalidated cache lines makes the core access the main memory (MPB or off-chip).

1 i n t ∗ a = . . . ; // M a l l o c s h a r e d MPB memory i n t p r i v a t e = a [ 0 ] ; // Puts a [ 0 ] . . . a [ 7 ] i n L1 c a c h e 3 R C C E i n v a l i d a t e c a c h e ( ) ; // I m p o r t a n t ! I f o m i t t e d , t h e i f s t a t e m e n t w i l l s u c c e e d . a [ 0 ] = 4 2 ; 5 a [ 8 ] = 1 // F o r c e p r e v i o u s l i n e t o be committed t o MPB. i f( a [ 0 ] != 4 2 ) 7 p r i n t f (”You f o r g o t t o i n v a l i d a t e t h e c a c h e \n”) ;

Listing 2.1: Example of c code working on MPBT memory.

When writing this type of memory, the L1 cache displays write-through behaviour. However, the modified P54C includes a buffer which collects all MPBT writes until a full cache line (32 bytes) is written, at which point

(24)

2.10. RCCE CHAPTER 2. BACKGROUND

it flushes the buffer. It will also flush if a write is outside the cache line currently being collected.

When the P54C accesses non-MPBT memory, the caching works as it would in a normal single-core x86 system, with the L2 memory enabled. The SCC offers no hardware cross-core cache coherency, so accesses to the off chip shared memory must be done with care. It is possible to disable L2 caching by marking pages with the MPBT flag, even if is not actual MPB memory. This is a way to disable L2 caching for the off chip memory and enables a shared-memory programming model for the user, at the cost of lower performance.

Ehsan Totoni et. al. [11] compare the power and performance of SCC to Intel Core I7, Intel Atom and Nvidia ION2 GPGPU. In the test, SCC did not scale very well in comparison. The authors believe that this is due to limitations in the communication network. SCC seem to lack the strength of doing global many-to-many communication well. Comparing energy efficiency the SCC did worse than all the other architectures in every test save one, which used very little communication and performed integer calculations. The authors repeatedly stress that SCC is a concept vehicle and that next generation many-core processor with more powerful cores and better tuned communication network likely will be much more competitive.

2.10 RCCE

RCCE (pronounced “rocky”) is a MPI-like library providing means to pass messages between cores on the SCC. The message passing interface is im-plemented using the MPB, and signalling with flags, also in the MPB. The library provides four fundamental functions for this purpose: RCCE put, RCCE get, RCCE set, and RCCE wait until. It also provides convenience functions built upon these functions like send and receive, which are more flexible regarding the size of data transfers. There are two separate ver-sions of the interface, called gory and non-gory. In the non gory interface put and get are not available, as the convenience functions aim at replacing their functionality.

In order to send a message with the gory interface, the programmer needs to put the data into the MPB memory. RCCE handles the cache write behaviour by requiring the size of the data to be sent to be a multiple of 32 bytes and 32 byte-aligned. The sender can then notify the receiver that data is available with a flag, who fetches the message from MPB with get m stores it in private memory, and flags that the data is received.

The RCCE library also include a power management API. The API pro-vides means to change the frequency and voltage for the cores. It has some

(25)

2.11. OTHER MULTICORE ARCHITECURESCHAPTER 2. BACKGROUND

limitations:

• Frequency can not be changed on a tile granularity, even though the SCC architecture supports this. The frequency will be the same throughout one voltage island, which is subsequently called a Power Domain by RCCE.

• Only a designated master core can change voltage/frequency on each island, even though SCC allows any core to change voltage/frequency on any island.

2.11 Other multicore architecures

2.11.1 Tilera TILEPro and TILE-Gx

TILEPro, released in 2008, is a tile processor architecture with 32 or 64 cores connected through a mesh network. The cores are simple 64 bit architectures with in-order execution and a short pipeline, running at a maximum of 866 MHz. High performance is achieved though its WLIW instruction set. Each core has two levels of caches, coherent across the processor. There is no physical third cache level. However, data resident in L2 on all cores are de-facto treated as a distributed L3 cache, maintained with hardware mechanisms and a separate mesh network dedicated for this purpose. Tilera calls this technique “Distributed Dynamic Cache”. The processor can, through its four memory controllers, be equipped with up to 64 GiB of DDR2 memory. A technique called TileDirect enables transfer of network I/O data directly into the caches, without touching main memory, enabling lower I/O latency. The network is accessed through two 10 gigabit Ethernet interfaces.

Released in 2010, TILE-Gx is the next generation of many-core processor architecture family from Tilera, with up to 72 cores (tiles) on a processor. Compared to TILEPro, the core is more advanced, with a SIMD instruction set and a maximum frequency of up to 1.5 GHz. There are five independent mesh networks on the chip. In addition to the benefits of newer technology, its 40 nm manufacturing technology makes the processor more energy effi-cient. Tilera claims that TILE-Gx has the highest performance-per-watt of all architectures with complete system-on-chip feature.

2.11.2 Intel XEON

XEON is a brand for high performance multicore processors, designed for server and embedded use. The XEON brand has incorporated a range of processor families since Pentium II XEON was released in 1998. The families are known for having relatively few, strong cores. The cores are optimized for high performance with primarily sequential, but increasingly also parallel

(26)

2.12. CONCLUSION CHAPTER 2. BACKGROUND

applications. The later additions to the brand incorporate 512 bit SIMD instructions.

2.11.3 Adapteva Epiphany

Epiphany is a scalable low power MIMD architecture. As of this date versions up to 64-cores are offered as a separate device, but the architecture supports up to 4096 processors on one chip. Adapteva offers the 4096 version as IP modules.

The cores are simple superscalar RISC architectures, operating at 800 MHz. Each core has up to 1 MiB of local on-die memory, but no cache. All cores can read from any of the physical memory locations with regular load/store instructions, enabling a shared memory model with varying but reasonably predictable latency. Epiphany is a low power device, with the 64 core version consuming less than two W of power. As the architecture is designed with scalable extreme energy efficiency in mind, the raw perfor-mance as single core von-Neunman devices is limited. The power is instead the possibility of performing highly parallel tasks with low latency for shared data and cache coherence overhead eliminated.

2.12 Conclusion

We have explored the problem of making static schedules for streaming applications. Published papers on this problem use a range of problem defi-nitions and assumptions to propose “optimal” or “first” solutions. Further-more, to our knowledge, no common assessment tool exist to compare the quality of the produced schedules across different schedulers, as the authors tend to parametrize their energy consumption model only with properties taken into account in their own schedulers. A framework that can facilitate different schedulers and assess the quality of produced schedules by running the applications with given schedule on an actual machine and measure the energy consumption will be able to make a comparable study.

(27)

Chapter 3

Related work

H¨onig and Shiffmann [12] has contributed with a benchmark suite for the scheduling research community. The test suite contains 7200 task graphs. There are several clones for each task graph with the indicator for the num-ber of processors in the target system varied, adding up to 36000 problems. They choose only to use a power of two number of processors. The bench-mark also includes the optimal solutions in respect to computation time. In the paper they note the lack of proper comparisons of schedulers from different research groups and the difficulty in estimating their relative qual-ity. They demonstrate their benchmark by comparing the results from four different schedulers. Results show that the even though the schedulers have different strategies, they all exhibit similar strengths and weaknesses. As a motivation for using a large test bench with known solutions they claim that analysis of the algorithms as a method of comparing relative quality is infeasible, partially due to the problem complexity, and partially due to the fact that many schedulers do not produce optimal schedules in general.

SPLASH-2 [13] is a collection of parallel programs compiled as a bench-mark suite. When Fan et al [3] evaluate their MPTS approximation algo-rithm OCM, they use all the applications in the SPLASH-2 suite as indepen-dent moldable tasks. This corresponds well to the MPTS problem. Being from from 1995, the suite contains no streaming applications with commu-nication between tasks. Although the task themselves are “real world”, the pressure on the memory controller and the global communication pattern can be more accurately be described as random.

PARSEC [14], a newer parallel benchmark suite than SPLASH-2 con-tains a few applications that are explicitly constructed with the pipelined programming model. These tasks are moldable as well. In their technical report from 2008 [15] they state that SPLASH-2 no longer represent state of the art of parallel workloads and that it does not have pipelined

(28)

appli-CHAPTER 3. RELATED WORK

Application Number of stages Comments

dedup 5

ferret 6

x264 configurable Can have as many pipeline stages as there are frames in input

bodytrack 8 tasks Not explicitly pipelined. Uses a task collection and a thread pool

Table 3.1: Pipelined applications in the PARSEC benchmark suite

cations, motivating the need for a new benchmark suite. For all included pipelined applications in PARSEC, the task graphs are faily simple however.

Kessler et al. [16] implemented a parallel sorting algorithm for Intel SCC optimizing for speed, and did so by investigating what type of algorithms are most suitable on many-core architectures such as the SCC. Merge-sort is chosen, which is a typical streaming application. They thus investigated the problem of scheduling and mapping of streaming applications on the SCC. The resulting application is a hybrid of sequential, pipelined, and shared-memory-multicore parallel computing paradigms. Some effort is spent in mapping each “task” to an optimal configuration using integer linear pro-gramming. Two mappings are presented: one where each level in the merge tree is mapped to one core, and one where the total communication distance is minimized. The paper report the total running time for the distance op-timized mapping to be lower than the level based version [16]. However, the dominant contribution to the improvement could be that the first phase is using more cores with the distance optimized mapping. It is unclear whether the pipeline phase perform better or worse for respective mapping.

Cichowski et al. [17] present a theoretical model for power consumption of the SCC. They simplify the model to only depeny include the frequency as a paremeter, arguing that voltage is dependent on the frequency and is thus implicitly included. In the model they include the SCC specific voltage and frequency island constraints. They then derive the constants of the model with measurements. A simplified version of the presented model for power consumption as a function of frequency for this architecture is described by formula 3.1. pn(f0) + pm+ 6 X i=1 8 · pc(fi) (3.1)

In the paper the formula is expanded into a static and dynamic part. Note that the sum is the voltage islands of eight cores each (pn is network power, pm is memory controller power, pc is core power comsumption).

(29)

CHAPTER 3. RELATED WORK

The model assumes that the frequency of the network is not scalable and simplifies the memory controller to only consume static power.

(30)

Chapter 4

Design

In this chapter we introduce the static scheduling framework, as well as schedeval, which is the main body of work in this thesis.

Figure 4.1

The purpose of the framework is to evaluate the performance of static schedulers for streaming applications and multi/many-core architectures. We can with this system measure the execution time and energy footprint of pipelined streaming applications with different schedules.

The framework is designed to work with arbitrary streaming applica-tions; there are no restrictions for the task graph structure, other than memory limitations. Autonomous batch processing of all task graph, sched-uler, and target architecture combinations and generation of quality reports using various schedule quality assessment tools makes a powerful tool chain for development and comparison of static schedulers.

(31)

4.1. SCHEDEVAL: FRAMEWORK FOR RUNTIME SCHEDULE

EVALUATION CHAPTER 4. DESIGN

4.1 Schedeval: Framework for runtime

sched-ule evaluation

Figure 4.2: Flow overview for schedule evaluation with schedeval

Schedeval is a framework for evaluating static schedules from an energy perspective. It provides an architecture independent programming interface in c for the purpose of developing pipelined task based applications. These applications are defined by source code, a task graph, and a static schedule. The framework consists of two main parts. The entry point for compila-tion and evaluacompila-tion, simply called “schedeval”, is responsible for compiling an executable for the target system, issuing commands to run the executable, and to collect the measurements after the run. The second part is the gen-erated executable, target exec, which when compiled is self contained with all the necessary code and information to run the evaluation.

Schedeval is invoked with the following command: $> schedeval-<target architecture> [options]

<path/to/taskgraph> <path/to/schedule> The available options are described in table 4.1.

When schedeval is invoked the following sequence of steps is performed. Some steps may be skipped to conform with command line flags. If any step should fail, the sequence is aborted and schedeval exits with a nonzero return value.

(32)

2. Load the schedule

3. Cross compile the Application Under Test. It is assumed that the application developer has added lines the Makefile, residing in the aut directory, describing how to compile each task. 1.

4. Generate code for the runtime (round robin) scheduler, with data de-scribing the static schedule for each core.

5. Generate code specifying the communication channels that will be built at load-time

6. Cross-compile the runtime scheduler with the generated code and link with the Application Under Test archive from step 3 to produce the target executable

7. Prepare the target platform and make sure it is ready

8. Run the target executable on the target platform and collect the re-sulting data

Fig. 4.3 describes a simplified structural view of schedeval. Note that most of the system is target independent. To use schedeval for a specific target architecure, the interface target functions must be implenented. Currently, schedeval has been implemented for Intel SCC and an SCC emulator.

schedeval schedule_handling +schedule_load() taskgraph_handling +taskgraph_load() target_functions target_functions +compile_aut() +compile_exec() +target_setup() +target_run() gode_generation +generate_code()

Figure 4.3: Structural view of the program compiling and linking the target executable. The parts marked in gray needs to be modified when porting to other architectures.

Fig. 4.4 describes the structural view of the target executable program. As with schedeval, much of the program is not dependent, or aware, of the architecture it is running on. In order to port this program to a specific architecture, the measurement, communication and power handling module needs to be modified.

(33)

target_exec.c communication measurement communication_backend +global_barrier() +communication_setup() +build_channel() +get_status() communication_aut +is_full(channel) +is_empty(channel) +is_alive(channel) +get(datatype)(channel,data) +put(datatype)(channel,data)

Application Under Test

Task Interface generated for each task +TASKSETUP(argc,argv,incomming_channels, outgoing_channels): int +TASKPRERUN(): int +TASKRUN(): int +TASKDESTROY() measurement 1 n task_schedule generated power_handling +set_frequency_mhz(frequency) +set_DVFS(voltage,frequency) power_handling

Figure 4.4: Structural view of the binary running on the target. The parts marked in gray needs to be modified when porting to other architectures.

After an initial setup and initialization phase,target exec’s activity is de-scribed in Fig. 4.5. The backend calls each task in the order specified in the schedule. For each task, the state machine (Fig. 4.6), communication channels, and preferred frequency is being maintained.

(34)

Figure 4.5: Activity diagram of the main loop of the backend running an application on the target system

(35)

Flag Description Default

-compile Compile the target exec only. off Do no run.

-run Assume the target exec is compiled, off run it.

-compileandrun Compile and run the target exec. on -walltime xx Run target exec for a specified time, off

in milliseconds.

xx is a double, so the walltime can be specified down to nanosecond granularity -stdlog Redirect standard output to a logfile. off

Useful for batch processing

-detailed timings Output detailed timings for the tasks off to stdout. Not affected by -stdlog

-dumppower Output energy measurements to stdout. off Not affected by -stdlog

(36)

4.2. MESSAGE PASSING MODEL CHAPTER 4. DESIGN

4.2 Message passing model

In order to comply with different programming paradigms, schedeval pro-vides non-blocking First-In-First-Out message passing between tasks. The communication channel between two tasks is implemented as a ring buffer with offset variables indicating how much is written and read from the queue. If the communicating tasks are mapped on the same core, the buffer is al-located in private memory. If the tasks are mapped on different cores, the buffer is allocated on the MPB memory located at the receiving end. The available MPB memory is split evenly between communication channels on the cores. For any shared memory location, only one core is allowed to write, so the implementation is completely lock free. Listing 4.2 show the available operations for the communication channels.

Function Description

get(DATATYPE)(channel,&var) Dequeue one element from the channel

put(DATATYPE)(channel,&var) Enqueue one element to the channel

size t length(DATATYPE)(channel) Returns the number of elements currently in queue

size t capacity(DATATYPE)(channel) Returns the maximum number of elements the channel can hold int is empty(channel) Returns 1 if the channel is empty. int is full(channel) Returns 1 if the channel is full. Table 4.2: Functions available for the provided communication channels

(37)

4.3. DVFS WITH SCHEDEVAL CHAPTER 4. DESIGN

4.3 DVFS with schedeval

As discussed in section 2.9, can scale both frequency and voltage explicitly ant runtime. The RCCE library provides functionality to scale frequency and voltage together on the power island groups of eight cores. This is too coarse grained for our purposes, so schedeval bypasses the RCCE functions. Schedeval supports changing frequencies for each individual task. As two cores on the same tile must run on the same frequency, tasks scheduled to run simultaneously on the same tile must also have the same scheduled frequency. It is not dangerous to ignore this rule; the chosen frequency for the tile will be subject to race condition and the actual frequency will be one of the two specified.

Voltage changing is subject to more problematic issues. As discussed in section 2.6, scheduling voltage changes is not trivial. Additionally, due to the voltage island, running a schedule which assumes voltage can be changed for each tile can be outright hazardous for the processor. Any schedule specifying voltage must be aware of the power islands. Furthermore, before each power change, the power island would need to synchronize. There must be no core left requiring a high voltage before the change. A good schedule may have voltage changes perfectly aligned in a safe way, but variations at runtime may skew the schedule to be hazardous anyway.

We choose to provide explicit frequency scaling only. In order to still benefit from low frequencies, schedeval finds the highest scheduled frequency within every voltage island. At load-time, the voltage for each power island is set to the minimum supported voltage for that highest frequency. The voltage is then remained constant throughout the program run. In this way, the voltage is guaranteed to be at a safe level at any time, and groups running at low frequencies get the benefit of low voltage.

4.4 Application development with schedeval

A streaming application for schedeval consists of the task code written in c, a task graph, and a schedule. We demonstrate how to make a complete application with a synthetic example. Listing 4.1 show how a typical small task could look like. Compare the code with Fig. 4.6 to get a sense of how the state machine for the task works.

1 #d e f i n e DATATYPE f l o a t #i n c l u d e ” c o m m u n i c a t i o n a u t . h” 3 s t a t i c c h a n n e l t i n , o u t ; 5 i n t TASKSETUP( c h a n n e l c t i n c o m i n g , c h a n n e l c t o u t g o i n g , i n t a r g c , c h a r ∗ a r g v [ ] ) 7 {

(38)

4.4. APPLICATION DEVELOPMENT WITH SCHEDEVALCHAPTER 4. DESIGN o u t = g e t c h a n n e l ( o u t g o i n g , 0 ) ; 9 i n = g e t c h a n n e l ( i n c o m i n g , 0 ) ; r e t u r n 0 ; // e v e r y t h i n g ok 11 } 13 i n t TASKPRERUN( ) {r e t u r n 0 ; }// P l e a s e c a l l TASKRUN n e x t t i m e . 15 i n t TASKRUN( ) { 17 f l o a t f ; g e t (f l o a t) ( i n ,& f ) ;

19 put (f l o a t) ( out ,& f ) ;

i f( I F e e l F i n i s h e d ) 21 r e t u r n 0 ; // C a l l TASKDESTROY and t h e n l e t me be e l s e 23 r e t u r n 1 ; // I want t o be s c h e d u l e d a g a i n } 25 27 v o i d TASKDESTROY( ) { 29 // f r e e y o u r own m a l l o c e d memory h e r e }

Listing 4.1: Example of source code for a small task

Figure 4.6: State diagram for a running task. Each task corresponds to a specific function within the task that will be called when it is scheduled.

We connect tasks into a task graph. Schedeval uses the xml based format graphML with some extra attributes defined as representation for the graph.

(39)

4.4. APPLICATION DEVELOPMENT WITH SCHEDEVALCHAPTER 4. DESIGN

As GraphML is a well used format, we can choose from drawing the graph by hand using a GraphML compatible graphical tool, such as yED, or writing directly into the file. Fig. 4.7 show an example on how a task graph can look like. Schedeval allows for arbitrary graphs or forests. The primary limit is the memory constraints on the target machine. In the task graph, each task is mapped to source code through its chosen taskname. A powerful feature of schedeval is the possibilty to reuse a task several times within the same graph. This is done by creating nodes with the same taskname. Every task in the graph is uniquely identified by its taskid. When more than one task has the same name, the task code corresponding to the taskname gets compiled several times with the taskid as a c define, creating a separate compilation unit with a separate scope for each instantiated task. The relation between the taskname and taskid attributes can be compared with classes and objects, existing in most object oriented languages.

Figure 4.7: Graphical view of a task graph

For the purpose of scheduling, a few more attributes needs to be existing in the task graph, see Table 4.3. The autname is the name of the Application Under Test we are defining, and is defined once for the whole graph.

Schedeval has some limited support for multicore (“moldable”) tasks. As task programmer, it is possible to define tasks that can be run on more than one core. This is specified with the maxwidth attribute. The relative efficiency as a function of number of cores it is running on is defined in the efficiency attribute.

Some attribute values, can be either a numeric value, or a formula depen-dant on variables. As an example; the efficiency for a task as a function of the number of cores it is running on can be defined as

(40)

Attribute Type Level

autname string graph

target makespan float or fml graph

taskname string task

taskid string task

maxwidth int or fml task efficiency float or fml task workload float or fml task

Table 4.3: Attributes in the task graph specification

fml: p <= 16 ? 5 / ( p + 4) : 1E-6.

This particular curve, visualized in Fig. 4.8, obeys Amdahl’s law up till 16 cores, but is then defined as almost zero. This expressive feature is useful when generating schedules for a task graph.

Figure 4.8: Example of an efficiency curve described by a formula in the task graph.

The workload and the makespan is defined in milliseconds, but is inter-preted as a float (or fml) and can thus take any realistic value. Schede-val handles these Schede-values with nanosecond accuracy 2. While the makespan generally is a part of the requirement specification of an application, the workload for each task may not be trivial to obtain. As operations take dif-ferent amount of time on difdif-ferent architectures, it is generally not possible to accurately define an architecturally independent measure of workload. A static analysis of the code may give a somewhat accurate estimate. As no such analysis tool is a part of this framework, we suggest that the de-veloper microbenchmark each task with schedeval, using a task graph and

(41)

schedule containing only one task, and possible a driver. We run the bench-mark application with schedeval for a fixed amount of time and note how many iterations the schedule has performed. After dividing the time with the number of iterations and subtracting the known time for the overhead (See section 5.1) and the time for drivers, we get a measured value for the workload of the benchmarked task.

Listing 4.2 shows the textual representation of the task graph for our pipelined application. The only remaining piece before running our applica-tion is a schedule. We have the choice of either writing it by hand, or letting schedulers generate one for us. Listing 4.3 show one possible schedule for the application. Note that we differentiate the target makespan in the task graph from the roundtime in the schedule. Strictly speaking, makespan is the time from scheduling the first task until the last task in the iteration is finished. Depending on the workload of the tasks, the scheduler may end up with a makespan lower than the target. The schedule attribute round-time reflects this; slack is added after the actual makespan up to the desired target makespan. 1 <?xml v e r s i o n=” 1 . 0 ” e n c o d i n g=”UTF−8”?> <graphml xmlns=” h t t p : // graphml . g r a p h d r a w i n g . o r g / xmlns ” 3 x m l n s : x s i=” h t t p : //www. w3 . o r g / 2 0 0 1 /XMLSchema−i n s t a n c e ” x s i : s c h e m a L o c a t i o n=” h t t p : // graphml . g r a p h d r a w i n g . o r g / xmlns 5 h t t p : // graphml . g r a p h d r a w i n g . o r g / xmlns / 1 . 0 / graphml . xsd ”> 7 < !−− a t t r i b u t e d e f i n i t i o n s and d e f a u l t v a l u e s o m i t t e d h e r e −−> 9 <gr aph i d=”G” e d g e d e f a u l t=” d i r e c t e d ”>

11 <d a t a key=” g t a r g e t m a k e s p a n ”>200</ d a t a> 13 <node i d=” n0 ”>

15 <d a t a key=” v t a s k i d ”> f i r s t </ d a t a>

17 <d a t a key=” v max width ”>1</ d a t a>

27 <d a t a key=” v e f f i c i e n c y ”>

f m l : p == 1 ? 1 : 1 e −06 </ d a t a>

29 </ node>

31 <node i d=” n2 ”>

(42)

33 <d a t a key=” v t a s k i d ”>i d d o i n g s o m e t h i n g</ d a t a>

37 f m l : p & l t ; 5 ? 1 − 0 . 1 ∗ p : 1 e −06 </ d a t a>

</ node>

39

41 <d a t a key=” v t a s k n a m e ”>d o i n g s o m e t h i n g e l s e</ d a t a>

43 <d a t a key=” v w o r k l o a d ”>14</ d a t a>

45 <d a t a key=” v e f f i c i e n c y ”>

f m l : p == 1 ? 1 : 1 e −06 </ d a t a>

47 </ node>

49 <node i d=” n4 ”>

51 <d a t a key=” v t a s k i d ”>1</ d a t a>

Listing 4.2: Textual representation of a task graph

<?xml v e r s i o n=” 1 . 0 ” e n c o d i n g=”UTF−8”?> 2 <s c h e d u l e name=” s c h e d u l e n a m e ” autname=” M y s t r e a m i n g a p p l i c a t i o n ” c o r e s=” 2 ” t a s k s=” 5 ” r o u n d t i m e=” 200 ”> <c o r e c o r e i d=” 0 ”> 4 <t a s k t a s k i d=” f i r s t ” w id th=” 1 ” f r e q u e n c y=” 800 ” o r d e r i n g=” 0 ” /> <t a s k t a s k i d=” s e c o n d ” wi dt h=” 1 ” f r e q u e n c y=” 800 ” o r d e r i n g=” 1 ” /> 6 <t a s k t a s k i d=” i d d o i n g s o m e t h i n g ” wi dt h=” 1 ” f r e q u e n c y=” 800 ” o r d e r i n g=” 2 ”/> </ c o r e> 8 <c o r e d o r e i d=” 1 ”> <t a s k t a s k i d=” i d d o i n g s o m e t h i n g e l s e ” w id th=” 1 ” f r e q u e n c y=” 800 ” o r d e r i n g=” 0 ”/> 10 <t a s k t a s k i d=” 1 ” w id th=” 1 ” f r e q u e n c y=” 533 ” o r d e r i n g=” 1 ”/> </ c o r e> 12 </ s c h e d u l e>

(43)

4.5. SCHEDULING AND EVALUATION CHAPTER 4. DESIGN

4.5 Scheduling and evaluation

The frameworks supports autonomous batch processing of all task graph, scheduler, and target architecture combinations and generation of quality reports using various schedule quality assessment tools. [to be completed]

(44)

Chapter 5

Experimental evaluation

5.1 Pingpong: communication delay and

over-head

Before using schedeval to benchmark applications, we start with an eval-uation of the framework itself. Aside from functional verification, in order to explain evaluation results it is highly important to document how well the framework performs in various situations. The backend invariably car-ries some measure of overhead to the runtime. Additionally, since we do not use the RCCE library functions for our message transfer operations, the message passing delay needs to be tested as well.

5.1.1 Experimental setup

(a)

Task graph (b) Sequence diagram

Figure 5.1: Task graph and sequence diagram for the pingpong test appli-cation.

(45)

5.1. PINGPONG: COMMUNICATION DELAY AND OVERHEADCHAPTER 5. EXPERIMENTAL EVALUATION

As a functional evaluation, we devise a small application called pingpong. Pingpong consists of two tasks, ping and pong, which are both aware of a constant. An iteration begins with ping sending an integer to pong through a provided communication channel. Pong, waiting for this number, receives it, adds the constant to it, and sends it back to ping. Ping checks that the number received has been incremented with the correct constant. This is one ping-pong round trip. If the number was correct, ping sends a new number, continuing in this loop without ending.

We schedule ping and pong to run at 800 MHz with three different con-figurations: on the same core (“local”), on different cores on the same tile (“tile”), and on different cores on adjacent tiles (“remote”). By running the application for extended amounts of time without it failing, we hope to verify that scheduling and message passing is functioning.

Since the tiny ping and pong themselves both are running for very short durations each time they are scheduled, the round trip time for the local test is mostly composed of the overhead for running one iteration of a schedule. The tile and remote versions also include the extra delay derived from the inter-core data transfers.

In order to understand how good the values are, we compare the results with the round trip time for a blocking send/receive loop implemented with the synchronous RCCE send and RCCE receive functions. Since this RCCE pingpong only sends a value back and fourth between cores, with blocking operations, the value can be considered being very close to the minimum required time for a data transfer roundtrip.

Finally we schedule several pingpong-pairs in the same application to test how the overhead and communication scales.

5.1.2 Results and discussions

As can be seen in Fig. 5.2, the roundtrip time for the local configuration is 4.2 microseconds. When considering that the application roughly equates to storing a value in L1 cache and then reading it again, the time is quite high. As the tasks themselves do very little, the timing can roughly be equated to the backend overhead when no inter-core communication occurs. The high value might also partially be attributed to the schedule having very few tasks. The RCCE pingpong reference value of 5.2 µs on the other hand represents the minimum possible communication delay, without any overhead. Both the tile and remote configuration overhead is at 8.4 µs much higher. This can be attributed to overhead and communication delay combined.

By counting the number of times ping is scheduled but has no message from pong to consume we find how often the task was scheduled while not

(46)

Figure 5.2: Blue: Time for one ping-pong sequence with different configu-rations (lower is better). Red: How often the ping task is scheduled but no incoming message from pong is available to process (lower is better).

data ready (Fig. 5.2 red bars). This number turns out to be around 45% for the tile and remote configurations1_{. As ping finds itself non-dataready,} it deschedules itself, and another round of overhead occurs before the next check. Comparing this to the RCCE application, which is busy-polling, gives an idea of why the time is so much higher.

The data-ready rate is highly relevant to address, since schedulers such as Crown Scheduling assume that message delays are hidden since the trans-fer occurs while other tasks are scheduled. If it turns out not to be true, the runtime of a task is dependent on whether its pipeline neighbours are scheduled on the same core or not. However, if correct, an application may benefit from having more tasks scheduled, as the non-dataready rate falls and the cpu-time required for each individual task is decreased.

Fig 5.3 displays the results of runs where several independent pingpong applications run on the same schedule. The total amount of round trips is divided by the number of pingpong pairs and the experiment duration time to obtain how much cpu-time is required to perform a roundtrip iteration as a function of the amount of pingpong pairs. This value can be seen as an inverted efficiency metric.

cputime(pairs) = total number of roundtrips pairs · experiment duration

In the remote version all ping tasks are scheduled on a single core, and all pong tasks are scheduled on another. In the local test the ping and pong

1_{The application with local configuration is naturally always data ready, since no} message transfer occur. RCCE is busy-reading, so the non-dataready rate is here modeled as 99%.

(47)

tasks are interleaved on the same core. Using the theory of message delay hiding, the round trip time for the remote configuration is expected to be reduced to the round trip time for the local when a sufficient number of tasks are scheduled2.

(a) (b)

Figure 5.3: a: cputime of one iteration, or roundtrip time, as a function of number of pingpong-pairs running. b: not-dataready rate for remote confuguration as a function of number of pingpong-pairs running.

The results at Fig. 5.3 Show a correlation between efficiency and number of scheduled tasks for the remote configuration. With more tasks scheduled, more data is processed per cputime. This can be attributed to a combination of latency hiding and amortization of the backend overhead, the latter also slightly visible at the local test.

However, comparing the round trip time for local and remote at the eight pair run, an unexplained gap of 0.5-1.0 µs remains. We suggest this is the delay from fetching the channel metadata and payload from the local MPB to L1. It is therefore impossible to completely hide the delay with the above latency-hiding technique.

The pingpong tests show that the overhead is fairly high when the number of tasks is low, but the overhead is partially amortized when the number of tasks is increased. The tests also reveal that the assumption that latency can be hidden by running several tasks is partially correct for Intel SCC. The remaining difference in overhead per task should be considered when scheduling, as tasks communicating through MPB, according to results, op-erates at slightly lower efficiency. If it later is found important to remove this difference, a possible solution to this problem could be to prefetch data

2_{This means that the RCCE pingpong delay is NOT the lower limit, but only the} pingpong execution time and the backend overhead

Evaluation of Energy-Optimizing Scheduling Algorithms for Streaming Computations on Massively Parallel Multicore Architectures

Final thesis

Evaluation of Energy-Optimizing

Scheduling Algorithms for Streaming

Computation on Massively Parallel

Architectures

Johan Janz´

en

LITH-IDA-EX-2014/043

2014-xx-xx

Final thesis

Evaluation of Energy-Optimizing

Scheduling Algorithms for Streaming

Computation on Massively Parallel

Architectures

Johan Janz´

en

LITH-IDA-EX-2014/043

2014-xx-xx

Abstract

Contents

Chapter 1

Introduction

1.1

Project Goal

1.2

Clarifications

Chapter 2

Background

2.1

Introduction

2.2

Energy Consumption Models

2.3

Streaming Computation

2.4

Process Network Models

2.5

Static Scheduling

2.6

Scheduled dynamic voltage and frequency

switching

2.7

OCM, a collection of heuristics for the

MPTS problem

2.7.1

Crown Scheduling

2.7.2

Sanders and Speck’s Scheduling Algorithm

2.8

StreamIt programming language

2.9

Intel SCC

2.10

RCCE

2.11

Other multicore architecures

2.11.1

Tilera TILEPro and TILE-Gx

2.11.2

Intel XEON

2.11.3

Adapteva Epiphany

2.12

Conclusion

Chapter 3

Related work

Chapter 4

Design

4.1

Schedeval: Framework for runtime

sched-ule evaluation

4.2

Message passing model

4.3

DVFS with schedeval

4.4

Application development with schedeval

4.5

Scheduling and evaluation