R oyal I nstituteof T echnology KTH

(1)

KTH

Royal Institute of Technology

School of Information and Communication Technology

Electronic Systems

Automatic Software Synthesis from

High-Level ForSyDe Models Targeting

Massively Parallel Processors

Master of Science Thesis in System-on-Chip Design June 2013

TRITA–ICT–EX–2013:139

Author:

George Ungureanu

Examiner:

Assoc. Prof. Ingo Sander

Supervisors:

(2)

Parallel Processors

Thesis number: TRITA–ICT–EX–2013:139 Royal Institute of Technology (KTH)

School of Information and Communication Technology (ICT) Research Unit: Electronic Systems (ES)

Forum 105 164 40 Kista Sweden

This work is licensed under the Creative Commons Attribution-NoDerivs cc by-nd 3.0 License. A copy of the license is found at http://creativecommons.org/licenses/by-nd/3.0/ This document was typeset in LA_{TEXwith kp-fonts as font package. Most of the figures were}

(3)

Abstract

In the past decade we have witnessed an abrupt shift to parallel computing subsequent to the increasing demand for performance and functionality that can no longer be satisfied by conventional paradigms. As a consequence, the abstraction gab between the applications and the underlying hardware increased, triggering both industry and academia in several research directions.

This thesis project aims at analyzing some of these directions in order to offer a solution for bridging the abstraction gap between the description of a problem at a functional level and the implementation on a heterogeneous parallel platform using ForSyDe – a formal design methodology. This report treats applications employing data-parallel and time-parallel computation, regards nvidia CUDA-enabled GPGPUs as the main backend platform.

The report proposes a heuristic transformation-and-refinement process based on analysis methods and design decisions to automate and aid in acorrect-by-design backend code synthesis.

Its purpose is to identify potential data parallelism and time parallelism in a high-level system. Furthermore, based on a basic platform model, the algorithm load-balances and maps the execution onto the best computation resources in an automated design flow. This design flow will be embedded into an already existing tool, f2cc (ForSyDe-to-CUDA C) and tested for correctness on an industrial-scale image processing application aimed at monitoring inkjet print-heads reliability.

Keywords: system design flow, high abstraction-level models, ForSyDe, GPGPU, CUDA, time-parallel, data-parallel

(4)

In the course of this thesis project several people have helped me accomplish my tasks and contributed some way or another and to whom I am deeply grateful.

First of all, I would like to thank my supervisors, Hosein and Gabriel for all their support. Without Hosein’s scientific feedback my report would have been much less valuable and without Gabriel’s active involvement in the software tool’s development and implementation my progress would have been even further delayed.

Secondly, I would like to thank my mentors, Werner Zapka and Ingo Sander, for investing so much time and trust in my personal and professional development. Without their "leap of faith" regarding my trustworthiness I would never have had the chance to be involved in such exciting projects and work in such an amazing environment.

Thirdly, I would like to thank my colleagues from XaarJet AB who proved to be not only excellent professionals in their area of research, helping me to develop insight in areas I could never explore, but great friends as well. I am also grateful to my Master’s Program colleagues Marcus Miculcak and Ekrem Altinel. The excellent collaboration between us ended up in great outcomes, and the ideas presented in this report are mostly resulted from the interesting debates and discussions that I had with them.

And last, but not least, I would like to show my deepest gratitude to my wife, Ana Maria, who has valiantly put up with me during the grim time when I worked for this thesis. Her unconditional support, care and understanding kept me going morally, and helped me yield results even if the workload was too heavy.This thesis is undoubtedly dedicated to her...

George Ungureanu Stockholm, June 2013

(5)

7

2 ForSyDe . . . 9

2.1 Introduction . . . 9

2.2 The modeling framework . . . 10

2.3 System modeling in ForSyDe-SystemC . . . 12

2.3.1 Signals . . . 13

2.3.2 Processes . . . 13

2.3.3 Testbenches . . . 15

2.3.4 Intermediate XML representation . . . 15

(6)

3 Understanding Parallelism. . . 17

3.1 Parallelism in the many-core era . . . 17

3.2 A theoretical framework for parallelism . . . 18

3.2.1 Kleene’s partial recursive functions . . . 19

3.2.2 A functional taxonomy for parallel computation . . . 21

3.3 Parallel applications: the 13 Dwarfs . . . 22

3.4 Berkeley’s view: design methodology . . . 23

3.4.1 Application point of view . . . 24

3.4.2 Software point of view . . . 24

3.4.3 Hardware point of view . . . 27

4 GPGPUs and General Programming with CUDA . . . 29

4.1 Brief introduction to GPGPUs . . . 29

4.2 GPGPU architecture . . . 30

4.3 General programming with CUDA . . . 32

4.4 CUDA streams . . . 35 5 The f2cc Tool . . . 37 5.1 f2cc features . . . 37 5.2 f2cc architecture . . . 38 5.3 Alternatives to f2cc . . . 40 5.3.1 SkelCL . . . 40 5.3.2 SkePU . . . 42 5.3.3 Thrust . . . 43 5.3.4 Obsidian . . . 44 6 Challenges . . . 47

II Development and Implementation

51

7 The Component Framework . . . 53

7.1 The ForSyDe model . . . 53

7.1.1 f2ccapproach . . . 53

7.1.2 Model limitations and future improvements . . . 57

7.2 The intermediate model representation . . . 59

7.2.2 Limitations and future improvements . . . 61

7.3 The process function code . . . 61

7.3.2 Future improvements . . . 63

7.4 The GPGPU platform model . . . 64

7.4.1 Computation costs . . . 64

7.4.2 Communication costs . . . 65

7.4.3 Future improvements . . . 66

8 Design Flow and Algorithms . . . 67

8.1 Model modifier algorithms . . . 67

8.1.1 Identifying data-parallel processes . . . 67

(7)

vii

8.1.3 Load balancing the process network . . . 73

8.1.4 Pipelined model generation . . . 79

8.1.5 Future development . . . 79

8.2 Synthesizer algorithms . . . 80

8.2.1 Generating sequential code . . . 81

8.2.2 Scheduling and generating CUDA code . . . 83

8.2.3 Future development . . . 87

9 Component Implementation . . . 89

9.1 The ForSyDe model architecture . . . 89

9.2 Module interconnection . . . 90

9.3 Component execution flow . . . 91

III Final Remarks

93

10 Component Evaluation and Limitations. . . 95

10.1 Evaluation . . . 95

10.2 Limitations and future work . . . 97

11 Concluding Remarks . . . 101

IV Appendices

103

A Component documentation . . . 105

A.1 Building . . . 105

A.2 Preparations . . . 106

A.3 Running the tool . . . 107

A.4 Maintenance . . . 107

B Proposing a ForSyDe Design Toolbox . . . 109

B.1 A simple explanatory example . . . 109

B.2 Layered refinements & Refinement loop . . . 110

B.3 Proposed architecture for the design flow tool . . . 113

C Demonstrations. . . 115

(8)

(9)

List of Figures

2.1 ForSyDe process network . . . 11

2.2 ForSyDe process constructor . . . 11

2.3 ForSyDe MoCs . . . 12

3.1 Kleene’s composition rule . . . 19

3.2 Kleene’s basic forms of composition . . . 20

3.3 Kleene’s primitive recursiveness and minimization . . . 20

4.1 nvidia CUDA architecture . . . 30

4.2 nvidia CUDA thread division . . . 31

4.3 nvidia CUDA streams . . . 35

5.1 f2cc identification pattern . . . 38

5.2 f2cc component connections . . . 39

5.3 f2cc internal model . . . 39

5.4 Obsidian program pattern . . . 44

7.1 f2cc v0.1 internal model . . . 54

7.2 f2cc v0.2 internal model . . . 56

7.3 The ParallelComposite process . . . 57

7.4 f2cc cross-hierarchy connections . . . 58

7.5 Generating variable declaration code . . . 62

7.6 CFuncton structure in f2cc v0.1 . . . 62

7.7 CFuncton structure in f2cc v0.2 . . . 62

7.8 Extracting variable information . . . 63

8.1 Grouping potentially parallel processes . . . 70

8.2 Building individual data paths . . . 74

8.3 Loop unrolling . . . 74

8.4 Modeling streamed execution . . . 85

(10)

9.1 f2cc v0.2 internal model architecture . . . 90

9.2 f2cc execution flow . . . 91

B.1 Simple model after analysis . . . 110

B.2 Simple model after refinements . . . 110

B.3 Model analysis aspects . . . 111

B.4 Hierarchical separation of transformation layers . . . 111

B.5 Refinement loop . . . 113

C.1 Demo example: input model . . . 118

C.2 Demo example: model after flattening . . . 119

C.3 Model after grouping equivalentcomb processes . . . 120

C.4 Model after grouping potentially parallel leaf processes . . . 121

C.5 Model after removing redundantzipx and unzipx processes . . . . 122

C.6 Model after platform optimization . . . 123

C.7 Model after load balancing . . . 124

(11)

List of Tables

3.1 The 13 Dwarfs of parallel computation . . . 23

8.1 Assigning data bursts to streams . . . 84

10.1 The component’s status at the time of writing the report . . . 96

(12)

(13)

Listings

2.1 ForSyDe-SystemC signal definition . . . 13

2.2 ForSyDe-SystemC leaf process function definition . . . 13

2.3 ForSyDe-SystemC composite process declaration . . . 14

2.4 ForSyDe-SystemC testbench . . . 15

2.5 ForSyDe-SystemC introspection . . . 15

2.6 ForSyDe-SystemC intermediate XML format . . . 16

4.1 Matrix multiplication in C . . . 33

4.2 Matrix multiplication in CUDA - Host . . . 34

4.3 Matrix multiplication in CUDA - Device . . . 34

4.4 Concurrency in CUDA . . . 35

5.1 SkelCL syntax example . . . 41

5.2 SkePU function macros . . . 42

5.3 SkePU syntax example . . . 43

5.4 Thrust syntax example . . . 43

5.5 Obsidian function declaration . . . 44

5.6 Obsidian definition forpure and sync . . . . 45

7.1 GraphML port . . . 60

7.2 XML port . . . 60

7.3 Static type name declaration . . . 60

7.4 Result of static type name declaration . . . 61

7.5 Algorithm for parsing ForSyDe function code . . . 63

8.1 Algorithm for identifying data parallel sections . . . 68

8.2 Methods used by data parallel sections identification algorithm . . . 69

8.3 Method used by data parallel sections identification algorithm . . . 71

8.4 Proposed algorithm for identifying data parallel sections . . . 72

8.5 Algorithm for platform optimization . . . 72

8.6 Algorithm for load balancing . . . 73

8.7 Method for data paths extraction, used by load balancing algorithm . . . 74

8.8 Method for extracting and sorting contained sections . . . 77

8.9 Method for splitting the process network into pipeline stages . . . 78

8.10 Algorithm for code synthesis . . . 81

(14)

8.11 Method for generating sequential code for composite processes . . . 82

8.12 Top level for method for generating CUDA code . . . 86

A.1 Platform model template . . . 107

C.1 ForSyDe process function code . . . 117

C.2 Extracted C code . . . 117

C.3 Excerpt from the f2cc output logger . . . 126

C.4 Sample sequential code: composite process execution wrapper . . . 127

C.5 Sample parallel code: top level execution code . . . 128

(15)

List of Abbreviations

3D three-dimensional

ANSI American National Standards Institute

API Application Program Interface

AST Abstract Syntax Tree

CPU Central Processing Unit

CT Continuous Time (MoC)

CUDA Computer Unified Device Architecture

DE Discrete Event (MoC)

DI Domain Interface

DRAM Dynamic Random-Access Memory

DSL Domain Specific Language

DUT Design Under Test

EDSL Embedded Domain Specific Language

ESL Electronic System Level

f2cc ForSyDeto CUDA C

ForSyDe Formal System Design

GPGPU General Purpose Graphical Processing Unit

GPU Graphical Processing Unit

GraphML Graph Markup Language

GUI Graphical User Interface

HDL Hardware Description Language

ILP Instruction Level Parallelism

IP Intellectual Property

ITRS International Technology Roadmap for Semiconductors MIMD Multiple Instruction Multiple Data

MoC Model of Computation

OS Operating System

POM Project Object Model

RTTI Run-Time Type Information

SDF Synchronous Data Flow (MoC)

(16)

SDK Software Development Kit SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Thread

SM Streaming Multiprocessor

SP Streaming Processor

STL Standard Template Library

SY Synchronous (MoC)

UT Untimed (MoC)

(17)

Chapter

1 Introduction

This chapter will present the problem that will be approached throughout this thesis. The problem will be stated prior to a brief motivation for this project in the current industrial context. Afterwards, a set of overarching goals will be enumerated, followed by an overview of this report.

1.1 Problem statement

T

he current project aims at tackling with the problem of mapping intensive parallel computation on platforms with resource support for data- and time-parallel computation, with special consideration to the leading many-core platform in industry, theGeneral Purpose Graphical Processing Unit (GPGPU, [Kirk and Hwu, 2010]). As a design language for describing

systems at a high level of abstraction, ForSyDe [Sander and Jantsch, 2004] will be used. ForSyDe is associated with a formal high-level system design methodology that raises the abstraction level in designing real-time embedded systems in order to aid the mapping on complex heterogeneous platforms through techniques like design space exploration, semantic-preserving transformations, refinement-through-replacement, etc.

The first problem that has to be treated is analyzing whether or not ForSyDe supports the description of parallel computation in harmony with the existing MoC-based framework. In this sense, a deep understanding of parallelism and its principles is necessary. The two main terms introduced in the current contribution, data parallelism and time parallelism are

be presented in the context of parallel computation in Chapter 3.

The second problem that this project must attend to is the implementation of a mapping algorithm from a parallel ForSyDe model to a GPGPU backend. In order to do so, an existing tool called f2cc [Hjort Blindell, 2012]1 _{has to be extended to support both the new}

ForSyDe-SystemCfeatures and new data-parallel and time-parallel models.

1_{ForSyDe to CUDA C (f2cc) was developed and implemented by Gabriel Hjort Blindell as part of his Master’s Thesis}

in 2012.

(18)

The third and final problem treated by the ongoing thesis is the resulting software component’s validation with an industrial-scale application provided by XaarJet AB, a printing-oriented company.

1.2 Motivation

In the past decade we witnessed a dramatic shift of computation paradigms into the parallel domain, hence the dawn of the "many-core era". This shift was not a result of great innovation as much as a necessity to cope with the increasing demands for performance and functionality. This fact is summed up with the increasing complexity of both platforms and applications that cannot be handled by traditional design methods anymore.

Faced with the "parallel problem", both industry and academia came up with a number of solutions which will further be presented in Chapter 3, Chapter 4 and Section 5.3. The main issue is that most of these solutions are not based on a formal basis commonly agreed upon which can constitute a theoretical foundation for the parallel paradigm, just like Turing’s model was the foundation for the sequential paradigm. Furthermore, they represent points of view that are dispersed among research groups which try to mold the paradigms to their desired goals (productivity, backward-compatibility, verification, etc.).

Most of the aforementioned solutions treat many-core parallel platforms as means of high-throughput computation. Strangely, one point of view was ignored until now, especially by high-level programming models: treating many-cores as complex, heterogeneous and

analyzable platforms.

Hence, we invoke ForSyDe as a methodology to treat with these issues. Due to its inher-ent formalism, the complexity problem can be properly handled, enabling correct-by-design

implementation solutions. Furthermore, the Model-of-Computation-base formalism [Sander

and Jantsch, 2004] is a natural framework for expressing parallel computation, consequently it offers a good environment for a foundation for parallelism. The design flows associated with the ForSyDe methodology are based onanalysis, design space exploration and semantic-preserving transformations, providing means to take advantage of architectural traits hard to

explore otherwise.

The platform chosen for analysis is the GPGPU since it is the most widely-used many-core platform in industry. GPGPUs are notoriously difficult to program and verify due to their low-level style of programming based on a sequential model. One application that requires high throughput computation on a parallel platform whose development stagnated due to these issues is provided by XaarJet AB. This application will be implemented in ForSyDe and given as example for testing the current project.

1.3 Objectives

The main goal of this project is to investigate and offer proper solutions to the problems stated in Section 1.1. This task has been split in the following set of sub-goals:

(19)

1.4. Document overview 3

• The f2cc architecture, tool API, implementation, and the thesis report; • Relevant material related to parallel computation;

• GPGPUs, their architecture and programming model; • Alternatives to f2cc;

2. Devise a plan for expanding f2cc’s functionality.

3. Expand f2cc with new features provided by the ForSyDe-SystemC modeling framework, and implement a new frontend.

4. Implement a synthesis flow for pipelined CUDA code synthesis for f2cc. 5. Provide high quality code documentation.

6. Evaluate the improved f2cc tool with an image processing application provided by XaarJet AB.

As an optional goal, the analysis and proposal of a generic development flow tool for ForSyDe will be presented. This tool should easily embed the implemented synthesis flow but keep a high degree of flexibility in order to enable any type of flow. As it is beyond the scope of a M.Sc. thesis to consider implementing such a fully generic tool, only proposals for future research will be delivered. Other relevant optional goals would be the implementation of other types of parallelism if time permits it.

1.4 Document overview

The document is divided into four parts. The first part includes the background study performed in order to understand the full scale of the problem which we will encounter. The second part presents the individual steps of implementing the component. The third part closes the report with some concluding remarks based on evaluation results. The fourth part contains supplementary material that could not be included in the report body. The following sections aim to offer a reading guide for the current report.

1.4.1 Part I

The first part of the document digs into theoretical problem and tries to analyse it from different perspectives. Its purpose is to provide the reader with enough knowledge to understand the full scale of the problem and the challenges that will arise during the component implementation. Chapter 2 briefly introduces the reader to the ForSyDe methodology and the ForSyDe-SystemC design framework. It presents the basic concepts and usage, focusing on the structures used in this project, and it points to further related material. The reader may skip this chapter provided he or she possesses previous knowledge of ForSyDe.

(20)

exist a few theoretical notions that are defined and referenced in these sections. Still, a future ForSyDedeveloper is encouraged to read the provided material, since it offers valuable insight in the problems she or he might encounter.

Chapter 4 introduces the reader to the basic concepts of GPGPUs, as they are the main target platform. Material for further reading is referenced, and the basic usage ofthreads and streams

is shown. This chapter constitutes the background for the implementation of the software component’s synthesizer module2.

Chapter 5 briefly presents the current component that has to be improved, f2cc. It also analyses four alternatives to f2cc for synthesis flows targeting GPGPUs, available in the research community. The reader is strongly encouraged to read this chapter in order to understand the content in Part II.

Chapter 6, the final chapter belonging to this part, lists the main challenges that were identified during Part I as needed to be treated by this project’s work efforts, and prioritizes them.

1.4.2 Part II

The second part concerns the development of the software component. An in-depth analysis of the software architecture and the algorithms used is presented. The part closes with putting together the previously presented components, in order to depict the proposed design flow. Chapter 7 introduces the reader to the main development framework that had to be improved in order to both deliver the desired goals and to embed this project’s design flow into the available design flow. Thus the main features are presented in order to understand the magnitude of the work effort. Apart from the design decisions made, a a comprehensive list of future improvement is proposed for a potential developer.

Chapter 8 presents the main theory and concepts that hide behind the above-mentioned software tool. Its algorithms are both analyzed for scalability and provided with either optimized alternatives or proposals for future development, since they are tempering with still young and undiscovered issues.

Finally, Chapter 9 binds together the previous two chapters and shortly presents the compo-nent’s main implementation features and plots its execution flow.

1.4.3 Part III

The third part closes this report. An evaluation of the current state of the project is offered, as regards the initial goals, along with a list with proposals for future development. Chapter 10 tries to evaluate the current state of the software component while listing and prioritizing proposals for future development, as they emerge from an overview of Part II. Chapter 11 concludes the M. Sc. project and gives a verdict with respect to the delivered versus the initially proposed goals.

(21)

1.4. Document overview 5

1.4.4 Part IV

(22)

(23)

Part I

Understanding the Problem

(24)

(25)

Chapter

2 ForSyDe

This chapter will briefly present ForSyDe (Formal System Design), a system design methodology that starts from a high-level formal description. The first section will introduce ForSyDe in the system design environment. The second section will provide a brief overview of the modeling framework, while the third section will show an example of how to model systems using the SystemC implementation. It is out of the scope of this report to provide full comprehensive documentation, which is why the reader is encouraged to consult related documents like [Sander and Jantsch, 1999, Sander, 2003, Sander and Jantsch, 2004, Attarzadeh Niaki et al., 2012, Jakobsen et al., 2011] or tutorials [ForSyDe, 2013]

2.1 Introduction

K

eutzer et al. statesthat "in order to be effective, a design methodology that addresses complex systems must start at high levels of abstraction" [Keutzer et al., 2000]. ForSyDe is one such methodology for Electronic System Level (ESL) design that "raises the abstraction level of the design entry to cope with the increased complexity of (embedded systems)" [Attarzadeh Niaki et al., 2012].

ForSyDe’s main objective is to "move the design refinement from the implementation to the functional domain" [Sander and Jantsch, 2004], by capturing a design’s functionality inside a specification model. Thus, the designer works on this model which hides the implementation details and is able to "focus onwhat a system is supposed to do rather than how" [Hjort Blindell,

2012].

Working at a high level of abstraction has two main advantages. Firstly, it enables the designer to have an overview of the system and of the data flow. Secondly, it aids the identification of optimizations and of opportunities for better design decisions. One example, which will be extensively treated in this report, is the identification and exploitation of parallel patterns in algorithms described in ForSyDe. These patterns could not be exploited as naturally at compiler-level, since the full context for the cause and effect in the execution is lost.

Another key feature of ForSyDe is the design transformation and refinement. By applying

(26)

semantic-preserving transformations to the high-level model, and gradual refinement by adding

backend-relevant information, one can achieve a correct implementation through a transparent process optimized for synthesis. Combining refinement with analysis at each design stage, the designer is able to reach an optimum implementation solution for the given problem.

Perhaps the most important feature of ForSyDe is its formalism. Practically the design starts from aformal model that expresses functionality. This fact aids in developing a correct-by-design

system that can be both tested and validated at all levels of refinement. This is an especially difficult task to achieve without a formal basis. Also, the computational and structural features can be captured and analyzed formally, dismissing all ambiguities. This eliminates, at least theoretically, the need for post-verification and debugging which is often the most expensive stage of a product realization process.

2.2 The modeling framework

The following subsection is based on material found in [Attarzadeh Niaki et al., 2012] and [Lee and Sangiovanni-Vincentelli, 1997].

To understand the mechanisms behind ForSyDe one should have a clear picture of its modeling framework, which determines its formal basis. In the following paragraphs, the basic concepts will be explained.

Structure

The system model is structured as a concurrent hierarchicalprocess network. The components

of a process network areprocesses and domain interfaces, connected through signals, as shown

in Figure 2.1. The processes are triggered and synchronized only through signals, and the

functions encapsulated by them are side-effect free.

Hierarchy can be achieved throughcomposite processes. They are formed by composing either

leaf processes (like p1. . . p5in Figure 2.1), or other composite processes.

Models of Computation

The Models of Computation (MoCs) describe the semantics of concurrency and computation of the processes in the system. Each process belongs to a MoC which explicitly describes its timely behavior. Currently the ForSyDe-SystemC framework supports four MoCs [ForSyDe, 2013], but more are researched and in development. The supported MoCs are:

(27)

2.2. The modeling framework 11 p1 p2 p3 p4 p5 di1 di2 Legend MoC A MoC B process domain interface composite process1 composite process2

Figure 2.1: ForSyDe process network

• The Discrete Event MoC (DE), where a time quantum is defined. It is suitable for describing test bench systems and modeling the environment.

• The Continuous Time MoC (CT) that describes physical time. It is suitable for modeling analog components and physical processes.

Process Constructors

The process constructors enforce formal restrictions upon the design, to ensure analyzability and an organized structure.

In order to create a leaf process in the model, the designer must choose a process constructor from the defined ForSyDe library. A process constructor takes side-effect-free functions and values as arguments and creates a process. The formal computation and communication semantics are embedded in the model based on the chosen constructor.

f g v

mealy mealy

f g

v

Process Constructor Functions Values Process

Figure 2.2: Example for creating aMealy process using a ForSyDe process constructor. Source: adapted from [ForSyDe,

2013]

.

Figure 2.2 illustrates the concept of process constructor, by creating a process that implements

a Mealy finite-state machine within the SY MoC. The process constructor defines the Model

of Computation, the type of the process (finite-state machine), and the process interface. The functionality of the process is defined by a function f that specifies the calculation of the next state, another function g that specifies the calculation of the output, and a value v that specifies the initial value of the process.

Domain Interfaces and Wrappers

(28)

Figure 2.3. Other DIs (the dotted lines) are derived by composing existing DIs.

SDF SY DE CT

Figure 2.3: ForSyDe MoCs and their DIs

The wrappers are special processes which behave similarly to other processes, but which embed external models. They communicate their input/output to external simulators to co-simulate the model and assure system validation even if not all components are implemented in the ForSyDeframework. It is out of this thesis’ scope to study the effects and offer solutions for DIsand wrappers.

The Synchronous Model of Computation

This report will mainly focus on the SY MoC, since it is the only MoC considered in the design flow associated with this project’s software component. The SY MoC describes atimed concurrent system, implying that its events are globally ordered. This means that any two

distinct events are eithersynchronous (they happen at the same moment, and are associated

with the same tag) or one unambiguously precedes another [Lee and Sangiovanni-Vincentelli, 1997].

Two signals can be considered synchronous if all events in one signal are synchronous with the events from the other signal and vice-versa. A system is synchronous if every signal in the system is synchronous to every other signal in the system.

Apart from ForSyDe, there are several languages that describe synchronicity, such as Lustre [Halbwachs et al., 1991], Esterel [Berry and Gonthier, 1992] or Argos [Maraninchi, 1991]. These languages describe events tagged as present > or events ⊥. A key property is that the order of these event tags is absolute and unambiguous.

2.3 System modeling in ForSyDe-SystemC

The following section is based on material found in [ForSyDe, 2013]. This report assumes that the reader is familiar with programming in C++, understanding XML, and using the SystemC platform. For a comprehensive

SystemCtutorial, the reader is encouraged to consult [ASIC World, 2013].

(29)

2.3. System modeling in ForSyDe-SystemC 13

All ForSyDe-SystemC elements are implemented as classes inside the ForSyDe namespace. Each element belongs to a MoC, which is in fact a sub-namespace of the ForSyDe namespace. For example ForSyDe::SY holds all elements (processes, signals, DIs) belonging to the SY MoC.

2.3.1 Signals

Signals are bound to an input or an output of a ForSyDe process. They are typed and can be defined as belonging to a MoC by using their associated class from the respective MoC namespace. For the SY MoC there is a helper (template) class abst_ext<T> which is used to represent absent-extended values. Absent-extended values can be either absent or present with a value of type T. Listing 2.1 defines a signal of SY MoC called my_sig which carries tokens of type abst_ext<double>.

1 ForSyDe ::SY::SY2SY <double> my_sig ;

Listing 2.1: ForSyDe-SystemC signal definition

2.3.2 Processes

Leaf processes are created using process constructors. Process constructors are templates provided by the library that are parameterized in order to create a process. The parameters to a process constructor can be initial values (e.g., initial states) or functions. From the C++ point of view, creating a leaf process out of a process constructor is equivalent to instantiating a C++ class and passing the required parameters to its constructor.

1 void mul_func ( abst_ext <int>& out1 ,

2 const abst_ext <int>& a , const abst_ext <int>& b){

3 int inp1 = a. from_abst_ext (0) ;

4 int inp2 = b. from_abst_ext (0) ;

5

6 # pragma ForSyDe begin mul_func

7 out1 = inp1 * inp2;

8 # pragma ForSyDe end 9 }

Listing 2.2: ForSyDe-SystemC leaf process function definition

Listing 2.2 shows an example of defining a process constructor’s associated function. It looks like a regular C++ function definition but there are a few particularities that have to be taken into account:

• The function header contains the function name and the function parameters, in the order defined in the API (please consult the API documentation in [ForSyDe, 2013]). In this example, the function has two inputs that have to be declared const and one output. • The function body, where one can identify two separate parts: the computation part,

between pragmas, which hold the C function which can be analysed or further be mapped to a platform; and theprotocol part, outside pragmas, with the sole purpose of wrapping /

(30)

A composite process is the result of instantiation of other processes and wiring them together using signals. A set of rules should be respected in order to benefit from the ForSyDe features such as formal analysis, composability, etc. Otherwise, the system can still be simulated using SystemCkernel, but will not be able to follow a design flow. These rules are [ForSyDe, 2013]:

• A composite process is in fact a SystemC module derived from the sc_module class.

• A composite process is the result of instantiation and interconnection of other valid ForSyDeprocesses, no ad-hoc SystemC processes or modules are allowed.

• Ports of all child processes in a composite process are connected together using signals of SystemCchannel type ForSyDe::[MoC]::[signal] (for example ForSyDe::SY::SY2SY).

• A composite process in the includes zero or more inputs and outputs of SystemC port types [MoC]_in and [MoC]_out (for example SY_in and SY_out).

• If an input port of a composite process should be connected to several child processes, an additional fanout process (i.e., ForSyDe::SY::fanout) is needed in between.

1 # ifndef MULACC_HPP 2 # define MULACC_HPP 3 4 # include < forsyde .hpp > 5 # include "mul . hpp " 6 # include " add . hpp " 7

8 using namespace ForSyDe :: SY ; 9

10 SC_MODULE( mulacc ){ 11 SY_in<int> a , b; 12 SY_out<int> result ; 13

14 SY2SY<int> addi1 , addi2 , acci ; 15

16 SC_CTOR( mulacc ){

17 make_comb2(" mul1", mul_func , addi1 , a , b); 18

19 auto add1 = make_comb2(" add1 ", add_func , acci , addi1 , addi2 );

20 add1 -> oport1 ( result );

21

22 make_delay(" accum ", abst_ext <int>(0) , addi2 , acci );

23 }

24 }; 25 26 # endif

Listing 2.3: ForSyDe-SystemC composite process declaration

(31)

2.3. System modeling in ForSyDe-SystemC 15

1 SC_MODULE( top ){

2 SY2SY<int> srca , srcb , result ; 3

4 SC_CTOR( top ){

5 make_constant (" c o n s t a n t 1 ", abst_ext <int>(3) , 10 , srca );

6

7 make_source (" s i g g e n 1 ", s_func , abst_ext <int>(1) , 10 , srcb );

8

9 auto mulacc1 = new mulacc (" mulacc1 ");

10 mulacc1 ->a( srca );

11 mulacc1 ->b( srcb );

12 mulacc1 -> result ( result );

13

14 make_sink (" r e p o r t 1 ", report_func , result );

15 }

16 };

Listing 2.4: ForSyDe-SystemC testbench

2.3.3 Testbenches

There are processes in each MoC that only produce / consume values and can be used for testing purposes. As seen in Listing 2.4, the testbench can be seen as atop module which connects the

design under test (DUT – in this case the mulacc composite process) with these source / sink processes.

2.3.4 Intermediate XML representation

ForSyDe’s introspection feature enables it to extract structural information from the SystemC project files and encapsulate it in an XML format. The XML files represent an intermediate format that will further be fed to the system design flow, and they capture essential structural

in-formation. This information can be easily accessed, analyzed and modified (refined) by an auto-matic process. To enable introspection, one has to invoke the ForSyDe::XMLExport::traverse function to traverse the DUT’s top module at the start of the simulation, and to compile the design with the macro FORSYDE_INTROSPECTION defined. Listing 2.5 shows the syntax to enable the introspection feature while Listing 2.6 shows an example XML output.

1 # ifdef FORSYDE_INTROSPECTION

2 void start_of_simulation () {

3 ForSyDe :: XMLExport dumper (" ");

4 dumper . traverse (this);

5 }

6 # endif

(32)

1 <?xml version=" 1 . 0 " ? >

2

3 <!DOCTYPE process_network SYSTEM " f o r s y d e . d t d " >

4 <process_network name ="CombSubMul">

5 <port name =" p o r t _ 0 " type =" i n t " direction =" i n " bound_process =" sub1 " bound_port ="

p o r t _ 0 "/ >

6 <port name =" p o r t _ 1 " type =" i n t " direction =" i n " bound_process =" y _ f a n o u t " bound_port ="

p o r t _ 0 "/ >

7 <port name =" p o r t _ 2 " type =" i n t " direction =" o u t " bound_process ="mul1 " bound_port ="

p o r t _ 2 "/ >

8 <signal name =" f i f o _ 0 " moc =" s y " type =" i n t " source =" sub1 " source_port =" p o r t _ 2 " target ="mul1" target_port =" p o r t _ 0 "/ >

9 <signal name =" f i f o _ 1 " moc =" s y " type =" i n t " source =" y _ f a n o u t " source_port =" p o r t _ 1 "

target =" sub1 " target_port =" p o r t _ 1 "/ >

10 <signal name =" f i f o _ 2 " moc =" s y " type =" i n t " source =" y _ f a n o u t " source_port =" p o r t _ 1 "

target ="mul1 " target_port =" p o r t _ 1 "/ >

11 <leaf_process name =" y _ f a n o u t ">

12 <port name =" p o r t _ 0 " type =" i n t " direction =" i n "/ >

13 <port name =" p o r t _ 1 " type =" i n t " direction =" o u t "/ >

14 < process_constructor name =" f a n o u t " moc =" s y "/ >

15 </leaf_process>

16 <composite_process name =" sub1 " component_name =" sub ">

20 </composite_process>

21 <composite_process name =" mul1" component_name ="mul">

25 </composite_process>

26 </process_network>

(33)

Chapter

3 Understanding Parallelism

This chapter aims at tackling the problem that has arisen due to the brusque leap from the industry standard of single-processor sequential computation to many-core parallel computation. First, a short background will attempt to place the current problem which industry is facing in the context of the many-core era. The second section will propose a theoretical framework defining parallelism starting from Kleene’s computational model. The third and the fourth sections will describe Berkeley’s view of the parallel problems, and its views regarding the design of hardware and software systems. The fourth section will also compare Berkeley’s proposed methodologies with ForSyDe, and we will argue about why ForSyDe is a proper methodology for designing heterogeneous systems embedding massively parallel many-core processors.

3.1 Parallelism in the many-core era

T

oday industryis facing an abrupt shift to parallel computing, which it is not yet ready to fully embrace. Over the past decades the main means of pushing forward the IT industry was by either increasing the clock frequency or by other innovations that were inefficient in terms of transistors and power, but which kept the sequential programming model (ILP , deep pipelining, cache systems, etc) [Hennessy and Patterson, 2011].

During this time, there were several attempts to develop parallel computers, like MasPar [Blank, 1990], Kendall Square [Dunigan, 1992] or nCUBE [Hayes et al., 1986], but they failed due to the rapid development and increase of the sequential performance. Indeed, compatibility with legacy programs, like C programs, was more valuable to industry than new innovations, and programmers accustomed to continuous improvement in sequential performance saw little need to explore parallelism.

However, during the last decade industry has reached its most important turning point by hitting the power limit which a chip is able to dissipate, called in [Hennessy and Patterson, 2011]"the power wall". As the International Technology Roadmap for Semiconductors (ITRS)

was "replotted" during these years [ITRS, 2005, ITRS, 2007, ITRS, 2011] one could see an increasing discrepancy between earlier clock rate predictions (15GHz in 2010 judging by

(34)

2005 predictions [ITRS, 2005]), and actual processors’ sequential performance (currently Intel products are far below even the conservative 2007 predictions [ITRS, 2007]).

This is an understandable phenomenon due to the sudden changes in conventional wisdoms that had to be accepted by the industry. A comprehensive list of old versus new conventional wisdoms can be found in [Asanovic et al., 2006]. Apart from the well-knownpower wall, memory wall and ILP wall which constitute "the brick wall", we can point out the tenth conventional

wisdom pair. According to it programmers cannot rely on waiting for sequential performance increase instead of parallelizing their programs, since it will be a much longer wait for a faster sequential computer.

Thus the current leap to parallelism is not based on a breakthrough in programming or architecture, but "(it) is actually a retreat from the more difficult task of building power-efficient, high-clock-rate, single-core chips" [Asanovic et al., 2009]. Indeed, the current solution for general computing is still replicating sequential processors into multi-cores, which has

proven to work for a small number of cores (2 to 16) without drastic changes from the sequential paradigms and way of thinking. But this strategy is likely to face diminishing returns once the number of cores increases beyond 32 [Hennessy and Patterson, 2011], stepping into the many-cores domain.

Apart from that, the more pessimistic predictions in [ITRS, 2011] show an increased discrep-ancy in the user required performance and device performance. Faced with this new knowledge and the new eleventh conventional wisdom stating that "increasing parallelism is the primary way of increasing a processor’s performance", industry is faced with the decision of adopting new paradigms and new functional models that maximise productivity into thousands of cores environments. Asanovic et al. state that the only solution lies in the research community, and that "researchers [have to] meet the parallel challenge".

3.2 A theoretical framework for parallelism

As seen in Section 3.1, the difference between multi- and many-processors is qualitative rather than quantitative. While multi-processors could be regarded as multiple machines running sequentially and extended with scheduling constructs with the parallel execution mainly at program level, many-processors have a completely different foundation principle.

Sequential processors have a strong foundation in Turing’s computational model, which lead to the von Neumann machine. Although this model lasted for more than half a century, it doesn’t express execution platforms naturally any more. Maliţa et al. say that "Turing’s model cannot be useddirectly to found many-processors" [Maliţa and Ştefan, 2008]. Unfortunately, industry is

conservative and many of the available solutions are rather non-formal extensions of available topologies.

This drawback is summed up with the theoretical weakness of the new domain. The parallel computation is still in its infancy and does not have a theoretical framework of its own that is unanimously accepted by both computer scientist and industry. During the past few decades, several groups of researchers adopted Kleene’s model of partial recursive functions

(35)

3.2. A theoretical framework for parallelism 19

3.2.1 Kleene’s partial recursive functions

The following subsection is based on material found in [Maliţa et al., 2006,Maliţa and Ştefan, 2009,Maliţa and Ştefan, 2008]

In the same year that Turing published his paper, Kleene published the partial recursive functions. He defines computation using basic functions (zero, increment, projection), and rules

(composition, primitive recursiveness and minimization). The main rule is the composition, and Figure 3.1 depicts a structure which computes Equation 3.1:

f (x0, x1, ...xn−1) = g(h0(x0, x1, ...xn−1), . . . hm−1(x0, x1, ...xn−1)) (3.1)

where hxrepresents an increment function and g represents a projection function. Both the first level of computing and the second level are parallel, and the only restriction is that g cannot start the computation before all h have finished.

x0, x1, ...xn−1

h0 h1 . . . hm−1

g

out = f (x0, x1, ...xn−1)

Figure 3.1: The structure associated with the composition rule

A Universal Turing Machine is a sequential composition of functions (for example hi), thus the parallel aspect of the computation is lost. On the other hand, the Kleene processes are inherently parallel, where hi functions are independent and g could be independent if it works in a pipelined fashion and on different inputs data x0, x1, ...xn−1. Thus, Kleene’s model is a

natural starting point for a parallel computation model.

From the general form of composition expressed in Equation 3.1 one can express several simplified forms to describe the other rules:

• data-parallel composition, described by Equation 3.2, is a limit case of Equation 3.1, where n = m and g is the identity function.

f (x0, x1, ...xn−1) = [h0(x0), . . . hn−1(xn−1)] (3.2)

• serial composition, described by Equation 3.3, is defined for p applications of the composition with m = 1, and the function is applied on an input stream < x0, x1, ...xn−1> in a pipelined fashion.

(36)

• reduction composition, described by Equation 3.4, is a special case of Equation 3.1, where

h is the identity function, and the input vector [x0, x1, ...xn−1] is reduced to a scalar.

f (x0, x1, ...xn−1) = g(x0, x1, ...xn−1) = out (3.4)

x0, x1, ...xn−1

h0 h1 . . . hm−1

h0(x0) h1(x1) hn−1(xn−1)

(a)Parallel composition

x k0 . . . km−1 f (x) (b)Serial composition h0 h1 . . . hm−1 g g(x0, x1, ...xn−1) (c)Reduction composition

Figure 3.2: The basic forms of composition

Thecomposition rule is strong and natural enough to describe almost all types of data-intensive

problems and applications, and could be associated to many implementations. The last two rules, primitive recursiveness and minimization, introduce a higher degree of difficulty

in expressing them in implementations and are less natural to associate with structural implementations. x y f (x, y) H G1 Gi

. . .

R r0 r1 ri

(a)Primitive recursion

x G0 Gi f (x) = miny[g(x, y) = 0] R r0 ri . . . . . . (b)Minimization

Figure 3.3: Structure for two of Kleene’s rules

The primitive recursion is described by Equation 3.5 and by a structure like in Figure 3.3a. This structure is fully parallel since, apart from the serial composition, it supports speculation at each level of computation through a reduction network. The function has an initial value, described by block H, which feeds the infinite pipeline. The reduction network R inputs an infinite vector of pairs{scalar, predicate}, corresponding to the predicated result for each stage.

Thus the result will always return the scalar which is paired with the predicate having value 1.

(37)

3.2. A theoretical framework for parallelism 21

The minimization rule is described by Equation 3.6 and it computes the function f (x) to the value of the minimal y for which g(x, y) = 0. As with the previous rule, the structure depicted in Figure 3.3 is an example of applying minimization, while keeping the concept of ideal parallelism by using speculation. Thus each block G computes the predicated value and returns a pair of form {i, g(x, i) == 0} and the reduction network R extracts the first pair having the predicated value 1 (if any).

f (x) = miny[g(x, y) = 0] (3.6)

3.2.2 A functional taxonomy for parallel computation

Since new mathematical models are emerging to describe parallelism, the huge diversity of solutions involved in actual implementations tend to make the classic computer taxonomies [Flynn, 1972, Xavier and Iyengar, 1998] obsolete.

One such taxonomy introduced in [Flynn, 1972] describes parallel machines from astructural

point of view, where parallelism is symmetrically described using a two-dimensional space:

data × programs. The current parallel applications cannot fit in one of Flynn’s categories (for

example SIMD or MIMD), since they require more than one type of parallelism.

In [Maliţa and Ştefan, 2008] and [Ştefan, 2010] there is proposed a new more consistent

func-tional taxonomy, starting from the way a function is computed, as presented in Subsection 3.2.1.

Thus, five types of parallel computation have been emphasized:

• Data-parallel computation as seen in the data-parallel composition. It is applied on

vectors, and each component of the output vector results from the predicated execution of the same program.

• Time-parallel computation as seen in theserial composition. It applies a pipe of functions

on input streams and, according to [Hennessy and Patterson, 2011], it is efficient if the length of the stream is much greater than the pipe’s length.

• Speculative-parallel computation extracted as a functional approach for solving

prim-itive recursion and minimization. This computation can be described by replacing

Equation 3.1 with the limit cases in Equation 3.7. It usually applies the same variable to slightly different functions.

• Reduction-parallel computation deduced from the reduction composition. Each vector

component is equivalent related to the reduction function.

• Thread-parallel computation is not directly presented in Subsection 3.2.1 but can be deduced by replacing Equation 3.1 with the limit case in Equation 3.8. It also describes the timely behaviour of interleaved threads.

hi(x1, . . . xm) = hi(x), g(h1(x), . . . hm(x)) = {h1(x), . . . , hm(x)} (3.7)

hi(x1, . . . xm) = hi(x), g(h1(x1), . . . hm(xm)) = {h1(x1), . . . , hm(xm)} (3.8)

Based on this taxonomy, the types of computation can be separated into two categories: • complex computation where parallelism is tightlyinterleaved allowing efficient complex

(38)

• intensive computation where parallelism is strongly segregated, allowing large-sized

simple computations. It groups together the data-parallel computation, time-parallel computation, speculative-parallel computation and reduction-parallel computation.

[Maliţa et al., 2006] concludes that "any computation, if it is intensive, can be performed efficiently in parallel".

3.3 Parallel applications: the 13 Dwarfs

The parallel problem has been studied intensely in the last decade by numerous research groups focussing on multi- and many-core processing [Georgia Tech, 2008, Habanero, 2013, PPL, 2013, Illinois, 2013, Par Lab, 2013]. One of the groups involved in this research originates from Berkeley University of California and consists in multidisciplinary researchers. They discuss in [Asanovic et al., 2006] and [Asanovic et al., 2009] an application-oriented approach to treat the parallel problem from different perspectives and at different layers of abstraction.

They motivated this approach by examining the parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. They argue that "these two ends of the computing spectrum have more in common looking forward than they had in the past" [Asanovic et al., 2006]. By studying the success driven by parallelism for many of the applications, it is possible to synthesize feasible andcorrect solutions based on application

requirements. Thus, the main approach is to "mine the parallelism experience" to get a broader view of the computation mechanisms.

Also, since parallelism is still not clearly described yet by formal means, benchmarking programs cannot be used as measurements of innovations. Asanovic et al. argues that "there is a need to find a higher level of abstraction for reasoning about parallel application requirements" [Asanovic et al., 2006]. This point is valid, judging from the experience of successful mapping high-performance scientific applications on embedded platforms.

Extending the work of Phil Colella [Colella, 2004], the research team grouped similar applications into thirteen "dwarfs", which are equivalence classes based on similarity of computation and data movement, and by studying the programming patterns. They are presented in Table 3.1.

# Dwarf Data Comm.

Pat-terns

Description Application

Exam-ples Hardware 1 Dense Lin-ear Algebra dense matrices or vectors memory strides

usually vector-vector,

matrix-vector and matrix-matrix opera-tions Block Triadiagonal Matrix, Symmetric Gauss-Siedel Vector computers, Array computers 2 Sparse Lin-ear Algebra compressed matrices indexed loads/ stores

data includes many zero values, compressed for low storage and bandwidth Conjugate Gradient Vector computers with gather /scatter 3 Spectral Methods frequency domain multiple butterfly patterns combination of multiply-add

operations and specific data permutations

Fourier Transform DSPs, Zalink

PDSP 4 N-Body Methods discrete points interaction between points particle-particle methods –

O(N2); hierarchical particles –

O(N log N ) or O(N )

Fast Multipole

Method

(39)

3.4. Berkeley’s view: design methodology 23

# Dwarf Data Comm.

Pat-terns

Description Application

Exam-ples Hardware 5 Structured Grids regular grids high spatial locality

grid may be subdivided into finer grids ("Adaptive Mesh Re-finement"); transition between granularity may happen dynam-ically Multi-Grid, Scalar Pentadiagonal, Hydrodynamics QCDOC, BlueGeneL 6 Unstructured grids irregular grids multiple levels of memory reference

location and connectivity de-termined from neighboring ele-ments Unstructure Adap-tive Tera Multi Threaded Architecture 7 MapReduce – not dominant

calculations depend on statisti-cal results of repeated random

trials. Considered

embarrass-ingly parallel

Monte Carlo, Ray tracer NSF Teragrid 8 Combinato-rial Logic large amount of data bit-level op-erations

simple operations on very large amount of data, often exploiting bit-level parallelism Encription, Cyclic Redundancy Codes, IP NAT Hardwired algorithms 9 Graph Traversal nodes, objects many lookups

algorithms involving many lev-els of indirection and small amount of computation Route Lookup, XML parsing, collision detection Sun Niagara 10 Dynamic Program-ming

– – solve simpler overlapping

sub-problems. Used in optimization of problems with many feasible solutions Viterbi Decode, variable elimination Dyna 11 Back-track and Branch + bound

– – optimal solutions by dividing in

subdomains and pruning sub-problems that are suboptimal

Kernel Regression, Net-work Simplex Algorithm – 12 Graphical Models

nodes – graphs where random variables

are nodes and conditions are edges Bayesian Net-work, Hidden Markov Models – 13 Finite State Machines

states transitions behavior defined by states,

tran-sitions and events

PNG, JPEG,

MPEG-4, TCP,

compiler

–

Table 3.1: The 13 Dwarfs of parallel computation

While the first twelve dwarfs show inherent parallelism, the parallelization of the thirteenth

constitutes a challenge. The main reason is that it is difficult to split the computation into several parallel finite state machines. Although the Berkeley research group favors the exclusion of the thirteenth dwarf from the parallel paradigm, as being considered "embarrass-ingly sequential", architectures likeRevolver [Öberg and Ellervee, 1998], the Integral Parallel

Architecture [Ştefan, 2010], or the BEAM [Codreanu and Hobincu, 2010], demonstrate that these

problems can successfully be parallelized, as being derived from the complex computation class from Subsection 3.2.2.

3.4 Berkeley’s view: design methodology

The following section is based on material found in [Asanovic et al., 2006, Asanovic et al., 2009]

(40)

The following section will presentBerkeley’s view, in comparison to the ForSyDe methodology,

in order to merge these two different schools of thought into an even stronger conceptual foundation. In addition we will try to demonstrate that ForSyDe is a proper methodology to approach even parallel computational problems, not only real-time embedded problems.

3.4.1 Application point of view

Section 3.3 pointed out the need to mine applications that demand more computing power and can absorb the increasing number of cores for the next decades in order to provide concrete goals and metrics to evaluate progress.

For this matter, a number of applications are studied and developed based on different criteria: compelling in terms of marketing and social impact, short-term feasibility, longer-term potential, speed-up or efficiency requirements, platform coverage, potential to enable technology for other applications, involvement in usage and evaluation of technology. Some of the applications could count: music and hearing, speech understanding, content-based image retrieval, intraoperative risk assessment, parallel browser, 3D graphics, etc.

Currently, ForSyDe has a suite of case studies originating in industrial or academic applications and some of them, for example Linescan, focus on industrial control. In the future, its application span could broaden even in other user-oriented areas (for example actor-based parallel browsers [Jones et al., 2009]), by studying and following other successful attempts such as Berkeley’s. ForSyDe has a good profile for many of the applications in Table 3.1. Since parallelism is expressed inherently, it could well fit in the above classes of applications.

3.4.2 Software point of view

The Berkeley research group admit that developing a software methodology for bridging the gap between users and the parallel IT industry is the most vexing challenge. One reason is the fact that many programmers are unable to understand parallel software. Another reason is that both compilers and operating systems have grown so large that they are resistant to changes. Also, it is not possible to properly measure improvement in parallel languages, since most of them are prototypes that reflect solely the researchers’ point of view. There are eight main ideas presented that will be analysed separately in the following paragraphs.

Idea #1: Architecting parallel software with design patterns, not just parallel programming languages This is the first idea proposed by Asanovic et al. Since "automatic parallelism

doesn’t work" [Asanovic et al., 2009], they propose to re-architect the software through a "design patten language", explored in earlier works such as [Alexander, 1977, Gamma et al., 1993, Buschmann et al., 2007]. The pattern language is a collection of related and interlocking patterns, constructed such that the pattern flow into each other as the designer solves a design problem. The computation and structural patterns can be composed to create more complex patterns. These are conceptual tools that help a programmer reason about a software project and develop an architecture, but are not themselves implementation mechanisms for producing code.

(41)

3.4. Berkeley’s view: design methodology 25

with using formalism as a starting point in a design methodology, ForSyDe enforces formal restrictions at early design stages. A reason against using a formal model as a starting point is the fact that it is understood only by a narrow group of researchers, and it limits expressiveness in designing solutions, at least for the uninitiated.

We argue that this is a common misconception amongst the research groups and has to be overcome in order to fully take advantage of both concepts. Case studies have shown that by using a structured thinking, the formal constraints do not limit expressiveness. On the contrary, describing computation throughprocesses and signals aids the designer in keeping a

clear picture of the entire system.

It is simpler to mask MoC details under a design pattern than to assure formal correctness to large pattern-based systems. By masking, the designer needs only minimal prior knowledge of the mathematical principles behind MoCs, while still being able to respect formalism and fully take advantage of it. Thus, ForSyDe could be easily extended into a pattern framework since it allows composable patterns of process networks. This subject is further treated in Section 5.3, and could be a relevant point of entry for future research.

Idea #2: Split productivity and efficiency layers, not just a single general-purpose layer Productivity, efficiency and correctness are inextricably linked and must be treated together, during system design stages. They are not a single-point solution and must be defined in separate layers.

Theproductivity layer uses a common composition and coordination language to glue together

the libraries and programming frameworks produced by the efficiency-layer programmers. The implementation details are abstracted at this layer. Customizations are made only at specified points and do not break the harmony of the design pattern.

Theefficiency layer is very close to machine language, allowing the best possible algorithm to be

written in the primitives of the layer. This is the working ground for specialist programmers trained in details of parallel technology.

This concept is powerfully rooted in the parallel programming community. It explains the multitude of template libraries, DSLs or language extensions for specific parallel platforms (i.e. the ones presented in Section 5.3) during the last decade.

Although ForSyDe is not only a programming language, but rather a system design method-ology, it follows this conceptual pattern. While the ForSyDe-Haskell or ForSyDe-SystemC design frameworks can be associated with the productivity layer, the suite of tools of analysis, transformation, refinement and synthesis can be associated with the efficiency layer. The schema proposed in Appendix B extends this seemingly simple but powerful idea.

Idea #3: Generating code with search-based autotuners, not compilers Since compilers have grown so large and are resistant to changes, one cannot rely on them to identify and optimise parallel applications. Instead, one useful lesson can be learned from autotuners. These are optimization tools that generate many variants of a kernel and measure each variant by running on the target platform. Autotuners are built by efficiency-layer programmers.

(42)

development should be aware of the autotuner mechanisms, since it can benefit from hybrid synthesis methods. One best practice example is narrowing down the design space, then running several analyses in parallel on virtual platforms with different configurations and choose the best solution.

Idea #4: Synthesis with sketching This idea enforces programmers to write "incomplete sketches" for programs. In them, they provide an algorithmic skeleton and let the synthesizer fill in the holes in the sketch.

As presented in Chapter 2, this is one of ForSyDe’s ground rules, described by the abstraction of design details. Moreover, in earlier ForSyDe publications, process constructors were referred to as skeletons [Sander and Jantsch, 1999], which is infers the concept of "sketch".

Idea #5: Verification and testing, not one or the other The research group enforces modular verification and automated unit-test generation through a high-level semantic constraints on the behavior of the individual modules (such as parallel frameworks and parallel libraries). They identified this to be a challenging problem since most programmers find it convenient to specify local properties using assert statements or static program verification. As a consequence, these programmers would have a hard time adapting to the high-level constructs. Since the ForSyDe methodology starts from a formalcorrect-by-design reaching an

implemen-tation mostly through semantic-preserving constraints, this problem is not valid anymore. Validation may be elegantly taken care of by the design’s formalism, while early-stage testing can be achieved by executing the model [Attarzadeh Niaki et al., 2012].

Idea #6: Parallelism for energy efficiency Efficient use of multiple cores to complete a task is more efficient in terms of energy consumption [Hennessy and Patterson, 2011]. Several mechanisms are recommended, such as task multiplexing, use of parallel algorithms to amortize instruction delivery and message passing instead of cache coherency.

There is a number of projects related to ForSyDe which address power estimations in system design [Zhu et al., 2008, Jakobsen et al., 2011]. Since it is an issue problem especially in embedded systems, this problem will still be a main topic for future ForSyDe researches.

Idea #7: Space-time partitioning for deconstructed operating systems The spatial partition contains the physical resources of a parallel machine. The space-time partitioning virtualizes spatial partitions by time-multiplexing whole partitions onto available hardware. As seen in [Ştefan, 2010], the partition can be done at low instruction level or, as the Berkeley research group propose, at a "deconstructed OS" level.

Currently there is no OS support in ForSyDe, but implementing it could be seen as a mapping problem. The temporal dimension has to be described along with the partitioning of tasks to resources. Therefore, it counts as adesign space exploration problem, and could be a relevant

future research topic.

R oyal I nstituteof T echnology KTH

KTH