Hardware/Software Partitioning of Dataﬂow Programs

(1)

Hardware/Software Partitioning

of Dataflow Programs

Rapid Prototyping of Computer Systems in the CAL Actor Language

BORIS TRASKOV

Master’s Thesis at School of ICT, Royal Institute of Technology Supervisors: Johan Eker and Carl von Platen (Ericsson AB)

Examiner: Ingo Sander (KTH)

(2)

(3)

iii

Abstract

Dataflow programming is emerging as a promising technology for program-ming of parallel systems, such as multicore CPUs and FPGA. Development for parallel hardware is very different compared to single core system, and calls for new tools and languages.

Within the Ptolemy II project a new language - the CAL Actor Language - has been invented for the specification of dataflow programs. Subsequently Xilinx and Ericsson Research have been developing tools for this dataflow guage, particularly a compiler and runtime environment. Meanwhile the lan-guage was adopted by the ISOMPEG body as a specification lanlan-guage for video (de)coding algorithms (RVC-CAL).

(4)

List of Figures

2.1 Example of a Dataflow Network Graph . . . 6

2.2 Expressiveness of Models of Computation . . . 7

2.3 SRDF: Singular production and consumption rates . . . 8

2.4 SDF: Production and Consumption Rates . . . 8

2.5 CSDF: Cyclic token production and consumption rates . . . 9

2.6 Token production and consumption rates in DDF . . . 10

3.1 StreaMIT primitives for system composition . . . 15

3.2 Graphical representation of a CAL Actor . . . 16

3.3 Graphical representation of a CAL Channel . . . 16

3.4 Unicast channel (left) and multicast channel (right) . . . 18

3.5 Finite State Machine for a deterministic execution of actions . . . 20

3.6 Multicast implementation in software . . . 22

3.7 VHDL top-level netlist with port descriptors . . . 23

3.8 Multicast implementation in hardware . . . 23

4.1 A simple SDF system within the graphical environment in Ptolemy II . 28 4.2 Example of a RVC-based video-decoder . . . 31

5.1 Overview of the Development System . . . 33

5.2 Overview of the Hardware System . . . 34

5.3 Schematic of the developed master accelerator architecture . . . 35

5.4 Block diagram of the CAL software architecture . . . 35

5.5 FIFO states . . . 37

6.1 Original workflow . . . 42

6.2 Extended Codesign Workflow . . . 44

6.3 Target partitioning . . . 45

6.4 Effects of partitioning . . . 46

8.1 Block diagram of the LTE receiver . . . 54

8.2 Input testvector vector of 10 samples before and after synchronization . 55 8.3 FFT-block output . . . 56

(8)

viii List of Figures

(9)

Nomenclature

ACTORS Adaptivity and Control of Resources in Embedded Systems

API Application Programming Interface

ART ACTORS Runtime

ASIC Application Specific Integrated Circuit

BSDL Bitstream Syntax Description Language

CAL CAL Actor Language

CCB Channel Control Block

CPU Central Processing Unit

CSDF Cyclo-Static Dataflow

CT Continuous Time

DDF Dynamic Dataflow

DDL Decoder Description Language

DE Discrete Event

Demux De-Multiplexer

DMA Direct Memory Access

EDSL Embedded Domain-Specific Language

FIFO First-In First-Out Buffer

ForSyDe Formal System Design

FPGA Field Programmable Gate Array

FSM Finite State Machine

FU Functional Unit

(10)

x List of Figures

HDL Hardware Description Language

HW Hardware

IFAI Input FIFO Accelerator Interface

IFBI Input FIFO Bus Interface

IO Input/Output

KPN Kahn Process Networks

LE Logical Element

LGDF Large-Grain Dataflow

MoC Model of Computation

MP3 MPEG Layer-3

MPEG Moving Pictures Expert Group

MRDF Multi-Rate Dataflow

Mux Multiplexer

NL Network Language

OFAI Output FIFO Accelerator Interface

OFBI Output FIFO Bus Interface

OFDM Orthogonal Frequency-Division Multiplexer

OpenDF Open Dataflow

PCI Peripheral Component Interconnect

PLB Processor Local Bus

PN Process Network

POSIX Portable Operating System Interface

PPC PowerPC Processor

RAM Random Access Memory

RTL Register Transfer Level

RVC Reconfigurable Video Coding

(11)

List of Figures xi

SDF Synchronos Dataflow

SR Synchronous Reactive

SRDF Single-Rate Dataflow

SW Software

UART Universal Asynochronous Receiver and Transmitter

USB Universal Serial Bus

VHDL VHSIC Hardware Description Language

(12)

(13)

Chapter 1 Introduction

In this thesis a generic framework for the design of computer systems is proposed that uses the CAL Actor Language. This enables rapid prototyping of signal pro-cessing systems on platform-FPGAs without the necessity of another hardware def-inition language.

1.1 Background

The problem of hardware/software codesign can be formulated in the way of an optimization problem with constraints on performance and cost.

Find a computing solution that fulfills the requirements on execution time, response time, throughput, and/or manufacturing cost for a given algorithm or program.

The problem statement is a simple two-liner, but raises three problem areas. The first area concerns the computing solution. The overwhelming span of today’s ready-made microprocessor systems, cache types and sizes and interconnection fab-rics opens up a huge design space, that can be narrowed down by a system designers intuition from experience. For multi-core or ASIC-accelerated systems, the inter-action, scheduling and partitioning of the algorithm becomes a nontrivial problem, that requires deep understanding of every single aspect of the system.

The second problem area is the exact prediction or measuring of the key con-straints execution time, response time, bandwidth and manufacturing cost. Run-ning test cases with predetermined input patterns is a straight-forward approach for measuring these parameters. Unfortunately, the problem is only forwarded to the definition of the test cases, which may or may not represent a worst-case work load.

Prediction without measurement, on the other hand, alleviates the need of run-ning test cases and traces, but requires a very deep understanding of the methods

(14)

2 CHAPTER 1. INTRODUCTION

used and their limitations.

The last problem area is the algorithm representation itself. It may be given in the form of a sequential (possibly C/C++) program (or set thereof), a legacy hardware IP or a mathematical formula. This work strives to open up the HW/SW codesign field to a new method of defining the initial algorithm - Dataflow Pro-gramming.

1.2 Problem Statement

Two tools for C-code generation (OpenDF) and HDL-code generation (OpenForge) from CAL dataflow programs have been developed by Xilinx and are now freely available as open source software. The process of integrating the produced software and hardware is a tedious, manual task and involves a deep understanding of FPGA design methodology. In order to facilitate the use of CAL for codesign experiments, an automated workflow is desirable.

1.3 Goals

The first primary goal of this thesis is the development of a toolchain for HW/SW codesign, based on the CAL Actor Language. It shall leverage on the existing tools for pure software and pure hardware synthesis, OpenDF and OpenForge, respec-tively. The implementation shall involve a customized ACTORS-runtime, targeting Xilinx’ development board ML510, and support a single CPU (PowerPC440).

The second primary goal is to demonstrate the toolchain by using it for synthe-sizing a CAL program. The lessons learned from this example shall be incorporated into the codesign methodology.

Thus, the deliverable of this thesis consists of the codesign toolchain and the report including documentation of all design activities and decisions, such that the results can be reused and the design choices understood.

1.4 Use Cases

(15)

1.5. METHODS 3

1.5 Methods

• First implementation of a simple working prototype methodology. • Set up of regression tests based on CAL programs.

(16)

(17)

Chapter 2 Dataflow Models of Computation

The term model of computation (MOC) describes the abstract paradigm that a computer language adheres to. It is easiest to understand the term by example. For this we consider three common models of computation. An imperative program expresses exactly which instructions are to be executed in which order (e.g. C and Java). It is a straight-forward cooking recipe. A declarative program approaches the problem from the other side. It provides a description of the desired outcome (e.g. SQL). In a functional program the concept of state is avoided. Computation is described in a form of hierarchical or recursive function calls (e.g. Haskell).

These three sample paradigms provide a feeling for the term model of computa-tion. A multitude of models of computation has been developed. Each model has strengths and weaknesses and thus yields itself to a different area of problems. In general one cannot say which model of computation is best, without knowing the problem to be solved. If an algorithm is textually described as a sequence of steps, then an imperative language is best suited. Similarly, if the initial algorithm is a mathematical formula, then a functional language is the obvious choice.

The Dataflow model of computation is particularly well applicable if the initial problem statement is given (or can easily be described) in a diagram of “connected boxes”. Such a description is very common in signal processing applications, which makes them natural candidates for the dataflow modelling approach.

A “diagram with connected boxes” by itself has no semantics. Therefore it is mandatory to establish semantics either through textual annotations in each dataflow diagram or by a predefined set of rules, which leaves no margin for inter-pretation. The following is a synopsis of the general terms and properties in the dataflow model of computation. Since Dataflow comes in a variety of different fla-vors, each with its own distinctive set of features and limitations, these individual properties are discussed subsequently.

(18)

6 CHAPTER 2. DATAFLOW MODELS OF COMPUTATION

2.1 General Terms and Properties

The diagram in figure 2.1 depicts a dataflow program, also called dataflow network or dataflow graph. This graph consists of nodes which are connected by directed arcs. A node represents a computational kernel - the primitive entity that acts on data. Hence a node is called an Actor. Arcs represent Channels of communication between Actors. The idea behind such a representation is that Actors are to a large extend independent of each other. Thus, one Actor by itself can be described, and analyzed independently of any other Actors in the network.

A B

C D

Figure 2.1. Example of a Dataflow Network Graph

Dataflow programs are event-driven. In this case event-driven means that Ac-tors may execute their functionality when they receive data through their inputs. This execution is called a firing of an Actor. Firing eligibility may depend on the availability of input data Tokens - the basic unit of information that is exchanged between two Actors. It may as well depend on a Token’s valu, but also on the Actor’s internal state.

Channels function as a buffer between output and input ports. They store in-coming Tokens until they are consumed by the receiving Actor when it fires. As a consequence of a firing the Actor may change its internal state and produce Tokens at its output ports.

Some additional definitions will help to describe dataflow models of computa-tion. First, Tokens that are available from the beginning are called initial tokens. Second, the number of tokens that is consumed by a node during one firing is called

consumption rate and number of tokens produced by a node during one firing is

called production rate. Details can be reviewed in [1] and [2].

(19)

2.2. EXPRESSIVENESS OF DATAFLOW MODELS 7 Dynamic Dataflow Cyclo-Static Dataflow Synchronous Dataflow Single-Rate Dataflow multi-rate single-rate

Figure 2.2. Expressiveness of Models of Computation

can be modelled in lower classes but not vice versa. Therefore, lower classes are less expressive and have more limitations.

2.2 Expressiveness of Dataflow Models

Depending on the assumptions and limitations which a model of computation bears, the expressiveness of the model may vary. Furthermore this influences also the de-gree of analyzabilty. Strict and constraining assumptions tend to yield analyzable models of computation, which in turn are limited in the scope of algorithms they can model.

The following discussion provides an idea of all the well-known models of com-putation with their basic properties. The analysis runs from the least expressive model (heavily constrained and simple to analyze) to the most expressive (hardly constrained and difficult to analyze) as depicted in figure 2.2.

2.3 Single-Rate Dataflow

Single-Rate Dataflow (SRDF) is the most limited class in the range of dataflow

(20)

I P=1 C=1 J

BN,d

Figure 2.3. SRDF: Singular production and consumption rates

2.4 Multi-Rate Dataflow

Multi-Rate Dataflow (MRDF) relaxes the requirements on production and

con-sumption rates compared to Single-Rate Dataflow. In this family, production and consumption rates can be different for each actor and port. Depending on further rate constraints the model is categorized as Synchronous Dataflow, Cyclo-Static

Dataflow or Dynamic Dataflow.

2.4.1 Synchronous Dataflow

Synchronous Dataflow (SDF) as it is explained in [1] is a dataflow model of compu-tation that can be scheduled statically over one or multiple processors. The word

synchronous in the model’s name means that consumption and production rates do

not change during the entire runtime, i.e. they are a constant number each time a node fires. Hence, the number of tokens consumed or produced each time is inde-pendent from internal state of the node and the input tokens’ values.

In addition, a delay d on an arc from node I to node J means that the (n-d)th sample produced by node I will be consumed as nth sample by node J. In order to realize this delay, initial data needs to be available on the corresponding edge. Two nodes that are connected with a delay element are shown in figure 2.4.

I P=const C=const J

BN,d

Figure 2.4. SDF: Production and Consumption Rates

(21)

2.4. MULTI-RATE DATAFLOW 9

In any implementation of dataflow programs in general the overhead of the Actor scheduling and communication via channels is considerable - especially in parallel execution on multiple processors due to interprocessor communication time. In or-der to alleviate this overhead, a coarser granularity can be used instead of atomic nodes. Whereas the latter are elementary and indivisible, a coarse-grained node incorporates more functionality within a single node. If a SDF graph is made more course-grained, then this technique is called Large-Grain Dataflow (LGDF) repe-sentation.

SDF is one of the most well-understood dataflow models of computation.

2.4.2 Cyclo-Static Dataflow

Cyclo-Static Dataflow (CSDF) is a model of computation which was proposed for the first time in [2]. In addition to SDF, it provides for cyclically changing consump-tion and producconsump-tion rates. The limitaconsump-tion is that the consumpconsump-tion and producconsump-tion rates must follow a constant, predefined pattern for the entire runtime. Thus, pro-duction and consumption rates of an actor are directly dependent on the number of previous firings of this particular actor.

I Pi=Pn mod Q Cj=Cn mod S J BN,d

Figure 2.5. CSDF: Cyclic token production and consumption rates

Production and consumption rates of an execution sequence can be formalized as shown in figure 2.5. Here the sequence of consumption rates of Actor I is denoted by P_i and the sequence of production rates of Actor J is denoted by C_j. P_i has the period Q and Pi has the Period S. Each Actor will is then going to consume

Ci(n mod Q) and produce Pj(n mod S) tokens at the nth iteration, respectively. The motivation for CSDF graphs stems from the fact that in some cases the property of static production and consumption rates yields particularly long periodic admissible schedules [3]. This may be alleviated by transforming the SDF graph into a CSDF graph.

2.4.3 Dynamic Dataflow

(22)

state without consuming any Tokens, thus exhibiting non-predictability.

I PI ϵ ℕ0 Cj ϵ ℕ0 J BN,d

Figure 2.6. Token production and consumption rates in DDF

Because of this data-dependency, there is not enough information available to schedule a DDF model statically. Instead, a dynamic scheduling algorithm must be used for any implementation of this model of computation. Hence, in any imple-mentation of this MoC there is a nontrivial scheduling overhead at runtime. Fur-thermore, it is usually impossible to calculate minimum buffer sizes that are needed for an infinite number of iterations. Another detrimental effect of data-dependency is that unpredictable deadlocks may occur.

A basic example for this dataflow class is a best-effort multiplexer, which for-wards any of its input ports to its output port as soon as the data arrives. Even in such a simple Actor the consumption rates are already dynamic and timing-dependent. This illustrates, why despite all these limitations in analyzability, often dynamic dataflow is the only suitable model for many problems, because of the wide range of supported firing conditions.

Scenario-Aware Dataflow

The poor analzabilty and thus schedulability of dynamic dataflow has encouraged research towards heuristical scheduling approaches of dataflow models [4]. The main observation that facilitates analysis of dynamic dataflow graphs is that very often the input data streams follow a strictly predefined pattern. Knowing this pattern or scenario, one can (by simulation) find partial firing schedule for subnetworks or even a full firing schedule at compile time.

(23)

2.5. KAHN PROCESS NETWORKS 11

2.5 Kahn Process Networks

An early model of computation aiming at specifying parallel programs is formu-lated by Gilles Kahn in [5]. This model, subsequently referred to as Kahn Process

Network (KPN), describes an algorithm as a set of processes that communicate

with each other via unidirectional Channels. It exhibits dynamic behavior as the dataflow MoC, but is described in detail due to its wide recognition.

A KPN program consists of three parts: • Channel declaration and instantiation • Process declaration

• Process binding

In the channel declaration and instantiation section all Channels are listed with identifier and type, as follows:

<TYPE> channel <ID>

The process declaration section specifies for each parallel process its identi-fier and signature. The signature features parameters and provides formal (local) identifiers for input and output Channels as well as the associated types:

Process <ID> (<TYPE> <ID>, ..., <TYPE> [in|out] <ID>, ...)

A process’ body constitutes a sequential program with a syntax which is close to the programming languageAlgol. Local variable declarations and initiations stand at the beginning of the body, followed by an endless loop of statements. Throughout the endless loop a process can invoke blocking reads from any single input channel and non-blocking writes to any output channel.

The last section, process instantiation, binds parameters and formal identi-fiers of the processes’ signatures to constants and actual Channel identiidenti-fiers, respec-tively.

(24)

Still the historical strength of this model of computation is to abolish global

state, i.e. separate processes can communicate exclusively via those channels that

(25)

Chapter 3 Dataflow Languages

A language in general terms is the means by which a computer program is expressed. While most common languages like C or Java are general purpose languages, there also numerous domain-specific languages, which target a certain area of computa-tional problems. Dataflow languages are developed to facilitate the input of Actor networks by the programmer and support him via means of tailor-made syntax and semantics. The following is an assembly of three languages that adhere to the dynamic dataflow model (DDF) of computation.

3.1 StreamC/KernelC

StreamC and KernelC have been developed as symbiotic programming languages for the Imagine signal processing platform at Stanford [6]. In [7] it has been demon-strated that StreamC/KernelC can alleviate the high design complexity of FPGA-targeted signal processing applications. For the sake of simplicity both languages will be referred to as StreamC in the following paragraphs.

StreamC is a set of macros and functions that build on top of the C language. The main units of abstraction in StreamC are processes and streams which are roughly equivalent to Actors and Channels in CAL.

A process is a sequential program reading from input streams and writing to output streams. The sequential program’s body is written in C, but exclusively uses StreamC API to access data streams. Those C-features that are available in the process’ body depend on whether the process is mapped on a CPU or FPGA-fabric. In the first case standard C-functions, e.g. file-IO, are supported as well as manip-ulation of global state variables. In the latter case the expressiveness is limited and only local data can be accessed.

A stream is defined as a set of records of the same type. The Imagine system counts the references to these records and frees the corresponding memory space

(26)

14 CHAPTER 3. DATAFLOW LANGUAGES

when a record is not referenced anymore.

It is noteworthy that StreamC is - similarly to the well-known SystemC - a mere set of C libraries and macros. Thus the developer deals with the C language directly utilizing predefined functions and macros for the specification of channels and computational kernels.

3.2 StreaMIT

StreaMIT is a language particularly designed with stream processing in mind, such as signal and media processing. This section provides an overview of the language as described in [8] and [9]. For a complete overview the interested reader can refer to the project’s homepage [10]. The naming conventions differ slightly from the previously presented work. Thus the counterpart of Actors and Channels in CAL are filters and bands in StreaMIT.

Filters can have dynamic data consumption and production rates (DDF), even though static rates are typical (SDF). In the latter case the StreaMIT compiler supports the generation of static schedules and static optimizations. Systems that have at least one filter with dynamic data rates, require a runtime environment which handles scheduling. The main function in a StreaMIT is the work function and looks like this:

work push [number] pop [number] peek [number] { ... }

The function is declared by the keyword work, followed by the three param-eters. The first two parameters, push and pop, declare the data production and consumption rates, respectively. The third parameter peek declares how many to-kens lookahead are required by the filter. These need to be preloaded without being consumed, hence the name. “Peek” enables to easily model a very common feature in streaming applications, namely a sliding window.

(27)

3.3. CAL ACTOR LANGUAGE 15

Bands (imagined as tapes in a Turing machine or FIFOs) are the dominant, but

not the exclusive means of communication between filters in StreaMIT. A built-in special-purpose message passing interface termed teleport messaging provides com-munication, which is synchronized to data elements in bands. This means, that even though teleport messages are not part of the actual data stream they are timed such that the arrival is tied to a certain data Token. This facilitates the transfer of con-trol information between filters.

StreaMIT supports three primitives for the description of structure which can be used in a hierarchical way to compose systems as depicted in figure 3.1. The nam-ing of these primitives is straight-forward: A pipeline is a sequential composition of filters. The split-join represents concurrent processing of the same input data.

Feedback-loops are modelled as a special case of the split-join, in which the

posi-tions of split and join are exchanged and the data direction in one branch is reversed.

filter filter filter filter split filter join filter join filter split filter

pipeline split-join feedback loop

Figure 3.1. StreaMIT primitives for system composition

3.3 CAL Actor Language

(28)

16 CHAPTER 3. DATAFLOW LANGUAGES Actor Input Port Input Port Outp. Port Outp. Port Action Action State Action Scheduler

Figure 3.2. Graphical representation of a CAL Actor

Figure 3.3. Graphical representation of a CAL Channel

3.3.1 Actors

An Actor as shown in figure 3.2 is described by the following four qualities that will later be discussed in more detail:

• Number and type of its input and output ports • Internal state

• Set of actions that can be executed

• Set of rules (priorities, state transition definitions) that govern the execution of Actions.

An Actor’s state is encapsulated entirely within the Actor. In CAL, state can be modelled with:

• Actor-wide variables

• State variables in Finite State Machines (FSM)

CAL offers two methods to model state transformations. Firstly, actions can perform transformations on internal variables of the actor during their firing. Sec-ondly, the firing of an action itself can cause the (optional) internal state machine to transit into a different state.

(29)

and peak into the input tokens’ value to prevent an execution of the action even though the consumption rules. The third mechanism that steers the firing of Actions are priorities. A priority describes which action shall be fired in case of multiple actions are eligible to fire.

During a firing an Action may perform any of the following:

• Consume (read) a number of Tokens from the Actor’s input ports

• Execute transformations on the Actors state

• Produce (write) a number of Tokens at the Actor’s output ports

Actions are atomic, which means that once an Action fires no other action can fire until the previous is finished.

CAL’s strength is to describe concurrent systems only based on their dataflow behavior. The Action firing rules are intentionally made in such a way as to allow an underspecification of a system. This means that the evaluation of Actions’ firing eligibility and the Action firing is to a large extend delegated to the imple-mentation of the CAL compiler and runtime system. The motivation behind this design choice is, that the overspecification of an algorithm can lead to inefficient computing solutions. In particular the sequential specification of a C program poses a challenge on the C compiler to exploit the potential parallelism.

3.3.2 Channels

Communication between Actors is modeled by message passing, so called Tokens, along unidirectional unbounded Channels. Communication on one particular Chan-nel has a total order as in KPN. This means the exact ordering of messages needs to be kept in applications, which is equivalent to unbounded First-In-First-Out buffers (FIFO). An Actor’s signature contains a list of input ports and output ports as-sociated with their corresponding type. All of these ports need to be bound to a counterpart after instantiation of the Actor. As depicted in figure 3.4, output ports can have multiple readers, corresponding to multicast message passing, whereas in-put ports are limited to only a single stream from one outin-put port.

3.3.3 CAL Syntax

(30)

Actor Actor Actor

Actor Actor

Figure 3.4. Unicast channel (left) and multicast channel (right)

1 actor Add ( ) A, B ==> Result :

2 action A: [ a ] , B: [ b ] ==> Result : [ a + b ] end

3 end

Listing 3.1. Adder

1 actor Mux ( ) A, B ==> Result :

2 f i r s t : action A: [ a ] ==> Result : [ a ] end 3 second : action B: [ b ] ==> Result : [ b ] end 4 end

Listing 3.2. Non-deterministic Multiplexer

A Basic CAL Adder

The CAL source code in listing 3.1 is the implementation of a simple Actor class

Add. It has two input ports A and B as well as an output port Result. Only

one action is defined in the Actor’s body. It consumes one token from each port A and B, which are subsequently addressable via their identifier a and b. The Action completes with the production of a new token on the Result port which carries the value of the sum of both consumed tokens.

The Actor can fire immediately when its input ports contain at least one token each. The Actor complies with the single-rate dataflow MoC as all of its ports consume or produce exactly one token per firing.

Non-deterministic CAL Multiplexer

The CAL source code in listing 3.2 specifies one of the most basic examples for non-determinism in the language. The Mux Actor class has also two input ports

A, B and an output port Result. Observe, that two actions named first and sec-ond are defined in the Actor’s body. Both actions consume a Token from port A

or B respectively, and produce a new Token with the same value on the Result port.

(31)

1 actor Mux ( ) A, B ==> Result :

2 f i r s t : action A: [ a ] ==> Result : [ a ] end 3 second : action B: [ b ] ==> Result : [ b ] end 4 5 schedule fsm S0 : 6 S0 ( f i r s t ) −−> S1 ; 7 S1 ( second ) −−> S0 ; 8 end 9 end

Listing 3.3. Deterministic Multiplexer

leads to following observations:

1. Which of the two Actions fires depends on the arrival time of Tokens at it’s input ports.

2. Since the production and consumption rates at the input ports are not con-stant (either 0 or 1), the Actor is a MRDF-class Actor (multi-rate dataflow).

3. Because consumption and production rates are non-constant and non-cyclic, the exact subclass is DDF (dynamic dataflow).

This simple example illustrates, how easy it is to model non-determinism and timing-dependent behavior in CAL.

Deterministic CAL Multiplexer

The same multiplexer from listing 3.2 is extended in the source code listing 3.3 by an finite state machine (FSM) with the identifier fsm by the keyword schedule. The FSM contains two states S0 and S1 with S0 being the initial state. The two lines after the declaration define the arcs of the state-transitions as shown in figure 3.5. The syntax is read as follows:

current-state (fired action) –> next-state

Therefore at any one time the first Action is only firable, when two conditions apply: There must be at least one available token at port A and the Actor must be in state S0.

(32)

S0 S1

“first“ action fires

“second“ action fires

Figure 3.5. Finite State Machine for a deterministic execution of actions

1 network Add3 Op1 , Op2 , Op3 ( ) ==> Out : 2 e n t i t i e s

3 add0 = Add( ) ; 4 add1 = Add( ) ; 5 s t r u c t u r e

6 Op1 −−> add0 .A; 7 Op2 −−> add0 .B;

8 add0 . Result −−> add1 .A; 9 Op3 −−> add1 .B;

10 add1 . Result −−> Out ; 11 end

Listing 3.4. Network of Two Cascaded Adders

CAL Network

CAL source code listing 3.3.3 shows how Actors can be instantiated and connected into a network. The network has three input ports Op1,Op2 and Op3, one out-put port Out and contains two instances of Adders Add0 and Add1. Since all instantiated Actors are deterministic, also the resulting network is deterministic.

3.3.4 Software Code Generation

The translation of a CAL program into the host language C can be split in three parts, which are described in the following subsections.

System Actors

Platform-specific functions like file system operations and device drivers cannot be modelled in CAL, as the CAL language does not provide access to system-calls or memory-mapped devices. Still this kind of functionality is essential for an embed-ded system as a means of communication with its environment.

(33)

from a library. The code in this library includes calls to system functions. The

art_-prefix labels an Actor as a System Actor. Non-System Actors

Most Actors in a CAL program are Non-System Actors, further only called Actors. An Actor’s C model includes one function for each Action. In this function, though, only the data operations on the input tokens are performed and the Actor’s state is transformed. The evaluation of Token availability is not part of the action function.

An Action Scheduler is the core of each Actor. This function’s task is to chose one action and fire it. Even though CAL allows an underspecification of the algo-rithm, i.e. multiple actions can be eligible at the same time, the Action Scheduler introduces implementation-dependent deterministic behavior. The selection process is translated into a highly-nested if-then-else tree. The firing conditions are evalu-ated in the following order, starting from the root and going to the leaves of this if-then-else tree:

1. Evaluation of state variables and priorities 2. Evaluation of input token availability 3. Evaluation of output buffer space

ACTORS Runtime

Purpose of the ACTORS Runtime is the efficient scheduling of Actors. Depend-ing on the schedulDepend-ing policies in use a system can be fine-tuned for low-latency or high-throughput. The Actors Runtime has been developed in the course of the ACTORS project. It has been subject to numerous design iterations and is in con-tinuous development. The runtime environment presented here is a fork from the runtime version 1.1 and has been stripped from all Linux and POSIX references and multithreading support. Its main features are the support for static FIFO sizes and round-robin scheduling of Actors.

The initNetwork function allocates memory for all FIFOs and Actors and in-stantiates them. The main function calls initNetwork first and then

executeNet-work. executeNetwork creates a single workerThread, which processes the list of

all instantiated Actors in the system and calls their action scheduler

Memory Footprint

(34)

complexity of its actions, its state vector and the complexity of its firing rules.

A Channel’s memory footprint is roughly equivalent to the size of the underlying FIFO - independently of how many readers are connected to the FIFO. The imple-mentation that achieves this is shown in figure 3.6. As depicted the source’s output port has exclusive permission to write the tail pointer of the Channel Control

Blocks (CCB), whereas the sink’s input port has the exclusive write permission for

the CCB’s head pointer. In this way each input port knows exactly where its next token is located in memory.

Actor Actor Actor CCB CCB ta il h e a d ta il h e a d

Figure 3.6. Multicast implementation in software

3.3.5 HDL-Code Generation

The general principle of HDL-code generation is similar to C code generation. For historic reasons each individual Actor is synthesized to Verilog code, whereas the instantiation of the Verilog Actors and connecting buffers is done in a top-level netlist file in VHDL as depicted in figure 3.7.

Because of its inherent parallelism, there is no need for global logic that evaluates firing eligibility. Rather each Actor individually controls the firing of its actions, solely depending on input Token availability, output space availability, state, and firing rules.

Top-level Interface

The interface of the VHDL top-level netlist is shown in figure 3.7. The implemented protocol is a three-way handshake. In the first step the receiver senses the count signal and asserts the rdy signal if the input requirements are met. Then the sender asserts the data and send signals until the receiver asserts ack signal.

(35)

3.3. CAL ACTOR LANGUAGE 23 Top-level VHDL Netlist data send rdy ack count data send rdy ack count <ID>_ <ID>_ <ID>_ <ID>_ <ID>_ <ID>_ <ID>_ <ID>_ <ID>_ <ID>_

}

Input Channel

}

Output Channel Actor Actor Actor

Figure 3.7. VHDL top-level netlist with port descriptors

action, code is also instantiated for a reset generator.

Resource Usage

Each Actor occupies a certain static amount of Logical Elements on the target FPGA, which depends on the number and complexity of its actions. Channels are implemented using FIFOs that utilize BlockRAM resources. An disadvantage of these FIFOs is that due to its interface one FIFO cannot be used as an input for more than one Actor as figure 3.8 shows for the software case. Rather, if two input ports are bound to one output port, two hardware FIFOs will be instantiated as in figure 3.8.

Actor

Actor Actor

(36)

3.3.6 Discussion Tool Support

The supported subset of CAL varies with compiler (OpenDF, ORCC) and synthe-sis target. As the OpenDF compiler is the basynthe-sis of this work, we concentrated on its feature support. Naturally the supported subset of the codesign methodology is the intersection of the supported subsets of the software-backend and hardware-backend. Unfortunately the tool support is not well-documented outside the source code.

As a general design rule, we can state that keeping a CAL program simple and avoiding complex constructs (lambda calculus) is a good guideline to writing programs for both backends.

Coding Style

The avoidance of overspecification of a problem is the great strength of CAL. From a high-level point of view, the actual firing of actions may be unimportant or even unknown both at design time and at compile-time. This leaves a large scope of possible implementations to the designer. Some design-space dimensions are:

• granularity of actors and actions

• granularity of channel read and write accesses • use of internal state with or without priorities

As a rule of thumb one can say that (for software-mapped Actors) no matter how sophisticated the runtime system is, overhead is inherent with every additional Actor that is introduced. The hardware implementation on the other side can leverage on a parallel (which often means “fine-grained”) description of an algorithm in terms of throughput at the cost of gate-count and design complexity.

Abstract Model vs. Implementation

(37)

(38)

(39)

Chapter 4 Related Work

This chapter covers the history of CAL, beginning with the language’s specification as part of the Ptolemy project. The subsequent episode is characterized by Xilinx’s involvement in CAL research through the development of the OpenDF toolchain. As of now, the latest research effort has been the European ACTORS project which concluded in February 2011 - the first month of this work - and set the scene for the presented codesign methodology. An noteable development concurrent with the ACTORS project is that CAL was adopted by the Moving Pictures Expert Group (MPEG) as a specification standart. In the subsequent section, Workpackage D1b of the ACTORS project is described, which proposed a codesign methodology based on CAL. After this two historic codesign approaches Cosyma and Vulcan are presented and the difference to our CAL codesign approach is discussed. Finally the chapter closes with a short outline of a totally different approach to computer system design called ForSyDe.

4.1 Ptolemy Project

The Ptolemy project has started in 1990 with the purpose to model and simulate concurrent systems. A focus has been to combine subsystems, which abide by dif-ferent models of computation, and to analyze their behavior during interaction.

The project principle is to support a wide variety of both well-understood dataflow models of computation as presented in chapter 2 as well as other paradigms to model systems. This allows the modelling of heterogeneous systems.

Whereas the first version of Ptolemy (Ptolemy Classic) was implemented in C++, in 1996 the platform’s design language switched to Java in an effort to in-crease productivity. The resulting framework Ptolemy II is currently in its 8th major revision and still being maintained and extended.

Ptolemy II bases on the concept of actors with the implication that Actors

(40)

28 CHAPTER 4. RELATED WORK

have differently depending on which model of computation they pursue. In order to encapsulate this semantic each Actor has its own director, which implements the behavior of the model of computation according to the Actor’s type [12]. In this way a system in Ptolemy II can consist of a multitude of Actors - each with a differently typed director. The 8th revision of Ptolemy II understands process networks (PN), discrete-events (DE), dataflow, synchronous/reactive (SR), rendezvous-based, and continuous-time models.

Most Actors in Ptolemy feature a state machine, which governs their execution. An extension to stateful Actors are modal Actors. Modal Actors allow for the the specification of different operating conditions (or modes). Thus faulty Actors can be modelled, which may switch between normal operation and different faulty modes of operation.

Ptolemy II interprets a custom input syntax of Actors for which there is a graph-ical editor available [13]. An SDF subtractor with a display monitor implemented in Ptolemy II is shown in figure 4.1.

As the number of supported models of computation has increased, a new efficient input format for the specification of Actors was needed. The CAL language was developed to fill this gap.

(41)

4.2. ACTORS PROJECT 29

4.2 ACTORS project

After the definition of CAL as an input language to Ptolemy II, Xilinx has de-veloped a set of compilation tools around it in an effort to investigate alternative programming languages for its FPGA line of products. Thisproject concluded with the release of OpenDF and OpenForge, which are C-code and Verilog-code genera-tion tools for the CAL Actor Language.

Starting from this point, the EU-funded ACTORS project - an abbreviation for Adaptivity and Control of Resources in Embedded Systems - investigated CAL-based dataflow programming for control-oriented applications. A major research topic was the investigation of servers for resource allocation and feedback of service level to the dataflow application. Among other subtasks efficient scheduling and partitioning of CAL programs on multi-core systems as well as Actor-merging has been demonstrated. The most prominent contribution has been the earliest deadline

first scheduler for the Linux kernel - SCHED_EDF. It has found wide recognition

in the linux kernel community.

The most relevant proposal of the ACTORS project with respect to this thesis is described in the following section CAL Mapping Tools. In the subsequent chapter the new MPEG standard RVC-CAL is dicussed.

4.2.1 CAL Mapping Tools

In workpackage D1b (CAL Mapping Tools) of the ACTORS project a generic code-sign methodology is proposed, which is intended to provide a wide range of possible mapping targets for CAL programs. The aim is to support a variety of communi-cation protocols for CAL channels (PCI,Ethernet,USB).

The methodology leverages on the existing CAL compiles, i.e. CAL2C and CAL2HDL, and strived to integrate future developments (CAL2ARM). This re-quired flexibility called for a layered approach to the problem of mapping CAL Actors and Channels onto a designated platform.

The model consists of three distinct layers:

• CAL layer: The classic CAL program

• Wrapper Generic Layer: This layer handles the encoding and decoding of messages between distinct processing elements. It provides serialization and deserialization capabilities required for bus transfers.

(42)

The main tool of this work is the so called CodesignBuilder, which is unfortu-nately not entirely specified. The basic idea is that this tool takes an annotated CAL source code, in which each actor is assigned to designer-determined target processing resource. Subsequently it divides the CAL network into subnetworks and launches the corresponding code generators. The generated HDL description of the Accelerator-mapped Actor network and prefabricated HDL wrappers are syn-thesized by Synplify and downloaded to the FPGA board. Finally the generated C-code with custom libraries is inserted into a prefabricated Visual Studio project, compiled and executed on the host PC.

Unfortunately, this workflow has never been demonstrated.

4.2.2 Reconfigurable Video Coding

A far more successful development in the CAL ecosystem has been the adoption of the language as a specification language for video coding and decoding algorithms by the MPEG group.

The growing number of use cases for video codes as well as the range of target architectures and devices have lead to a multitude of video codes. Until now spec-ifications of those video codecs have been released mainly in the form of C/C++ source code, sometimes also as a textual description. This specification format does not provide for the expression of common subalgorithms (like transforms or predic-tors) that are similar or even equal between two distinct codecs.

Furthermore with the advent of mobile devices hardware accelerators have be-come a widespread means of delivering video decoding performance while maintain-ing a reasonable power consumption. Unfortunately sequential C/C++ code is not readily transformable into an efficient parallel implementation on Register Transfer Level (RTL). Rather a complete analysis of the algorithm and design from scratch is necessary. This is particularly becoming unacceptable as video codecs become more and more complex, thus making a steadily increasing redesign effort necessary.

In order to remedy these shortcomings and promote the reuse of existing soft-ware code and hardsoft-ware IP, the CAL Actor Language has been adopted by the Moving Pictures Expert Group (MPEG) in their draft proposal for their next gen-eration of video decoder specification, Reconfigurable Video Coding (RVC) [14].

One part of the RVC effort has been the creation of a CAL library (the MPEG Toolbox) of computational kernels, which are common among the MPEG decoders.

(43)

4.3. COSYMA AND VULCAN 31

from the MPEG Toolbox.

FU A FU B FU C FU D FU E FU F Switch Matrix

bitstream decoded video

S w it c h M a tr ix S w it c h M a tr ix S w it c h M a tr ix S w it c h M a tr ix decoder definition Configuration Engine

Figure 4.2. Example of a RVC-based video-decoder

The third and final part of RVC is RVC-BSDL (Bitstream Syntax Description Language). Its purpose is to enable the automatic generation of the bitstream parser, which is the unit in a decoder that interprets the fed-in bitstream and dis-tributes tokens to the subsequent actors.

4.3 Cosyma and Vulcan

Two historic systems that demonstrate opposite ways of approaching the hard-waresoftware codesign problem are Vulcan, delveloped at Stanford, and Cosyma, developed at Technische Universität Braunschweig [15] [16].

Both Vulcan and Cosyma assume a partitioning problem of an algorithm that shall be mapped partly on an CPU and partly on a seperate ASIC. This stems largely from the historic notion, that an accelerator would always be a discrete chip.

The communication paradigm is also common in both platforms. Exchange of data between CPU and accelerator ASIC happend via shared memory and registers that were accessed via a common bus.

(44)

The main differences of the two approaches address on the one hand the way a given design is constrained and mapped and on the other hand how a mapping’s performance is evaluated.

With Vulcan initally the entire algorithm (typically a C program) is imagined to be mapped on the accelerator ASIC. In a number of successive steps parts of the C program are moved to a software implementation to minimize ASIC cost, until the required maxiumum permissible cost is reached.

Cosyma on the other side proceeds on the assumption that the entire algorithm is initially mapped as software on the CPU. In the following mapping step critical parts of the algorithm are offloaded to the ASIC in order to meet a given perfor-mance goal.

In both methods the evalutation of system performance differs. Cosyma requires test cases for the determination of worst-case execution times for basic program blocks. Vulcan on the other side calls for an analysis of the control flow inherent to the algorithm.

Several ideas from Vulcan and Cosyma have inspired this work like the as-sumption of single-threaded programs and a common bus architecture. The main difference to the approach in this work is however that the algorithm is given in a

partition-agnostic specification language. Thus the algorithm is not biased towards

the one or the other domain.

4.4 Formal System Design

This chapter ends with the presentation of an alternative system design technique that has very little in common with dataflow programming. The Formal System

Design (ForSyDe) methodology tries to bring system design to a higher abstraction

by defining a computer system entirely on a functional level through the use of the purely functional programming language Haskell [17] [18].

Similarly to dataflow, in ForSyDe a design is modelled as a network of processes that are connected with signals (like Actors and Channels in dataflow models of computation). This specification can contain processes that adhere to different models of computation, thus allowing the simulation heterogeneous systems.

(45)

Chapter 5 Technical Implementation

This chapter discusses the implementation of the CAL development system from top to bottom. First the required equipment and hardware setup is presented, followed by the software architecture.

5.1 Development System

The development system that was created in this work is depicted in figure 5.1. At its core the development system consists of a ML510 development board. All programming and synthesis tools are run on the development PC, which connects to the ML510 board via a JTAG debugger/programmer. The testbench host computer injects stimulus and collects results from the development board either via Ethernet or RS232.

USB JTAG

JTAG Debugger/ Programmer Cat5 Crossover Ethernet

RS232

ML510

Development PC Testbench Host

Figure 5.1. Overview of the Development System

5.2 Hardware

The hardware system is build with Xilinx Platform Studio 12.4 and targets the ML510 development board (revision C). This board is equipped with a Xilinx Virtex

(46)

34 CHAPTER 5. TECHNICAL IMPLEMENTATION

5 FPGA with two PowerPC 440 hardmacros. The entire system is build around one of these processors.

5.2.1 FPGA Platform

The codesign platform is based on a shared bus microprocessor system with two bus masters. The first bus master is one PowerPC440 core and the second bus master is the CAL accelerator module. Both bus masters compete for bus time on the processor local bus (PLB).

PLB PPC440

(hard macro)

TEMAC (hard macro)

CAL Accelerator BRAM

(8KB) DIMM0 (512MB) Timer IRQ UART (hard macro)

Figure 5.2. Overview of the Hardware System

5.2.2 CPU/Accelerator Interface

(47)

5.3. SOFTWARE 35 Actor P L B I n te rf a c e Actor IFBI OFBI IFBI OFBI IFAI IFAI OFAI OFAI Actor Address Decoder .. . .. . .. . .. . .. . .. .

Input FIFO Bank

Output FIFO Bank

Data Mux Data Demux Ctrl Demux Ack Mux Actor Network

Figure 5.3. Schematic of the developed master accelerator architecture

5.3 Software

The software system is comprised of three layers on top of the hardware system as illustrated in figure 5.4. The Actors Runtime is mostly platform-independent and provides all necessary functionality for Actor scheduling and execution. Only cache-, timer- and interrupt-initialization require functions from within the driver layer.

System Actors access hardware functionality exclusively through hardware drivers. Non-System Actors, on the other hand, are not aware of the driver layer.

Actors Runtime System

Hardware Platform Actors Actors Actors Hardware Drivers System Actors

Figure 5.4. Block diagram of the CAL software architecture

5.3.1 Runtime System

(48)

depend-36 CHAPTER 5. TECHNICAL IMPLEMENTATION

ing on solely this parameter. The benefit of this option is however, that it allows a system to benefit from both instruction and data caches, because both caches could speed up subsequent memory access.

On the other hand the developer can set the Channel capacity, which determines the number of tokens each channel can hold. This parameter defaults to the equiv-alent of 1000 Tokens (4kB). A higher value for this parameter in combination with an enabled action scheduler loop increases the benefits of caches, as it allows more consecutive iterations over the same program block. The upper limit for this value is only determined by the available memory, because the memory for all actors is statically allocated during initialization of the CAL program. The minimal value of this parameter depends on the CAL program at hand and requires a careful exam-ination of the data dependency between Actors. In the worst case, if the value of this parameter is too low, the CAL program will deadlock.

5.3.2 PLB Bus System Actors

In order to keep a coherent model of the software system, access to device drivers is encapsulated within special actors, the so-called Actors. These System-Actors are treated by the runtime as normal System-Actors - once per scheduling round their firing eligibility is evaluated and in case of a positive evaluation they are scheduled for execution. The difference between System-Actors and normal Actors is that their C-source code has been altered, such that they call system functions and can thus access and operate on global data. The following is a functional explaination of the PLB bus System Actors. A complete overview of the commununication scheme with chache management and overflow control is available in the appendix as well as the companion CD.

art_PLB_Bus_SW_Writer

The PLB bus system actors load the DMA unit of the accelerator with the channel control block’s address (CCB) and activate the accelerator’s DMA logic. Subse-quently the accelerator autonomously loads the channel control block from main memory into its internal register file. Then it performs one or two DMA read trans-fers, depending on whether the available data range in the FIFO is continuous or wrapped. In the end the accelerator computes the new channel control block and saves it to main memory. The memory management and transaction algorithm is given below.

art_PLB_Bus_SW_Reader

(49)

5.3. SOFTWARE 37 H T H H T T H T

a) Empty buffer _{b) Partially full buffer}

c) Full buffer d) Partially full buffer (wrap)

Figure 5.5. FIFO states

Algorithm 1 Write to Accelerator

1. Determine memory ranges of available tokens 2. Flush memory ranges to main memory 3. Flush LocalInputPort struct to main memory 4. Load LocalInputPort struct pointer into accelerator 5. Wait for DONE signal

6. Invalidate LocalInputPort struct

Algorithm 2 Read from Accelerator

1. Determine memory ranges of available tokens 2. Flush LocalOuputPort struct to main memory 3. Load LocalOuputPort struct into accelerator 4. Wait for DONE signal

4. Invalidate entire buffer

5.3.3 UART System Actors art_UART_Source_txt

Reads a byte from the platform UART16550 device and outputs it as a token. The read is blocking.

art_UART_Source_int

Reads a byte from the platform UART16550 device and outputs its value diminished by 48 as a token. The read is blocking.

art_UART_Sink_txt

(50)

38 CHAPTER 5. TECHNICAL IMPLEMENTATION

art_UART_Sink_int

Reads a token and outputs its value diminished by 48 as a byte to the platform UART16550 device. The write is blocking.

5.3.4 Ethernet System Actors art_Eth_Source

This actor establishes a directed stream of tokens from the host machine to the software-mapped partition of the CAL program.

art_Eth_Sink

This actor establishes a directed stream of tokens from the software-mapped parti-tion of the CAL program to the host machine.

5.3.5 Host-Mapped Software

The host partition is necessary to provide for file system functionality, which is not available on the development platform. this is particularly useful for file-io functions in connection with stimulus- and result vectors.

source.py

This Python script reads all input stimulus files and opens one TCP connection with the ML510 board for each such file. A handshaking protocol on top of TCP ensures the reliability of the transmission, since the TCP stack of the ML510 board has problems with throttling of the connection speed.

sink.py

This Python script opens one TCP connection with the ML510 board for each result file to be generated. A handshaking protocol is not necessary in this case, since the host machine can process the datagrams much faster than the ML510 board.

5.3.6 Device Drivers

The CAL language intentionally provides no global address space. Therefore drivers for memory mapped devices such as the accelerator at hand must be written in the host-language C.

ACT_MST_MasterSendByte

(51)

5.4. TESTBENCH AND SYSTEM-LEVEL OVERVIEW 39

5.4 Testbench and System-level Overview

(52)

(53)

Chapter 6 Codesign Methodology

6.1 Existing CAL Workflow

This section deals with the methodology and tool flow of CAL-based software design as depicted in figure 6.1. Major steps in the process are marked with capital letters A-E and described in detail in the following sections. Special emphasis will be laid upon the intermediate representations, whereas the tools involved are described cursorily. A comprehensive list of all intermediate formats is provided in section 6.4 at end of this chapter.

6.1.1 System Specification

System specification is the task of writing the source code in the languages CAL and NL (figure 6.1, task A). At this time there are no tools that assist this process, other than conventional text editors. It is recommended to give the top level program (also referred to as model) the name top.nl. The model can encompas an arbitrary depth of hierarchical levels represented as subnetworks, which are themselves NL-files. Leaf elements in the hierarchy are always CAL-files, which represent Actor classes.

6.1.2 Flattening

The specification files from step A, carry the full semantic weight of CAL and NL. This includes the passing of parameters through the hierarchical levels as well as recursive instantiations of subnetworks and actors. In order to reduce the model’s complexity, the input files are compiled into an intermediate representation - the XDF file - in step B, figure 6.1). This compilation starts from the top level and analyses all subnetworks and actors instantiated in that level. This process repeats itself recursively taking into account one additional level of hierarchy per recursion, until no more levels of hierarchy remain. This XDF file, contains the entire infor-mation of the original NL and CAL files. Even though the network is flattened, the information about the entire hierarchical structure is kept within the file.

(54)

42 CHAPTER 6. CODESIGN METHODOLOGY <model>.nl *.cal XLIM2C *.c <model>.elf *.o *.xlim XILINX_GCC <model>.xdf <model>.c *.par SSA-Generator XDF2PAR <model>.depend generateDepend.xsl <subnetworks>.nl ELABORATOR SAXON8 SAXON8 generateConfig.xsl A) B) C) D) E)

(55)

6.2. EXTENDED CODESIGN WORKFLOW 43

6.1.3 Constant Propagation and Single Static Assignment Generation

Parameters are extracted from the flattened network for each actor instance in step C, figure 6.1. This information, combined with the CAL file of the corresponding actor class, is sufficient to generate an intermediate representation of the actor instance in single static assignment form. This process takes place once for each actor instance, even if two actors share the same parameter sets.

6.1.4 Actor Translation

The intermediate format from step C is translated into C source code for all actors except system actors.

6.1.5 Network and Schedule Generation

The final step E draws all actor instances from the intermediate representation of the network in XDF format and generates C source code for their instantia-tion, connection via channels, memory allocation for channels and scheduling. The scheduling takes place in a round robin manner, with the order of occurrence in the XDF format. The developer does not have any influence on the scheduling order. This stems from the lack of any guarantee, that naming and ordering of actor in-stances is deterministic in the process of translation from CAL and NL formats to XDF format.

6.2 Extended Codesign Workflow

Building on top of the existing software toolchain, the hardware toolchain was integrated as shown in figure 6.2. All steps A-E from the software workflow remain the same and steps F-I are added. The crucial point for the codesign workflow is the fully compiled XDF file with all annotations.

6.2.1 Partitioning

During the partitioning process the developer decides which actors shall be imple-mented as software actors versus which actors shall be synthesized to dedicated logic using FPGA fabric. The process is assisted by the configuration tool, which extracts all actor’s identifiers from the XDF representation and generates a tem-plate configuration file. This temtem-plate configuration file contains the identifiers in the required syntax to force all actors to be compiled as software. The developer can subsequently edit the configuration file in a text editor, to force different partition-ing, i.e. to map an arbitrary number of actors on FPGA fabric. Special attention must be given to two mandatory rules for correct operation of the CAL network.

(56)

44 CHAPTER 6. CODESIGN METHODOLOGY <model>.nl *.cal XLIM2C *.c <model>.elf *.o *.xlim XILINX_GCC <model>.xdf <model>.c *.par SSA-Generator XDF2PAR <model>.depend generateDepend.xsl <subnetworks>.nl ELABORATOR SAXON8 SAXON8 generateConfig.xsl A) B) C) D) E) *.v <model>.vhd Synthesizer <model>_ accelerator.xdf Xilinx ISE <model>.bit Configuration <model>.conf user_logic.vhd <model>_ powerpc.xdf XDF Splitter F) G) H) I)

(57)

6.2. EXTENDED CODESIGN WORKFLOW 45

2. Second, no new actors may be added to the configuration file.

3. Third, Each actor instance may appear exactly once in the configuration file.

Additionally to these rules it is self-explanatory that system actors which are written in the C language and not processed by the elaboration tool, cannot be mapped to hardware. In other words, system actors are fixed to the partition for which they were originally designed.

6.2.2 Splitting XDF

Input to this process is the comprehensive CAL program in its XDF representa-tion and the developer-specified configurarepresenta-tion file. The splitter tool now reads the configuration file and interprets it as a target partitioning, an example of which is shown in figure 6.3. It will try to map the XDF representation according to the provided configuration file. In case any of the partitioning rules given in 6.2.1 in are disobeyed, the tool will terminate.

Actor Instance S0 Actor Instance S1 Actor Instance S3 Actor Instance S2

Software Partition Hardware Partition

Figure 6.3. Target partitioning

Throughout the splitting process, each encountered channel between two actors is considered regarding weather it is a boundary channel or a non-boundary

channel. A boundary channel is a channel with the the property that source and

Hardware/Software Partitioning of Dataﬂow Programs