Implementation of a Scheduling and Allocation Algorithm for Hardware Evaluation

(1)

IMPLEMENTATION OF A SCHEDULING

AND ALLOCATION ALGORITHM FOR

HARDWARE EVALUATION

Kangmin Chen

LiTH-ISY-EX--05/3754--SE Linköping 2005

(2)

(3)

IMPLEMENTATION OF A SCHEDULING

AND ALLOCATION ALGORITHM FOR

HARDWARE EVALUATION

Master thesis in Electronics Systems at Department of Electrical Engineering,

Linköping University

by

Kangmin Chen

LiTH-ISY-EX--05/3754--SE

Supervisor: Kenny Johansson

Examiner: Lars Wanhammar

(4)

(5)

Avdelning, Institution Division, Department

Institutionen för systemteknik

581 83 LINKÖPING

Datum Date 2005-05-16 Språk

Language Rapporttyp Report category ISBN Svenska/Swedish

X Engelska/English Licentiatavhandling X Examensarbete ISRN LITH-ISY-EX--05/3754--SE

C-uppsats

D-uppsats Serietitel och serienummer _{Title of series, numbering} ISSN

Övrig rapport

____

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2005/3754/

Titel

Title Implementation of a Scheduling and Allocation Algorithm for Hardware Evaluation

Författare

Author Kangmin Chen

Sammanfattning

Abstract

In this thesis, an intuitive approach to determine scheduling and allocation of a behavioral

algorithm defined by a netlist is presented. In this approach, scheduling is based on a weighted list scheduling where operations have the longest critical path are scheduled first. The component allocations are resorted to the PDCPA algorithm which focus on making efficient and correct clusters for hardware reuse problem. Several constraints are used in order to ensure the causality of processes and prevent conflicts of hardware components. This approach can give the total number of control steps and the number of registers and multiplexers in detail. Hence, designers obtain useful information from it and can make trade-offs between different resource conditions. The program is implemented in MATLAB programming environment and provides parts of behavioral synthesis to facilitate the whole synthesis procedure.

Nyckelord

Keyword

(6)

(7)

Abstract

In this thesis, an intuitive approach to determine scheduling and

allocation of a behavioral algorithm defined by a netlist is presented. In

this approach, scheduling is based on a weighted list scheduling where

operations have the longest critical path are scheduled first. The

com-ponent allocations are resorted to the PDCPA algorithm which focus

on making efficient and correct clusters for hardware reuse problem.

Several constraints are used in order to ensure the causality of

proc-esses and prevent conflicts of hardware components. This approach can

give the total number of control steps and the number of registers and

multiplexers in detail. Hence, designers obtain useful information from

it and can make trade-offs between different resource conditions.

The program is implemented in MATLAB programming

environ-ment and provides parts of behavioral synthesis to facilitate the whole

synthesis procedure.

(8)

(9)

1 Introduction

1

1.1 Motivation . . . 1 1.2 Objective . . . 1 1.3 Related work . . . 2 1.4 Limitations . . . 2

1.5 Organization of this thesis . . . 2

2 Background

3

2.1 Behavioral synthesis . . . 3 2.2 MATLAB. . . 3 2.3 Netlist . . . 4 2.4 Scheduling . . . 5 2.5 Allocation. . . 5 2.6 NP-Complete Problem . . . 5 2.7 Partition problem . . . 6

3 Theoretical study

7

3.1 Behavioral synthesis process . . . 7

3.2 Scheduling techniques . . . 8

3.2.1 ASAP (As Soon As Possible) Scheduling . . . . 9

3.2.2 ALAP (As Late As Possible) Scheduling . . . . 9

3.2.3 List Scheduling . . . 11

3.2.4 Force-Directed Scheduling . . . 12

3.2.5 Integer Linear Programming algorithm . . . 15

3.2.6 Classification of scheduling. . . 15

3.3 Allocation techniques. . . 16

3.3.1 Greedy allocation. . . 16

3.3.2 Left Edge (LE) Algorithm . . . 17

3.3.3 Clique partitioning . . . 18

3.3.4 Classification of allocation algorithm . . . 19

4 Description of the program

21

4.1 Program specification . . . 21

4.2 Basic structure of the program . . . 22

(10)

4.3 DFG Analysis (sorting) . . . 25

4.4 Scheduling . . . 26

4.4.1 Sub Blocks of Scheduling . . . 26

4.4.2 Scheduling technique selection . . . 26

4.4.3 Multicycling in scheduling . . . 27

4.4.4 Critical path computation . . . 29

4.4.5 List scheduling. . . 32

4.5 Allocation. . . 35

4.5.1 Sub blocks of Allocation . . . 35

4.5.2 Allocation technique selection . . . 35

4.5.3 PDCPA . . . 36

4.5.4 Register allocation . . . 39

4.5.5 Functional unit allocation . . . 42

4.5.6 Data Path allocation . . . 47

4.6 Statistics computation . . . 47

5 Conclusion and future work

49 References

51

(11)

LIST OF FIGURES

Figure 3.1: A data flow graph example . . . 10

Figure 3.2: One possible ASAP solution . . . 10

Figure 3.3: One possible ALAP solution . . . 11

Figure 3.4: A possible list scheduling solution. . . 13

Figure 3.5: Example of FDS. . . 14

Figure 3.6: Drawback on randomly allocation . . . 17

Figure 4.1: Black box diagram of the program . . . 22

Figure 4.2: Structure view of the program . . . 23

Figure 4.3: Basic structure of the data path . . . 24

Figure 4.4: Sub blocks of scheduling . . . 27

Figure 4.5: Methods for dealing with varying delay . . . 28

Figure 4.6: A critical path computation example . . . 29

Figure 4.7: Another critical path computation example . . . . 30

Figure 4.8: Netlist of FIR filter. . . 33

Figure 4.9: DFG of FIR filter . . . 33

Figure 4.10: Netlist with critical path length of FIR filter . . 34

Figure 4.11: Scheduled DFG of FIR filter . . . 35

Figure 4.12: Sub blocks of allocation . . . 36

Figure 4.13: Register initialization for scheduled DFG . . . . 41

Figure 4.14: Registered DFG of FIR filter . . . 42

Figure 4.15: Textual registered DFG of FIR filter . . . 43

Figure 4.16: FU allocation example - step one. . . 44

Figure 4.17: FU allocation example - step two . . . 45

Figure 4.18: FU allocation example - step three . . . 46

Figure 4.19: FU allocation example - step four . . . 46

(12)

(13)

1

INTRODUCTION

1.1 MOTIVATION

Over the last decades, computer aided design (CAD) provides more and more help to the designers in electrical engineering and becomes a subject of great importance. Methods of automatically synthesis on high level have been invented. The design cycle improves and the time-to-market time is short-ened. But at the moment, all effective tools in this field are developed by companies for commercial purpose. So we decided to implement a related simple tool to explore different automate synthesis methods and help people with the simple designs.

1.2 OBJECTIVE

Generally, component reuse is an important part in the electrical field. How to save components but still achieve the same functionality is the main topic of this thesis. In order to assign different operations to functional units, we must divide the operations into clusters that do not have compatibility problems. Each cluster occupies a dedicated hardware component. Our goal is to find the minimum hardware overhead that leads to minimum cost.

Designers can enter different input values to the developed program, which executes and generates the resources requirements. Hence, designers can make trade-offs early in the design so as to make the final solution fits the requirements.

(14)

1.3 RELATED WORK

Computer-aided-design is a field which has already received considerable attention for decades. Lots of researchers have presented various of methods on this field and a mass of commercial companies have developed all kinds of powerful CAD tools. But the improvements on this field will never stop. The program we provided is focus on two major parts on synthesis aspect, namely scheduling and allocation.

There are lots of approachs on scheduling and allocation. A lot of people have developed algorithms on these fields [1][2][3][4]. Also, people have improved the existing algorithm for better performance [5][6]. Some algo-rithms for the whole synthesis work are developed [7][8][9][10][11].

1.4 LIMITATIONS

The program follows some algorithms from previous research and is implemented under the local environment. It’s not easy for other people writ-ing the self-defined inputs for this program. It also has some other deficits on the algorithm because there are lots of modern methods on the same arena. Using the most effective algorithm makes the program generate the result after a long execution time. For large designs the long execution time is a fatal defect.

1.5 ORGANIZATION OF THIS THESIS

This chapter has given a brief background and goal of the work. Some other research is also mentioned here. Chapter 2 will give some basic con-cepts present in the paper. An extensive background of the explored and used technologies can be found in Chapter 3. Chapter 4 discusses the design and implementation, shown by example. Finally, chapter 5 gives the conclusions that got from this work.

(15)

3

2

BACKGROUND

2.1 BEHAVIORAL SYNTHESIS

Behavioral synthesis is an automated design process that interprets an algorithmic description of a desired behavior and creates hardware that implements that behavior [12].

Usually behavioral synthesis starts with a high level coding, then automat-ically translates the algorithm description into cycle-by-cycle detail for hard-ware implementation, mostly in register-transfer level (RTL). Normally, behavioral synthesis consists of scheduling, resource allocation, module binding, and controller synthesis.

2.2 MATLAB

MATLAB is a high-level technical computing language and interactive environment for algorithm development, data visualization, data analysis, and numerical computation. Using MATLAB, you can solve technical computing problems faster than with traditional programming languages, such as C,

C++, and FORTRAN.1

MATLAB is the leader software on scientific market. It offers easy-to-use environment where problems and solutions are expressed in familiar mathe-matical notation. For reusable and participatory perspective, here we use

(16)

MATLAB as the coding environment because several other programs on this field are developed by MATLAB.

2.3 NETLIST

Netlist: A file describing the interconnection information of an electronic circuit. A netlist describes the connectivity between components in a given electronic system.1

This thesis is based on the netlist definition defined by Electronics Sys-tems, Linköping University.

Here is an example of a netlist: [ 1 1 1 NaN NaN; 1 2 2 NaN NaN; 1 3 3 NaN NaN; 2 1 6 NaN NaN; 2 2 8 NaN NaN; 3 1 1 2 4; 3 2 4 3 5; 3 3 1 3 7; 5 1 5 6 0.63; 5 2 7 8 0.32; ];

The first column of the netlist specifies the type of a component. For example, “1”, “2”, “3”, “4”, “5” stand for the input, output, adder, subtractor, multiplier specification, respectively. The second column stands for the iden-tifier number in that kind of operation. For all the columns remain, they repre-sent all the connection ports for this hardware component2_{. The width for}

several rows can be differed according to the operation type. If that lattice ele-ment is empty caused by the width differs, it will be filled with “NaN”, which

1. From “http://www.coresim.com”

(17)

Chapter 2 – Background 5

means “Not a Number” in MATLAB.

From this netlist, we can obtain the data flow graph (DFG) of the imple-mented algorithm. All the analyses in the thesis are originally based on the given netlist. Hence, netlist must be well defined.

2.4 SCHEDULING

Scheduling is one of the most important steps in behavioral synthesis pro-cedure. Scheduling is based on data flow analyses, which can extract the inherent parallelism of the algorithm, and the resource restrictions. Schedul-ing turns the un-timed description into detail-timed behavior and specifies what should be done in every clock cycle.

2.5 ALLOCATION

Allocation means assign operations to specified hardware component. Allocation is one process acting an important role in behavioral synthesis, the same as scheduling. Scheduling and allocation acts as twins and they interact with each other. Because one hardware component can be assigned to exactly one operation in each control step, allocation procedure must make sure that in one control step all the operations use different hardware resources. This is the main point of restricted resource allocation.

2.6 NP-COMPLETE PROBLEM

A problem is assigned to the NP (nondeterministic polynomial time) class if it is verifiable in polynomial time by a nondeterministic Turing machine.

A problem is NP-hard if an algorithm for solving it can be translated into one for solving any other NP-problem (nondeterministic polynomial time) problem.

A problem which is both NP (verifiable in nondeterministic polynomial time) and NP-hard (any other NP-problem can be translated into this problem).1

Usually, NP-Complete problem requires performing an exhaustive search to find out the best answer. But it’s not practicable to perform this kind of “brute-force” algorithm. Hence, we have to find out a near-optimal solution at

(18)

a reasonable computational cost without being able to guarantee the feasibil-ity and optimalfeasibil-ity.

2.7 PARTITION PROBLEM

The partition problem can be described as: If two nodes do not have con-flicts with each other on certain resource, there is an edge between these two nodes. The goal is to find out the minimum number of resource needed by grouping compatible nodes together.

The partition problem is also an NP-complete problem. For resource allo-cation problem it can be converted into partition problem. This paper will explore some graphical partition algorithm in section 3.3.3.

(19)

7

3

THEORETICAL STUDY

3.1 BEHAVIORAL SYNTHESIS PROCESS

Usually, a behavioral synthesis process can be divided into several steps listed below.

• Specification

Define which language to use in the synthesis process. Is it a procedural language?

Find out the parallelism of the applying algorithm.

Obtain the requirement specification, such as timing, power consumption, area, cost, etc.

• Data-Flow Analysis

Generally, this step is quite an important pre-process in the synthesis process. In this step, the characteristic of the high level language, procedure calls and complex operators for example, should be eliminated. Loops should be unrolled. And one important thing, that affects the consecutive scheduling step, is called parallelism extraction. We should get a clear view on the data flow of the algorithm, understand how the data dependency is and explore the possible parallelism of the code so as to accelerate the total consumption time (latency) of the algorithm.

• Operation Scheduling

(20)

performance of the final design. In this step, we should make a schedule under some constraints. When making the schedule, we should follow some rules (scheduling techniques) in order to achieve better performance. During the scheduling process, we should consider the trade-off between cost and performance. Usually the cost is the area overhead and the performance is the latency of the whole block. Also, we can evaluate the performance of the design after the scheduling is finished and obtain some clocking strategy to make sure the design works as wanted.

• Data-Path Allocation

In the data-path allocation step, we should decide the basic data-path structure. A function can be implemented in various designs, whose over-heads are different. We should make decisions at this moment on which func-tional component should be selected in the design. Then binding every operation to certain part of hardware component we selected. Now we have already changed the operation in the algorithm into hardware. For the sake of saving hardware components, we should do the hardware minimization dur-ing the allocation procedure. So there will be a hardware reuse problem that we mainly focus on in this work. Data-path allocation includes functional unit allocation, register (memory) allocation and interconnection allocation. • Control Allocation

Data-path should be controlled by the control signals that generated by control units. People have to provide not only the data path of the design but also the control part. We have to get through the control allocation after the data-path allocation is complete. There are several ways to specify the control signal, for example, micro-code and programmable logic array (PLA). We should choose the style of control and generate the control code.

3.2 SCHEDULING TECHNIQUES

As mentioned before, scheduling is one crucial step in the behavioral syn-thesis process. The final timing information may be completely different if two different scheduling techniques are used. Here we briefly illustrate some scheduling techniques mostly used nowadays.

(21)

Chapter 3 – Theoretical study 9

3.2.1 ASAP (AS SOON AS POSSIBLE) SCHEDULING

In the data flow graph, nodes stand for operations. An operation can be executed only when it does not have parents. In other words, the highest nodes (top most in a DFG diagram) are the operations available to be exe-cuted at the moment. If an operation is ready for running, assign time step to it as soon as possible. Then increase the time step and assign the currently highest nodes. That means the ASAP scheduling technique will assign all the available operations in each time step. There is no need to care about the data dependency explicitly. Since the highest nodes in DFG are the nodes availa-ble to execute, all their precedence operations have already executed. So it will follow the data dependency implicitly.

ASAP can be developed to work for limited resource constrained condi-tion that always be true in real design. Not all the available nodes are assigned if there are some resource restrictions. Random selections are used because it does not have any priorities to figure out which operation has higher priority than others. Because of this defect, the result is not good enough comparing with other heuristic scheduling algorithms.

Here we use a simple example to illustrate the algorithm. The data-flow graph is shown in Fig. 3.1. All operations are addition operations in order to illustrate the basic concept of the technique. In realistic, there could be lots of operations combined together so the resource constraints are defined in sev-eral subsets. We assume there are two adders provided here. In the first con-trol step (CS1), there are three operations available for scheduling, namely 1, 2, and 3. Because it does not have selection category for ASAP scheduling, we pick up randomly. We select for example 1 and 2 for scheduling in CS1, and follow the algorithm to make the whole schedule. One possible solution made by ASAP is shown in Fig. 3.2.

3.2.2 ALAP (AS LATE AS POSSIBLE) SCHEDULING

ALAP is similar to ASAP scheduling algorithm except that it works from the bottom of the DFG towards the top. Each operation is scheduled at the last moment that must be executed. Because the assigning of control step counts downwards, the control step numbers are determined after the whole schedule is finished.

(22)

Figure 3.1: A data flow graph example

(23)

And these two scheduling algorithms have the same drawback.

In Fig. 3.3, one possible ALAP solution for the DFG in Fig. 3.1 is shown.

3.2.3 LIST SCHEDULING

The problem with ASAP and ALAP is that they do not provide global exploration on the operation selection part. Some operations are more critical than others since they may be the predecessors of mass of operations. We should always find out the vital operations and schedule them first so as to achieve better scheduling solutions. The priority generation can be done by list scheduling, a heuristic scheduling algorithm.

In list scheduling, critical path is one basic definition for solving the oper-ation selection problem. The longest length from the operoper-ation node to the node with no immediate successor is called critical path. And the length of this path is called critical path length. We can get the idea that which opera-tion is more urgent than others according to the critical path length of the

(24)

node. In other words, the critical path length is the indicator on determining which operation should have higher priority in selection under resource con-straints.

Usually, the list scheduling can be based on several priority methods. The critical path length is one of them. Mobility is another mostly used priority function for list scheduling. Mobility is the number of control steps from the earliest to the latest feasible control step for an operation. The greater the mobility, the smaller the priority. This means, if there are some conflicts between several operations to be scheduled, the operations with larger mobil-ity should be deferred to later control steps because they can be scheduled into more control steps.

Time constrained list scheduling can be derived from the original resource constrained list scheduling.

Now we go back to see the data-flow graph described in Fig. 3.1 again, using list scheduling. The three available operations are 1, 2, and 3. The criti-cal path length from that node to the end of the path is 1, 2, and 3, respec-tively. Because only two adders are available, the highest two operations on priority, operation 2 and operation 3, should be scheduled at the moment in CS1. All the available resources are used in CS1 so all other operations are deferred to CS2. In CS2, operation 1, 4, and 5 are available, the critical path length are found as 1, 1, and 2. So operation 5 is a compulsory one to be scheduled, and we arbitrarily pick up operation 1 as the other operation to be scheduled at the moment. And at last, the remaining two operations are avail-able to be scheduled in CS3. The final solution is shown in Fig. 3.4. Totally there are 3 control steps using list scheduling, while ASAP and ALAP might generate solutions using 4 control steps. Hence, we can find out that the list scheduling can consider global priority of operations so that it can solve the problem more efficiently.

3.2.4 FORCE-DIRECTED SCHEDULING

The force-directed scheduling (FDS) [2] is a commonly used algorithm for scheduling. It uses a heuristic method to find out the solution. Force-directed scheduling is usually treated as a time constrained scheduling tech-nique. The main goal of the time constrained scheduling technique is to reduce the total hardware resources used but still fits the timing constraints.

(25)

first. After that, a distribution graph is achieved. Distribution graph is a plot of distribution over the schedule steps for each operation type. From the dis-tribution graph, we can find out the possible resource utilization at every con-trol step. We have the capability to make an even better utilization by pinching the maximum width of the graph. Force is one definition for this purpose. It is quite similar to the list scheduling using mobility as priority function. Here the force is computed by

The less the force, the larger balance the solution have.

In force-directed scheduling, each operator makes its own schedule for balancing the total cost of this kind of resource. Here we make a simple example to illustrate the force-directed scheduling.

Figure 3.4: A possible list scheduling solution

Force(Ω( )O_i = S_j) DG S( )_j 1 ∆T O( )_i --- DG s( ) s=ΩASAP O( )_i ΩALAP O( )i

∑

– =

(26)

The data flow graph is shown in Fig. 3.5 (a). In Fig. 3.5 (b) and Fig. 3.5 (c) are the ASAP and ALAP solutions of the DFG illustrated. The time con-straint here is three clock cycles (control steps). In order to find out the mini-mum resource requirement, FDS is used. Fig. 3.5 (d) is the distribution graph of addition operations. Because the mobility of addition 1 and addition 2 is 0, so they must be scheduled in CS1 and CS2 respectively. The only feasible movement to get better resource balance happens for addition 3 which can be scheduled in CS2 or CS3. Now follow the formula given above. The force for add3 scheduled in CS2 is

While the force for add3 scheduled in CS3 equals

So, in order to use less hardware, add3 should be put to CS3 because –0.5 is smaller than 0.5 for CS2.

Modification on the traditional FDS can be done. A scheduling technique called force directed list scheduling (FDLS) is one of them. FDLS changes the time constrained FDS to a resource constrained scheduling. In FDLS, operations are first sorted by their data dependencies. This is equivalent to nearly all kinds of scheduling algorithm. After this, the scheduler checks if there are sufficient resources for scheduling the DFG only by data depend-ency. If not, some operations are deferred to the latter control steps according

Figure 3.5: Example of FDS

(a) DFG, (b) ASAP for the DFG, (c) ALAP for the DFG, and (d) distri-bution graph of addition

1.5 1 2 ---•(1.5+0.5) – = 0.5 0.5 1 2 ---•(1.5+0.5) – = –0.5

(27)

to the “deferral force” until there are sufficient hardware resources. For more information on FDLS, see [2].

Improvements on force-directed scheduling in order to have better solu-tion and faster scheduling time [5][6] have been presented.

3.2.5 INTEGER LINEAR PROGRAMMING ALGORITHM

The integer linear programming (ILP) method tries to find an optimal schedule using a branch-and-bound search algorithm. It also involves some backtracking that will change the decision made previously seeking for better decision. Most of ILP algorithm yield optimum solution but for longer DFG whose control step numbers are larger ILP would compute the result much slower. For more information, see [14].

3.2.6 CLASSIFICATION OF SCHEDULING

We can divide the scheduling technique based on different aspects. According to the restrictions, we mainly classify them into time con-strained scheduling and resource concon-strained scheduling. But as mentioned in some techniques before, some scheduling techniques can solve both time and resource constrained problem.

According to the scheduling method, we can classify them into iterative/ constructive scheduling, global scheduling, transformational scheduling, and mathematical programming scheduling. For iterative/constructive scheduling, one operation is assigned to one control step at a time and this process is iter-ated from operation to operation (or control step to control step). ASAP, ALAP, and list scheduling are belonging to this category. For global schedul-ing, all control steps and operations are considered simultaneously when deciding the schedule. They take the global environment into account in the effect that better solution, and/or balance will gain. Force-directed scheduling and integer linear programming algorithm are such kind of techniques. Trans-formation scheduling starts from an initial schedule, better solutions are obtained by successively transformation. Force directed list scheduling starts from ASAP and make it feasible for the constraints; it’s a transformation scheduling technique. ILP is one kind of mathematical programming sched-uling techniques that change the problem into mathematic field and solve it.

(28)

3.3 ALLOCATION TECHNIQUES

In the procedure of behavioral synthesis, allocation determines the types of resources and the number of them, and binding each operations to specific resource. Usually resource can be divided into three main categories: func-tional units, storage elements, and buses. For each of them we should make the allocation separately in order to minimize the need of hardware resource and to maximize the utilization. The basic idea of the resource sharing on allocation is that non-concurrent operations can share the same hardware at different control steps. In order to achieve better utilization, we should group the operations together as much as we can and try to make fewer groups as much as we can to minimize the total spending. For this reason, lots of tech-niques are invented for allocation process.

3.3.1 GREEDY ALLOCATION

Greedy algorithm is one kind of iterative/constructive allocation algo-rithm. It starts with an empty resource and then adding resource following the scheduling result. If there are not enough resources available, simply add a new resource so that the functionality can be fulfilled. If there are some avail-able resources, then select one of the resources to implement the required function. Greedy algorithm does not care about the external overhead brought by sharing resource for any compatible pairs. For example, if an operation shown in the right of Fig. 3.6 is to be allocated, there are two multipliers available at the moment, which is already used for the operations in previous control step and the two multipliers are already binding with the left and mid-dle operation in Fig. 3.6, respectively. For greedy allocation, it does not have any kind of associated rules so it will perform the allocation randomly. If the operation reuses the same functional unit (FU) as the left one, only one 2–1 multiplex is needed. While it needs more multiplexes if binding with the

(29)

same FU as the middle one.

The name greedy is well describing the algorithm. It reuses all the resources in hand without thinking any foreseeable conflicts in the future.

3.3.2 LEFT EDGE (LE) ALGORITHM

Left edge algorithm can be used in lots of arenas in the world, for exam-ple, register allocation process and path routing field on layout design. It’s an intuitive algorithm while generating good results.

Why so called left edge? That’s because everything should have a time range that points out its living lifetime. There is one value denotes the birth time, and one for the death time. The algorithm sorts all the lifetime accord-ing to their birth time, in ascend order. If some have the same startaccord-ing time, sort them by their ending time also, earlier ending first. Then we have gained a sorted graph showing the lifetime of all elements. We can group them together to obtain a high utilization of resources. We group it following this law: assign the first value in the sorted list as a group, scan the sorted list for the next value whose birth time is larger than or equal to the death time of the previous value and assign this element to current group, scan until no more value can share the same group then a new group will be introduced. Follow the law until there is no more ungrouped element in the sorted list. Every group stands for a practical hardware component. All the elements in a group share the same resource. From the algorithm we can simply find out that the algorithm is trying to make every resource utilize at its maximum capability, an ending time is the starting time of another element. From the graphical view, it tries to connect the right edge of a line with the left edge of another line to make this line longer and seamless. All the operations are based on the edges.

(30)

The same drawback as introduced in greedy algorithm, LE algorithm does not take the external cost into account but simply minimizes the total number needed. It may bind two completely different elements into one resource but not two almost the same one. There should have some instructions on select-ing the bindselect-ing elements.

3.3.3 CLIQUE PARTITIONING

A clique in graphical representation is a set of vertices that form a com-plete sub graph. People have found that treating any basic elements as a ver-tex, two vertices have direct line between them if they do not have conflicts. Then the reuse problem in allocation changes to the graphical partition prob-lem: how to partition the given graph into a minimal number of cliques such that each vertex belongs to exactly one clique.

We can illustrate how the graph is made for different resource allocation. Take a functional unit for example, each operation is represented by a vertex, an edge connects two vertices if the two operations are scheduled into differ-ent control steps and there exists a functional unit can implemdiffer-ent both opera-tions. For storage elements, every value needed to be stored represent as a vertex, the connection exists if the occupied time for two storage elements differ. So we can see how the realistic problems being translated into clique partition problems.

In graphical field, the clique partitioning problem is a NP-complete prob-lem. Exhaustive methods are not preferable for computing the NP-complete problem. Some more efficient algorithm should be used for the clique parti-tioning problem.

A near minimal cluster partitioning algorithm (CPA) was presented by Tseng [1]. It is a polynomial time algorithm based on step-wise grouping. The result it generates is good. CPA does not distinguish the different profits driven by grouping different vertices together. It treats all the edge equally such that the drawback of external overhead exists in CPA.

For the sake of distinguishing different gains from different combination, a profit driven algorithm based on CPA is released. It’s called a near minimal profit directed cluster partitioning algorithm (PDCPA). It overcomes the drawback for extra overhead for most of the allocation algorithm and achieves better solutions that have less hardware overhead. This algorithm will be specified in detail in later section 4.5.3.

(31)

3.3.4 CLASSIFICATION OF ALLOCATION ALGORITHM

We can classify the allocation algorithm into three main groups: construc-tive, graph-theoretical, and transformational allocation techniques. Construc-tive algorithm solves the problem fast but mostly not good enough. Greedy algorithm and some rule-based algorithms belong to this group. Graph-theo-retical algorithm including clique partitioning, CPA and PDCPA produce good result for the allocation problem but it spends a lot of execution time on achieving it. Transformational allocation starts from an initial allocation, then pruning it for better solutions.

(32)

(33)

21

4

DESCRIPTION OF THE PROGRAM

This program is constructed in MATLAB and use the netlist format given by Electronics Systems, Linköping University.

Here we focus on several important techniques for analysing the program.

4.1 PROGRAM SPECIFICATION

Every program should be developed for some reasons. Why we need such a program? What can be provided to us by the program?

As we already have the hardware connection in hand. But the resources are not sufficient for all hardware components depicted by the netlist. We have to make a trade-off between the total resources used and the perform-ance. In order to make an approximate evaluation of the hardware overhead, a tool is required for implementing the automatically evaluation for all kinds of hardware components. That’s the requirement for this program.

The tool can read the netlist and some other extra information, e.g. the numbers of crucial hardware components, provided by the user. Then accord-ing to the predefined algorithm, it can generate an approximate hardware spending results for all kinds of hardware components.

The basic inputs and outputs are shown in the blackbox diagram in Fig. 4.1.

(34)

4.2 BASIC STRUCTURE OF THE PROGRAM

4.2.1 BLOCK DIAGRAM OF THE PROGRAM

For a program, it must follow some basic guidelines and existed rules for data processing. The computer can not think by itself to find out the way towards the solution. Computer intelligence is also created by predefined rules figured out by people. So we should explore the blocks of this program to see how it works.

The program follows certain scheduling and allocation algorithm and gen-erates the result step by step. The basic structure of the program is shown in Fig. 4.2.

Inputs are feed into the program. After all the inputs are read by the “Inputs Reading” block, they are saved in memory at certain format. We have to generate the data flow graph from the netlist by “DFG Analysis” block in which a netlist is analyzed and sorted. At the moment, we have gripped the DFG in mind. After that, “Scheduling” block makes predefined schedule algorithm on the sorted DFG so as to follow the constraints to generate all the

(35)

Chapter 4 – Description of the program 23

timing information for the input DFG. The “Allocation” procedure binds the operations with hardware components following the scheduled DFG. The “Allocation” uses certain algorithm to minimize the hardware overhead, and generates some format that can be read out by the final “Statistic Calculation” block in which outputs are given to the users.

For some blocks, it is necessary to fractionize them into sub blocks. For example, the “Scheduling” block should be divided into some preparation steps and the scheduling step. The “Allocation” block in fact contains proce-dure on several kinds of resources. This will be introduced later.

4.2.2 HARDWARE STRUCTURE IN THIS PROGRAM

Here we classify two definitions first. The total time from an input changes, to the corresponding output becomes available is called sample period. To schedule a DFG, operations are put into different time for execu-tion; the unit time for scheduling is called control step. In normative way, control step is the basic unit of time in a synchronous system and corresponds to one clock cycle.

(36)

The basic idea of the program is hardware reuse. A DFG is scheduled into several control steps. Operations in different control steps can share the same hardware component. The hardware is used over and over again in different control steps and for different operations. Under this circumstance, lots of principles should be followed. E.g. all the data should be stored in some stor-age elements, there must be a certain path from specific storstor-age element to the binding functional unit that implements the operations, the functional unit should have the ability to read data from different storage elements. And the computed data should store back to the specific storage element. For fulfilling all these requirements, a basic hardware structure is derived as shown in Fig. 4.3. The data is first stored in some registers, and then it can be sent to FU through multiplexers. After the computation is finished, the values are sent back to the registers. All this happen in one control step. After several iterations, the final output is generated and stored in some of the registers. We can see the data path clearly. Data run through Reg–FU–Reg. This model is the basic hardware structure in this program.

All the data should be stored in registers first, including the coefficient for multiplication operations. Hence, the netlist should be changed to fit the structure. A new input for the coefficient is added and the multiplication for-mat is changed.

Table 4.1 shows an example of the multiplication format changing. It shows how “V2=V1*0.5” changes. After change is performed, “0.5” is stored

(37)

in V20, and the netlist is changed as shown in the table.

4.3 DFG ANALYSIS (SORTING)

In this block, a raw netlist is analyzed to create a DFG in which data dependency can be seen directly. If the provided netlist is not a legal one, an error will be thrown out, and the program stops. Only a legal netlist can be read into data flow graph in this block.

ALGORITHM OF SORTING:

Traverse all the rows in the netlist to find out the rows for inputs, outputs, and delay elements.

Mark down the input values of input rows and output values of delay ele-ments. Put in variable “Available”. These values are available from the beginning. All new values should be generated by the values in “Available”. Put the input and delay rows to “SortedSfg” that stores all the sorted opera-tions.

Initialize a variable “RowsToAdd” for recording all the rows that have not been executed in the algorithm.

Repeat {

Explore all the rows marked in “RowsToAdd”. For each row, if all the predecessors are in “Available”, add the output of this row to a temporary variable “AvailToAdd” for later appending to “Available”, and add the row number to a temporary variable “RowsToRemove” for later deleting from “RowsToAdd”.

Remove all the rows already executed in current loop from

“Row-Operation Type ID in this kind of op. First Operand Second Operand Third Operand Before 5 1 1 2 0.5 After 5 1 1 20 2

(38)

sToAdd”.

Add all values in “AvailToAdd” to “Available”.

Add all the rows executed in current loop to “SortedSfg”. }

Until (“RowsToAdd” is empty).

If no errors occur, a valid DFG is read into memory (“SortedSfg”).

In fact, the sorted block has already generated an ASAP scheduling result without resource limitation. The DFG shows the data dependency, so does the ASAP without resource limitation.

4.4 SCHEDULING

4.4.1 SUB BLOCKS OF SCHEDULING

In the scheduling block, some necessary data must be computed before making the scheduling. That is the critical path length for every operation. So the scheduling block is consisted of two sub blocks, namely critical path length computation and list scheduling. The first one computes the critical path length for all the operations in the DFG. The critical path length is the deterministic value for list scheduling. The second block makes list schedul-ing for the DFG usschedul-ing the result generated from the first block. Fig. 4.4 shows the sub blocks inside the scheduling block.

4.4.2 SCHEDULING TECHNIQUE SELECTION

We have already introduced several scheduling techniques in section 3.2. We know that there are resource constrained scheduling, time constrained scheduling judged by restriction, iterative/constructive scheduling, global scheduling, transformational scheduling, and mathematical programming scheduling judged by scheduling method. How could we make the decision which algorithm should be hired for the program?

(39)

hard-Chapter 4 – Description of the program 27

ware component is provided by the user. So there are certain resource restric-tions here. Resource constrained scheduling is preferable. The most intuitive resource constrained scheduling is list scheduling. It provides good solution while not using complicated algorithm. So in this program we decide to use list scheduling as the scheduling algorithm and use the critical path length list scheduling. The operations with longer critical paths would have higher pri-ority in operation selection.

4.4.3 MULTICYCLING IN SCHEDULING

In reality, different functional units have different delays. But two differ-ent functional units assigned to the same control step would be used at the same time. This would lead to that a clock cycle should be determined by the slowest functional unit in the design. This will always reduce the perform-ance. In order to overcome this problem, some techniques are used. We briefly introduce them:

1. Chaining: two or more operations are allowed to perform sequentially

(40)

in a single control step.

2. Multicycling: shorten the time for control step, so as to fit the faster operations in the scheduling, then the slower operation may take more than one control step. But there is one small drawback for multicycling, the input should be latched during the whole computation time for the slower opera-tions. This may also increase the hardware overhead.

3. Pipelining: A slower functional unit is being divided into several stages by latches to achieve pipelining. Now the slower FUs take more than one clock cycle. But in every clock cycle, new data can be fed into the FU to make it work at higher frequency.

Figure 4.5 illustrates the different solutions for the delay inequality prob-lem. (a) is the original timing issue. The length of clock cycle is decided by the slowest FU. (b) shows the chaining solution. (c) shows the multicycling solution, latches are needed for multicycle FU. (d) shows the pipelining solu-tion, data can be input at higher frequency.

In the program, we adopt the commonly used multicycling technique to deal with this problem. It looks like some external latches are required. But because of the Reg–FU–Reg structure we use, as mentioned in section 4.2.2, there will not spend any more resources.

Figure 4.5: Methods for dealing with varying delay

(a) original timing issue (b) chaining solution (c) multicycling solution(d) pipelining solution

(41)

4.4.4 CRITICAL PATH COMPUTATION

For list scheduling, it’s important to determine the critical path length cor-rectly. A critical path is a path that has the longest length from the operation node to the node with no immediate successor. So in fact, every operation node has its own critical path length, and it’s equal to the sum of its execution time and all its longest successors’ execution time. As the complete DFG is available, we can compute the critical path length for each operation before scheduling.

Figure 4.6 shows a simple example for critical path length computation. Assume that the weight of multiplication operation is 2, while the addition operation is 1. The picture has shown all the critical path length for all the operations. At the beginning there are four operations available for execution, namely Op1, Op2, Op3, and Op4. But Op4 has the highest priority, then Op2 and Op3, while Op1 has the lowest priority. If totally two adders are availa-ble, then Op4 must be scheduled in control step 1 and randomly select one out of Op2 and Op3 for another addition operation to be scheduled in control step 1. Following this scheduling manner then the final schedule is achieved.

The algorithm for computing critical path length is based on the sorted DFG. Because the sorting procedure is an ASAP scheduling without resource limitation, the operations in sorted DFG are sorted according to the data dependency. That means, if you see the operations upside down, treating it as

(42)

a binary tree (sometimes it’s not a binary tree if the operation result is used for more than one successive operation). The reverse data dependency still works. So as soon as computing the critical path length backwards from sorted DFG, there will not have inaccurate result. When computing the criti-cal path length, a list for the input ID of all computed operations (Format: [InputID | Length]) is maintained for recording the largest critical path length in its successive paths.

This is now illustrated by an example. Assume the weight of multiplica-tion operamultiplica-tion is two, while addimultiplica-tion is one. The sorted DFG is shown in Fig. 4.7.

The netlist is shown as: SortedDfg = [

1 1 1 NaN NaN % Row 1, an input specification 1 2 2 NaN NaN % Row 2, an input specification

(43)

1 3 3 NaN NaN % Row 3, an input specification 1 4 4 NaN NaN % Row 4, an input specification 1 5 5 NaN NaN % Row 5, an input specification 1 6 6 NaN NaN % Row 6, an input specification 1 7 7 NaN NaN % Row 7, an input specification 3 1 2 3 8 % Row 8, an adder specification 3 2 4 5 9 % Row 9, an adder specification 3 3 6 7 10 % Row 10, an adder specification 5 1 8 9 11 % Row 11, a multiplier specification 3 4 9 10 12 % Row 12, an adder specification 2 1 12 NaN NaN % Row 13, an output specification 5 2 1 11 13 % Row 14, a multiplier specification 2 2 13 NaN NaN % Row 15, an output specification ]; % “NaN” stands for “Not a Number”

At first, the list for computing the critical path length is empty. Start from the end of the sorted DFG. For inputs, outputs, and delay nodes, we do not need to compute them. The first one for computation is the operation at row 14. The inputs of this row are 1 and 11, and this operation is a multiplication operation. At the moment, there is no entry in the list. So the critical path length is 0+2. Two pair of data {1,2}, {11,2} are put into the list for later usage. Next comes the addition operation at row 12, whose length is 1. The values in the list are not related to the output (12) of this operation. So {9,1},{10,1} are put into the list. Next one is the multiplication operation at row 11. Its output (11) is in the list marked as {11,2}. So the length is 2+2=4, and {8,4},{9,4} are added to the list. Note that the {9,1} is overwritten by the new data for input ID 9. Then continue with this manner, all the critical path length can be generated.

(44)

4.4.5 LIST SCHEDULING

ALGORITHM:

ControlStep=1; Repeat {

For each resource type {

Determine the candidate operations (can be executed at this moment) If there are available resources, pick up candidates with highest crit-ical path length for scheduling at step “ControlStep”

}

ControlStep = ControlStep + 1; }

Until (All the operations are scheduled)

Sometimes, several available operations still have the same critical path lengths while there are not enough resources for all of them, then they should be picked up arbitrarily. For easier tracing purpose, the program provides a parameter as a random selection option. It provides selecting randomly or selecting from the left most side, respectively.

EXAMPLE

Here we study a FIR filter as an example to illustrate the scheduling. The netlist for the FIR filter and the data flow graph are shown in the Fig. 4.8 and Fig. 4.9, respectively. In Fig. 4.9, the italic number for multiplier indicates it’s a coefficient number. The dashed arcs are also data flow, but for making the graph clearer. In fact, the DFG in Fig. 4.9 is also a sorted DFG.

We then try to schedule the sorted DFG using list scheduling.

First compute the critical path length for all of the operations. This makes one more column on the sorted DFG, netlist is shown in Fig. 4.10. Assume that there are two adders and four multipliers available. This yields the sched-uling shown in Fig. 4.11. The scheduler in this example selects left most

(45)

Figure 4.8: Netlist of FIR filter

(46)

entries when needing arbitrary selections.

The delay elements are needed to be scheduled because they stand for a pure data transfer from one register to another. We must make sure that the value stored in the destination register has ended up its lifetime before the new value flowing in. This is hard to guarantee. So it’s a good idea to make all these assignments at the end of the sample period and they all transfer to other registers at the same time such that we do not need to care about the sequence of them. If all the inputs of the delay elements are not generated out by the operation in the last control step, all the delay elements can be put into the last control step to improve performance. If that’s not the case, the delay elements should occupy an external slot of control step, this control step becomes the last control step. In this example, the delay elements do not flict with the operations in last control step, so they can be put in the last con-trol step.

(47)

4.5 ALLOCATION

4.5.1 SUB BLOCKS OF ALLOCATION

After the scheduling process is finished, allocation should be performed. Allocation binds all operations to functional units, determines the using situa-tion on registers and tries to reuse them as much as possible, and minimizes the overhead for interconnection wires. It can be divided into three consecu-tive allocation sub blocks as illustrated in Fig. 4.12.

4.5.2 ALLOCATION TECHNIQUE SELECTION

In section 3.3, we introduced some basic allocation techniques mostly used nowadays. But which one should we use in this program?

The goal of the program is to evaluate the total resource overhead in detail. The main purpose is to minimize the resources needed. So we should select the algorithm that generating better solution instead of running faster.

(48)

PDCPA algorithm not only groups the vertices together, generating all the clusters, but also groups the vertices based on the priority function. The obtained result is good compared to other allocation techniques. By this anal-ysis, we decided to use PDCPA as the allocation algorithm in the program seeking for better solutions.

4.5.3 PDCPA

A near minimal profit directed cluster partitioning algorithm (PDCPA) was introduced in [1]. Because it is the basic algorithm adopted in our pro-gram, we give a detailed description here.

BASIC SETTINGS:

The edges of the original graph G are partitioned into several categories. The minimal category is labeled 1. With increasing profitability, the number increases. All the edges with category m are labeled G(m).

PDCPA:

A Near Minimal Profit Directed Cluster Partitioning Algorithm

Given a graph G and a subgraph G(k) of G that specifies edges with the highest potential profitability, simultaneously reduce G and G(k), follow the

(49)

instructions below. Apply the instructions below to determine which nodes should be merged first. The graph for execution is the maximum priority sub-graph at the moment, G(k).

1. For each edge, (i,j),

a. Compute the number of common neighbors of i and j. A node (r) is a common neighbor of i and j if r is connected to i by an edge and r is con-nected to j by an edge.

b. Compute the number of edges that will be deleted from the graph if i and j are merged into a single node. When i and j are merged, the edge (i,j) will be deleted along with all edges r that are neighbors of either i or j but not both. If node r is a common neighbor of both i and j, then one of the edges from r to i and j will be deleted and one will be kept.

2. Start collecting nodes for the next cluster.

a. If there are no edges in the graph, stop. Otherwise, continue.

b. Select an edge (p,q) that has the maximum number of common neighbors. If there is more than one such edge, select one that will result in the fewest edge deletions when p and q are merged. If there is still more than one choice, select an arbitrary representative.

c. Merge p and q into a cluster. We will list the elements of the cluster in order of increasing label and will call the lowest labeled element in the cluster the head of the cluster. At this time, the current cluster contains only the two elements p and q.

d. Call the graph update subroutine, shown at the end of this descrip-tion, to merge nodes p and q of the graph into a single node. Label the com-bined node p, where p is the head of the cluster. At this time, p represents the cluster {p, q}.

3. Node p in the graph represents the current cluster. Add an additional node to the current cluster if possible by performing this step.

a. If there are no edges connected to node p, then node p represents the full cluster. Go to step 2 to start the next cluster. Otherwise, continue in order to add another node to the current cluster.

b. Consider the nodes connected to node p. Select edge (i,p) or (p,i) such that nodes i and p have the most common neighbors. If more than one choice is available, select the node i such the number of edges removed when

(50)

nodes i and p are merged is minimal. If there is still more than one choice, arbitrarily select any choice for node i.

c. Add node i to the current cluster. If i<p, then i becomes the new head of the cluster; otherwise, p remains as the cluster head.

d. Call the graph update subroutine to merge nodes i and p. e. Set p = min(i,p). Node p is the new cluster head.

f. Repeat step 3.

Profit Directed Graph update subroutine for merging nodes x and y (x<y). 1. Merge nodes x and y into a single node labeled x in both G and G(k). 2. Update the edges of graphs G and G(k) as follows.

a. Delete all edges involving node y in both G and G(k).

b. In graph G delete any edge between node r and x unless there was also an edge between r and y before merger. If the deleted edge in G is also in

G(k), then delete it from G(k).

c. If there was an edge between r and x and also an edge between r and

y in graph G before merger, then keep the edge in graph G. If the edge was

also in G(k) before merger, then keep the edge in graph G(k). If the edge that is kept in G was not in G(k) before merger, it may need to be added to G(k). As a result of merger, the edge may change category. If the new category of the retained edge in graph G is greater than or equal to k, and if the edge was not previously in G(k), then add it to G(k).

d. Recompute the number of common neighbors and number of edges deleted for each edge remaining in graph G(k).

For every category it will execute once following the algorithm described as above. When it has accomplished, the highest category subgraph will even-tually be empty. Then a new highest category subgraph G(l) (l<k) can be found out in the complete graph G, then simultaneously reduce G and G(l) using the above algorithm again. Continue in this manner until the lowest cat-egory is reached. At that time, G will be equal to G(1). Then we can use CPA to generate the final result.

(51)

We will show the PDCPA by examples in section 4.5.5.

4.5.4 REGISTER ALLOCATION

Register is one of the most important factors affecting the overhead of a design. With more registers, data can flow freely with more freedom. Well-organized assignment of registers can lead to less hardware overhead but it’s more difficult in operating the data flow. For an automation design program, a better design pattern should be provided in order to fulfill its requirement.

REGISTER INITIALIZATION

In the netlist, lots of values are used and they all need to be stored in regis-ters according to the hardware structure we mentioned in section 4.2.2. The goal of register allocation is to minimize the total number of registers. In order to explore the possibility of registers reusage, we have to check out the lifetimes for all the values in the scheduled DFG. If the lifetimes for several registers are not overlap, they can share one register.

The register initialization is to construct the lifetime table for all the values used in the scheduled DFG. As there are several types of values, we will dis-cuss them separately.

The values stored in the registers may come from external inputs, outputs of delay elements, and midterm results for some operations. For external input values, they can be read whenever you want during the sample period. So it’s not needed to store them alongside with some other compulsory val-ues. We can load them into the registers as long as they are needed. For the output from a delay element, the value must be kept until it’s useless. So we should store this kind of values from the first timeslot of the sample period to the timeslot where the delay operation comes out. For midterm results, they should be stored from the time they are generated by some operations, and should be thrown away when these values are useless.

Here are the concrete instructions for the register initialization. First, we should add the time–valueID relationship for the values provided at the beginning of the sample period. This only includes the output of delay ele-ments. Then, we can explore the whole scheduled DFG to add the relation-ships step by step. For each operation, add time–valueID relationrelation-ships for each value the operation invokes. For output of an operation, simply add it

(52)

with the timeslot that the output should be stored in. But for input of an oper-ation, we should continue to check out the source of the input value to make continuous relationships from the birth of the value to the time of current operation. In other words, all the input data should be stored somewhere after they are generated so we should make a continuous relationship for each input data. Finally, make the relationship for outputs in the DFG, the output values should be kept in registers after the values have been computed. The relationship format is defined as [ValueID | TimeID].

Let’s look at Fig. 4.11 on page 35, take the scheduled DFG for example. There are 6 delay elements, so we add { (2,1), (3,1), (4,1), (5,1), (6,1), (7,1) } at the beginning. Then we go on exploring the operations. Note that the coef-ficients for multiplications are already changed to a new external input fol-lowing the Reg–FU–Reg model. For the first multiplication operation at row 2 in Fig. 4.10 on page 34, the input values 1 and 21 are external inputs (value 21 stands for the coefficient of the multiplication). Load them into registers at Time1. The output of this operation is at Time2. So (1,1), (21,1), (8,2) are added to the relationship table. The same things happen for all the operations in CS1. Then it turns to the operations in CS2. For the first operation in CS2, the inputs 8 and 9 are midterm values, when adding the relation, check where these values are generated and add the relationship if it lasts for some time. Continue for all the operations. At last, check the outputs to see where they are found. Make sure they are added for all the timeslots from the birth to the final output timeslot. Finally, you can obtain the same map shown as Fig. 4.13.

COMPATIBILITY GRAPH GENERATION

After the initialization step, the compatibility graph is straight forward. For making the graph, we first focus on the connection of two sample peri-ods. The last timeslot for the previous one is the starting Time1 for the con-secutive sample period. So it is a good idea to discard the last time slot and add all the outputs in the last timeslot to the first timeslot Time1.

For value 1, it occupies Time1 and Time5 while value 2 takes Time1. They overlap so that value 1 and 2 are not compatible. Value 8 occupies only Time2. So value 1 and 8 are compatible and value 2 and 8 are also compati-ble. Simply judging the overlap and finally it generates a compatibility graph of 166 entries.

(53)

ALLOCATION OF REGISTERS

With the compatibility graph, we can apply PDCPA on it. But before start-ing to use PDCPA, we must make out the category to diststart-inguish the different profits by grouping different pairs together.

For registers, one important operation type is the delay element. For a delay element, it transfers a value to a different container for storing. So it’s a good idea to group these two values together to share the same register if pos-sible so that no more interconnection from one register to another is needed in this delay statement. Hence, we make the categories as follows:

In this example, there are no compatibility pairs in category 2. So we use CPA to compute the cluster situation. The result of CPA is: ( {1,10,14}, {2}, {3}, {4}, {5}, {6}, {7}, {8,12,17,19,20}, {9,13,18,21}, {11,15,22}, {16,23,25}, {24,26}, {27} ).

Thirteen registers are required. We call the register by the minimum value G(2) For two values there is a delay element connected

G(1) For all other compatible nodes

(54)

stored in it. So registers 1,2,3,4,5,6,7,8,9,11,16,24,27 are used in the solution of the FIR filter.

Now the DFG changes to register mode, as shown in Fig. 4.14.

VALUE SCALE TO REGISTER SCALE

As we have generated all the clusters for register allocation, we already know which registers the values should be stored in. We have to perform a transformation for the scheduled DFG, changing all the value markers to cor-responding register markers. We call the transformed DFG as registered DFG. After the transformation, the textual registered DFG is shown in Fig. 4.15. In the table, the first column is the scheduled control step for the operations. From column 2, all the columns are listed in the previous order.

4.5.5 FUNCTIONAL UNIT ALLOCATION

There are three kinds of functional unit appeared separately in this pro-gram. There are adder, multiplier, and subtractor. For the sake of getting max-imum resource utilization, we make PDCPA for them separately.

(55)

Firstly, we should construct a compatibility table for PDCPA to execute. For functional unit, the only conflict that can occur happens when the two operations are running at the same time. That means it can be easily figured out after the scheduling procedure. Secondly, we should make the categories used in PDCPA for functional unit. The functional units used in this program are all two-input, one-output operations. Because addition and multiplication operations are commutative, the source of the two input ports can be com-muted, while for noncommutative subtraction the sequence of the two input ports are important. When checking categories, this must be carefully dealt with according to the operation type and the three ports. After finishing all the preparation works, PDCPA can be applied.

The categories are defined as:

For the example in Fig. 4.14, taking addition operations for account. In the program, we mark the operations by the row ID in the modified scheduled DFG as shown in underline numbers in the diagram. That means, as shown in Fig. 4.15, addition operations are marked as 16, 17, 18, 19, 20, and 21 (Op_Type “3”, “4”, “5” corresponds to addition, subtraction, and multiplica-tion, respectively). We generate the compatibility graph following the rules

G(4) All the three pairs of ports are compatible G(3) Two of three pairs of ports are compatible G(2) One of three pairs of ports is compatible G(1) None of three pairs of ports are compatible

(56)

mentioned before, it yields:

The graphical representation is shown in Fig. 4.16.

Then we illustrate how PDCPA works on this compatibility graph. Some external variables are needed for tracking the ports of the addition resource. The algorithm was presented in section 4.5.3. Initially it works like six adders are provided; each one behaves as a dedicated addition FU for the one opera-tion only.

In the compatibility graph, the highest priority is 4 at first. There is only one arc whose priority is 4. This subgraph only contains one arc connecting two nodes, namely 18 and 21. The number of common neighbors for this arc is zero, and the number of deleted edges after joining is one. Because there are no competitors, we should join 18 and 21. For the ports of the sharing FU, because all three ports are the same, we called FU18 still have one input port from register 8 (R8), one from R9, and output R8. For the compatibility graph, we use x=18, y=21 for the update subroutine. In the update subroutine, it deletes the edges in G and G(4). When deleting edges it checks if the cate-gory has changed and maintains it. But no catecate-gory changes in this subroutine

(16,18) 3 (16,19) 1 (16,20) 2 (16,21) 3 (17,18) 1 (17,19) 2 (17,20) 2 (17,21) 1 (18,20) 3 (18,21) 4 (19,20) 1 (19,21) 1 (20,21) 3

(57)

invocation. After the first invocation of the subroutine, the compatibility graph becomes:

Graphical presentation is shown in Fig. 4.17.

Now, following the algorithm we should make a decision on which opera-tion should be selected. We select (16,18) for example. The hardware FU16, binding previous FU16 and FU18, will lead to one input port for {8}, another input port for {9}, and output port for {8,11}. Following the subroutine pro-gram, but for the edge (16,20) 2, the category of new FU16 and FU20 now lifts up, and after the invocation of the subroutine, the compatibility graph becomes: (16,18) 3 (16,19) 1 (16,20) 2 (17,18) 1 (17,19) 2 (17,20) 2 (18,20) 3 (19,20) 1 (16,20) 3 (17,19) 2 (17,20) 2 (19,20) 1

(58)

Graphical presentation is shown in Fig. 4.18.

Because there is still a connection for node 16 in the highest priority sub-graph. Following the algorithm, (16,20) is selected for grouping. Then the port situation FU16 becomes: one input port {8}, another input port {1,9}, output port {8,11}. Using the update subroutine, finally the complete graph is obtained, as shown in Fig. 4.19.

So using CPA to group 17 and 19, the FU17 after grouping is: one input {1,16}, another input {11}, output {9, 16}.

Finally, the clusters for this example are: ( {16,18,20,21}, {17,19} ). For the first cluster, FU16 with {8}, {9} as its two inputs, {8,11} as its output. FU17 with {1,16}, {11} as its two inputs, {9,16} as its output.

Figure 4.18: FU allocation example - step three

Implementation of a Scheduling and Allocation Algorithm for Hardware Evaluation

IMPLEMENTATION OF A SCHEDULING

AND ALLOCATION ALGORITHM FOR

HARDWARE EVALUATION

Kangmin Chen

IMPLEMENTATION OF A SCHEDULING

AND ALLOCATION ALGORITHM FOR

HARDWARE EVALUATION

Kangmin Chen

Institutionen för systemteknik

581 83 LINKÖPING

Abstract

In this thesis, an intuitive approach to determine scheduling and

allocation of a behavioral algorithm defined by a netlist is presented. In

this approach, scheduling is based on a weighted list scheduling where

operations have the longest critical path are scheduled first. The

com-ponent allocations are resorted to the PDCPA algorithm which focus

on making efficient and correct clusters for hardware reuse problem.

Several constraints are used in order to ensure the causality of

proc-esses and prevent conflicts of hardware components. This approach can

give the total number of control steps and the number of registers and

multiplexers in detail. Hence, designers obtain useful information from

it and can make trade-offs between different resource conditions.

The program is implemented in MATLAB programming

environ-ment and provides parts of behavioral synthesis to facilitate the whole

synthesis procedure.

TABLE OF CONTENTS

1

Introduction

1

2

Background

3

3

Theoretical study

7

4

Description of the program

21

5

Conclusion and future work

49

References

51

LIST OF FIGURES

Figure 3.1: A data flow graph example . . . 10

Figure 3.2: One possible ASAP solution . . . 10

Figure 3.3: One possible ALAP solution . . . 11

Figure 3.4: A possible list scheduling solution. . . 13

Figure 3.5: Example of FDS. . . 14

Figure 3.6: Drawback on randomly allocation . . . 17

Figure 4.1: Black box diagram of the program . . . 22

Figure 4.2: Structure view of the program . . . 23

Figure 4.3: Basic structure of the data path . . . 24

Figure 4.4: Sub blocks of scheduling . . . 27

Figure 4.5: Methods for dealing with varying delay . . . 28

Figure 4.6: A critical path computation example . . . 29

Figure 4.7: Another critical path computation example . . . . 30

Figure 4.8: Netlist of FIR filter. . . 33

Figure 4.9: DFG of FIR filter . . . 33

Figure 4.10: Netlist with critical path length of FIR filter . . 34

Figure 4.11: Scheduled DFG of FIR filter . . . 35

Figure 4.12: Sub blocks of allocation . . . 36

Figure 4.13: Register initialization for scheduled DFG . . . . 41

Figure 4.14: Registered DFG of FIR filter . . . 42

Figure 4.15: Textual registered DFG of FIR filter . . . 43

Figure 4.16: FU allocation example - step one. . . 44

Figure 4.17: FU allocation example - step two . . . 45

Figure 4.18: FU allocation example - step three . . . 46

Figure 4.19: FU allocation example - step four . . . 46

1

INTRODUCTION

1.1 MOTIVATION

1.2 OBJECTIVE

1.3 RELATED WORK

1.4 LIMITATIONS

1.5 ORGANIZATION OF THIS THESIS

2

BACKGROUND

2.1 BEHAVIORAL SYNTHESIS