Virtual-Channel Based Wormhole NoC on FPGA for ForSyDe/NoC System Generator Tool Suite

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018,

Virtual-Channel Based Wormhole NoC on FPGA for ForSyDe/NoC System Generator Tool Suite

ZHANG RUNZI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

I

Abstract

Nowadays, the number of processors integrated on system-on-chip (SoC) increases rapidly and makes multiprocessor system-on-chip design (MPSoC) a regular feature of the embedded system. [25] To support the communication between several homogeneous or heterogeneous processors a communication infrastructure, Network- on-Chip (NoC) technology was raised more than ten years ago.

Decades of research provides the NoC area with an increasingly sophisticated knowledge system. A wide variety of NoC systems are developed and evaluated. To simplify the design flow of NoC systems, a NoC System Generator Tool (NGS) [23]

[25] [26] is developed by researchers of Royal Institute of Technology. It aims to generate NoC systems with different topology, size, processing elements, and network protocols on FPGAs. NoC routers are central parts in modelling NoC communications.

In the current version, Nostrum router is used as only supported communication backbone.

This project starts from an analysis to NoC theory and steps into important aspects related to NoC router design to implement a register transfer level (RTL) synthesizable wormhole (WH) router in VHDL for NGS tool. WH router is a comparatively old however widely used NoC router structure. A Virtual-Channel (VC) based flow control is adopted in this design to improve the performance. After implementing the WH router, functional tests on a single router, 2×2 and 3×3 WH Router based network are performed on Modelsim 10.0. VHDL test benches with “textio” operations are prepared to generate a clear report to reveal packets’ transfer details in the network. The router and 2×2 network is synthesized on Quartus II for the Cyclone IV E FPGA family to compare the area overhead with Nostrum NoC.

(3)

II

Acknowledgement

I would first like to express my gratitude to my supervisor Johnny Öberg for giving me the chance to experience this project and improve the skill in VHDL design. The thesis here could not be accomplished without his help at every stage of this work. I would like also to appreciate Kalle who gave me a lot of support and valuable comments during the project.

Special acknowledgement will be given to my family and Jin Keyu who gave me endless encouragement and support when I am in Stockholm far away from home this year. I could not have the opportunity to finish this project without them.

(4)

III

List of Figures

Figure 2.1: NoC research area classification [29] ... 3

Figure 2.2: A NoC based MPSoC and three building blocks of a NoC ... 5

Figure 2.3: A typical NoC router architecture ... 6

Figure 2.4: Link-level and end-to-end flow control [10] ... 8

Figure 2.5: Examples of classical direct topology NoC ... 10

Figure 2.6: Examples of indirect topology NoC ... 11

Figure 2.7: Examples of a deadlock situation for 4 packets/flits ... 13

Figure 2.8: Examples of a fixed priority arbitration procedure ... 14

Figure 2.9: Examples of a round-robin arbitration procedure [10] ... 15

Figure 2.10: Buffered flow control vs. Bufferless flow control [11] ... 16

Figure 2.11: Package and Flit ... 17

Figure 2.12: VCT vs. WH when sending a packet with 6 flits [10] ... 18

Figure 3.1: Overview of platform generation process [23]... 21

Figure 3.2: Topology supported by NGS tool and addressing method ... 22

Figure 3.3: Message/Packet/Flit in Nostrum NoC ... 23

Figure 3.4: XY routing example and routing path blocked ... 24

Figure 3.5: A deflective routing example ... 26

Figure 3.6: Block diagram of Nostrum NoC ... 27

Figure 4.1: A Canonical model of VC based WH router ... 29

Figure 4.2: Packet structure ... 30

Figure 4.3: Flit processing flow ... 32

Figure 4.4: Flit segmentation ... 34

Figure 4.5: Credit-based Flow Control Model ... 35

Figure 4.6: Two ways to introduce RC unit ... 36

Figure 4.7: State transition diagram of FSM ... 40

Figure 5.1: Result of single router experiment 1 ... 42

Figure 5.5: Result of 2×2 network experiment 1 ... 45

Figure 5.6: Result of 2×2 network experiment 2 ... 46

Figure 5.7: Result of 3×3 network experiment ... 47

(7)

VI

List of Tables

Table 3.1: Routing table of Nostrum NoC [11][24] ... 25

Table 4.1: Routing table for RC unit ... 37

Table 5.1: Parameters for Experiments ... 41

Table 5.2: Configuration of Router ... 45

Table 5.3: Area of 2VCs, 4VCs and 8VCs WH router ... 48

Table 5.4: VC-based WH router and Nostrum Router ... 49

(8)

1

Chapter 1 Introduction

Nowadays, the number of processors in System-on-Chip (SoC) increases rapidly due to the dramatical improvements in technology scaling. Reformulated Moore’s law has pointed out a doubling of the number of cores every eighteen months. The Multiprocessor System-on-Chip design (MPSoC) is already a regular feature of the embedded system. [25]

MPSoC requires intensive parallel communication. The bus-based system may not fulfil the bandwidth, latency, and power consumption requirements. A solution to overcome the bottleneck of communication is an architecture called Network-on-Chip (NoC). It is an embedded network to interconnect the processors in MPSoC. [2] The inherent redundancy, topology design and packet-based communication [25] provide advantages in reusability, scalability, and reliability. [6]

1.1 Background

Decades of research provide the NoC area with an increasingly sophisticated knowledge system. A wide variety of NoC systems are developed and evaluated. This development gives the term NoC a broader meaning. It can include hardware communication infrastructure‚ middleware, operating system communication services and a design methodology together with tools to map applications onto the NoC. [14]

To simplify the design flow of NoC system, a NoC System Generator (NGS) Tool [23]

[25] [26] is developed by researchers of Royal Institute of Technology. It aims to generate NoC systems with different topology, size, processing elements, and network protocol on FPGAs. In the current version, Nostrum router is used as only supported communication backbone.

NoC routers are “heart” components of a NoC. [10] It generates the backbone of the network communication. The design space for a NoC router is considerably large and several types of NoC routers can be implemented using different theory and approaches such as routing algorithm, flow control, buffer strategy, etc. to achieve respective performance and suitable for diverse NoCs.

Wormhole routing was first introduced by Dally and Seitz to build fast single-chip routers in 1986 [10]. Then a lot of improvement was made based on this switching strategy. Virtual Channels (VCs) were introduced in wormhole-switched networks in 1992 [7] Numerous studies on wormhole switching exists in the literature. Some commercial wormhole switches were concluded in [22]. Wormhole switching is a comparatively old technology but widely used.

(9)

2 1.2 Goal

The goal of these project is to implement a Register Transfer Level (RTL) synthesizable Wormhole (WH) router in VHDL for NGS tool after a detailed study of NoC theory especially NoC router design aspects. The implemented router will be simulated in Modelsim to ensure its proper functioning. Then it will be synthesized on Quartus II for the Cyclone IV E FPGA family to compare the area overhead with Nostrum NoC.

1.3 Outline

• Chapter2: the basic theory of Network-on-Chip and related work. The aspects related to NoC router design will go in depth.

• Chapter3: an overview of NGS tool and the characteristics of Nostrum NoC will be described in detail

• Chapter4: the implementation of a virtual-channel based wormhole router in RTL.

• Chapter5: the simulation and synthesis result of target router will be illustrated and a comparison between this router and Nostrum router will be performed.

• Chapter6: a conclusion regarding experiment result will be made

• Chapter7: the future work related to this project will be defined

(10)

3

Chapter 2 Network-on-Chip

The concept Network-on-Chip was raised as a new paradigm to realize the modular and scalable communication for complex Multiprocessor System-on-Chip design (MPSoC) because the growth of cores brings a higher requirement for intensive parallel communication between them. The NoC structure can provide larger bandwidth, lower latency and lower power consumption than traditional bus-based systems which will lead to MPSoC’s high performance. [2] [14] After more than ten years development, today it is already one of the basic communication cornerstones for almost all large- scale chips. [21] In this chapter, the main concept under the label of NoC will be introduced briefly so that to provide an overview of the NoC system. Then, principles related to NoC router (which is an elementary building block of NoC) will be detailed from the perspective of NoC design.

2.1 Research area of NoC

The design space and research area of NoC are considerably large. In the research of [29], a typical NoC research area classification was summarized in Figure 2.1.

Figure 2.1: NoC research area classification [29]

This figure provides an overview of the NoC system and can be used as a roadmap to investigate and evaluate almost all aspects of a NoC. In the following sections, the key concepts in this figure will be discussed separately except the system level which has been mentioned briefly in Chapter 1 and it is beyond the scope of this thesis. The

(11)

4

contents related to network adapter will be included in Section 2.2.2 and link level in Section 2.2.1. The network part’s analysis can be found from Section 2.3 to Section 2.5.

2.2 NoC structure

A NoC based MPSoC can be abstracted as Figure 2.2. It is composed of a set of basic cells and one basic cell calls a node. The physical medium connecting every node are links. Each Node has a Processing Element (PE) or sometimes called a Resource. The PE of each Node can be all the same or totally different, Such as CPUs, some specialized cores, GPUs, etc., together with the associated memory hierarchies and memory controllers. As a result, the system can be either homogeneous or heterogeneous architecture. Each PE is connected to the Network Adapter (NA) through the Resource Network Interfaces (RNI). The RNI plays a significant role in decoupling the computation from the communication. On the other hand, it ensures the logic connection between PEs and network especially in heterogeneous architecture which each PE may have distinct interface protocol. The Routers are the backbone of the NoC, it is responsible for the data switching from PEs. The links, RNI, and router consist the three elementary building blocks of a NoC structure. [4] [5] [6]

(12)

5 2.2.1 Links

The link in a NoC is a group of wires directly connect every router together. It is an important part of the NoC. Generally, each link is composed of a pair of opposite, unidirectional, point-to-point physical channels to make a full-duplex connection between two routers. This kind of implementation is under the category of synchronization protocol between source and target nodes. There are also links of the asynchronous protocol. One commonly used example is Global Asynchronous Local Synchronous (GALS). The GALS links have an advantage in power consumption especially in large chips and potentially performance benefits in forwarding latency.

However, it involves some area and dynamic power overhead compared with synchronous ones. [29]

Figure 2.2: A NoC based MPSoC and three building blocks of a NoC

As introduced in Chapter 1, NoC always adopts packet-based communication and the implementation of links belongs to the physical layer. In this layer, the packet format shall be defined. A packet can be divided into several smaller unit named flits (this concept will be detail illustrated in after chapters) to achieve lower overhead and higher transmission efficiency. [8] It is the atomic control flow units defined in the link level.

Furthermore, the minimum amount of data that is transmitted in one link transaction is a phit (physical unit). Mostly, a flit equals to a phit and flit width matches the link width.

node PE

RN

ROUTE PE

RNI

ROUTE R PE

RNI

ROUTE R PE

RNI

ROUTE R

PE RNI ROUTE

R

PE RNI ROUTE

R

PE RNI ROUTE

R LIN

NoC

(13)

6 2.2.2 Resource Network Interface (RNI)

This RNI block is a unit attach PE with the NoC. It makes the computation and communication part independent from each other. Meanwhile, it works as an adaptor between different resources and NoC. The tasks of an RNI can be divided into two aspects. The first one is always a socket connected to the PE and the second one is to make the data generated by the PE decomposed and packaged as defined in the link level and then ejected to the NoC or de-package and assemble the communication units received from the NoC and then transmitted to the PE. [4][29]

2.2.3 Routers

Figure 2.3: A typical NoC router architecture

Routers are the central part of the NoC. They undertake the work of switching data from the source node to the destination node. As a result, the Routers are also called switches in some research around NoC. In this circumstance, they share the meaning. Figure 2.3 illustrates a typical NoC router architecture. A pair of local ports are connected to the PEs/Resources and the other ports are connected to other Routers in the Network. The input buffer and output buffer are optional because in some design the bufferless flow control will be used (This topic will be discussed in Section 2.4.3). The crossbar/switch matrix is implemented by a group of multiplexers controlled by routing logic which connects the inputs to the outputs. The arbitration logic is another significant module of the NoC router. It solves the conflicts when more than one inputs referred to the same output at the same time. The control logic is built behind the architecture to ensure all components in the router are operating normally. When modelling a NoC router’s behaviour, some definitions shall be illustrated. They define the overall strategy applied in transmitting data in the NoC.

Input Port Output Port

Cross- bar/

Switch matrix

…

Routing and Arbitration

…

Local Port

Port 0

Port 1

Port N

Local Port

Port 0

Port 1

Port N Input buffer Output buffer

… …

(14)

7

 Routing Algorithm: is the policy used to define the path of a packet moving from its resource node to its destination node. The routing algorithm can be categorized in several ways such as deterministic or non-deterministic, adaptive or non-adaptive, static or dynamic, minimal or non-minimal, etc. The different choice of routing algorithm can lead to a trade-off in area and power consumption.

 Arbitration Logic: solves the conflicts among inputs requesting to the same outputs simultaneously as mentioned in Section 2.2.3. The arbitrators can be built in port level or router level which lead to a trade-off between matching rate and latency. Another topic related is arbitration algorithm. A wide variety of practical algorithms can be classified as static and dynamic arbitration.

 Flow Control mechanism: controls of packet movements in the NoC. [4] It can be classified by the scope of the policy. If the policy limited at the borders of a router to router level, it is named link-level flow control also known as distributed flow control. If it has a scope of recourse to the destination, it called end-to-end flow control or centralized flow control [10]. Figure 2.4 illustrates the difference between the link-level flow control and end-to-end flow control in a graphical method. Despite the existence of a few NoCs with centralized flow control using a Time Division Multiplexing (TDM) approach, most NoC systems applied the distributed flow control. Another technology related to flow control is Virtual Channel (VC) [7]. It separates a physical channel into several logical lanes and multiplexes the logic lanes in a TDM manner. This topic will be analysis comprehensive in Section2.4.4 since it is a core concept adopt in the router implemented in this thesis. Another issues regard to flow control is buffering strategy, the flow control policy determines the buffering strategy to some extent.

(15)

8

 Buffering strategy: determines the buffer usage integrated into the router from the aspects of the amount, size and location. Using buffer is a way of avoiding congestion, however, previous research reveals that the buffers integrated into the router costumes the main part of the router area [29] and power consumption.

[5] The buffering strategy shall be concentrate on the balance between prospective performance and power and area overhead. Another effective way is to level the bursts instead of focus on the buffers.

The characteristics above sketches a considerably wide design space for a NoC router. When selecting appropriate tactics in distinct conditions in terms of performance, power consumption, latency, etc., routers with unique features can be implemented. NoCs are application-oriented designs and the variety of routers create the possibility to build NoCs suitable for various applications.

In addition, these approaches will be expanded to discuss in Section 2.4 from the perspective of NoC router design.

Figure 2.4: Link-level and end-to-end flow control [10]

source Router Router Router destination

end-to-end data

end-to-end notification

data data

notify notify

link-level flow

end-to-end flow

(16)

9 2.3 NoC Topologies

The topology of a NoC defines the connecting structure of routers and communication channels between PEs and routers. Two possible methods are employed to classify different NoC topologies which is direct vs. indirect topologies and regular vs.

irregular forms.

2.3.1 Regular and Irregular Topologies

The regular or irregular topologies can be distinguished from the grid type, k-ary n- cube, where k is the degree of each dimension and n is the number of dimensions. [29]

Currently, most NoCs are instructed by a regular topology and the advantage is lower design time and cost with a predictable area and power consumption. It is proposed to involve in the general-purpose platform. However, the irregular topology is more popular in commercial products since they are more suitable for heterogeneous PEs.

2.3.2 Direct and Indirect Topologies

In the direct topologies, every router is connected to one PE through an RNI which compose a node as mentioned in Section 2.2.1. Every node in the NoC is connected directly with one or more neighbour node or nodes. The data transmission in a direct topology NoC is from one source node to the destination node via several intermediate nodes. On the contrary, the concept of a node does not exist in indirect topology NoCs, since the routers in the network are not always connected to a PE. Instead, some routers may only attach to other routers. In other words, these routers have a single function to transfer data in network level. While the direct topology benefits from the scalability and simpler routing algorithm and higher bandwidth, the indirect topology may have advantages in less traffic congestion with less area overhead for the redundancy of intermediate routers which do not have a PE.

A few classical direct topologies of NoC are listed in Figure 2.5[29].

 (a) It is a 4 ×4 mesh topology NoC. The N×M mesh topology is the simplest and most commonly used topology. All links are of the same length lead to comparatively simplified physical design. However, the congestion may happen in the central part of the structure, a well-designed routing algorithm can release the situation.

 (b) It is a 4×4 torus topology NoC. It is generated by an N×M mesh with the edge node connected to each other. The added wrap-around link may release the congestion of mesh topology compared with mesh topology, yet the long wrap- around brings serious delay together with limited scalability.

 (c) It is a 2×2×2 mesh topology NoC. The mesh topology expanded in 3D domain generate 3D mesh topology. The 3D structure shows several advantages

(17)

10

in higher throughput and less power consumption [27] while a research [13]

mentioned the drawback in chip temperature and fabrication cost under current technology.

 (d) It is an octagon topology NoC with 8 nodes. The topology makes it possible to reach short-path switching under simply algorithm. Yet with the increase of nodes, the octagon topology becomes multidimensional design. The wiring complexity will increase substantially.

As a comparison, two examples of indirect topology are listed in Figure 2.6. Since this topic will not be more involved in this thesis, the analysis of each example will not go in depth. In addition, in the rest of this thesis, all NoCs referred to are of regular topology otherwise stated separately.

(b) 2D Torus Topology

(c) 3D Mesh Topology (d) Octagon Topology Indicate a node in direct topology NoC

(a) 2D Mash Topology

Figure 2.5: Examples of classical direct topology NoC

(18)

11 Figure 2.6: Examples of indirect topology NoC 2.4 Network Protocol

The network protocol defines the use of NoC resources (mainly router, RNI and links) and strategies adopted in transferring data through it.

2.4.1 Routing Algorithm

The routing algorithm referred to the path a packet/flit will choose to move from the source node to its destination node or an input port to output port in a router. Many routing algorithms can be selected in NoC design with a trade-off between performance and cost. It depends on the application of the NoC. However, all routing algorithms can be classified into three aspects. [11]

 Source and Distributed routing: the difference between the source and

P ^PE ^PE ^PE

R

PE PE PE PE

R

R R

R

R R

P

PE

R R

(a) Fat-tree Topology

(b) Butterfly Topology

R Router PE Processing Elements with RNI

(19)

12

distributed routing is where the routing decision is made in the NoC. For source routing, the whole path (all routers it will go through) of a packet/flit is pre-determined by the source node and the routing table is integrated into the header of the packet/flit. Thus, the source routing also supports irregular NoC topology. In distributed routing, the path of a packet/flit is determined on its fly. The routing table is stored on every intermedia node and the routing decision is made locally. As a result, the head of a packet/flit can be shorter.

 Deterministic and Adaptive routing: in the deterministic routing, the path a packet/flit will go through is only determined by its source node and the destination node. In other words, the packets/flits with same source and destination will always have the same path. Dimensional routing (XY routing or XYZ routing for 3D topology) is a typical example of deterministic routing.

Under this routing algorithm, a packet/flit will always go along a certain dimension and then the next one until it reaches the destination. The adaptive routing is totally different from the former one. The path of every packet/flits is judged by the status of the NoC. Even if packets/flits have same source and destination, they may have a different path when congestion or a broken link has occurred. Consequently, it is more flexible for NoCs demand for reliability and data security.

 Minimal and Non-Minimal routing: a minimal routing will ensure the shortest path (minimal intermediate node) a packet/flit goes from its source to its destination. On the contrary, non-minimal routing does not provide this kind of guarantee.

Deadlock and livelock are another issues related to routing algorithm.

 Deadlock in NoC is a situation that some packets/flits in the NoC are blocked forever. [13] Figure 2.7 provide an example of deadlock scenario in a NoC. The gird represents a 2D mesh NoC, and the small hollow circle is the node on the NoC. Four shapes filled with a slash are packets/flits routed from its source and the solid ones are their destination respectively. Packets 1 is occupied its channel and requests the channel hold by Packets 2, and Packets 2 is referred to the channel occupied by Packet 3, and the channel needed by Packet 3 is belonging to Packet 4, and Packet 4 is asking for the channel reserved for Packet 1. Finally, no packets can go further. Recovery and avoidance can be two solutions for deadlock. [8] A deadlock recovery scheme for buffered flow control can be found in [18]. Compared with deadlock avoidance method, the recovery benefits in significant saving in buffer space, however, the network throughput is reduced evidently. The approaches to achieve deadlock avoidance is bufferless flow control which is inherently deadlock-free. For buffered flow control, routing algorithms with special policies must be applied to eliminate deadlock. For instance, the dimensional routing is a deadlock-free algorithm for

(20)

13 NoCs with mesh topology.

 Livelock in NoC is a situation that some packets/flits will never reach its destination due to perpetual movement in the network. Some non-minimal algorithms or algorithms allow the deflection of packets/flits may lead to livelock. [29] Adding hot-counters in the head of all packets/flits to mark its priority and preventing from deflecting packets/flits with high priority (a packet/flit moves one step from one node to another node calls a hop) can be a solution to avoid livelock.

In conclusion, deadlock and livelock both have an adverse effect on data transfer in a NoC and shall be prevented when considering the routing algorithm. When selecting routing algorithms leading to these situations, other schemes must be applied to make NoC finally meet its performance constraint.

2.4.2 Arbitration Logic

Arbitration logic is established to solve the conflicts among packets/flits request for the same resource simultaneously. Meanwhile, the arbitration shall ensure that the resources can be employed by the contenders fairly depending on their priority. As mentioned above the arbitration logic can be built distributed or centralized. The centralized arbitration will achieve higher matching rate however the latency may be longer. Despite the way to build arbitration logic is different, the kernel of it is arbitration algorithm used to determine the priority. Two commonly used arbitration

Figure 2.7: Examples of a deadlock situation for 4 packets/flits S

D1

S2

D2 S3

S4

D3

D4

(21)

14 algorithms will be presented below. [10]

 Fixed Priority Arbitration (FPA): this is the simplest arbitration algorithm and a typical static arbitration algorithm. Arbitration logic using this algorithm will give its inputs static priority. For an N bit FPA, the request from the position 0 has the highest priority and the N-1 bit has the lowest priority. If a request vector of 5 bit, R5R4…R0=00110, and “1” means there is a request otherwise “0”, the position 1’s request will be guaranteed. Although this arbitration logic is easy to build and even no priority status is needed, the weakness is obvious since its unfair, the requests from higher priory inputs may occupy the ports for a long time with requests from lower priory inputs without served. Figure 2.8 shows the requests guaranteed procedure in an FPA.

 Round-Robin Arbitration (RRA): it is an example of the dynamic routing algorithm. the round-robin arbitration logic will go through all the inputs in a cyclic manner. The scan will from the input with the highest priority. The first active request will be served, after that this guaranteed input will have the lowest priority and the one next to it will have the highest. Figure 2.9 illustrates the priority change in a round-robin arbitration logic. The RRA has benefits in strong fairness and easy to build.

Guaranteed

C

D

E

A

C

D E

B

D E

D E Cycle 1

Cycle 0 Cycle 2 Cycle 3 Cycle 4

A

D

A

A A C B D

priority High

Low

Figure 2.8: Examples of a fixed priority arbitration procedure

(22)

15

The delay or loss communication model are also included in this topic. Arbitration logic defines which communication model will be selected. In a delay model, packets/flits can be delayed but never be dropped while in a loss model, the packets/flits have the possibility to be dropped when congestion and a re-transmitted mechanism might be implemented to inform the source of the failure of transmission so that the dropped packets/flits can be transferred again if needed. [29]

2.4.3 Flow Control Mechanism and Buffers

The flow control mechanism defines the rules for packets/flits transmission in a NoC.

It has a strong relationship with buffer strategy since the rules including the utilization of buffers for each packet/flit. From the view of buffer usage, the flow control can be classified as buffered flow control and bufferless flow control. Figure 2.10 provides a comparison of buffered flow control and bufferless flow control. In (a) buffered flow control, a set of buffers are constructed in the router. The packets/flits arrived can be stored in the intermediate router and wait to be sent to another router when congestion happens, or some other packets/flits arrived ahead of it to be sent. In (b) bufferless flow control, every routing cycle only one packet/flit will be stored per input port after this cycle the packet/flit must leave current route so that the router can be processing next one.

C

E

A

C C

B

Cycle 1

Cycle 0 Cycle 2 Cycle 3 Cycle 4

A

D

A

D E A C B

scan order

Guaranteed

Highest priority position

Figure 2.9: Examples of a round-robin arbitration procedure [10]

(23)

16

Figure 2.10: Buffered flow control vs. Bufferless flow control [11]

 Bufferless flow control: as the name suggests, no extra buffers are added to the router. Only one buffer for current packets/flits to be transferred as a result no packets/flits can be stored. All the packets/flits received must be switching out to next router. When conflicts happened, a drop (Section 2.4.2) or deflective routing method shall be used. This method will be introduced in Section 3.3.3.

 Buffered flow control: it defines the allocation of channels and buffers for the packets/flits traversal in the network. [19] The buffers instructed in the router can temporarily store the packets/flits when the channels it requests are occupied. The three commonly used buffered flow control mechanism are list below.

- Store-and-Forward (SAF) flow control: each packet will transfer through the link in one piece. It will wait until the whole package stored in the router before forwarding to next router. As a result, each router will have enough buffers to store the largest packets. It a costly mechanism with considerable high latency for waiting for the whole packets to arrive.

- Virtual-Cut-Through (VCT) flow control [15]: the packets are divided into several small parts called flow control digits (flits) (Figure 2.11).

When the head flit arrives, the router will refer to the next router. If next router is available for the whole packets. It will forward the current flit immediately without waiting for the whole packets to arrive. This strategy reduces the latency compared with SAF, however, the buffer size is the same.

(a) buffered flow control (b) bufferless flow control crossbar crossbar

(24)

17

- Wormhole(WH) flow control [28]: it is a comparatively old but still popular flow control method. This strategy is an improvement to VCT to avoid large buffer space. It breaks the limit on buffer size in the router.

Figure 2.11 shows the difference between VCT and WH flow control. In WH flow control, only the head flit holds the destination of the whole package. If next router is available, it will go forward and reserved this to prevent from another package to use it. Then, the body flits will go through the channel established by the head flit. Finally, there must be a tail flit to cancel the reservation and reset all the routers in the channel it goes through so that they can be used by another package. This improvement changes the package-based flow control to a flit-based one.

The buffer size needed to be reduced from a package to a flit. However, the established channel occupied all the routers on it, it may lead to congestion when other packets request it. To solve this situation, a virtual-channel based flow control can be implemented based on WH flow control. This issue will go in depth in Section 2.4.4.

package

Head Flit Body Flit Body Flit Tail Flit

package

* A package can be divided to one head flit, one tail flit and several body flits. In this example the body flits in a package is two.

Figure 2.11: Package and Flit

(25)

18

Figure 2.12: VCT vs. WH when sending a packet with 6 flits [10]

2.4.4 Virtual-Channel Based Flow Control

In [10], an appropriate analogy is given to illustrate the meaning of VCs, adding VCs as to adding lanes on a single street to make it a multi-lane street so that the cars go to different direction can go separately instead of waiting on the only lane. However, for NoCs the lanes are virtual cause they do not physically exist. It is a method to arrange buffers. There is still only one physical channel and the channel may point to different buffer set (VCs) in a time-multiplexed manner. This concept is first raised to achieve protocol level deadlock avoidance. On the other hand, it is also a method to release network congestion and increase the throughput of WH flow control as mentioned above. In architectures support VCs, the head flit of one packet arrived in one router will only occupy a VC and send requests to the next router for another VC instead of blocking the whole router. Thus, router-to-router channels are changed to construct VC- to-VC links. When one router is on the links of one package, the other packages can still use it, but use different VC. However, the VC-based flow control needed extra buffers to generate VCs and logic units to implement an enhanced flow control together with more arbitration logic. The area and power consumption will increase obviously.

Keep a balance between area, power and performance is a complicated issue.

source destination

WH Flow

VCT Flow Control

source destination

Do not accept a grant to go to the destination

Flit forward

Free slots to accept the whole package

Flit stopped since the buffer is full

Flit forward

(26)

19 2.5 Quality-of-Services

Quality of Service (QoS) is defined “as service quantification that is provided by the network to the demanding core” in [29]. Two basic QoS classes are characterized in [12].

 Best-Effort Services (BE): NoCs with BE services do not offer any commitment.

 guaranteed services (GS): NoCs with GS services will offer certain level commitment.

The commitment can be given from three aspects: “(i) correctness of the result, (ii)completion of the transaction, (iii) bounds on performance” [29]. For most of NoCs with BE services, they only try to complete the transaction, it is designed for average- case scenarios. On the contrary, the NoCs with GS services take the worst-case scenarios into consideration. Some extra mechanisms shall be implemented to ensure committed performance in data correctness, maximum latency or minimum throughput.

The behaviour of the entire system is predictable to some extent. [12]

(27)

20

Chapter 3 NoC System Generator Tool and Nostrum NoC

As the cores doubled every eighteen months according to reformulated Moore’s Law, the complexity of MPSoC increase sharply and more than ten years development gives the term NoC a broader meaning. The hardware communication infrastructure‚

middleware, operating system communication services and a design methodology together with tools to map applications onto the NoC are included in this category and all these elements together called a NoC platform. [14] Implementing, programming, debugging and testing this kind of highly integrated platform can be a challenge. A NoC system Generator (NSG)Tool is introduced by KTH to assist the design flow for prototyping a NoC platform on Altera/Xilinx FPGAs. [25] [26] The nostrum NoC [24]

composes the communication infrastructure of the NoC platform generated by NGS tools.

A range of industry accepted soft-cores can be selected as PEs with the device drivers equipped, meanwhile, a test program is provided as a template for larger programs.

3.1 NoC Platform Generation Process

An overview of the platform generation process in Figure 3.1. [23] Two kinds of files are needed to describe the NoC platform to be generated. A System Description XML file (SD-XML) file and a least one (can be more than one) Processes Description C files (PD-C). The SD-XML file defines the hardware area and the PD-C(s) referred to the software domain. After processed by the NSG tool, the XML file will be translated to hardware description files to generate the PEs and the configuration file for the NoC.

On the other hand, the PD-C file will generate the software projects carried by the NoC platform.

(28)

21

Figure 3.1: Overview of platform generation process [23]

3.2 Nostrum NoC

The Nostrum NoC [20] is developed by a research group in KTH. The NoC platform generated by NSG tool can only use this NoC structure. The characteristic of Nostrum NoC will be illustrated in this section so that a general view of this NoC can be established.

3.2.1 NoC Topology

The NoC topology supported currently is 1D/2D/3D mesh or torus topology. The max node supported is 8×8×4, which means 8 rows, 8 columns and 4 layers. The addressing method adopted by the tool is absolute addressing technique. Every node has a fixed address defined by the Nostrum NoC. It is in accordance with its coordinator and the lowest southwest corner is (0,0,0). Figure 3.2 shows the NoC topology can be constructed by NGS tool and the addressing method in a 2D example.

(29)

22

Figure 3.2: Topology supported by NGS tool and addressing method 3.2.2 Flow Control and Buffer Strategy

Nostrum NoC adopts bufferless flow control. It also uses a flit-based communication.

The message is divided into several packets and the packet are divided into several flits.

Figure 3.3 depicts the message/packet/flit defined in Nostrum.

 (a) Messages are groups of complete data generated from PEs and to be transferred another node. One message will be divided into several packets. The number of packets in a message can be defined in SD-XML file.

 (b) Packages structure is illustrated here. One packet will be segmented into several flits. The number of flits in a package can be defined in SD-XML file.

It contains two functional flits and some of data flits.

 (c) Flits are units actually transferred in the NoC. Even though the flit type can be different, the segmentation of all flits is the same. Each of them has 32-bit head and several payload bits which can be defined in SD-XML file.

- Type: it distinguishes flit from functional flits to data flits, size=2 bits.

The first bit is also used as a valid bit and for the second bit,1=setup flit, 0=data flit.

- Flit ID: it is the serial number of one flit. It marks the position of a flit in its packet. Size=log2(number of flits in a packet) and the maximum number of flits in a packet is 128.

- PID: it shows the process number of the flit. Size=8

- HC: it is a hop counter. (related to livelock avoidance, see Section 3.3.2)

（ 0 0^（1,0） ^（4,0）

（4,4）

…

（0,1）

（0,4）

（0,0）（1,0）（4,0）

（4,4）

…

（0,1）

（0,4）

(a) mesh topology (b) torus topology

(30)

23

- NS/EW/UP: it is the address of destination node. Since the largest NoC supported currently is 8×8×4, Size=3bit/3bit/2bit.

Figure 3.3: Message/Packet/Flit in Nostrum NoC

3.2.3 Routing Algorithm and Arbitration Logic

The routing algorithm used in Nostrum NoC is a variant of XY routing. Since in Nostrum NoC, the routing procedure are done in routers, it is a distributed routing. A research in different kinds of XY routing can be found in [4].

To make it easier to describe this routing algorithm. The basic XY routing shall be analyzed first. The basic XY routing (or XYZ routing for 3D NoC) uses a tableless routing technology. [29] The packets/flits will be switched first along the X-axis of the network then Y axis and Z axis when exists. For example, a packet is routed from the node with the address (1,1) to the node (3,4). When it will go east for two hops and then go north for three hops. Figure 3.4 shows the moving track of this packet and illustrate the blocked moving path. The path for packets with same source and destination is

Type

Set-Up Flit

Data Flit 0

… …

Package 0 Package 1 Package 2

Package N-1

(b) Package Structure Global Clock Flit

Data Flit 1

Data Flit M-1

Flit ID PID HC NS EW UD Payload

(a) Message

(c) Flit Segmentation 32 bits

(31)

24

unique and it is the shortest path, thus it is deterministic and minimal routing algorithm.

It means it is livelock free. Since the channel of all packets cannot generate a circle due to the blocked path, it is also deadlock free.

Figure 3.4: XY routing example and routing path blocked

In Nostrum NoC a deflective routing named hot-potato routing algorithm is integrated with XY routing to make it an adaptive routing which is suitable for bufferless flow control. Since in the bufferless flow control, no flits can be stored in the router and every flit arrived must have an output channel. drop or deflective routing can be used as mentioned in Section 2.4.3. In Nostrum NoC, miss routing is adopted. It means when two or more flits arrived at the same router and pretend to go to the same direction according to XY routing, only one of them can be sent to the right output and the others must be transferred to different output even if it does not fulfil XY routing. Figure 3.5 shows an example of deflective routing. Flit A ejected from node (1,1) to node (2,4) and flit B started from node (3,1) to node (2,3). when they are both arrived node (2,1), the conflicts happened since they request to north output at the same time. Under this situation, flit B goes to the north and flit A go to the east. The conflicts solved with a cost flit A go to the ‘wrong’ direction during this hop. In this example, two flits arrived their destination finally. However, when the flits number become large and conflicts happened continuously, a livelock may happen. A hop counter must be equipped in the header of the flits so that the flits with a large hop counter number can have priority to choose a minimal path and achieve livelock avoidance. To ensure every flit arrived in the same router have an output, the worst-case situation shall be considered. The number of candidate output should be equal to the number of input ports of a router when designing a routing table. The routing table used in Nostrum NoC is represented in Table 3.1. The N, S, W, E, U, D and R represent the seven output port of a Nostrum router which will be analysed in the following section.

（0,0）（1,0）（4,0）

（4,4）

…

（0,1）

（0,4）

（1,1）

（3,4）

(32)

25 Table 3.1: Routing table of Nostrum NoC [11][24]

Routing Table

NS > row

EW > column

UD > layer N, E, U, S, W, D UD < layer N, E, D, S, W, U UD = layer N, E, S, W, U, D EW <column

UD > layer N, W, U, S, E, D UD < layer N, W, D, S, E, U UD = layer N, W, S, E, U, D EW =column

UD > layer N, U, S, E, W, D UD < layer N, D, S, E, W, U UD = layer N, S, E, W, U, D

NS < row

EW > column

UD > layer S, E, U, N, W, D UD < layer S, E, D, N, W, U UD = layer S, E, N, W, U, D EW <column

UD > layer S, W, U, N, E, D UD < layer S, W, D, N, E, U UD = layer S, W, N, E, U, D EW =column

UD > layer S, U, N, E, W, D UD < layer S, D, N, E, W, U UD = layer S, N, E, W, U, D

NS = row

EW > column

UD > layer E, U, N, W, S, D UD < layer E, D, N, W, S, U UD = layer E, N, W, S, U, D EW <column

UD > layer W, U, N, E, S, D UD < layer W, D, N, E, S, U UD = layer W, U, N, S, U, D EW =column

UD > layer U, N, E, W, S, D UD < layer D, N, E, W, S, U UD = layer R, N, E, W, S, U, D

(33)

26 Figure 3.5: A deflective routing example

3.2.4 Router Structure for Nostrum NoC

The router designed for Nostrum NoC follows the flow control method and routing algorithm mentioned in above sections. It is constructed by three main components which are receiver, transmitter, and crossbar with control logic. The block diagram is depicted in Figure 3.6. VHDL is used as hardware design language. To switching a flit from the input port to an output port cost four clock cycle.

（0,0）（1,0）（4,0）

（4,4）

…

（0,1）

（0,4）

（1,1） （3,1）

（2,3）

（2,4）

A

B

(34)

27 Figure 3.6: Block diagram of Nostrum NoC

Nostrum Router

Transmitter 0

Transmitter 1

Transmitter 2

Transmitter 3

Transmitter 4

Transmitter 5

Transmitter 6

Receiver Receiver Receiver Receiver Receiver Receiver Receiver

Crossbar

Control Logic

Resource

North

South

East

West

Up

Down

Resource

North

South

East

West

Up

Down

(35)

28

Chapter 4 Virtual-Channel Based Wormhole Router for NSG Tool

Since the NSG Tool is preferred to generate diverse types of NoC platform, there should a set of components for NoC platform can be selected. For current NSG tool, only Nostrum NoC supported as mentioned above. In this chapter, a synthesizable Virtual- Channel based wormhole (VC-based WH) router for 3-D mesh NoC will be described.

It is prepared to equip on the NSG tool so that the performance of Nostrum NoC and VC-based WH NoC can be compared with each other. The theory of VC and WH routing is already illustrated in Chapter.2, hereby the chapter will focus on the implementation aspects.

(36)

29

4.1 Virtual-Channel Based Wormhole Router Model

Figure 4.1: A Canonical model of VC based WH router

Input Port

Switch Allocator

Virtual- Channel

…………

Crossbar

…

Credit counte

State Update

Routing Logic

0 to v-1 0

1

N-1

0

1

N-1

FSM

Output Port

(37)

30

A canonical Virtual-Channel based Wormhole Router Model [10] [28]is depicted in Figure 4.1. The router has N input ports and N output ports. The port is a Physical Channel (PC) that can accept flits. Every input port includes v VCs which is a set of buffers to store flits accepted. The main components of the routers are routing logic, VC allocator, Switch allocator, Finite-State- Machine (FSM) per VC and two blocks related to flow control which is Credit Counter (CC) and State-Update(SU). The implementation of each component will be illustrated separately in the following Section.

In WH flow control, a flit-based flow control, a packet is divided into one head flit which includes the destination node of the packet and only the head flit hold this information. After that are several body flits, which contain the payload of the packet.

The number of flits can be customized. The last flit of a packet is a tail flit which will inform the router the whole packet is transferred. Figure 3.2 shows the packet structure.

Figure 4.2: Packet structure

When a packet passed through a VC-based WH router, the head flit arrived first. Since it holds the destination node address, routing calculation (RC) will be done first. The routing procedure determines which output port the packet will go through and the routing result will be stored in a register. Then followed by a VC allocation (VA) stage, it will determine which VC it will use in the next router. The VC allocator performs the VC allocation operation. It will look through the SU block which holds all VC state of its N neighbour routers to find an available VC. When several packets requests for the same output VC, arbitration will be introduced. If the packet is guaranteed a VC, the selected VC number will be stored in a register and the channel will be marked locked to prevent other packets to use it. After the VA is done, it goes to the switch allocation (SA)stage. In this stage, the switch allocator searches the CC block to see if there the requested port has enough buffer space for the flit, and when some packets request the same output port, the arbitration will be introduced so that only one flit can go through

Head

Body Flit

Packet

Body Flit … Body Flit

Tail Flit n body flits

(38)

31

the output port in one cycle. After the SA guaranteed, then the head flit will go through the output port and go to another intermediate router or just arrived at PE. This stage called switch traversal (ST). when a flit leaves the router, the credit count will record the buffer space change of next router immediately. Then the body followed the head flit, it reads the registers that store the routing result and VC selected by the head flit, then this information sends to the switch allocator and does SA directly. If SA guaranteed, then ST. The tail flit will do the same as body flit. However, the tail flit has a task to refresh the usage of VC in next router to make it unlocked. Meanwhile, it informs the SU to set that VC available and the registers storing the routing result and VC number will be reset. The flit processing flow is summarized in Figure 4.3.

(39)

32 Figure 4.3: Flit processing flow

4.2 Implementation

The implementation details of a synthesizable VC-based WH router according to the model analyzed in above section will be explained in this section. The router is constructed use VHDL so that it can be equipped with the NSG.

4.2.1 Overview

Before implementation as building blocks in the router, some configuration parameters and NoC protocols shall be defined first. To make the design more general, some parameters can be defined in configuration package which is a VHDL file.

Flit arrived

Head Flit Body or Tail Flit

steps in one cycle

RC

VA

SA

ST

Tail Flit

SU

CC SU

Body Flit

steps in one cycle

(40)

33

• NoC topology: the target NoC topology is 3D mesh with absolute addressing method. The largest size supported by NGS is 8×8×4.

• Number of Ports: according to the topology. Every router shall have 7 ports which are North, South, East, West, Up, Down, and Resource and the port number is 0 to 6 respectively. To make it more general, us N to represent the number of ports and N=7.

• Number of VCs: this parameter can be defined in configuration package.

To make it easy to explain in the following chapter, the number of VCs is v and v=4.

• Flit segmentation: the flit segmentation for head/body/tail flit is illustrated in Figure. 4.4. all flits are of the same size

- VC_ID: size=log2(v), used to represent the VC number the flit will be stored.

- Valid bit: size=1, ‘1’=valid flit, ‘0’=void flit

- Flit type: size=2, “01”=head flit, “11”=body flit, “10”=tail flit This part only exists in head flit to tell the destination of the packet - EW address: size=log2 (number of columns)

- NS address: size=log2 (number of rows) - UD address: size=log2 (number of layers)

- Unused: to keep all flits have the same size, this area shall be all zero The data area only exists in body and tail flits.

- Data: size can be defined in configuration package

• Buffer strategy: the buffer shall be implemented with full-bandwidth FIFOs. The FIFO depth can be defined in configuration package. The width of each FIFO slots shall be the flit size minus the VD_ID size since this part won’t be transferred to next router.

• Routing Algorithm: basic XYZ routing shall be used. (see Section.3.2.3)

(41)

34 Figure 4.4: Flit segmentation

• Arbitration Logic: Round-Robin arbitration (see Section. 2.4.2) is adopted for its strong fairness. And the implementation of an RRA will be described in Section.4.2.4.

• Link-level flow control: credit-based flow control method [10] will be used.

The interpretation of this method will be in Section.4.2.3

• Quality-of-Services: only BE services supported and only try to complete the transaction.

4.2.2 Input Port

Several FIFOs and a multiplexer connected can generate an input port for the VC-based WH router. The VC_ID of a flit can be used to select which FIFO it will be stored since it is the only effect of a VC_ID. After that, the VC_ID can be dropped to reduce the buffer size. When the flit is switched out of the router a new VC_ID will be added in front of the flit so that it can go to the selected VC in next router.

VC_ID

Valid Bit Flit Type:01 EW address NS address

UD address

Unused

VC_ID Valid Bit

Flit Type:11

Data

VC_ID Valid Bit

Flit Type:10

Data

Head Body Tail

(42)

35 4.2.3 Credit-Based Flow Control

The abstract link level flow control model is credit-based flow control (see Figure 4.5).

[10] In this flow control model, the credits mean the number of free FIFO slots in neighbored routers. Every VC count its credit separately. If the number of credits larger than one, the VC can accept a new flit. The CC can be equipped with the current router to hold the FIFOs state of next router. If a flit leaves current router to the next router, the credit count will minus one. Since the switched-out flit will occupy one FIFO slot in the next router. If a flit leaves next router, the CC will increase by one.

Figure 4.5: Credit-based Flow Control Model

4.2.4 Round-Robin Arbiter

Round-Robin Arbiter (RRA) is a basic element to compose VC allocator and Switch allocator. The RRA used in this router is based on an algorithm in [17]. The RRA is a generic design. It will have n bit inputs and n bit outputs. The output using one-hot coding to represent the guaranteed input.

The algorithm can be represented below:

The request is “req”, an n bit vector, in which ‘1’ means an active request. The guaranteed result of the last cycle is “pre_gnt”, also an n bit vector. The final guaranteed result is “gnt”.

• (1) fir_gnt<= req and (not(req)+1)

• (2) masked_req <= req and not (((pre_gnt) - 1) or pre_gnt)

• (3) sec_gnt <= masked_req and not((masked_req)) + 1)

• (4) if masked_req >’0’ then gnt<= sec_gnt else fir_gnt

For example, in cycle 0, req= “00110001”, according to (1), fir_gnt= “00000001” and in (2), masked_req= “00000000”. Since its cycle 0, pre_gnt= “00000000”. According

sender

Credit counter - +

(43)

36

to (4), the final result gnt=fir_gnt= “00000001”. It means the request of req[0] is guaranteed. In cycle 1, the pre_gnt= “00000001”. Assumed that the req = “00111001”, according to (1) and (2), fir_gnt= “00000001”, mask_req= “00111000”. The mask_req in this cycle is larger than ‘0’, as a result the final result is caculated according to (3) gnt=sec_gnt= “00001000”.

Using this algorithm, we can implement an n:1 RRA with n bit input and n bit output.

4.2.5 Routing Computation Unit

The RC can be instructed in two ways. The first one (Figure.4.6 (a)) is to build an RC unit for every VC, and the second one (Figure.4.6 (b)) is to build an RC unit for one port and make the VCs in the port share this RC unit. [10] The first choice seems to cost more area; however, arbitration logic is needed hereby it may not area saving and increase latency. In this design, the first choice is adopted, and RC unit is attached to every VC. The routing table for each RC is shown in Table 4.1.It is a basic XYZ routing.

Figure 4.6: Two ways to introduce RC unit R

RC

…

VC 0

VC 1

VC v-1 …

RC …

arbit

(a) RC per VC (b) RC per Port

(44)

37 Routing Table

EW > column

NS > row

UD > layer E

UD < layer E

UD = layer E

NS<row

UD > layer E

UD < layer E

UD = layer E

NS=row

UD > layer E

UD < layer E

UD = layer E

EW < column

NS > row

UD > layer W

UD < layer W

UD = layer W

NS <row

UD > layer W

UD < layer W

UD = layer W

NS=row

UD > layer W

UD < layer W

UD = layer W

EW = column

NS > row

UD > layer N

UD < layer N

UD = layer N

NS <row

UD > layer S

UD < layer S

UD = layer S

NS=row

UD > layer U

UD < layer D

UD = layer R

Table 4.1: Routing table for RC unit

4.2.6 VC Allocator Unit

VC allocator unit is responsible for determining which VC the flits in current will be allocated to in the next router. In other words, it determines a VC-VC level channel for packets. Only head flit can do VC allocation. In this design, no limitation is added to requests for head flits. It means that a head flit can request all VCs in the other router.

The arbitration will be performed on two levels.

In the first level, it is performed in every VC separately. An available VC has selected from the candidate VCs the head flit request for. The “available” here means:

• the VCs shall belong to the requested input ports of next router which is

(45)

38

determined by the RC result. The VCs does not belong to the required ports will be masked off.

• the VCs are not occupied by other packets. To check the usage of VCs in its neighbour router, the VC allocator reads the SU unit. The locked VCs will be masked off. The candidate

After the available VCs selected out, the available candidates are round-robin selected.

The second level arbitration is centralized. All VCs in the router (including all ports) sent their requests to respective output VCs. The VCs are in next router however the decision is made locally. In this stage, the requests to the same VC are collected and arbitration is done in a round-robin manner. An output VC only guarantee one input VC so that a VC-VC channel established.

After that, the VC allocator generates a guaranteed signal to the input VC and the selected output VC number is stored in a register meanwhile this VC is blocked from doing VC allocation again. The last task of a VC allocator is sent a message to the SU so that the selected output VC is marked occupied. At this point, the VC allocation is done.

To implement such a function unit, in first arbitration level, a v:1 RRA is needed per VC so that they can requests all VCs. In second arbitration level, an N×v :1 RRA is needed per output VC. Since all input VCs have the possibility to send a request to one output VC. [10]

4.2.7 Switch Allocator Unit

The switch allocator unit is constructed to determine which flits can traverse the crossbar. Based on VC based flow control, only one flit and go out of one input port in one clock cycle since only one physical channel is existed per port. Similarly, only one flit can go out of one output port. The arbitration is also divided into two level. Every flit including head flit, body flit and tail flit will go through SA stage.

Different from VA allocator, the first level of arbitration is implemented in every input port. However, the request signal is generated by every VC in the port separately. If a VC hold a valid flit and the VA is guaranteed, it will send a request to the switch allocator together with the requested output port coded in one-hot. The requests will be masked off by result generated by CC units. Since only the output VC have enough credits can accept a new flit. The requests to output VCs without a credit will be skipping out. To sum up, an availability request to SA must fulfil two conditions:

• the input VC is allocated to an output VC

• the output VC have enough credits

Virtual-Channel Based Wormhole NoC on FPGA for ForSyDe/NoC System Generator Tool Suite

Virtual-Channel Based Wormhole NoC on FPGA for ForSyDe/NoC System Generator Tool Suite

ZHANG RUNZI

Abstract

Acknowledgement

Table of Contents

List of Figures

List of Tables

Chapter 1 Introduction

Chapter 2 Network-on-Chip

Chapter 3 NoC System Generator Tool and Nostrum NoC

Chapter 4 Virtual-Channel Based Wormhole Router for NSG Tool