Design and Implementation of AXI-based Network-on-Chip Systems for Flow Regulation

(1)

TRITA-ICT-EX-2009:157

Master Thesis in Electronic System Design

Design and Implementation of

AXI-based Network-on-Chip Systems

for Flow Regulation

Jiayi Zhang

September 2009

Supervisor: Dr. Zhonghai Lu

(2)

Abstract

In Network-on-Chip (NoC), controlling Quality-of-Service is crucial in order to build predictable systems. In this project, we design and implement an AXI-based system on the Nostrum NoC, which features a 2D mesh topology and deflective routing. The main components we add to the Nostrum NoC are master and slave interfaces. The master interface conducts packetization, queuing and multiplexing. The slave interface performs de-packetization, queuing, de-multiplexing and, in particular, reordering of transfers. We also build master and slave modules to serve as traffic generators and sinks. One particular feature of the master module is that it can regulate traffic burstiness while generating traffic. All these models are implemented in VHDL at the RTL. The interface protocol of masters and slaves is the AXI from ARM.

With the above components, we designed experiments to show the effect of traffic regulation. Our results show that higher burstiness traffic results in larger transfer delay and bigger backlog. We can conclude that the transfer delay and backlog can be controlled to some degree in a best-effort network via regulating the traffic burstiness.

(3)

Acknowledgements

I would like to thank my examiner Professor Axel Jantsch and Doctor Zhonghai Lu for giving me the opportunity to work on my master thesis in their research group. I want to thank Dr. Zhonghai Lu for his forgiveness and continuous support on me. I also want to thank Dr. Zhonghai for his patient guide, from which I have learnt the fundamental knowledge of how to research. I would also like to thank my parents in China. Without them I wouldn‟t be able neither to attend the master program in KTH nor to finish the master thesis.

(4)

Abbreviations

NoC Network on Chip

AXI Advanced eXtensible Interface SOC System On Chip

IC Integrated Circuit QoS Quality of Service

API Application Programming Interface OEM Original Equipment Manufacture IDE Integrated Development Environment RTOS Real-Time Operating System

ARM Advanced RISC Machines

RISC Reduced Instruction Set Computer MI Master Interface

SI Slave Interface

(5)

1 Introduction

1.1 Background

Network-on-Chip (NoC) has been proposed to address the scalability challenge of buses in clocking frequency, bandwidth and power consumption. The research starts around year 2000 and so far it has been a hot research area.

The NoC group at KTH has proposed the Nostrum NoC. The Nostrum NoC is a packet-switched dropless network, which features a 3D mesh topology and deflection routing. As an example, a 2 x 2 network is shown in Figure 1. The routers do not have buffer queues. When contention for shared links occurs, the router deflects the packets which lose the link arbitration to unflavored links.

Figure 1. An example of Nostrum NoC (Source: The Nostrum homepage)

Quality-of-Service (QoS) has been a main concern for on-chip network research because routing packets may bring non-deterministic behavior, thus uncertainties in delay and jitter. For real time applications which require guarantees under even worst-case conditions, this is not acceptable. At KTH, the concept of flow regulation has been proposed to control QoS by traffic shaping. This concept is based on the network calculus theory. It has been demonstrated in [3] that the flow regulation can be applied to reduce delay and backlog for a NoC with guaranteed services.

(8)

1.2 Project Overview

This project is set up to investigate the effect of flow regulation on Nostrum NoCs. To realize this, we will build a NoC-based system, which consists of masters, slaves, master interfaces (MIs) and slave interfaces (SIs) and the Nostrum NoC. The MIs and NIs connect the masters and slaves to the Nostrum NoC, respectively. The Nostrum router is already designed in VHDL model at RTL. The Masters and slaves are used mainly for testing purpose. We have focused on the three main tasks:

1. Design and implement the MI and SI 2. Design and implement masters and slaves 3. Experiment on the effect of flow regulation

Since a typical master/slave IP has a standard interface, such as AXI or OCP, we have chosen to use the AXI in our projects. This means that the MI and SI are specific to the AXI protocol.

1.3 Thesis Structure

Chapter 1 Introduction: it gives a description of the project including shortly

its background, objectives and design tasks.

Chapter 2 NoC System: We describe the NoC system which has been

constructed in the project. This system consists of masters, slaves, and master and slave interfaces connected to the Nostrum network.

Chapter 3 Experiments: Experiments for investigating the effect of flow

regulation on the Nostrum network are described and results are analyzed.

Chapter 4 Conclusion: This chapter summarizes the project work, also

(9)

2 NoC System Description

This chapter first describes the whole NoC system, and then details the hardware modules.

2.1 System Overview

Figure 2 depicts the NoC system.

Figure 2. Nostrum overview [1]

The Nostrum network has less buffers but it does not guarantee in-order delivery. This means that a sequence of packets is sent in the order P1, P2 … PN, but it may be received in the order P2, P1, P3, P8 … PN. However, a slave module, typically, a memory controller plus a memory does not aware of the packet sequence. This implies that the packets sent to the slave module must be re-ordered into the right sequence, like P1, P2 … PN before being transmitted to the slave.

2.2 AXI Master and Slave

AXI is the next generation high performance interconnect interface from ARM. It features five wide parallel channels, three write channels, AW , W, B and two read channels AR and R, in which AW is for Write Address, W for Write data, and B for acknowledgement; AR for Read Address and R and Read data.

(10)

The protocol works in a request-response fashion. A write starts with AW, then W from an AXI master, and finishes with a response B from an AXI slave. A read starts with an AW request from an AXI master and finishes with a read data R from an AXI slave.

Figure 3 shows how a read transaction is performed. The master initiates the read transaction by sending the address and control information through read address channel. When slave receives the information from read address channel, it will process the request and reply the corresponding read data via read data channel. The last read data will assert the last signal which indicates the end of the read transaction.

Figure 3. Channel architecture of reads

Figure 4 shows how a write transaction is performed. The master initiates the write transaction by sending the address and transaction through write address channel. Then the master will send the data through write data channel and by asserting the last signal to indicate the end of data transfer. When the slave receives the information, it will process the request and wait for the data. After the slave receives all the data, it will send a write response through write response channel. The master will process the response from the slave and finalize the write transaction.

(11)

Figure 4 Channel architecture of write

2.3 Master Interface (MI)

The MI performs packetization, multiplexing and queuing. The master network interface is constituted by master side write channel interface, master side read channel interface, master side mux, master to NOC FIFO and NOC to master FIFO. Figure 5 shows the structure of the master interface.

FIFO R O U T E R M U X FIFO WRITE CHANNEL INTERFACE M A S T E R READ CHANNEL INTERFACE WRITE ADDRESS WRITE DATA WRITE RESPONSE READ ADDRESS READ DATA

(12)

2.3.1 Write Channel Interface

WRITE ADDRESS INF

WRITE DATA INF

WRITE RESPONSE INF

PACKET UNIT PACKET UNIT DE-PACKET UNIT WRITE TRANSACTION TABLE WA REQ WD REQ FIFO IN WA SIGS WD SIGS WR SIGS

Figure 6. Structure of Master Write Channel Interface

The write channel interface module interacts with the AXI master‟s write channel, and packetization the write requests into proper network packets. The structure of master write channel interface is shown as Figure 6. Figure 8 shows the program flow of the write channel interface. Both of the write address request and the write data request are initiated by valid signals. When the AXI master asserts the corresponding valid signals, the write channel interface will answer with the ready signal. For write address request signal, it first creates an entry in write transaction table to record the characters of the request. The structure of the write transaction table is depicted in Figure 7.

VALID ID LEN SIZE BURST CTR

Figure 7. Structure of Write transaction table

The valid field in the table represents if this entry is occupied or not. The id field records the transaction id. The LEN field records the length of the transaction. The size field records the size of the transaction. The burst field records the burst type of the transaction. The CTR field is the key field we need to maintain in-order transfer in NOC. It records the number of current transfer. When we pack the transfer into network packet, we will put this field in it. Then when the slave side receives this packet, it will put the packet into a location in the reordering buffer corresponding to its CTR number.

(13)

write transaction table to find the transaction record with the same ID field. Then it will pack the write data request into network packet in addition with the CTR filed to indicate the order of the transfer in the transaction. After packing the write request into a packet, the interface will assert request signal to the mux. The mux will solve the contention and send the packets to FIFO.

When the write channel interface receives the write response packet from the NOC to master FIFO, it will restore the request from the packet. Since the write response means the acknowledgement from the slave, we remove the entry in the write transaction table with the same ID to finish the transaction.

Figure 8. Pseudo code for the Master Write Channel Interface

if awvalid = „1‟ then

put write address request to table; packet write address request; assert request to mux;

wait for ack from mux; end if;

if wvalid = „1‟ then search the table;

packet write data request; assert request to mux; wait for ack from mux; end if;

if fifo_in_valid and fifo_in_ready then de-packet write response;

remove corresponding entry; assert bvalid to master; wait for bready from master; end if;

(14)

2.3.2 Read Channel Interface

READ ADDRESS INF

READ DATA INF

PACKET UNIT DE-PACKET UNIT READ TRANSACTION REORDER BUFFER RA SIGS RD SIGS WA REQ FIFO IN

Figure 9. Structure of Master Read Channel Interface

Figure 10. Pseudo code for the Master Read Interface

Figure 9 and Figure 10 show the structure and the flow of the read channel interface. When it receives the ARVALID signal, it will create an entry for this transaction and allocate space in the reorder buffer. Then the read address request arrives at the slave and the slave answers the request with corresponding data. The read data transfer travel through the network and reaches the master side. When receiving the packet from the NOC to master FIFO, the interface first unpacks the data, and then it searches the read

if arvalid = „1‟ then

put read address request to table; packet read address request; assert request to mux;

wait for ack from mux; end if;

if fifo_in_valid then de-packet read data; search rid in the table; put data in reorder buffer; end if;

if rready = „1‟ then

search the reorder buffer for valid output; if rlast = „1‟ then

clear the entry; end if;

(15)

transaction table to put the data into the right place in the reorder buffer. The next action for the read interface is to find a valid output data to feed the master. A valid data means a data that is in the order of its transaction. Because the transfers could reach the master in random order, sometimes the interface has to wait for the first transfer in the transaction to come although it has all the other transfers in the reorder buffer.

2.3.3 Master side Mux

The master mux is to solve the contention that the write address request, write data request and read address request could happen at the same time. Since we have only one output port to the network, we have to make the simultaneous requests injected to the network one by one. The master side mux serves the request in fixed priority. With the consideration that address requests should arrive to the slave side first, we make the address requests have higher priority than the data requests.

Figure 11. Pseudo code for the Master side Mux

2.3.4 Master to NOC FIFO

We implement the master to NOC FIFO in a circular buffer way. We think this could reduce the toggling rate of the transistor and thus it consumes less energy than a shift register. The size of the FIFO could be determined when we set up the platform. Because the FIFO only deal with 3 outgoing packets which are write address request, write data request and read address request. Packets with other types will be discarded directly. Also note the AXI transaction is finalized by receiving the response from slave side. So if we have configured the master component as 2 write and 2 read outstanding transactions. Then the before the slave responses the master, it will only generate 4 transaction at most. In this case, if the FIFO accepts all the packets

if fifo not full then if wa request then

out to fifo = wa packet; else if ra request then

out to fifo = ra packet; else if wd request then

out to fifo = wd packet; end if;

(16)

and cannot send them to the network, then the maximum items in the FIFO are the total transfers of 4 transactions without response. For instance if the maximum burstiness is 16, then the FIFO has to buffer 4 address request for both read and write, and 16 multiply 2 write data transfers, which are 36 packets. This is the maximum occupation of the FIFO and we can guarantee that it will not exceed the number.

Figure 12. Pseudo code for the Master to NoC FIFO

2.3.5 NOC to Master FIFO

The NOC to master FIFO only deals with 2 kinds of packet, which are read data packet and write response packet. It first read out one data from the FIFO buffer, if its type is write response, the FIFO then dispatches it to the write channel interface. If its type is read data, then the FIFO dispatches it to the read channel interface. There could be some background traffic packets in the network, if the FIFO receives this kind of packets, it will discard them directly.

if mux valid then if not full then

fifo[write_pointer] = input; item_cout ++;

if write_pointer = depth then write_pointer = 0; else write_pointer ++; end if; end if; end if;

if network ready then if not empty

output = fifo[read_pointer]; item_count --;

if read_pointer = depth then read_pointer = 0; else read_pointer ++; end if; end if; end if;

(17)

In the case the FIFO size could be decided by the configuration of the system. If the master can generate 2 write and 2 read transaction, then the maximum occupation of the FIFO is 2 write responses and 16 multiply 2 read data transfers, which is 34 in all.

Figure 13. Pseudo code for NoC to Master FIFO

2.4 Slave Interface (SI)

The SI performs de-packetization, re-ordering, de-multiplexing and queuing. The slave interface is similar with the master interface. The slave side mux, the slave to NOC FIFO and the NOC to slave FIFO work in the way like the master side ones. However, the slave side write and read interface are

if resource in then if not full then

fifo[write_pointer] = input; item_cout ++;

if write_pointer = depth then write_pointer = 0; else write_pointer ++; end if; end if; end if; if not empty temp = fifo[read_pointer]; item_count --;

if read_pointer = depth then read_pointer = 0;

else

read_pointer ++; end if;

if temp.type = read_data then

request to read channel interface; else if temp.type = write_response then

request to write channel interface; else

discard the packet; end if;

(18)

different. Figure 14 shows the structure of the slave interface. FIFO R O U T E R M U X FIFO WRITE CHANNEL INTERFACE S L A V E READ CHANNEL INTERFACE WRITE ADDRESS WRITE DATA WRITE RESPONSE READ ADDRESS READ DATA

Figure 14. Structure of Slave Interface

2.4.1 Slave side write interface

WRITE ADDRESS INF

WRITE DATA INF

WRITE RESPONSE INF DE-PACKET UNIT DE-PACKET UNIT PACKET UNIT WRITE TRANSACTION REORDER BUFFER WA SIGS WD SIGS FIFO IN FIFO IN WR SIGS WR REQ

Figure 15. Structure of Slave side Write Channel Interface

The slave side write interface behaves like master side read interface. It unpacks the data packet from FIFO and then manages the write transaction reorder buffer. Figure 15 and Figure 16 show the structure and flow of the slave side write channel interface.

(19)

Figure 16. Pseudo code for Slave side Write Channel Interface

2.4.2 Slave side read channel interface

READ ADDRESS INF

READ DATA INF DE-PACKET UNIT PACKET UNIT READ TRANSACTION TABLE FIFO IN RD REQ RA SIGS RD SIGS

Figure 17. Structure of Slave side Read Channel Interface

if fifo_wa_in then

if new transaction then

create new entry in the table; else

fill the header; end if;

assert awvalid; wait for awready; end if;

if fifo_wd_in then

fill the reorder buffer; end if;

end if;

if wready then

search the table for valid output; assert wvalid;

(20)

Figure 18. Pseudo code for Slave side Read Channel Interface

The main problem for SIs is to maintain the in-order transmission to the slave. The reordering mechanism is table-based. The reordering is done by filling a table. The principle is shown in Figure 19 and Figure 20. The reordering buffer has 2 parts, the header table and data array. When a write transaction arrives, the slave will fill the header table to record the transaction. The MST_POS field records the master position in the NoC, so the slave could send back the data according this information. The data_index field shows the connection between the header and the data array. The data_valid indicates if the address request is sent to the slave or not, because the AXI requires the address should be sent to the slave ahead of the data transfers.

VALID ID MST_POS DATA_VALID DATA_INDEX NXT

Figure 19. Structure of the Reorder Header Table

VALID WRITE DATA

Figure 20. Structure of the Reorder Data Table

if fifo_ra_in then

fill the header; end if;

assert arvalid; wait for arready; end if;

if rvalid then

search the table

pack counter to the request; request to mux;

wait for ack; end if;

(21)

2.5 Flow Regulation

Flows from a master to a slave are regulated according to the concept of regulation spectrum, which gives the upper and lower limits of regulation. We use σ, ρ regulation factors to define the characteristics of each flow. We give an example to show the two limits. Figure 21 show a flow without regulation. If the flow is not regulated, it could generate any length of burst. Here we can see that the flow generates 8 consecutive transfers and then waits for another 32 cycles to transfer. In a network, this could increase the network traffic suddenly and may induce high rate of congestion.

Figure 21. Flow without Regulation [4]

Figure 22 depicts the flow with regulation. After the regulation the 8 transfers

are evenly distributed in the t axis. In our design, the flow is regulated by MI. The MI controls the AXI ready signals to proceed or stall the master‟s request if it has valid tokens or not.

Figure 22. Flow with Regulation [4]

2.6 System Integration

Figure 23 demonstrates the basic configuration of the experiment platform. The master component is connected to the network router through the master interface while the slave is connected by the slave interface. The master and slave interface function as both the RNI and NI.

For synthesis purpose, the master components use some non-synthesizable features of VHDL, such as file operation and real data type. So the master cannot be synthesized. But the master and slave interface are designed by the purpose of synthesizing them to real hardware. And the NoC infrastructure is fully synthesizable with very good result. With proper IP cores serving as master and slave components we are able to implement the whole system on

(22)

a FPGA board or even on chip. FIFO R O U T E R M U X FIFO WRITE CHANNEL INTERFACE S L A V E READ CHANNEL INTERFACE WRITE ADDRESS WRITE DATA WRITE RESPONSE READ ADDRESS READ DATA FIFO R O U T E R M U X FIFO WRITE CHANNEL INTERFACE M A S T E R READ CHANNEL INTERFACE WRITE ADDRESS WRITE DATA WRITE RESPONSE READ ADDRESS READ DATA

(23)

3 Experiments

This chapter reports experiments and results.

3.1 Experiment Setup

3.1.1 Purpose

The experimental purpose is to investigate the effect of flow regulation on the Nostrum NoC. In order to investigate closely on the network behavior, we use simple experiments.

The two groups of experiments are designed as follows: 1. 4 Master and 2 Slaves

2. 8 Masters and 4 Slaves

For each group of experiments, we inject write transactions at different rates and burstiness. We investigate delay of transfers and backlog.

3.1.2 Infrastructure of the Experiment

All of the experiments are based on a 4x4 mesh network with deflection routing enabled. With the scale of 4x4, we can explore the influence of deflection routing and keep the simulation within reasonable time. Figure 24 shows the topology of the network, according to Erland [5], the network traffic gives the inner part of the mesh network most influence. So we are going to distribute the master components to the central part of network. We index the nodes in the network by its row and column number. The row and column number start from 1. The upper left node is indexed as (1,1) and the lower right node is (4,4).

(24)

R R R R

Figure 24. Basic Structure of 4x4 Mesh Network

The first group of experiments is based on the 4x4 mesh network with 4 masters with 2 slaves. Figure 25 depicts the distribution of 4 masters and 2 slaves in a 4x4 mesh network. To make the effect of network traffic more significant, we make the traffic flows of the 4 masters pass through the bisection of the mesh network. For instance, the master at (2,2) will access the slave at (4,4). The colored arrow lines demonstrate the possible traffic flows of the masters. However, since it is a deflection routing network, the network packets could travel through the network following all the possible routes. Note that only 6 out of 16 nodes in the network generate traffic. It is a light traffic network. Most packets in the network will travel to their destination following the shortest path.

(25)

R R R R R R R R R R R R R R R R SLV SLV MST MST MST MST

Figure 25. Flow Demonstration of 4x4 Mesh Network with 4M2S

The second group of experiments is based on the 4x4 mesh network with 8 masters and 4 slaves. The distribution is depicted in Figure 26. We dispatch the master components along row 2 and row 3. As we can see from Figure 26, all the masters‟ traffics are through the bisection of the mesh network. In this distribution, there are 12 out of 16 nodes generate traffic, which gives a higher pressure to the network. We expect to see effect introduced by the deflection routing.

(26)

R R R R R R R R R R R R R R R R SLV SLV MST MST MST MST SLV SLV MST MST MST MST

Figure 26. Flow Demonstration with 8M4S

3.1.3 Evaluation Statistics

We performed 2 groups of regulation parameters on the experiment platform. The first group is 5 different burstiness with ρ=0.2, which is 1/5, 2/10, 4/20, 8/40 and 16/80. The second group is 5 different burstiness with ρ=0.5, which is 1/2, 2/4, 4/8, 8/16 and 16/32. The burstiness of each group is increased by order, which is expected to give different pressure to the network and the FIFO in the slave side.

During the experiment, we shall measure the following data, the max items in the slave side FIFO, the cycles each transfer takes, the cycles each transfer takes during in the FIFO, the hops each transfer takes to travel in the network and the cycles each transaction takes.

(27)

3.2.1 Experiment results of ρ = 0.2

This group of experiment is performed with different combination of m and n with the result of ρ is 0.2.

Figure 27. Histogram of FIFO occupation in 4M2S ρ=0.2

Figure 28. Maximum FIFO occupation in 4M2S ρ=0.2

Figure 27 shows the histogram of FIFO occupation in ρ=0.2 with different burstiness and Figure 28 shows the maximum FIFO occupation in ρ=0.2 with

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 1 2x 10 5  = 1/5 0 0.5 1 1.5 2 2.5 3 0 2 4x 10 4  = 2/10 0 1 2 3 4 5 6 0 5000 10000 15000  = 4/20 0 2 4 6 8 10 12 14 0 5000 10000 15000  = 8/40 0 2 4 6 8 10 12 14 16 0 5000 10000  = 16/80 1/5 2/10 4/20 8/40 16/80 0 2 4 6 8 10 12 14 16 18

(28)

different burstiness. From the above 2 figures we know that higher burstiness leads to higher FIFO occupation due to the contention in slave side.

Figure 29. Mean transfer cycles in 4M2S ρ=0.2

Figure 30. Maximum transfer cycles in 4M2S ρ=0.2

Figure 30 depicts the maximum cycles each transfer takes to go through the network, while Figure 29 depicts the mean cycles each transfer takes to go through the network. Each bar in these pictures is formed by 3 parts, the total interface delay, the FIFO delay and the network delay. It is easy to understand that higher burstiness would induce more congestion in the slave side FIFO as well as the network packet congestion. We can see that by increasing the burstiness, all of the three parts will increase.

1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25 30 35 40 45

4M2S =Mean Transfer Cycles

Interface delay FIFO delay Hopcount 1/5 2/10 4/20 8/40 16/80 0 10 20 30 40 50 60 70 80 90

100 4M2S =0.2 Max Transfer Cycles

Interface delay FIFO delay Hopcount

(29)

Figure 31. Mean transaction cycles in 4M2S ρ=0.2

Figure 32. Maximum transaction cycles in 4M2S ρ=0.2

Figure 29 and Figure 30 show the maximum and mean cycles every AXI transaction takes to finish. Since higher burstiness will shorten the waiting period between 2 consecutive transfers, and the AXI transaction is in a pipelined style. The higher burstiness, the less cycles it takes to finish one transaction. 1/5 2/10 4/20 8/40 16/80 0 20 40 60 80 100 120

4M2S =0.2 Mean Transaction Cycles

1/5 2/10 4/20 8/40 16/80 0 20 40 60 80 100 120

(30)

3.2.2 Experiment results of ρ = 0.5

This group of experiments is performed with different combination of m and n which the result of ρ is 0.5.

Figure 33 shows the histogram of slave side FIFO occupation and Figure 34 shows the slave side maximum FIFO occupation. Because the slave can only process one transfer in every 2 cycles, ρ=0.5 is the limitation of the slave processing capability. In this case, we can see that different

0 2 4 6 8 10 12 14 16 0 5000 10000 15000  = 1/2 0 2 4 6 8 10 12 14 16 0 5000 10000 15000  = 2/4 0 2 4 6 8 10 12 14 16 0 1 2x 10 4  = 4/8 0 2 4 6 8 10 12 14 16 0 5000 10000  = 8/16 0 2 4 6 8 10 12 14 16 0 5000 10000  = 16/32 1/2 2/4 4/8 8/16 16/32 0 2 4 6 8 10 12 14 16 18

(31)

levels of burstiness induce almost the same FIFO occupation. Note that the AXI transaction is a kind of transactions with response to finalize each transaction. So a master with fixed outstanding capability will not generate infinite outstanding transactions. The master must wait for the response from the slave and then generates a new transaction after it reaches its maximum outstanding transactions. In this case, the maximum FIFO occupation will not exceed a certain number which is related to the burstiness and the max outstanding transaction capability.

Figure 35 and Figure 36 depict the mean and maximum cycles it takes to

1/2 2/4 4/8 8/16 16/32 0 5 10 15 20 25 30 35 40 45

4M2S =0.5 Mean Transfer Cycles

Interface delay FIFO delay Hopcount 1/2 2/4 4/8 8/16 16/32 0 10 20 30 40 50 60 70 80 90 100

4M2S =0.5 Max Transfer Cycles

(32)

finish one transfer. We can see that higher burstiness will bring higher maximum transfer cycles, longer FIFO occupation and larger hop count.

Figure 37 and Figure 38 depict the mean and maximum cycles it takes to finish one AXI transaction with ρ=0.5. Since the generation rate of the transfers reaches the limitation of the slave processing capability, even lower burstiness will not induce too long extra waiting period.

1 2 3 4 5 0 20 40 60 80 100 120

4M2S Mean Transaction Cycles

4/8 8/16 16/32 1/2 2/4 1 2 3 4 5 0 20 40 60 80 100 120

4M2S Max Transaction Cycles

(33)

3.2.3 Comparison between ρ= 0.2 and ρ=0.5

Figure 39. Comparison of mean delay in 4M2S

Figure 40. Comparison of maximum delay in 4M2S

Figure 39 and Figure 40 show the mean and maximum transfer delay of both ρ=0.2 and ρ=0.5. ρ=0.2 gives better result in both mean and maximum cases, however when it reaches higher burstiness, ρ=0.2 almost equals to ρ=0.5. 1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/32 0 5 10 15 20 25 30 35 40 45 4M2S Mean delay Interface delay FIFO delay Hopcount 1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/32 0 10 20 30 40 50 60 70 80 90 100 4M2S Max delay =0.2 vs =0.5 Interface delay FIFO delay Hopcount

(34)

Figure 41. Comparison of mean transfer delay in 4M2S

Figure 42. Comparison of maximum transfer delay in 4M2S

Figure 41 and Figure 42 depict the individual comparison of transfer delay. We can see that ρ=0.5 leads to much higher transfer delay when burstiness is low. 1 2 3 4 5 0 5 10 15 20 25 30 35 40 45

4M2S Mean Transfer delay

=0.2 =0.5 1 2 3 4 5 0 10 20 30 40 50 60 70 80 90 100

4M2S Max Transfer delay

=0.2 =0.5

(35)

Figure 43. Comparison of mean FIFO delay in 4M2S

Figure 44. Comparison of maximum FIFO delay in 4M2S

Figure 43 and Figure 44 depict the individual comparison of FIFO delay. We can see that ρ=0.5 induce much higher FIFO delay when burstiness is low. In some case it can lead up to 40% larger compared with ρ=0.2.

1 2 3 4 5 0 5 10 15 20 25

4M2S Mean FIFO delay

=0.2 =0.5 1 2 3 4 5 0 5 10 15 20 25 30 35 40 45 50

4M2S Max FIFO delay

=0.2 =0.5

(36)

Figure 45. Comparison of mean hop count in 4M2S

Figure 46. Comparison of maximum hop count in 4M2S

Figure 45 and Figure 46 show the individual comparison of network delay. The network delay is influenced by the level of network traffic. Higher burstiness will induce more congestion in the network. However, in the highest burstiness case, since the waiting period is larger than the cycles it takes the packet to travel through the network, both of them gives the same result. 1 2 3 4 5 0 1 2 3 4 5 6 4M2S Mean Hopcount =0.2 =0.5 1 2 3 4 5 0 5 10 15 20 25 4M2S Max Hopcount =0.2 =0.5

(37)

3.3 8 Masters 4 Slaves (8M4S)

The 8M4S platform is to investigate the behavior of regulation in high network traffic. Each master generated the 16-transfer AXI transaction with different regulation parameters.

3.3.1 Experiment results of ρ= 0.2

This group of experiments is performed with different combination of m and n with the result of ρ is 0.2.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 1 2 3x 10 4 __{= 1/5} 0 0.5 1 1.5 2 2.5 3 0 1 2x 10 4 __{= 2/10} 0 1 2 3 4 5 6 0 5000 10000 15000  = 4/20 0 2 4 6 8 10 12 14 0 5000 10000 15000  = 8/40 0 2 4 6 8 10 12 14 16 0 5000 10000  = 16/80

(38)

Figure 47 shows the histogram of FIFO occupation and Figure 48 depicts the maximum FIFO occupation. Higher burstiness induces more congestion in the slave side, which will lead to higher FIFO occupation.

1/5 2/10 4/20 8/40 16/80 0 2 4 6 8 10 12 14 16 18

8M4S =0.2 Max FIFO Itmes

1/5 2/10 4/20 8/40 16/80 0 10 20 30 40 50 60 70 80 90 100

(39)

Figure 49 depicts the maximum cycles each transfer spends to go through the network, while Figure 50 depicts the mean cycles each transfer spends to go through the network. Each bar in these pictures is formed by 3 parts, the interface delay, the FIFO delay and the network delay. It is easy to understand that higher burstiness would induce more congestion in the slave side FIFO as well as the network packet congestion. We can see that by increasing the burstiness, all of the three parts will increase.

1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25 30 35 40 45

Interface delay FIFO delay Hopcount 1 2 3 4 5 0 20 40 60 80 100 120

(40)

Figure 51 and Figure 52 depict the mean and maximum cycles it takes to finish one AXI transaction. Due to the introduction of waiting period, higher burstiness needs shorter time to finish one transaction.

3.3.2 Experiment result of ρ=0.5

This group of experiment is performed with different combination of m and n with the result of ρ is 0.5.

1 2 3 4 5 0 20 40 60 80 100 120

(41)

Figure 53 is the histogram of FIFO occupation and Figure 54 shows the maximum FIFO occupation. We notice that because ρ=0.5 is a high regulation rate, the combination of m=2 and n=1 reached more times maximum FIFO occupation while others only reached very few times.

0 2 4 6 8 10 12 14 16 0 5000 10000 15000 rho = 1/2 0 2 4 6 8 10 12 14 16 0 1 2x 10 4 _{rho = 2/4} 0 2 4 6 8 10 12 14 16 0 5000 10000 15000 rho = 4/8 0 2 4 6 8 10 12 14 16 0 5000 10000 15000 rho = 8/16 0 2 4 6 8 10 12 14 16 0 5000 10000 rho = 16/32 1/2 2/4 4/8 8/16 16/32 0 2 4 6 8 10 12 14 16 18

(42)

Figure 55 and Figure 56 depict the mean and maximum transfer cycles. Like ρ=0.2, because of the congestion it introduced, higher burstiness leads to more interface delay, higher FIFO delay and larger hop count.

1/2 2/4 4/8 8/16 16/32 0 10 20 30 40 50 60 70 80 90 100

Interface delay FIFO delay Hopcount 1/2 2/4 4/8 8/16 16/32 0 5 10 15 20 25 30 35 40 45

(43)

Figure 57 and Figure 58 show the mean and maximum cycles it takes to finish one transaction. We notice that the combination m=2 and n=1 is different from others. It always leads to maximum transaction cycles. This is because it keeps the slave side under the maximum pressure.

1/2 2/4 4/8 8/16 16/32 0 20 40 60 80 100 120

8M4S =0.5 Mean Transaction Cycles

1/2 2/4 4/8 8/16 16/32 0 20 40 60 80 100 120

(44)

3.3.3 Comparison between ρ=0.2 and ρ=0.5

Figure 59. Comparison of mean delay in 8M4S

Figure 60. Comparison of Maximum delay in 8M4S

Figure 59 and Figure 60 depict the mean and maximum transfer delay. We can see that ρ=0.2 gives better result, however when burstiness is high, ρ=0.2 and ρ=0.5 give the almost the same result.

1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/32 0 5 10 15 20 25 30 35 40 45 8M4S Mean delay Interface delay FIFO delay Hopcount 1/5 1/2 2/10 2/4 4/20 4/8 8/40 8/16 16/80 16/32 0 10 20 30 40 50 60 70 80 90 100

8M4S Max Transfer Cycles

(45)

Figure 61. Comparison of Mean transaction delay in 8M4S

Figure 62. Comparison of Maximum transaction delay in 8M4S

Figure 61 and Figure 62 show the individual comparison of the mean and maximum transfer delay. We notice that the combination of m=2 and n=1 gives very high mean transfer delay, it is even higher than the highest burstiness case. We consider this combination of regulation rate as almost no effect.

1 2 3 4 5 0 5 10 15 20 25 30 35 40 45

8M4S Mean Transfer delay

=0.2 =0.5 1 2 3 4 5 0 10 20 30 40 50 60 70 80 90 100

8M4S Max Transfer delay

=0.2 =0.5

(46)

Figure 63. Comparison of mean FIFO delay in 8M4S

Figure 64. Comparison of maximum FIFO delay in 8M4S

Figure 63 and Figure 64 show the individual comparison of mean and maximum FIFO delay. From them we can see that ρ=0.5 induce higher congestion and makes each transfer stay in the FIFO for longer time.

1 2 3 4 5 0 5 10 15 20 25

8M4S Mean FIFO delay

=0.2 =0.5 1 2 3 4 5 0 5 10 15 20 25 30 35 40 45 50

8M4S Max FIFO delay

=0.2 =0.5

(47)

Figure 65. Comparison of mean hop count in 8M4S

Figure 66. Comparison of maximum hop count in 8M4S

Figure 65 and Figure 66 show the individual comparison of network delay. The network delay is influenced by the level of network traffic. Higher burstiness will induce more congestion in the network. However, in the highest burstiness case, since the waiting period is larger than the cycles it takes the packet to travel through the network, both of them gives the same result. 1 2 3 4 5 0 1 2 3 4 5 6 7 8M4S Mean Hopcount =0.2 =0.5 1 2 3 4 5 0 2 4 6 8 10 12 14 16 18 8M4S Max Hopcount =0.2 =0.5

(48)

3.4 Comparison between 4M2S and 8M4S

Here we compare between the 2 kinds of network distribution. Since the pressure of the network traffic is different in 4M2S and 8M4S, we will analyze the effect introduced by the network traffic.

3.4.1 Comparison of ρ=0.2

Figure 67. Comparison of mean transfer delay between 4M2S and 8M4S ρ=0.2 1/5 2/10 4/20 8/40 16/80 1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25 30 35 40 45 4M2S vs 8M4S =0.2 Interface delay FIFO delay Hopcount

(49)

Figure 68. Comparison of maximum transfer delay between 4M2S and 8M4S ρ=0.2

Figure 67 and Figure 68 show the mean and maximum overall perspective of transfer delay. We can see that we did benefit from lower network traffic. With higher network traffic, it increases the time the packets need to travel through the network. We shall analyze the result of individual comparison to see which part introduce the difference.

Figure 69. Comparison of mean transfer cycles between 4M2S and 8M4S ρ=0.2 1/5 1/5 2/10 2/10 4/20 4/20 8/40 8/40 16/80 16/80 0 10 20 30 40 50 60 70 80 90 100 4M2S vs 8M4S =0.2 Interface delay FIFO delay Hopcount 1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25 30 35 40 45 

=0.2 Mean Transfer delay

4M2S 8M4S

(50)

Figure 70. Comparison of maximum transfer cycles between 4M2S and 8M4s ρ=0.2

Figure 69 and Figure 70 show the individual comparison of mean and maximum transfer delay, where we can see that higher network traffic will affect the transfer delay.

Figure 71. Comparison of mean FIFO delay between 4M2S and 8M4S ρ=0.2

1/5 2/10 4/20 8/40 16/80 0 10 20 30 40 50 60 70 80 90 100

=0.2 Max Transfer delay

4M2S 8M4S 1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25

=0.2 Mean FIFO delay

4M2S 8M4S

(51)

Figure 72. Comparison of Maximum FIFO delay between 4M2S and 8M4S ρ=0.2

Figure 71 and Figure 72 depict the comparison of mean and maximum FIFO delay. The 8M4S has higher FIFO delay because the transfers in the FIFO have to wait for others to arrive. Thus it is also influenced by the network traffic.

Figure 73. Comparison of mean hop count between 4M2S and 8M4S ρ=0.2

1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25 30 35 40 45 50

=0.2 Max FIFO delay

4M2S 8M4S 1/5 2/10 4/20 8/40 16/80 0 1 2 3 4 5 6 7 =0.2 Mean Hopcount 4M2S 8M4S

(52)

Figure 74. Comparison of maximum hop count between 4M2S and 8M4S ρ=0.2

Figure 73 and Figure 74 show the comparison of mean and maximum network delay. This comparison should reflect the influence of the network directly. From the mean network delay diagram we can see that 8M4S takes 1 or 2 cycles more to travel through the network.

3.4.2 Comparison of ρ=0.5

Figure 75. Comparison of mean transfer delay between 4M2S and 8M4S ρ=0.5 1/5 2/10 4/20 8/40 16/80 0 5 10 15 20 25 =0.2 Max Hopcount 4M2S 8M4S 1/2 1/2 2/4 2/4 4/8 4/8 8/16 8/16 16/32 16/32 0 5 10 15 20 25 30 35 40 45 4M2S vs 8M4S =0.5 Interface delay FIFO delay Hopcount

(53)

Figure 76. Comparison of maximum transfer delay between 4M2S and 8M4S ρ=0.5

Figure 75 and Figure 76 depict the comparison of mean and maximum transfer delay.

Figure 77. Comparison of mean transfer cycles between 4M2S and 8M4S ρ=0.5 1/2 1/2 2/4 2/4 4/8 4/8 8/16 8/16 16/32 16/32 0 10 20 30 40 50 60 70 80 90 100 4M2S vs 8M4S =0.5 Interface delay FIFO delay Hopcount 1/2 2/4 4/8 8/16 16/32 0 5 10 15 20 25 30 35 40 45 

=0.5 Mean Transfer delay

4M2S 8M4S

(54)

Figure 78. Comparison of maximum transfer cycles between 4M2S and 8M4S ρ=0.5

Figure 77 and Figure 78 show the comparison of mean and maximum cycles it takes for each transfer.

Figure 79. Comparison of mean FIFO delay between 4M2S and 8M4S ρ=0.5

1/2 2/4 4/8 8/16 16/32 0 10 20 30 40 50 60 70 80 90 100

=0.5 Max Transfer delay

4M2S 8M4S 1/2 2/4 4/8 8/16 16/32 0 5 10 15 20 25

=0.5 Mean FIFO delay

4M2S 8M4S

(55)

Figure 80. Comparison of maximum FIFO delay between 4M2S and 8M4S ρ=0.5

Figure 79 and Figure 80 show the comparison of mean and maximum FIFO delay. We can see that there is not much difference between the 2 sorts of network distributions.

Figure 81. Comparison of mean hop count between 4M2S and 8M4S ρ=0.5

1/2 2/4 4/8 8/16 16/32 0 5 10 15 20 25 30 35 40 45 50

=0.5 Max FIFO delay

4M2S 8M4S 1/2 2/4 4/8 8/16 16/32 0 1 2 3 4 5 6 7 =0.5 Mean Hopcount 4M2S 8M4S

(56)

Figure 82. Comparison of maximum hop count between 4M2S and 8M4S ρ=0.5

Figure 81 and Figure 82 show the comparison of mean and maximum network delay. This comparison should reflect the influence of the network directly. From the mean network delay diagram we can see that 8M4S takes 1 or 2 cycles more to travel through the network.

1/2 2/4 4/8 8/16 16/32 0 5 10 15 20 25 =0.5 Max Hopcount 4M2S 8M4S

(57)

4 Conclusion

4.1 Summary

This report describes the construction of a Nostrum network based system. We reuse the existing Nostrum router, and build interfaces to wrap the Nostrum network. The two interfaces which have been designed and implemented are master interface and slave interface. In particular, the two interfaces realize an industrial interconnect protocol, the AXI protocol.

After constructing the platform, we perform experiments of flow regulation on the platform. With simple but illustrative experiments, we can look into the effect of flow regulation on reducing delay and backlog under various traffic scenarios.

4.2 Future Work

As the first step, we have evaluated flow regulation using synthetic traffic flows. In the future, we shall use traffic streams from real applications. This requires integrating real IP modules (masters and slaves) into the NoC system.

As can be observed from the experimental results, the regulation has clear impact on the system performance. In general, increasing regulation strength results in less transfer delays. However, we also observe that there are exceptions in some cases. The reason for this complicated phenomenon is partially due to delfection routing, which is adaptive and non-deterministic, but an in-depth investigation is necessary. We are also aware of that, for NoC systems, regulation is better globally orchestrated since regulaltion on invidiual streams results in interferences and their impact needs to be investigaed more from a global perspective.

(58)

REFERENCES

[1] Nostrum Network on Chip, Nostrum website http://www.ict.kth.se/nostrum/ [2] AXI specification. ARM. www.arm.com

[3] J.-Y. L. Boudec and P. Thiran, “Network Calculus: A Theory of

Deterministic Queuing Systems for the Internet”. Number 2050 in LNCS, 2004.

[4] Zhonghai Lu, Mikael Millberg, Axel Jantsch, Alistair Bruce, Pieter van der Wolf and Tomas Henriksson. "Flow Regulation for On-Chip Communication". Proceedings of the 2009 Design, Automation and Test in Europe Conference (DATE'09), Nice, France, April 2009.

[5] Erland Nilsson. “Design and Implementation of a hot potato switch in a Network on Chip”. Master Thesis, IMIT, KTH, June 2002.

Design and Implementation of AXI-based Network-on-Chip Systems for Flow Regulation

TRITA-ICT-EX-2009:157

Master Thesis in Electronic System Design