• No results found

Design of a core router using the SoCBUS on-chip network

N/A
N/A
Protected

Academic year: 2021

Share "Design of a core router using the SoCBUS on-chip network"

Copied!
73
0
0

Loading.... (view fulltext now)

Full text

(1)

Design of a core router

using the SoCBUS on-chip network

Examensarbete utf¨ort i Datorteknik

vid Tekniska H¨ogskolan i Link¨oping av

Jimmy Svensson

Reg nr: LiTH-ISY-EX–04/3562–SE Link¨oping 2004

(2)
(3)

Design of a core router

using the SoCBUS on-chip network

Examensarbete utf¨ort i Datorteknik

vid Tekniska H¨ogskolan i Link¨oping av

Jimmy Svensson

Reg nr: LiTH-ISY-EX–04/3562–SE

Supervisor:Daniel Wiklund Examiner: Dake Liu

(4)
(5)

Avdelning, Institution Division, Department

Institutionen för systemteknik

581 83 LINKÖPING

Datum Date 2004-12-02 Språk

Language Rapporttyp Report category ISBN

Svenska/Swedish

X Engelska/English Licentiatavhandling X Examensarbete ISRN LITH-ISY-EX--04/3562--SE

C-uppsats

D-uppsats Serietitel och serienummer Title of series, numbering ISSN

Övrig rapport

____

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2004/3562/

Titel

Title Design of a core router using the SoCBUS on-chip network

Författare

Author Jimmy Svensson

Sammanfattning

Abstract

The evolving technology has over the past decade contributed to a bandwidth explosion on the Internet. This makes it interesting to look at the development of the workhorses of the Internet, the core routers. The main objective of this project is to develop a 16 port gigabit core router

architecture using intellectual property (IP) blocks and a SoCBUS on-chip interconnection network.

The router architecture will be evaluated by making simulations using the SoCBUS simulation environment. Some changes will be made to the current simulator to make the simulations of the core router more realistic. By studying the SoCBUS network load the bottlenecks of the

architecture can be found. Changes to the router design and SoCBUS architecture will be made in order to boost the performance of the router.

The router developed in this project can under normal traffic conditions handle a throughput of 16x10Gbit/s without dropping packets. This core router is good enough to compete with the top of the line single-chip core routers on the market today. The advantage of this architecture compared to others is that it is very flexible when it comes too adding new functionality. The general on-chip network also reduces the design time of this system.

Nyckelord

Keyword

(6)
(7)

Abstract

The evolving technology has over the past decade contributed to a bandwidth explosion on the Internet. This makes it interesting to look at the development of the workhorses of the Internet, the core routers. The main objective of this project is to develop a 16 port gigabit core router architecture using intellectual property (IP) blocks and a SoCBUS on-chip interconnection network.

The router architecture will be evaluated by making simulations using the SoCBUS simulation environment. Some changes will be made to the current simulator to make the simulations of the core router more realistic. By studying the SoCBUS network load the bottlenecks of the architecture can be found. Changes to the router design and SoCBUS architecture will be made in order to boost the performance of the router.

The router developed in this project can under normal traffic conditions handle a throughput of 16x10Gbit/s without dropping packets. This core router is good enough to compete with the top of the line single-chip core routers on the market today. The ad-vantage of this architecture compared to others is that it is very flexible when it comes to adding new functionality. The general on-chip network also reduces the design time of this system.

(8)
(9)

Abbreviations

ACL Access Control List BGP Border Gateway Protocol CAM Content Addressable Memory CRC Cyclic Redundancy Check

FT Forwarding Table

IPP Input Packet Processor IP Intellectual Property IP Internet Protocol

IPv4 Internet Protocol version 4

LAN Local Area Network

LPM Longest Prefix Match MAN Metropolitan Area Networks MLPS Mega Lookups Per Second MPLS Multi Protocol Label Switching MPPS Mega Packet Per Second

MU Multicast Unit

NP Network Processor

OC-3 SONET Optical Carrier at 155Mbit/s OC-48 SONET Optical Carrier at 2.5Gbit/s OC-192 SONET Optical Carrier at 10Gbit/s OPP Output Packet Processor

OSI Open Systems Interconnection OSPF Open Shortest Path First

PB Packet Buffer

QoS Quality of Service

SNMP Simple Network Management Protocol SONET Synchronous Optical Network

TCP Transmission Control Protocol

TTL Time To Live

UDP User Datagram Protocol

WAN Wide Area Network

(10)
(11)

Contents

1 Introduction 1 1.1 Background . . . 1 1.2 Objectives . . . 1 1.3 Method . . . 2 1.4 Thesis outline . . . 2 2 Computer Networks 3 2.1 Protocol Layers . . . 3

2.1.1 OSI Reference Model . . . 3

2.1.2 Internet Reference Model (TCP/IP) . . . 4

2.2 Network entities . . . 5

2.2.1 Routers . . . 6

2.3 Network Processing Tasks . . . 7

2.3.1 Classification . . . 7 2.3.2 Lookup . . . 7 2.3.3 Computation . . . 7 2.3.4 Data manipulation . . . 7 2.3.5 Queue management . . . 8 2.3.6 Control processing . . . 8

3 SoCBUS on-chip network 9 3.1 SoCBUS overview . . . 9

3.2 Packet connected circuit (PCC) . . . 10

3.3 Behavioral simulation environment . . . 11

3.4 Specification of the current implementation . . . 12

4 Core router design 13 4.1 Design flow . . . 13

4.2 Partitioning of functionality . . . 13

4.2.1 Input packet processor (IPP) . . . 14

4.2.2 Output packet processor (OPP) . . . 14

4.2.3 Forwarding table (FT) . . . 14

4.2.4 Packet buffer (PB) . . . 15 v

(12)

4.2.5 Multicast unit (MU) . . . 15

4.2.6 Central processing unit (CPU) . . . 15

4.3 On-chip traffic model . . . 16

4.3.1 Data path . . . 16

4.3.2 Control path . . . 17

4.4 Block interconnection using SoCBUS . . . 17

4.4.1 Number of IPP/OPPs and PBs . . . 18

4.4.2 General SoCBUS network structure . . . 18

4.4.3 Motivation of block placement . . . 19

4.5 Extraction of execution time for the functional blocks . . . 19

4.5.1 Input and Output Packet Processor (IPP/OPP) . . . 19

4.5.2 Forwarding Table (FT) . . . 19

4.5.3 Packet Buffer (PB) . . . 19

4.5.4 Multicast Unit (MU) . . . 20

4.5.5 Central Processing Unit (CPU) . . . 20

5 Traffic simulations of the initial design 21 5.1 Internet traffic model . . . 21

5.1.1 Minimum size packets . . . 21

5.1.2 Evenly distributed (RFC2544) . . . 22

5.1.3 Internet Mix . . . 22

5.2 Improvements in the SoCBUS simulator . . . 22

5.2.1 Implementation of a model for discrete distributions . . . 22

5.2.2 Implementation of dependencies between SoCBUS traffic . . . . 23

5.3 Simulation results . . . 24

5.3.1 Throughput . . . 24

5.3.2 SoCBUS router lock . . . 26

5.3.3 SoCBUS wrapper send lock . . . 27

5.3.4 Conclusions . . . 28

6 Second router design 31 6.1 Improvements in the design . . . 31

6.1.1 Several forward tables . . . 31

6.1.2 More SoCBUS switches . . . 31

6.1.3 SoCBUS bus width . . . 32

6.2 Complete design after improvements . . . 32

6.3 Simulation results . . . 32

6.3.1 Throughput . . . 33

6.3.2 SoCBUS router lock . . . 34

6.3.3 SoCBUS wrapper send lock . . . 35

6.3.4 SoCBUS transfer overhead . . . 36

(13)

Contents vii

7 Final router design 39

7.1 Improvements in the design . . . 39

7.1.1 More packet buffers and forward tables . . . 39

7.1.2 Changes in PCC . . . 39

7.2 Complete design after improvements . . . 40

7.3 Simulation results . . . 40

7.3.1 Throughput . . . 41

7.3.2 SoCBUS router lock . . . 42

7.3.3 SoCBUS wrapper send lock . . . 43

7.3.4 Conclusions . . . 44

7.4 New requirements on the functional blocks . . . 44

7.4.1 Input and output packet processors (IPP/OPP) . . . 44

7.4.2 Packet buffer (PB) . . . 44

7.4.3 Forwarding table (FT) . . . 45

8 Conclusions 47 8.1 Results . . . 47

8.2 Further work . . . 48

A Dependency support in the SoCBUS simulator 51 B Supplementary results 53 B.1 Initial design . . . 53

B.2 Second design . . . 54

(14)
(15)

Chapter 1

Introduction

1.1 Background

The evolving technology has over the past decade contributed to a bandwidth explosion on the Internet. Today the Internet traffic doubles every six months [6]. This rapid change forces the actors in the networking business to act fast and reduce the time to market to be first with a new generation of products. One way of decreasing the time to market is to increase the (re-)use of of Intellectual Property (IP) blocks.

SoCBUS is a research project at Link¨oping University that started in 1999. The aim of this project is to develop a bus system that provides the data and control connections between different IP blocks on one chip. This way of designing chips using IP blocks and SoCBUS will be a tool for the engineers to even further reduce the time to market.

One of the most important building blocks of the Internet is the core routers. Today the bandwidth achieved using fiber optics is much higher than the speed achieved by routers so the limiting factor in the Internet today is the routers. This fact makes the development of routers very interesting to look at.

The task of this project is to design a router chip on a system level using IP blocks and a SoCBUS on-chip interconnection network to connect the different blocks. Several benchmarks will be performed to find out the performance of the router and to evaluate the SoCBUS architecture.

1.2 Objectives

The main objective of this thesis is to design and evaluate a 16 port IP version 4 core router architecture developed using IP blocks and the SoCBUS on-chip interconnection network. The router should fulfill the requirements specified in RFC1812 [2]. A number of goals were defined.

• Evaluate different router architectures. 1

(16)

• Divide the router functionality into functional blocks that can be implemented as IP blocks.

• Use SoCBUS to make the interconnection between the different IP blocks.

• Perform benchmarks of the router to find bottlenecks in the design of the router and in the present SoCBUS architecture.

• Refine the router design and SoCBUS architecture to boost the performance of the router in terms of higher throughput and lower latency.

1.3 Method

The first thing to do is to study present router architectures to find an architecture that fits this type of implementation, or come up with a new type of architecture. When the theoretical design of the router is finished it has to be mapped into SoCBUS. This is done by describing the design in the SoCBUS simulator environment. To make realistic bench-marks of the design typical Internet traffic is used as input to the simulator. The result from the simulations will be analyzed and used to boost the performance of the router in terms of throughput and latency. This will be achieved by changing the design of the router and by making some changes to the current SoCBUS architecture.

1.4 Thesis outline

Chapters 2 and 3 will give the reader some background information about the technologies used in this project. Chapter 4 will describe the development of the initial router architec-ture. The results from simulations of this architecture are presented in chapter 5. Chapters 6 and 7 describe the process of refining the router model to improve the performance. Conclusions from the work are drawn in chapter 8.

(17)

Chapter 2

Computer Networks

This chapter will give a brief introduction to packet based computer networks. The most common network protocols and network entities will be described briefly. If you seek further knowledge in this field the book Computer Networks [5] is a good starting point.

2.1 Protocol Layers

To help getting a better understanding of computer networks most networks are organized as a stack of layers. Each layer can only send information to the next higher or lower layer. To exchange information between peer layers at different network nodes a header can be added to the data on the sending side. When the packet is received on the receiver side the corresponding layer examines the header.

2.1.1 OSI Reference Model

The Open Systems Interconnection (OSI) is a standardized way of describing protocol layers. It is composed of seven abstract layers as illustrated in figure 2.1.

Host A Host B Application Application Presentation Presentation Transport Session Session Transport Network Network

Data Link Data Link Physical Physical Layer 7 6 5 4 3 2 1 Physical Link

Figure 2.1. The OSI reference model.

(18)

2.1.2 Internet Reference Model (TCP/IP)

Although the OSI model is widely used and often referenced to the explosive development of the Internet has made the TCP/IP protocol stack totally dominant. TCP/IP use a four layer scheme composed of the layers presented below. Some examples of protocols and networks used in the TCP/IP model are shown in figure 2.2.

SATNET ARPANET DNS FTP UDP TCP IP LAN SMTP 1+2 3 4 7 OSI Layer

Figure 2.2. Protocols and networks in the TCP/IP model.

Link Layer

This layer defined the network hardware and device drivers. In this layer different proto-cols are used depending on the size of the network. When the size of a computer networks is described one often refers to LAN (Local Area Network), MAN (Metropolitan Area Networks) and WAN (Wide Area Network). Table 2.1 shows the protocols typically used in the different networks.

Network size Link layer protocols

LAN Ethernet

MAN Ethernet or packet over SONET

WAN Packet over SONET

Table 2.1. Data link layer protocols used at different network sizes.

Typical speeds for Ethernet is 100Mbit/s or 1Gbit/s while packet over SONET typically runs at 155Mbit/s (OC-3), 2.5Gbit/s (OC-48) or 10Gbit/s (OC-192). These speeds can be good to have in mind because the router design in this project will have to support one or several of these standards.

Network Layer

This layer handles simple communication with neighbors on the local network. IP is a typical network layer protocol that is used in the Internet. This protocol adds the possibility to identify each computer on the network using an arbitrary id called IP address.

(19)

2.2 Network entities 5

Transport Layer

The transport layer handles communication between the actual source and destination even if the computers are not on the same local network. Typical transport layer protocols used on the Internet is TCP and UDP.

Application Layer

This is the end-user application encapsulation. Typical application layer protocols are DNS, HTTP and SMTP.

2.2 Network entities

The entities present on the Internet can be divided into two major groups, routers and ter-minals. Furthermore routers that connect other routers with high bandwidth in backbone networks are normally denoted core routers. At the edge of the Internet edge routers pro-vide an access point between the local network and the Internet. The structure used on the Internet is shown in figure 2.3. Network terminals can be anything from desktop comput-ers to file server systems. Wireless applications like mobile phones are also classified as network terminals.

Internet core

Edge router

Core router

Terminal

(20)

2.2.1 Routers

The Internet is a packet-switched communication network based on the IP protocol. This means that no dedicated communication channels from source to destination are created. Each network entity receive, store, and forward the packet to the closest entity along the path towards the destination host.

Routers can operate at different layers in the protocol stack. Layer 2 routers are com-monly denoted as switches and make all routing decisions based on information in the OSI layer 2 protocol header while layer 3 routers are commonly denoted as routers and make all routing decisions based on information in the OSI layer 3 protocol header. The basic functionality of Internet routers can be divided into the following parts.

Packet forwarding

The packet forwarding can be divided into two different parts, unicast and multicast. With multicast the incoming packet is forwarded to one or several output ports while unicast can only be forwarded to one output port. The packet should be forwarded so that the packet eventually reach their destination(s). To decide the output port(s) to which the packet will be forwarded the router examines the incoming packet header. By using the destination address as an index into the routing table the router can find the most appropriate output port(s). Because of the time limits of this work the multicast standard will not be discussed in more detail.

Route Processing

Because of the dynamic properties of Internet, routers implement different routing pro-tocols to share connectivity information and maintain routing tables. This information is needed to make correct decisions when a packet is forwarded.

The routing on the Internet is divided into two layers. On each local network an interior gateway routing protocol is used by the routers to determine the best way to the destina-tion. These routing algorithms can be grouped into two major classes: nonadaptive and adaptive. Nonadaptive algorithms do not base their routing decisions on measurements or estimations of the current traffic and network topology. Instead the choice of route is static and is downloaded to the routers when the network is booted. Adaptive algorithms change their routing decisions to reflect the changes in the network and traffic. One of the most widely used adaptive routing algorithm is OSPF (Open Shortest Path First).

Between the different networks, in the core of the Internet, an exterior gateway routing protocol is used. This protocol connects the different service providers using a unique ID for each network called Autonomous System (AS). Because this traffic involves for ex-ample crossing international borders or being forwarded through another service provider the exterior gateway protocol needs to be very flexible when it comes to routing policies. The protocol used on the Internet today is called BGP (Border Gateway Protocol) and is designed to allow many different kinds of routing policies.

(21)

2.3 Network Processing Tasks 7

Router special services

Tasks that fall into this category are filtering, traffic prioritizing, authentication and net-work management using for example SNMP (Simple Netnet-work Management Protocol). These services are not critical for the basic router functionality and will not be further discussed in thesis.

2.3 Network Processing Tasks

Processors used in network entities are generally called network processors. The tasks generally performed by network processors are described below.

2.3.1 Classification

To make a decision on how to process the incoming packets each packet has to be clas-sified. The classification consists of both pattern matching and field value extraction. In routers you typically want to match the destination address against the access control list (ACL) to see if the packet should be forwarded or not. The pattern matching is performed either by calculation or lookup tables.

2.3.2 Lookup

The lookup consists of looking up data based on a key, but is often used in conjunction with pattern matching to find one unique entry in the table. The most common application of lookup in the network processor domain is the route lookup. Based on data in the packet header the destination port and/or address is calculated. For MPLS and ATM the mapping is often one to one and only one lookup is required, but IPv4 and IPv6 require Longest Prefix Matching (LPM). Tree like data structures are often used to efficiently store the table and to speed up the lookup.

2.3.3 Computation

The most common calculations performed in a network processor is calculation and/or up-dating of the header checksum or Cyclic Redundancy Check (CRC). With the new support for authentication encryption and decryption algorithms sometimes needs to be applied on the entire packet.

2.3.4 Data manipulation

Any modification of the packet header is classified as data manipulation. This could for example be the TTL-field in IPv4 that for every hop has to be decremented by one.

(22)

2.3.5 Queue management

The queue management is the scheduling and storage of the packets inside the network processor. The queue management kernel is responsible for traffic priorities, traffic shap-ing and other Quality of Service (QoS) applications.

2.3.6 Control processing

Control processing consists of several tasks, for example synchronization of the different parts of the design, the gathering of statistics and routing table updates. These tasks are generally performed by a general purpose processor.

(23)

Chapter 3

SoCBUS on-chip network

This chapter will give an introduction to the SoCBUS on-chip interconnection network developed at Link¨oping University [11]. To clarify the SoCBUS concept the terms System on Chip (SoC) and Network on Chip (NoC) will first be introduced.

The continuing development in modern electronics enables increasingly larger systems to be integrated on single chips. This is called Systems-on-Chip (SoC). When more and more different functionalities are added to a chip the need for a flexible on-chip bus system with support for multiple and simultaneous connections is vital. One way of achieving this is to add a general network that enables the different functional blocks to communicate with each other. This feature is called Network-on-Chip (NoC).

3.1 SoCBUS overview

The SoCBUS on-chip network consists of a number of switches that can be connected to each other using any network topology. Each switch is connected to one functional block or IP core through a wrapper and an arbitrary number of neighboring switches. Figure 3.1 shows an example of the switch interconnection when a 2D mesh network is used. An example of a 2D mesh network is shown in figure 3.2.

Switch

IP Core Wrapper

Figure 3.1. Network connected processing tile.

(24)

Wrapper Wrapper Wrapper Wrapper IP Core IP Core IP Core IP Core Wrapper Wrapper Wrapper Wrapper IP Core IP Core IP Core IP Core Wrapper Wrapper Wrapper Wrapper IP Core IP Core IP Core IP Core Wrapper Wrapper Wrapper Wrapper IP Core IP Core IP Core IP Core

Figure 3.2. Overview of a SoCBUS network.

The routing decisions are made in the switches and are implemented using a static routing table. Each entry in the table shows the possible outputs that will take the route closer to the destination. This information is combined with the dynamic state of the output ports to select the appropriate output for a routing request.

The interface between the network and the IP blocks are the wrappers. The wrappers handle format conversion, necessary buffering, asynchronous clock domain bridging and network signaling.

3.2 Packet connected circuit (PCC)

The SoCBUS network uses a novel style circuit switching called Packet Connected Circuit (PCC). When data is sent from one IP block to another a request packet first traverses the network to find the way to the destination. While doing this the packet path is locked and used as a circuit connection for the packet payload. If the route can not be established the request is sent back to the source and all switches in that path will be unlocked. Once the connection is established the data is sent. The last data packet will unlock the switches on the way to the destination. Figure 3.3 shows the PCC connection scheme.

(25)

3.3 Behavioral simulation environment 11 Retry Source Dest Request Source Dest Request Ack Transfer Cancel Ack Cancel Transfer (b) Second−try successful (a) First−try successful

nAck

Figure 3.3. PCC transfer.

3.3 Behavioral simulation environment

A complete simulation environment has been developed for making simulations of the SoCBUS on-chip interconnection network. This simulation environment consists of two parts, a stimuli generator and the actual simulator.

The stimuli generator is a tool for creating interesting traffic patterns used as input to the simulator. Several mathematical models can be used to specify different traffic properties like start time and size of packet. The input to the stimuli generator is described using XML.

The simulator performs the actual simulation of the network. The output of the stimuli generator together with a description of the network structure is given as input. The sim-ulator is event based and all components like routers, links, sources and destinations are implemented as compiled-in behavioral models. The output of the simulator consists of different measurements of the network. This could for example be the lock time of each SoCBUS switch. More information about the simulation environment can be found in the SoCBUS Simulator manual [12]. The general SoCBUS simulation flow is shown in figure 3.4.

(26)

Component models

Results

Stimuli file Network model Stimuli generator

Traffic model

Simulator

Figure 3.4. SoCBUS simulation flow.

3.4 Specification of the current implementation

The properties of the current simulation environment are based on a real implementation of the switches used in the SoCBUS on-chip network. The bus width is set to 16 bits in each direction and is clocked at 1.2GHz. The latency in the switches is 6 cycles during the route setup and 1 cycle for data transfer.

(27)

Chapter 4

Core router design

In this section the initial router design will be described. The different functional blocks used in the design are introduced and one mapping of functional blocks onto a SoCBUS network will be presented.

4.1 Design flow

The design flow used in this project can be described in the following six points. 1. Partition the router functionality into several functional blocks.

2. Write down the specifications of each block and find a way of implementing them that fulfills the requirements on time and functionality.

3. Extract the execution time of each function in each functional block to establish a database for system timing and scheduling.

4. Connect the different blocks using the SoCBUS on-chip network. 5. Make simulations of the SoCBUS network using realistic traffic.

6. Make modifications in the design until the design fulfills the requirements on time and functionality.

Number five and six may have to be iterated several times to fulfill the requirements on performance or to boost the performance beyond the requirements.

4.2 Partitioning of functionality

This router will be implemented as one main chip that performs the actual routing and one or several other chips that will be needed for the physical interfaces and for processing of the media dependent protocols. The later chips will not be discussed further since this thesis only focus on the development of a core routing kernel.

(28)

To make use of the SoCBUS on-chip network architecture the functionality has to be divided into different functional blocks. A number of functional blocks have been identified and their main functionality is described below.

4.2.1 Input packet processor (IPP)

There is one input packet processor (IPP) for each network input port. The IPP is respon-sible for the identification and verification of the incoming packet. Each IPP contains an access control list (ACL) which contains rules about which packets that are allowed to be forwarded from this input port. Routing or control packets will be directly forwarded to the CPU and multicast packets will be forwarded to the multicast unit. The processing of unicast packets consists of the following parts:

• Receive packet from input network interface. • Associate the packet with a unique ID (32bits) • Validate IPv4 header (TTL and header checksum). • Filter packets via the Access Control List (ACL).

• Send the destination IP address together with the packet ID to the forward table. • Send packet data and ID to the corresponding packet buffer.

4.2.2 Output packet processor (OPP)

There is one output packet processor (OPP) for each network output port. The OPP is responsible for the updating of packet headers, including calculation of checksum and CRC. The processing of a packet consists of these parts:

• Receive packet from buffer.

• Update IPv4 header (TTL and header checksum). • Send packet to the output network interface.

4.2.3 Forwarding table (FT)

The forward table (FT) block is responsible for the address lookups. A typical lookup consists of the following parts:

• Receive packet destination IP address and ID from the IPP. • Perform lookup.

• Send the result of the lookup (next-hop and output port) together with the packet ID to the corresponding buffer.

(29)

4.2 Partitioning of functionality 15

4.2.4 Packet buffer (PB)

The packet buffer stores the packet until the lookup result from the FT is received and the corresponding output port is available. The general data flow is shown in figure 4.1. Because of the bandwidth limit in the SoCBUS network there will be no central packet buffer but each packet buffer will be responsible for a number of input packet processors. These are the tasks of the PB:

• Receive packet from the IPP.

• Save packet data to the buffer memory.

• When the hop address and output port has been received from the FT, the next-hop is bundled with the packet and sent to the OPP which is responsible for the given output port.

OPP PB

FT IPP

Figure 4.1. Traffic flow.

4.2.5 Multicast unit (MU)

The multicast unit (MU) is responsible for delivering multicast packets. These are the tasks of the MU:

• Receive multicast packets from the IPP. • Send routing requests to the FT.

• When the next-hop addresses and output ports has been received from the FT, the next-hop address(es) are bundled with the packet(s) and sent to the OPP(s) which are responsible for the given output port(s).

4.2.6 Central processing unit (CPU)

The CPU will be responsible for the tasks that have no dedicated block. These are the tasks of the CPU:

• Configure and synchronize the functional blocks.

• Handle routing packets (BGP and OSPF) and distribute route information to the forward table(s).

(30)

• Handle control packets dedicated for the router. This could for example be SNMP requests.

• Gather statistics from the different blocks.

4.3 On-chip traffic model

In this section the general SoCBUS traffic flow, independent of the Internet traffic model, will be described. The traffic flow is divided into two different types, data path and control path.

4.3.1 Data path

The data path consists of transferring the incoming packets to the correct output port de-pending on the packet header. In the typical case the packet is forwarded to only one output port it could also be forwarded to several. This feature is called multicast. Multi-cast packets will not be taken into consideration at this point of this study.

The data path traffic flow is critical and determines the speed of the router. A more detailed description of the data path is given below. Each sending is associated with a traffic number.

1. Send packet payload from IPP-i to PB-j.

2. Send destination IP address and packet ID from IPP-i to FT. 3. Send nexthop address and output port from FT to PB-j. 4. Send packet payload and nexthop from PB-j to OPP-k.

By looking at the description of the data path one realize that there are dependencies between the different transfers. These dependencies are described below.

• Traffic number 3 depends on the completion of number 2 and that the route lookup is finished.

• Traffic number 4 depends on number 1 and 3.

To simplify the understanding of the traffic flow it can be described using flow graphs. Figure 4.2 describes the data path traffic flow of an 8 port router. The numbers present on the arrows are traffic numbers.

(31)

4.4 Block interconnection using SoCBUS 17 FT PB2 PB1 IPP1−4 IPP5−8 OPP1−16 2 2 3 3 1 1 4 4

Figure 4.2. Traffic flow, 8 port router.

4.3.2 Control path

Traffic belonging to the control path is not critical for the router functionality in the same direct way as the data path traffic. The control path traffic is given below.

• Routing packets dedicated for the router. These packets are processed by the CPU. • Routing updates from the CPU to the forward table.

• Configuration of the different blocks. This is done by the CPU. • Gathering of statistics from the different blocks.

4.4 Block interconnection using SoCBUS

In this section the task of connecting the different functional blocks using the SoCBUS interconnection network will be described.

A general view of the core router SoCBUS network is given in figure 4.3. In this figure there can be several instances of the input/output packet processors and packet buffers.

OPP

IPP

PB

SoCBUS interconnect

MU

CPU

FT

(32)

4.4.1 Number of IPP/OPPs and PBs

In the current architecture the number of input and output packet processors are simply determined by the number of ports on the actual router. If the line speed of the router gets very high one possibility to boost the performance is to let several input/output packet processors serve one port. The number of packet buffers needed is more difficult to de-termine. If one global packet buffer is used one realizes that the throughput in the packet buffer will be equal to the aggregate bandwidth of the router. This would in the case of a 16 port gigabit router be 16Gbit/s in each direction. This could be compared with the maximum data throughput in one SoCBUS node that is 19Gbit/s. This throughput is the theoretical maximum and will in reality never be achieved because of the overhead in the PCC protocol. This will further be discussed in the section covering Internet traffic mod-els. The number of packet buffers in the initial design was determined by simulations to four.

4.4.2 General SoCBUS network structure

In SoCBUS you have the freedom to choose any kind of network structure to connect the different blocks. In this design a 2D mesh network is used to connect the blocks. A 2D network is easy to understand and to get a view of. It is also easy to physically implement the wiring and switches when using a 2D mesh structure. The SoCBUS network is shown in figure 4.4. IPP IPP 1 IPP IPP 1 IPP 2 3 4 5 6 7 8 16 OPP OPP OPP OPP OPP OPP OPP

OPP OPP OPP OPP OPP OPP OPP

16 15 14 13 12 11 10 2 3 4 IPP IPP 9 PB PB PB PB MU FT CPU IPP 3 IPP 4 IPP 5 IPP 6 7 IPP IPP 8 OPP 1 IPP 9 10IPP 11 12 13 14 15 IPP OPP 2

(33)

4.5 Extraction of execution time for the functional blocks 19

4.4.3 Motivation of block placement

The current simulation environment does not provide any optimization of block placement so the task of placement optimization is very much based on “trial and error”. There are however some rules of thumb when designing the network. Blocks that communicate a lot with each other should be placed close to each other. By doing this the transfer will finish faster and not so many SoCBUS nodes will be locked during the transfer. Another important thing is to try to distribute the network load over the whole network. Several placement strategies had to be tested in the traffic simulator to find a good one. The final mapping is shown in figure 4.4.

4.5 Extraction of execution time for the functional blocks

This thesis does not focus on details on the implementation of the different functional blocks, instead modeling of the execution time and dependencies between the different blocks is more interesting. Still the requirements on the blocks have to be realistic. In this section the requirements in terms of execution time on the functional blocks will be discussed. In some cases a reference design will be given to ensure that the requirements are realistic.

4.5.1 Input and Output Packet Processor (IPP/OPP)

To handle a speed of 1Gbit/s per port the execution time for this block have to be less than 672ns assuming the worst case scenario of only minimum size packets. The function-ality of the port processors is very much the same as the packet processor for terminals implemented by Ulf Nordqvist [8].

4.5.2 Forwarding Table (FT)

The worst case scenario for this block is also when the input only consists of minimum size packets. Because this block supplies all traffic streams with information about next hop and output port the execution time for this block is 16 times lower than for the input and output port processors, thus 42ns. This corresponds to a lookup rate of approximately 24MLPS. In these calculations the routing updates triggered by the CPU which process the routing protocol requests are neglected. These updates takes long time compared to the lookups.

4.5.3 Packet Buffer (PB)

This block consists of a control block, index structure and a buffer memory. In this design four packet buffers will be used to serve the 16 input ports. The total execution time for both saving and fetching data from the memory is approximately 168ns. In this case it might be more interesting to look at the bandwidth needed to see if any memories fulfill the requirements. The average bandwidth needed for the buffer memory in this block

(34)

approximately 8Gbit/s because data will have to be both saved to the memory and later fetched to be sent to the output port.

4.5.4 Multicast Unit (MU)

The multicast unit has basically the same functionality as the packet buffer. The only difference is that the multicast unit forwards the packet to several output packet processors instead of one. For simplicity the other requirements on the multicast unit is the same as for the packet buffer.

4.5.5 Central Processing Unit (CPU)

The CPU is not part of the data path so the functions are not critical for the actual routing. This fact makes the timing constraints of the CPU uninteresting at this moment. A general purpose processor like ARM940 [1] could be used to fulfill the requirements for this block.

(35)

Chapter 5

Traffic simulations of the initial

design

This chapter will first introduce the reader to different Internet traffic models. To be able to describe these traffic models in the SoCBUS simulation environment some improvements had to be made to the simulator. These changes will be briefly described in this chapter and more details can be found in appendix A. The simulation results will be presented and discussed. A deeper discussion about possible improvements of the design can be found in the next two chapters.

5.1 Internet traffic model

To be able to make realistic simulations of the purposed router architecture it is important to use traffic patterns that reflect the actual traffic in the core of the Internet today. A good way of describing traffic patterns on the Internet is to describe it as a discrete distribution of the most common packet sizes. Today this is also the most common way of describing Internet traffic during core router benchmarks. The different traffic distributions used in this project are described below.

5.1.1 Minimum size packets

In general a traffic pattern consisting of only minimum size packet generated the highest load on a router. The reason for this is that for every packet that arrives at the input port of the router a lookup has to be performed to determine the output port to which the packet should be forwarded. In reality this traffic pattern in not really realistic but it is still a good way of measuring the performance of the router.

For ordinary IP packets over Ethernet the minimum size Ethernet packet is 64 bytes. Because this core router chip only works at OSI protocol layer 3 and higher the actual packet data shrinks to 40 bytes. This is the actual amount of data that will be sent through the SoCBUS network.

(36)

5.1.2 Evenly distributed (RFC2544)

RCF2544 defines a number of tests used to measure the performance of a network de-vice [3]. In this memo different packet distributions for different physical mediums are purposed. Benchmarks for Ethernet network devices should use packet sizes evenly dis-tributed across 64, 128, 256, 512, 1024, 1280 and 1518 bytes.

5.1.3 Internet Mix

The most realistic way of benchmarking the router would of course be to use real traffic captured on the Internet. Newman [4] observed live packet flows from the core of the Merit network. The result from the observations was a distribution of packet sizes that he called the Internet mix. This traffic pattern has more or less become standard in benchmarking of core Internet devices. The following table shows the probability for different packet sizes according to Internet mix.

Probability Packet size

56% 40 bytes

23% 1500 bytes

17% 576 bytes

5% 52 bytes

Table 5.1. Distributions of packet sizes using the Internet mix.

5.2 Improvements in the SoCBUS simulator

The current implementation of the SoCBUS simulation environment has no support for describing traffic using a discrete distribution. There is no support for dependencies be-tween different SoCBUS transfers either. To be able to make an accurate description of the traffic flow in the SoCBUS network and by that get more realistic simulation results, these features were implemented. A short description of the improvements will be given in the following section. More implementation details will be given in appendix A.

5.2.1 Implementation of a model for discrete distributions

The typical way of describing a traffic pattern is to associate different values with different probabilities. The model implemented is called “discrete” and is completely implemented in the stimuli generator. The new model was implemented using the GNU Scientific Li-brary [7]. Figure 5.1 shows how the implemented model could be used when specifying the network stimuli using the XML format. This example defines that the value 40 is as-sociated with the weight 56 and the value 1500 is asas-sociated with the weight 23. This is actually the beginning of a definition of the Internet mix described above, where the value specifies the packet size in bytes.

(37)

5.2 Improvements in the SoCBUS simulator 23 MODEL_NAME PARAM PARAM_NAME PARAM MATH_MODEL VALUE PARAM_NAME PARAM_VALUE WEIGHT DISCRETE PARAM_VALUE PARAM PARAM_VALUE PARAM_NAME A 56 PARAM PARAM_VALUE PARAM_NAME A 23 PARAM PARAM_VALUE PARAM_NAME A 40 PARAM PARAM_VALUE PARAM_NAME A 1500

Figure 5.1. Example of discrete mathematical model.

5.2.2 Implementation of dependencies between SoCBUS traffic

To be able to describe the internal traffic as a data flow with dependent traffic some new functionality had to be added to the SoCBUS simulation environment. Before this change the only way of defining start of a sending between two blocks in the SoCBUS network was to define the start time. With this new feature one sending could also be triggered by the completion of another transfers completion. This feature makes it possible to describe the traffic flow shown in figure 4.2.

To make it possible to describe the dependencies in the input to the stimuli generator some changes had to be made to the XML format. A new kind of task that defines the dependencies was added. The new XML format is described in figure 5.2 and 5.3. To get a deeper understanding of the stimuli generator it is recommended to read the thesis by Joakim Wallin [9]. WORKING EVENT_NAME VALUE EVENT_POSITION MATH_MODEL EVENT_LENGTH

MATH_MODEL VALUE VALUE

Figure 5.2. Stimuli task working.

EVENT_NAME DEPENDENCY VALUE EVENT_DEPENDENCY VALUE EVENT_DELAY VALUE EVENT_LENGTH VALUE MATH_MODEL

(38)

5.3 Simulation results

Now that all the tools for making accurate simulations are available it is time to perform the actual simulations and determine the performance of this router architecture. It is also important to find possible bottlenecks to make it possible to refine the design. First of all the general conditions for the simulations are defined.

The simulation time is set to 1ms and for consistency this value will be used during all simulations. All simulations made in this chapter will be based on the network shown in figure 4.4 and with the specifications of SoCBUS shown in chapter three. The throughput shown in the graphs is the actual throughput of each line port connected to the router. During all simulations the throughput is the same at all input ports. No delays are specified in the different functional blocks which imply that the latency from input port to output port is only related to the SoCBUS network latency.

5.3.1 Throughput

To determine the maximum throughput the router can handle without dropping or delaying packets to much a series of simulations are performed at different throughputs. A good way of determine the maximum speed the network can handle is to measure the time it takes for a packet to be sent from the input port to the correct output port. A sudden rise of the network latency shows that the network is getting highly loaded.

Minimum size packets usually generates the highest load on routers so it is no surprise that the performance in terms of throughput it much lower than for other distributions of packet sizes. Figure 5.4, 5.5 and 5.6 show the SoCBUS network latency for minimum size packets, evenly distributed packets and for Internet mix. The latency is measured from the input packet processor to the correct output packet processor.

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 150 155 160 165 170 175 180 185 190 195 Latency (ns)

Throughput per port(Gbit/s)

(39)

5.3 Simulation results 25 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 600 800 1000 1200 1400 1600 1800 2000 2200 Latency (ns)

Throughput per port(Gbit/s)

Figure 5.5. Latency using even distribution.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 500 1000 1500 2000 2500 3000 Latency (ns)

Throughput per port(Gbit/s)

(40)

5.3.2 SoCBUS router lock

In this section the SoCBUS network activity for different traffic distributions will be pre-sented using different 3D plots. In all these 3D graphs the x axis represents the horizontal count of SoCBUS nodes beginning from the left. The y axis represents the vertical count of SoCBUS nodes starting from the top. This means that the upper left corner of the SoCBUS network will appear down to the left in the 3D graph.

The SoCBUS router lock describes the amount of time that each SoCBUS switch, or node as they sometimes are called, has been locked for transfers. This measure takes all five ports (up, down, left, right and wrapper) into consideration and is calculated as a mean value of the lock time associated with the different ports. Because the network load are very similar for even distribution and Internet mix only the results for Internet mix will be presented from now on. The rest of the results are given in appendix B. All results in this section are at the maximum throughput that the router can handle using that specific traffic model.

Figure 5.7 and 5.8 shows the router lock using minimum size packets and Internet mix at 0.6Gbit/s and 1.8Gbit/s respectively. The two graphs look very much the same though there are some interesting differences to notice. One difference is that the traffic is very much concentrated to the center of the network. In the case of minimum size packets the forward table has the highest router lock while in the case of Internet mix the packet buffers has the highest lock. This is what can be expected because smaller packet size implies that more packets have to be processed by the forward table at the same speed.

1 2 3 4 5 6 7 1 2 3 4 5 6 0 5 10 15 20 25 30 y x Lock time (%)

(41)

5.3 Simulation results 27 1 2 3 4 5 6 7 1 2 3 4 5 6 0 5 10 15 20 25 30 y x Lock time (%)

Figure 5.8. Router lock using Internet mix.

5.3.3 SoCBUS wrapper send lock

The wrapper send lock is the amount of time that the wrapper associated with each SoCBUS switch has been locked sending data to the SoCBUS switch. In other words the 3D graphs presenting the wrapper send lock gives an indication of the amount of time that each switch is sending data from the local IP block. The switches that have no value in these graphs don’t send anything to the SoCBUS network. This is the case for example for the output packet processors.

Figure 5.9 and 5.10 shows the wrapper send lock time for minimum size packets and Internet mix at 0.6Gbit/s and 1.8Gbit/s respectively.

By looking the graphs it is easy to see that the forward table has the highest lock time while using minimum size packet and the packet buffers has the highest lock time while using the Internet mix. One other interesting thing to point out is that the send lock time for the packet buffers is not four times larger than the lock time for each input port even though the packet buffers send exactly 4 times the amount of data that the input ports does. This is because of the overhead in the PCC protocol. The input ports send two packets, one containing data to the packet buffer and one lookup request to the forward table. This means that the input port has to set up two SoCBUS links using the PCC protocol. The packet buffer only sends data to the output port and only has to set up one PCC link.

(42)

1 2 3 4 5 6 7 1 2 3 4 5 6 0 10 20 30 40 50 60 y x Lock time (%)

Figure 5.9. Wrapper send lock using minimum size packets.

1 2 3 4 5 6 7 1 2 3 4 5 6 0 5 10 15 20 25 30 35 40 45 50 y x Lock time (%)

Figure 5.10. Wrapper send lock using Internet mix.

5.3.4 Conclusions

It is now time to sum up the results from the initial simulations. The results show that the performance of the SoCBUS network using Internet mix and the even distribution is quite similar, but that the throughput for minimum size packets is much lower. The reason for this is that the overhead in the PCC protocol for small packets are much larger than

(43)

5.3 Simulation results 29

for large packets. In other words, when sending small packets a larger part of the time in consumed while handshaking between the different functional blocks. The throughput together with the average latency from IPP to OPP at 1Gbit/s is shown in table 5.2.

The router lock has shown that the network load looks different depending on the traffic model. For small packets the network is heavily loaded by the forward table, while for the other packet distributions the packet buffers carries the highest load. Generally the network load is higher in the middle of the network.

The wrapper send lock time gives us a clear view of the bottlenecks in the current design. Figure 5.9 shows that the forward table has the highest send lock time for minimum size packets. This means that the links between the forwarding table and the packet buffers are the limiting factor using this particular traffic model. When the Internet mix is used the packet buffers have the highest send lock time. This means that the links between the packet buffers and the output packet processors are the limiting factor. Based on the results from these simulations the SoCBUS architecture and router model will now be refined to boost the performance of the router.

Traffic model Maximum throughput Avg latency at 1Gbit/s

Minimum packets 0.6 Gbit/s

-Even distribution 2.0 Gbit/s 925 ns

Internet mix 1.8 Gbit/s 822 ns

(44)
(45)

Chapter 6

Second router design

This and the next chapter will describe the process of refining the router model and SoCBUS architecture to boost the performance of the router. During this process two different designs will be developed and evaluated. First of all the bottlenecks from the initial design will be identified. The router design and SoCBUS architecture will then be changed in a way that hopefully will increase the performance of the router. The new router design will then be simulated to evaluate the changes.

6.1 Improvements in the design

In this section the bottlenecks of the initial design is identified and improvements are made to boost the performance of the router.

6.1.1 Several forward tables

A big bottleneck of the initial design appears when the network is populated with small packets. The throughput achieved using minimum size packets is only one third of the throughput achieved when using any of the two other packet distributions. The bottleneck in this case is the forward table.

To improve the performance for small packets one more forward table is added to the network. Each forward table is now responsible for 8 packet processors instead of 16. By adding one more forward table we of course also add the problem of inconsistency between the different forward tables. The CPU will be responsible making updates to the forward tables, and to keep the data consistent.

6.1.2 More SoCBUS switches

By looking at the simulation results from the initial design it is clear that the traffic is very much concentrated to the middle of the network. This could lead to congestions that may lead to delayed or even discarded packets. To avoid this problem the SoCBUS network

(46)

size is increased from 7x6 to 8x7. The most network intensive blocks were also moved apart from each other to even further distribute the network load over a larger part of the SoCBUS network.

6.1.3 SoCBUS bus width

When looking at the current implementation of the SoCBUS switches one realizes that the implementation is more or less independent of the bus width. This means that the delay in the switches will remain almost the same independent of the bus width. This is because the delay introduced in the switches mainly comes from the control block that for example determines the route. Because of the limited time of this project the implementation of the switch will not be further described. The details on this implementation can be found in [10].

By increasing the SoCBUS bus width from the current 16 bits to 32 or 64 bits it would be possible to increase the router throughput dramatically. It may even be possible to reach the magic line speed of 10Gbit/s defined by the standard packet over SONET - OC-192. This of course assumes that the IP blocks in the design can operate at the same speed. When the new bus width should be decided one important thing to take into consideration is that the IP blocks should be able to handle the bus bandwidth. The bus bandwidth assuming a 64 bit bus is 64bits ·1.2GHz= 76.8Gbit/s full duplex. Because of the overhead in the PCC protocol the actual data bandwidth will be much lower. Within a few years memories with these requirements on speed will be possible to implement. From now on the SoCBUS bus width will be set to 64 bits and the delays in the SoCBUS switch will be the same as for 16 bits bus.

6.2 Complete design after improvements

Figure 6.1 shows the SoCBUS network for the second design of the router.

6.3 Simulation results

In this section the results from simulations of the second router design will be presented. All simulations are performed under the same conditions, except from the bus width, as described in the initial design. The maximum throughput is determined by looking at the latency from IPP to OPP and the SoCBUS router lock is analyzed to find the bottlenecks of the current design.

(47)

6.3 Simulation results 33 IPP IPP IPP 1 IPP 2

OPP OPP OPP OPP OPP OPP

IPP IPP

FT IPP IPP IPP IPP

IPP IPP OPP IPP IPP IPP IPP 3 4 13 14 15 16 IPP PB 1 PB 2 3 4PB FT OPP

OPP OPP OPP OPP OPP OPP OPP

OPP 10 11 12 13 14 15 16 MU CPU PB 1 2 3 4 9 5 6 7 8 1 2 5 6 7 8 9 10 11 12

Figure 6.1. SoCBUS network for the second design.

6.3.1 Throughput

Figure 6.2 and 6.3 shows the latency using minimum size packets and Internet mix respec-tively. The graph showing even distribution can be found in appendix B.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 200 400 600 800 1000 1200

Throughput per port(Gbit/s)

Latency (ns)

(48)

0 1 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 1400

Throughput per port(Gbit/s)

Latency (ns)

Figure 6.3. Latency using Internet mix.

6.3.2 SoCBUS router lock

To get a view of the SoCBUS network activity we look at the router lock time. To make it easy to identify the bottlenecks of the network the router lock is shown at the maximum throughput that the design can handle. Figure 6.4 shows the router lock time for minimum size packets at 1.6Gbit/s and figure 6.5 shows the router lock time for Internet mix at 8Gbit/s. 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 5 10 15 20 25 30 35 y x Router lock (%)

(49)

6.3 Simulation results 35 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 5 10 15 20 25 30 35 y x Router lock (%)

Figure 6.5. Router lock using Internet mix.

6.3.3 SoCBUS wrapper send lock

The wrapper send lock is the amount of time that the wrapper has been busy sending data to the current SoCBUS node. By looking at this graph it is easy to find the bottlenecks in the SoCBUS network. Figure 6.6 shows the wrapper send lock for minimum size packets at 1.6Gbit/s and figure 6.7 shows the lock for Internet mix at 8Gbit/s.In the case of minimum size packets the forward tables and the packet buffers are approximately evenly loaded while in the case of Internet mix the packet buffers has the highest load.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 y x Lock time (%)

(50)

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 y x Lock time (%)

Figure 6.7. Wrapper send lock using Internet mix

6.3.4 SoCBUS transfer overhead

Because of the big difference in maximum throughput between minimum size packets and the other packet size distributions some measures concerning overhead has been studied. Figure 6.8 illustrates the ratio between the time that the packet buffers sends data and the time that is spent waiting to send data. The overhead has been measured at the maximum throughput that each packet size distribution can handle.

Minimum size packets Even distribution Internet mix

0,00% 10,00% 20,00% 30,00% 40,00% 50,00% 60,00% 70,00% 80,00% 90,00% 100,00% Overhead Transfer

(51)

6.3 Simulation results 37

6.3.5 Conclusions

The simulations of the second design have shown that the changes made to the router design really increased the performance of the router in the form of increased throughput and decreased network latency. It is also obvious that the changes in network size, the new positions of the IP blocks and the addition of one more forward table has contributed to a more even load over the entire SoCBUS network. The most important measures from the simulations can be found in table 6.1.

Traffic model Maximum throughput Avg latency at 1Gbit/s

Minimum packets 1.6 Gbit/s 122 ns

Even distribution 9.5 Gbit/s 280 ns

Internet mix 8.0 Gbit/s 224 ns

(52)
(53)

Chapter 7

Final router design

The task of refining this router model to find “the best” design without any automated opti-mization tools is of course very hard or even impossible to accomplish. The router design described in this section is the final design developed during this final year project. First the changes made to the second design will be described, then the simulation results will be presented and discussed and finally the final implementation of the functional blocks will be described.

7.1 Improvements in the design

In this section the bottlenecks of the initial design is identified and improvements are made to boost the performance of the router.

7.1.1 More packet buffers and forward tables

By looking at the wrapper send lock from the second design it is obvious that the bottle-neck of the design still is the packet buffer and forward table. Increasing the SoCBUS bus width could be a solution to the problem, but by looking at the bandwidth of the packet buffers using for example a 128bit bus one realizes that this is not possible with today’s memory technologies. Another solution to the problem is to add more packet buffers and forward tables to distribute the network load better. The problem with this solution is that the buffer memory will not be efficiently used. Despite this problem it was decided to dou-ble the number of packet buffers and forward tadou-bles. The router now consists of 4 forward tables and 8 packet buffers.

7.1.2 Changes in PCC

By looking at the overhead introduced by the PCC protocol, shown in figure 6.8, it is clear that the protocol overhead stands for a very large part of the actual router lock. The reason for this overhead is that it takes time to set up the route between two blocks. The reason

(54)

for the large overhead in the case of minimum size packets is that the ratio between the time it takes to set up the route and the time it takes to send the data is very big.

A new feature in the SoCBUS nodes that will decrease the overhead is purposed. This change will affect the way small packet are treated. All packets that are 64 bits or smaller will be sent using speculative sending. This means that no route has to be set up in advance to send the data. Instead the data is included in the first request-packet.

7.2 Complete design after improvements

Figure 7.1 shows the SoCBUS network for the final design of the router. Each packet buffer is responsible for two IPPs and each forward table for four IPPs. The SoCBUS bus width is set to 64 bits.

IPP IPP IPP 1 IPP 2

OPP OPP OPP OPP OPP OPP

IPP IPP

IPP IPP IPP IPP

IPP IPP OPP IPP IPP IPP IPP 3 4 14 15 16 IPP PB PB PB OPP

OPP OPP OPP OPP OPP OPP OPP

OPP 10 11 12 13 14 15 16 MU CPU PB 1 2 3 4 9 5 6 7 8 PB PB PB PB 1 2 3 4 5 6 7 8 FT 4 2 FT FT 1 FT 3 5 6 7 8 9 10 11 12 13

Figure 7.1. SoCBUS network for the final design.

7.3 Simulation results

In this section the results from simulations of the final router design will be presented. All simulations are performed under the same conditions as described in the second design, except from the short packet PCC implementation. The maximum throughput is deter-mined by looking at the latency from IPP to OPP and the SoCBUS router lock is analyzed to find the bottlenecks of the current design.

(55)

7.3 Simulation results 41

7.3.1 Throughput

Figure 7.2 and 7.3 shows the latency using minimum size packets and Internet mix respec-tively. The graph showing the even distribution can be found in appendix B.

0 0.5 1 1.5 2 2.5 3 0 500 1000 1500 2000 2500 3000 3500

Throughput per port(Gbit/s)

Latency (ns)

Figure 7.2. Latency using minimum size packets.

0 2 4 6 8 10 12 14 0 500 1000 1500 2000 2500

Throughput per port(Gbit/s)

Latency (ns)

(56)

7.3.2 SoCBUS router lock

Figure 7.4 shows the router lock time using minimum size packets at 2.6Gbit/s and figure 7.5 shows the router lock time using Internet mix at 14Gbit/s.

The router lock graphs shows that the network load is much more evenly loaded than before. By the output port processors it is also easy to see that the router lock is higher in the center of the network than along the edge of the network. This is obvious because the randomized properties of the traffic between the packet buffers and the output port processors. 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 5 10 15 20 25 30 35 40 y x Lock time (%)

Figure 7.4. Router lock using minimum size packets.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 5 10 15 20 25 30 y x Lock time (%)

(57)

7.3 Simulation results 43

7.3.3 SoCBUS wrapper send lock

As described before the wrapper send lock is the amount of time that the wrapper has been busy sending data to the current SoCBUS node. Figure 7.6 shows the wrapper send lock for minimum size packets at 2.6Gbit/s and figure 7.7 shows the lock for Internet mix at 14Gbit/s. These graphs shows that the packet buffer now has the highest wrapper send lock for both minimum size packets and for Internet mix. It is interesting to notice that the wrapper send lock for the packet buffers is smaller in the middle of the network. This is because those packet buffers on average have a shorter way to the output ports.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 y x Lock time (%)

Figure 7.6. Wrapper send lock using minimum size packets.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 0 10 20 30 40 50 60 y x Lock time (%)

(58)

7.3.4 Conclusions

The graphs describing the router lock shows that the traffic is better distributed over the whole network than before. In the wrapper send lock graphs it is easy to see that the current bottleneck is the link between the packet buffers and the output ports. The big difference in network load between the minimum size packets and the other packet distributions shows that the overhead is still very big for minimum size packet despite the change made in the PCC protocol.

These simulations has shown that the SoCBUS network used in this router design can handle speeds up to, and even above the magic speed 10Gbit/s used in packet over SONET OC-192. The SoCBUS network still has problems achieving high speeds when it comes to small packets. Some possible solutions to this problem will be described later in this chapter. A summary of the results from the simulations can be found in figure 7.1.

Traffic model Maximum throughput Avg latency at 1Gbit/s

Minimum packets 2.6 Gbit/s 120 ns

Even distribution 18.0 Gbit/s 280 ns

Internet mix 15.0 Gbit/s 226 ns

Table 7.1. Results from simulation of the final design.

7.4 New requirements on the functional blocks

This far during this iterative process of increasing the performance of the router the re-quirements on the functional blocks has not been discussed very much. The initial design and requirements on the functional blocks were defined with Gigabit Ethernet in mind. Now that we have a SoCBUS network that can handle speeds beyond 10Gbit/s it is time to look at the new requirements on the functional blocks. Because of the current standards used on the Internet today the line speed is set to 10Gbit/s. The properties of the IP blocks will now have to be changed to fit this new line speed. The new requirements in terms of execution time and/or bandwidth are described below.

7.4.1 Input and output packet processors (IPP/OPP)

The big difference in the input and output packet processors are the time constraints for processing a packet. The worst case packet rate using SONET OC-192 at 10Gbit/s is 25MPPS, using minimum size packets. This corresponds to a maximum execution time for this block of 40ns.

7.4.2 Packet buffer (PB)

The packet buffer is a critical part of the router. In the final design of the core router each packet buffer is responsible for two IPPs. The maximum bandwidth in the packet buffer

(59)

7.4 New requirements on the functional blocks 45

will be 20Gbit/s in each direction.

7.4.3 Forwarding table (FT)

In the final design each forward table is responsible for four IPPs. Assuming the worst case consisting of only minimum size packets the lookup rate at 10Gbit/s will be 100MLPS. With the current design the maximum speed using only minimum size packets is 2.6Gbit/s. This line speed corresponds to a lookup rate of 26MLPS. In these calculations the routing table updates has not been taken into consideration.

(60)

References

Related documents

In this essay I analyze the ways in which gender and space are shaped and made sense of through digital gameplay. Specifically in the turn based strategy game

Detta skulle aldrig fungera enligt Clausewitz, han menar dock att bara för att krig bedrivs genom våld, så betyder det inte att de som för krig inte har något förnuft, utan

Technology trends show that global on-chip wire delays are growing significantly, increasing cross-chip communication latencies to several clock cycles and

Switching and routing algorithms needed to be chosen to suit the needs of LNoC but also to work good with LNoCs topology, especially the routing algorithm chosen depends a lot of

Figure 69 and Figure 70 show the individual comparison of mean and maximum transfer delay, where we can see that higher network traffic will affect the transfer delay.. Comparison

This document contains an overview of the model associated with the thesis work of Fadi Hamade.. To download the actual Excel file, follow the

Bilden föreställer en person som håller i ett tankmunstycke. Handen och armen syns, liksom underdelen av kroppen, men varken ben, överkropp eller ansikte är synliga. Betraktaren

If the polyphase filter and LVDS receivers are proved to be working, then by applying signals to the different pins named InX<Y> should generate something on the output