Design of a Gigabit Router Packet Buffer using DDR SDRAM Memory

(1)

Design of a Gigabit Router Packet Buer

using DDR SDRAM Memory

Examensarbete utfört i datorteknik

av

Daniel Ferm

LITH-ISY-EX--06/3814--SE

Linköping 2006

(2)

(3)

Design of a Gigabit Router Packet Buer

using DDR SDRAM Memory

Examensarbete utfört i datorteknik vid Linköpings tekniska högskola

av

Daniel Ferm

LITH-ISY-EX--06/3814--SE

Supervisor: Andreas Ehliar Examiner: Dake Liu

(4)

(5)

Avdelning, Institution

Division, Department DatumDate

Språk Language 2 Svenska/Swedish 2 Engelska/English 2 Rapporttyp Report category 2 Licentiatavhandling 2 Examensarbete 2 C-uppsats 2 D-uppsats 2 Övrig rapport 2

URL för elektronisk version

ISBN ISRN

Serietitel och serienummer Title of series, numbering ISSN

Titel Title Författare Author Sammanfattning Abstract Nyckelord Keywords

The computer engineering department at Linköping University has a research project which investigates the use of an on-chip network in a router. There has been an implementation of it in a FPGA and for this router there is a need for buer memory. This thesis extends the router design with a DDR memory controller which uses the features provided by the Virtex-II FPGA family.

The thesis shows that by carefully scheduling the DDR SDRAM mem-ory high volume transfers are possible and the memmem-ory can be used quite eciently despite its rather complex interface.

The DDR memory controller developed is part of a packet buer module which is integrated and tested with a previous, slightly modied, FPGA based router design. The performance of this router is investigated using real network interfaces and due to the poor network performance of desktop computers special hardware is developed for this purpose.

Institutionen för Systemteknik 581 83 Linköping 1 mars 2006 LITH-ISY-EX--06/3814 http://www.ep.liu.se/ 9th March 2006

Design of a Gigabit Router Packet Buer using DDR SDRAM Memory Design av en Packetbuer för en Gigabit Router användandes DDR Minne

Daniel Ferm

× ×

(6)

(7)

Abstract

The computer engineering department at Linköping University has a research project which investigates the use of an on-chip network in a router. There has been an implementation of it in a FPGA and for this router there is a need for buer memory. This thesis extends the router design with a DDR memory controller which uses the features provided by the Virtex-II FPGA family.

The thesis shows that by carefully scheduling the DDR SDRAM memory high volume transfers are possible and the memory can be used quite eciently despite its rather complex interface.

The DDR memory controller developed is part of a packet buer module which is integrated and tested with a previous, slightly modied, FPGA based router design. The performance of this router is investigated using real net-work interfaces and due to the poor netnet-work performance of desktop computers special hardware is developed for this purpose.

(8)

(9)

Abbreviations

AFIFO Asynchronous FIFO

ARP Address Resolution Protocol, protocol for IP to MAC

address translation on Ethernet ASIC Application Specic Integrated Circuit

CAS Column Address Strobe, control signal for SDRAM

CPU Central Processing Unit, processor

CRC Cyclic Redundancy Check, an algorithm used for error detection

CS Chip Select, control signal for SDRAM

DCM Digital Clock Manager, a clock manipulating hardware in FPGA

DDR Double Data Rate

DRAM Dynamic RAM, a type of memory

FIFO First In First Out, a type of buer

FPGA Field Programmable Gate Array, a type of

programmable circuit

IOB Input Output Block, part of FPGA that handles

communication outside of it

IP Internet Protocol, network protocol used on Internet

IPv4 Internet Protocol version 4

MAC address Hardware address used in Ethernet

MTU Maximum Transmission Unit, the largest packet a

network can transfer

RAM Random Access Memory

RAS Row Address Strobe, control signal for SDRAM

SDRAM Synchronous DRAM, clocked DRAM

WE Write Enable, control signal for SDRAM

(10)

(11)

1 Introduction 1 1.1 Background . . . 1 1.2 Objective . . . 1 1.2.1 Primary Requirements . . . 2 1.2.2 Secondary Requirements . . . 2 1.3 Reading Instructions . . . 2 1.3.1 Thesis Outline . . . 2 1.4 Method . . . 3 2 Memories 5 2.1 SRAM . . . 5 2.2 DRAM . . . 5 2.2.1 SDRAM . . . 6 2.2.2 DDR SDRAM . . . 8 3 Virtex-II FPGAs 11 3.1 DCMs . . . 11 3.1.1 Clock De-skew . . . 12

3.1.2 Variable Phase Shift . . . 12

3.1.3 Statically Phase Shifted Clock Outputs . . . 12

3.1.4 Frequency Altered Outputs . . . 13

3.2 DDR IOBs . . . 14

3.3 Global Clock Network . . . 15

3.4 Block RAMs . . . 15

3.4.1 Asynchronous FIFOs . . . 16

4 Router Design 17 4.1 Packet Path: Old Router . . . 17

4.2 Packet Path: New Router . . . 17

4.3 Router Blocks . . . 18

4.3.1 Input Module . . . 18

4.3.2 Output Module . . . 18

4.3.3 Routing Table . . . 18 v

(12)

vi CONTENTS

4.3.4 Packet Buer . . . 19

4.3.5 Socbus . . . 19

5 Router Memory Usage 23 5.1 Packet Identiers . . . 23

5.2 Memory Storage Scheme . . . 24

5.3 Packet to Memory Mapping . . . 25

5.4 Packet Identier Format . . . 28

6 Packet Buer Design 31 6.1 DDR Controller Selection . . . 31 6.1.1 Available Controllers . . . 32 6.1.2 Custom Controller . . . 32 6.2 The Controller . . . 33 6.2.1 Startup Controller . . . 36 6.2.2 Primary Controller . . . 36 6.2.3 Secondary Controllers . . . 36 6.3 Memory Interface . . . 37 6.4 Router Interface . . . 38 6.4.1 Input Connections . . . 38 6.4.2 Output Connections . . . 39 6.4.3 Route Connection . . . 39 6.5 Buer Memory . . . 39

6.5.1 Buer Memory Usage . . . 40

7 Verication, Testing and Debugging 43 7.1 Simulation . . . 43

7.2 Logic Analyzer Hardware Debugging . . . 44

7.3 Packet Buer Verication and Speed . . . 44

7.3.1 Packet Generators . . . 44

7.3.2 Route Generator . . . 45

7.4 Status Registers/Counters . . . 45

7.4.1 Serial Connection . . . 46

7.5 Testing in Real Network . . . 46

7.5.1 Dedicated Hardware Packet Generators . . . 46

8 Results 49 8.1 Controller Eciency . . . 49

8.1.1 Dierent Buer Sizes . . . 49

8.1.2 Final Packet Buer Design . . . 51

8.1.3 Theoretical Ethernet Maximal Throughput . . . 52

8.1.4 Packet Buer Limits . . . 53

8.2 Router Performance . . . 53

8.2.1 Manual Tests . . . 54

(13)

CONTENTS vii

8.2.3 Automated Tests, Dierent Packet Sizes . . . 57

8.3 FPGA Utilization . . . 58

8.4 Requirements . . . 59

9 Problems 61 9.1 Meeting Controller Timing . . . 61

9.2 Synthesizer Bug . . . 62

9.3 Memory Read Timing . . . 62

9.4 Controller Buer Usage . . . 63

10 Conclusions and Further Work 65 10.1 Controller Improvements . . . 66

10.1.1 Small Packets . . . 66

10.1.2 Large Packets . . . 66

10.1.3 Buer Memory . . . 66

10.1.4 Controller Command FIFOs . . . 66

10.2 Router Improvements . . . 67

10.2.1 Socbus . . . 67

(14)

(15)

List of Tables

5.1 Internet Mix . . . 25

5.2 Relative Block Sizes . . . 27

8.1 8k Buers @115MHz with Internet Mix . . . 50

8.2 16k Buers @115MHz with Internet Mix . . . 50

8.3 Final Design, 8k Buers @115MHz with Internet Mix . . . 51

8.4 Final Design, 8k Buers @115MHz with 40 byte Packets . . . . 51

8.5 Final Design, 8k Buers @115MHz with 1500 byte Packets . . 51

8.6 Packet Generator Measurements at Full Duplex . . . 54

8.7 FPGA Utilization, Router . . . 59

8.8 FPGA Utilization, Modules . . . 59

(16)

(17)

List of Figures

2.1 Organization of SDRAM with 4 Banks . . . 6

2.2 SDR SDRAM READ Timing Diagram . . . 7

2.3 Strobe to Data Timing Relationship . . . 9

2.4 DDR SDRAM READ Timing Diagram . . . 9

3.1 DCM . . . 11

3.2 DCM Outputs . . . 13

3.3 DDR IOB . . . 14

3.4 Asynchronous FIFO . . . 16

4.1 Successful Socbus Connection . . . 20

4.2 Old Router Socbus Network Conguration . . . 21

4.3 New Router Socbus Network Conguration . . . 21

5.1 Block Allocation per Row . . . 27

5.2 Packet Identier Format . . . 29

6.1 Packet Buer Overview . . . 31

6.2 Controller . . . 35

6.3 Socbus Connections Buer Sharing . . . 41

8.1 Loss for Packet in Range 48 to 288 bytes . . . 55

8.2 Loss with Dierent Packet Sizes, 400-1500 . . . 57

8.3 Loss with Dierent Packet Sizes, 1500-1500 . . . 58

(18)

(19)

Chapter 1

Introduction

1.1 Background

The basis for this project is two other master thesis projects that has been done at the department.

The rst was a feasibility study of an Internet core router design using an on-chip network [4]. This design targeted an ASIC, which is a much higher performing circuit than what was actually used during this and the second project.

The second project was based on the feasibility study and implemented a gigabit Ethernet router [1], now however a FPGA was used, as in this project. During the second project the number of ports on the router was severely limited (to 2 ports) due to the available hardware, this restriction was also in aect during this project. The department are working on new expansion cards with 4 gigabit Ethernet interfaces enabling a total of 8 ports to be connected to the router, but these are not nished at the time of writing.

With this much increased capacity comes a need for more buer memory to store the packets while processing them in the router. The available memory in the FPGA is limited so to facilitate the increased capacity it was decided to use DDR memory available on the FPGA development boards. In order to use the DDR memory it needs a controller and this is the idea for this project, make a DDR memory controller and integrate it into a packet buer of the FPGA router.

1.2 Objective

The project objective was to design a working DDR memory controller for the Avnet Kokerboom development board featuring a Xilinx Virtex-II XC2V4000 FPGA and to test it with the router.

A number of requirements were formed at the beginning of the project. 1

(20)

2 CHAPTER 1. INTRODUCTION They were divided into two groups, primary and secondary requirements, where the primary was to be completed and the secondary only if time allowed.

1.2.1 Primary Requirements

• Implement a working DDR memory controller for the Kokerboom devel-opment board with a Xilinx Virtex-II XC2V4000 FPGA and a 128MB SODIMM memory based on the Micron MT46V8M16-75 chip.

• Design a packet buer using the memory controller with socbus inter-face(s).

• The packet buer should handle 4 gigabit in/out port pairs. • Document how the hardware works.

1.2.2 Secondary Requirements

• Support for 8 gigabit in/out port pairs.

• Test the packet buer in the previous router design with real network interfaces.

1.3 Reading Instructions

In order to understand the contents of this thesis a basic understanding of computer networks is required. Especially a notion of what a router is and does is important.

Precise knowledge should not be required and a short introduction to this subject can be found in [1]. That report can also be interesting in order to get some background and a better understanding of the router design.

1.3.1 Thesis Outline

Chapters 2 and 3 start o by describing the technology used in this project, the memories and FPGA.

Chapter 4 goes on to describe the overall router design to put the packet buer into context and give the reader an understanding for what tasks it should perform.

The core of the thesis lies in Chapters 5 and 6 where Chapter 5 deals with the packet buers use in the overall router design and introduces some design decisions regarding how the DDR controller will be used. Chapter 6 then goes into detail on the DDR controller/packet buer design and shows how they are implemented.

(21)

1.4. METHOD 3 Chapters 7 and 8 describes the methods and results of the evaluation of the packet buer/router and Chapter 9 deals with problems throughout the project.

Chapter 10 sums up the project and deals with what can be done in the future.

1.4 Method

The project was started o with a collection of information on DDR SDRAM memories, router memory management and the FPGA in question. The details of how DDR SDRAM memories work was studied rst and later what the available FPGA could do to solve the dierent issues was investigated.

Using this information a rst controller was constructed which was later transformed into the nal packet buer design. This was integrated into the previous router design which also needed modication due to changed behav-iour between the new and old packet buers.

Throughout the development process testing of the evolving design was performed and bugs were investigated and corrected as they were found.

(22)

(23)

Chapter 2

Memories

There are several dierent types of memories. This chapter describes two main types of memories, both of which are used in this project.

2.1 SRAM

SRAM stands for Static Random Access M emory. Static means the memory retains its contents as long as power is supplied.

SRAM is constructed from cells consisting of a number of transistors, each bit is stored in one of these SRAM cells. Using a memory based on this principle is easy, there is normally an address bus, data bus and some control signals to control read/write operations.

The Virtex-II FPGA used in this project is a SRAM device, this means that all conguration information for the dierent parts of the FPGA are stored in SRAM cells. Also the FPGA has memories built-in called block rams. These also work according to this principle and have the simple address-data-control-signal-interface, for more detailed information on the block rams see Section 3.4.

2.2 DRAM

DRAM stands for Dynamic Random Access M emory and unlike SRAM it does not retain its data just because it has power, DRAM needs to be refreshed in order to keep its data. The reason for this lies in the way a DRAM cell is con-structed. The SRAM cell has several transistors where the DRAM cell only has one. Instead of keeping its information in looped-back transistor conguration the DRAM cell uses a small capacitor to keep information. Basically when the capacitor is charged the bit is 1, when it's not it's a 0.

This is where the refresh comes in, because a capacitor leaks current, thereby draining it of energy. So in order to keep the information one must at

(24)

6 CHAPTER 2. MEMORIES regular intervals read out the information from the cell and then write it back in. This makes using DRAM a much more tedious task than SRAM, but in return DRAM is much cheaper.

2.2.1 SDRAM

SDRAM stands for Synchronous Dynamic Random Access M emory. The meaning of synchronous is that all accesses to the memory are clocked. This section describes what is sometimes called SDR SDRAM (Single Data Rate). SDR SDRAM memory is organized in a 3-dimensional array, there are rows, columns and banks. The banks are few, usually 2 or 4, while the number of rows and columns are much larger. Figure 2.1 shows the memory organization.

bank

buffer row address

column address

data

Figure 2.1: Organization of SDRAM with 4 Banks

During startup the memory needs to go through an initialization procedure in order to place it in a correct operating state. In this procedure all of the banks are put into an inactive state and the settings for the memory are congured. All of the memory settings are in the so called Mode Register and to change it a load mode register command is sent with the relevant settings on the address bus. One important setting is the burst length which is the amount of data that a read/write operation works on (in multiples of the data width). For example having a burst length of 4 means that a read will return data in four consecutive clock cycles. There are more settings, some of which are mentioned below.

Once the startup procedure has been executed the memory is ready for use. To access it the memory needs to be addressed with the 3 parts of the address; bank, row and column. During the access the bank used cannot be used for any other accesses.

The rst step in an access is to activate the row of interest (in the bank), this is done by asserting the RAS (Row Address Strobe) signal while having the

(25)

2.2. DRAM 7 row and bank addresses on the address bus of the memory. After the command has been sent it takes some time for it to complete and in the meantime other operations can be performed on the memory (directed at other banks). When the row is being activated the memory contents of that row is read out so that it can be used.

With the row active read and write operations can be performed on it. By asserting the CAS (C olumn Address Strobe) and/or WE (W rite Enable) signals while having the column and bank addresses on the address bus a read or write is initiated. During a write there is one additional signal involved in addition to the data lines, the data mask signal. This signal is used to indicate which parts of the data is valid, the invalid parts of the data are not written to the memory. A read has no use for this signal, unwanted data is simply discarded by the controller, but there is however another problem involved with reads. When reading the requested data is not immediately transfered over the data lines, it takes some time for the correct data to be selected based on the column address. This extra time needed is called CAS- or read-latency and it is a multiple of the clock cycle. Typically the CAS-latency for an SDR SDRAM is 2 or 3 clock cycles and it is specied during memory initialization. Figure 2.2 shows a read cycle with CAS-latency 2 and a burst length of 4.

CK Command

Address

Data

READ NOP NOP NOP NOP NOP

CAS latency = 2 bank

col

Figure 2.2: SDR SDRAM READ Timing Diagram

After an access is done another accesses can be performed in the same way, otherwise the row needs to be deactivated. Deactivating a row is done with a so called precharge command1 _{and by doing this the possibly changed data}

of the row is written back into the memory. Once this is done the bank is no longer active and can be used for new accesses.

As earlier mentioned DRAM memories also need to be refreshed. This means that at regular intervals the memory needs to be issued a refresh com-mand. Each row needs to be refreshed every 64ms so on average refreshes can

1_{There is also the possibility to tell the memory to do an auto precharge when issuing}

a read or write command, by doing so the memory will at the rst possible time after the operation completes initiate the precharge

(26)

8 CHAPTER 2. MEMORIES be no longer than 64

rows ms apart2. To give a refresh command all of the banks

has to be in the precharged (inactive) state.

2.2.2 DDR SDRAM

When moving from SDRAM to DDR SDRAM (Double Data Rate Synchronous Dynamic Random Access M emory) some things changed in order to increase speed. This section describes these dierences.

First and foremost data is transfered on both clock edges in DDR, hence the name.

In order to handle this increased speed other changes had to be made. When clocking data on both clock edges a good clock signal is essential. To improve the accuracy of the clock it is dierential in the DDR standard. A dierential clock has two lines, one like the normal clock and one that is the inverse of the rst. The clock edges are then dened as the time when the signals cross, if the rst changes from high to low (and the second from low to high) then it is dened as the the falling edge of the dierential clock, and vice versa.

Another of the changes is the addition of data strobe lines. Data strobe lines work like a separate clock for the data and there are several of them, one per byte (8-bit) or nibble (4-bit) depending on the total data width of the memory. The strobes are, like the data lines, tristate signals and they are driven by either the memory controller or the memory itself depending of whether the operation is a read or a write.

When writing the strobe (and data) lines are driven by the controller and the strobe edge should be aligned with the center of the data. The reason for this is that the memory should be able to use the strobe as a clock signal and therefor it should have an edge when the data is stable, this is shown in Figure 2.3(a).

During a read it is the other way around, except for the strobe alignment. It is no longer aligned with the center of the data, instead it changes in sync with the data, see Figure 2.3(b). The reason for this dierence is because a delay circuit is complex and instead of having one in each memory it only needs be implemented once in the controller.

As can be seen in the gures the strobe signal is also a DDR signal (it changes just as often as the data). In addition to these signals there is one additional signal that is DDR, the data mask signal. It works in exactly the same way as it does for SDR SDRAM, but since data comes on both clock edges the data mask must also do so.

These are the basic changes (dierential clock, DDR data, strobe and data mask) made to signaling interface of the DDR memory compared to SDR SDRAM. In addition to the signaling changes there are also some new/changed settings. The burst length for DDR memories has to be a multiple of 2 since every clock can deliver two chunks of data, typical values are 2, 4 and 8. Also

(27)

2.2. DRAM 9

(a)

(b)

WRITE

READ

Strobe Data Strobe Data

Figure 2.3: Strobe to Data Timing Relationship

the CAS-latency no longer needs to be a multiple of the clock cycle, half clock cycles are also available which means that the rst data arrives in between com-mands. Typical CAS-latencies for DDR memories are 2, 2.5 and 3. Figure 2.4 shows a read cycle with CAS-latency 2.5 and a burst length of 4, compare with a SDR burst of length 4 (Figure 2.2).

CK Command Address CK Strobe Data

READ NOP NOP NOP NOP NOP

CAS latency = 2.5 bank

col

Figure 2.4: DDR SDRAM READ Timing Diagram

In addition to the all ready mentioned changes there is one more change compared to SDR, the startup procedure. The DDR SDRAMs startup proce-dure is more complex than for a SDR SDRAM. The proceproce-dure however follows the same basic principle as described on page 6, but some more steps are re-quired for correct initialisation. For an exact description of the requirements see the DDR SDRAM standard [3].

(28)

(29)

Chapter 3

Virtex-II FPGAs

This chapter describes some of the important features of Virtex-II FPGAs which are used in the design.

3.1 DCMs

When creating high speed designs it is important to be able to handle clock signals in dierent ways, this is what Digital C lock M anagers are for. The Virtex-II DCMs can do a multitude of things to clock signals, changing the frequency and delaying them are the most important for this project.

This section covers the features of the DCM used in this project, for a more detailed and complete description see the Virtex-II User Guide [8].

Figure 3.1 shows all the signals a DCM has and as can be seen there are lots of them. Still almost all of the signals, all except CLK2X180, CLKFX180 and STATUS, have been used in the project.

CLK2X CLKIN CLKFB CLK0 CLK90 CLK180 CLK270 CLK2X180 CLKDV CLKFX CLKFX180 LOCKED STATUS PSDONE PSINCDEC PSEN PSCLK RST Figure 3.1: DCM 11

(30)

12 CHAPTER 3. VIRTEX-II FPGAS

3.1.1 Clock De-skew

In normal operation the CLK0 output is the same as the CLKIN input clock. When the output clocks are used in the design there is a certain propagation delay from the DCM. To compensate for this the CLK0 (or CLK2X) signal should be connected to the feedback input of the DCM, CLKFB. By doing so the CLKIN signal can be internally delayed until the rising edges of the CLKIN and CLKFB signals match. When this happens those clock signals are 360◦

out of phase with each other, which means they are in phase. By having the CLKIN and CLKFB signals in phase the propagation delay for the clock signals are eectively zero1_.

When the clocks goes in phase with each other the LOCKED signal of the DCM is asserted to signal that the design can start working, before this happens the output clock should not be used as they are not stable.

3.1.2 Variable Phase Shift

Most features of a DCM are decided at design time through attributes in the HDL code and cannot be changed at runtime. The only thing that can be changed at runtime is the phase shift for all clock outputs (provided the correct attributes are applied). This is what the PS* signals are for. By phase shifting a DCM the delay circuit used to align CLKIN and CLKFB is modied to insert a constant delay, the phase shift. Thereby all DCM clock outputs are delayed and by using the PS* signals the delay can be changed, eectively allowing variable phase shifting.

The phase shift interface is very simple, it runs in its own clock domain with the PSCLK as clock signal. If during one cycle the PSEN signal is active the DCM checks the PSINCDEC signal to see if an increase or decrease in phase shift is wanted and then performs the change in delay. Every change in delay is

1

256 of the clock cycle time. Once the delay change has happened the PSDONE

is asserted for one cycle to inform the design that another phase shift change can be performed. The reason changing delay takes some time is because it is changed slowly so that the DCM lock is not lost (the LOCKED signal never needs to change).

Depending on the frequency the DCM might not be able achieve exactly the delay requested because the delay is achieved by including a discrete number of delay elements in a delay line for the CLKIN signal. In this case however the DCM selects the closest number of delay elements.

3.1.3 Statically Phase Shifted Clock Outputs

Other than the CLK0 signal there are also a number of clock outputs that are always phase shifted in relation to it. CLK90, CLK180 and CLK270 all run

1_{this is zero propagation delay to the global clock network, see Section 3.3, from there}

(31)

3.1. DCMS 13 at the same frequency as CLK0 but are shifted 1

4, 1 2 and

3

4 of the clock cycle

time, respectively. There are also the two clock outputs running at twice the speed of CLKIN, CLK2X and CLK2X180. Figure 3.2 shows how the dierent clocks relate to each other.

CLK90 CLK0 CLK180 CLK270 CLK2X CLK2X180 Figure 3.2: DCM Outputs

3.1.4 Frequency Altered Outputs

The remaining three clock outputs of the DCM, CLKDV, CLKFX and CLKFX180, can output clocks which are of a dierent frequency than CLKIN.

CLKDV is the simplest of the three and only produces frequencies lower than CLKIN. By setting a DCM attribute the divider can be chosen, the only limitation is that the divider has to be from a predened set of values2_.

The CLKFX output is more exible, its output has a frequency ofM

D×CLKIN.

M is the multiplier and can have any integer value in the range 2-32, likewise D is the divider, also an integer but the allowed range is 1-32.

CLKFX180 like the other clock outputs named 180 has the same output as CLKFX, only phase shifted 180◦_{in relation to it.}

There are limitations on how fast clocks the CLKFX outputs can produce3_.

What the limitation is however depends of the FPGA model and what speed grade it is, for the FPGA used in this project CLKFX can output a maximal frequency of 210MHz, which is more than enough for design. All limitations are available in Virtex-II Complete Data Sheet [7].

2_{1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 9, 10, 11, 12, 13, 14, 15 and 16} 3_{there are limitations on the other output clocks too, but CLKFX is the most restrictive}

(32)

14 CHAPTER 3. VIRTEX-II FPGAS

3.2 DDR IOBs

One of the points with double data rate signals is to increase the data transfered per wire without increasing the clock frequency. This means for example that the FPGA internally can work at 100MHz and still have data transfers at 200MHz. But in order to work with an external unit using DDR (for example a memory) the FPGA needs a facility to interface with DDR signals, this is what DDR IOBs (Double Data Rate I nput Output Blocks) are for.

The element that makes all this possible is the DDR mux. It is driven by two ip-ops running on clocks phase shifted 180◦ _{from each other and on each}

rising clock edge the DDR mux output changes.

A single DDR IOB contains two DDR mux constructions, one for data and one for tristate control. All four of these ip-ops have to be driven by the same two clocks because of how the FPGA is constructed internally. In addition to this there are also two input ip-ops, otherwise it wouldn't be much of an IOB (Input Output Block). These ip-ops also have to be driven by clocks phase shifted 180◦ _{from each other, however they do not have to be the same}

clocks as the other four ip-ops uses.

Figure 3.3 shows the complete IOB with ip-ops, DDR muxes and tristate buer. The signals in the gure are the four clocks, the in and out data and the output enable (tristate) control.

oe0 oe180 o_clk0 o_clk180 o_data0 o_data180 o_clk0 o_clk180 external pin D Q D Q D Q D Q D Q D Q i_clk180 i_clk0 i_data0 i_data180 Figure 3.3: DDR IOB

(33)

3.3. GLOBAL CLOCK NETWORK 15

3.3 Global Clock Network

In Virtex-II devices there is something called the global clock network. This is a number of dedicated clock lines that distribute clock signals to dierent parts of the design. The clock network is designed in such a way as to minimize clock skew which is the dierence in arrival-time of a clock signal to dierent parts of the system. In order to achieve this low skew all clock signals on the global clock network are routed into the center of the FPGA and from there they are distributed. This serves to make the distance each clock signal travels before reaching its destinations as equal as possible.

In addition to the low skew property of the global clock network it also contains clock buers. These, as the name implies, buers the clock signals. This is done in order to make them strong enough to drive a large number of synchronous elements.

These are both important features of the global clock network, however the important thing about it, in the context of this project, is that the number of clock lines are limited. This means care must be taken in order to make sure the design ts. If the number of clock signals exceeds the available clock lines some of them will have to be distributed through general routings where delay and thereby skew can be signicant. The number of clock lines available in the FPGA in question, the Xilinx Virtex-II XC2V4000, is 16 in total with a maximum of 8 of those mapped into each quadrant4_.

3.4 Block RAMs

As mentioned in Chapter 2 block rams work like SRAM memories. However they are a little more complex, block rams are synchronous dual port devices, meaning they can perform two operations at the same time and all operations are clocked.

Each of the two ports has a clock, enable, write enable, address, data-in and data-out port. The address, data-in and data-out ports are busses and their widths depends on how the block ram is congured. A block ram has an 18kbit memory array which can be congured for dierent depth/width ratios. 2kbit of the memory is however only available when using the wider congurations. The dierent data widths available are 1, 2, 4, (8+1), (16+2), (32+4) and the resulting depths are as needed to access the full memory array. The two ports can also be congured independently, giving access to the memory in two dierent ways simultaneously.

One important feature of the block rams is that the two ports can work completely independent of each other. This is almost true, some care must be taken so that a write on a specic address does not conict with another operation (using the same address) on the other port.

4_{The Virtex-II FPGAs are divided into 4 dierent quadrants, North-East, North-West,}

(34)

16 CHAPTER 3. VIRTEX-II FPGAS The ports being completely independent means that even the clocks needs no relationship to each other, they can be asynchronous which enables some important designs to be built, namely the asynchronous FIFO (F irst I n F irst Out).

3.4.1 Asynchronous FIFOs

The asynchronous FIFO is a construction which allows data to be transfered between clock domains and at the same time provide some buer capacity. Data entered at the write-side of the FIFO will with some delay end up on the read-side and all data will be read in the order in which they were written5_,

assuming no data was lost due to lack of buer space. Figure 3.4 shows the basic interface of an asynchronous FIFO.

wr_clk wr_en

rd_clk rd_en

AFIFO

data in data out

read side

write side

Figure 3.4: Asynchronous FIFO

Asynchronous FIFOs are used in a lot of dierent places in this project be-cause there are a number of dierent clock domains that need to communicate. There are also synchronous FIFOs which are like the asynchronous FIFOs except that the read- and write-sides both reside in the same clock domain. This type of FIFO is also used in this project where there is a need to buer data in a FIFO-fashion.

(35)

Chapter 4

Router Design

The router consists of a number of dierent blocks. These blocks are briey described in Section 4.3.

All of the blocks existed in the old router but only one remains relatively intact.

In order to better understand the dierences between the old and the new router designs the following two sections describes what happens to a packet on its way through the router.

4.1 Packet Path: Old Router

When a packet arrives at one of the in-ports it is read in and classied by the input module. The input module then sends the packet on-wards to the packet buer.

In the packet buer the destination IP address is extracted from the packet and sent to the routing table.

The routing table looks up which destination (output port and destination MAC address) the packet should go to and the information is again transfered to the packet buer.

Here the destination is translated into source and destination MAC ad-dresses and the packet is forwarded to the correct output module which in turn outputs it onto the network.

4.2 Packet Path: New Router

When a packet arrives at one of the in-ports it is read in and classied by the input module. The input module drops unwanted (non-IPv4) packets. If the packet was indeed a IPv4 packet it sends it to the packet buer and while doing so it extracts the destination IP address. Once the packet has been transfered the input module sends the destination IP address to the routing table.

(36)

18 CHAPTER 4. ROUTER DESIGN The packet buer stores the packet in memory and awaits the route lookup from the routing table.

Once the routing table nds the correct destination it forward the informa-tion to the packet buer.

Here the destination is again translated into a destination MAC address (but not source address) and the packet is forwarded to the correct output module which outputs it onto the network.

4.3 Router Blocks

4.3.1 Input Module

The input module for this project is a slightly modied version of input module from the previous router. The changes made was to move parts of the function-ality to a more suitable place in the router. In the old router the destination IP address was extracted in the packet buer and from there sent to the routing table. The new design moves this functionality to the input module so that the packet buer and routing table can process requests for the same packet in parallel.

The new input module was also modied so that it drops non-IPv4 packets, something the old version did not do. The reason for this is that the router as a whole only handles IPv4 trac and allowing anything else to enter the router would result in incorrect behaviour, it would result in routing the packets as if they were IP packets even though they are not.

4.3.2 Output Module

The output module handles outputting packets onto the network. The old output module wanted full Ethernet packets, requiring the packet buer to send both sender and receiver MAC addresses to the output module. In the redesign it was decided it would be better to add the sender MAC address in the output module instead since it will always be the same.

Because of this change and the fact that interfacing with the Ethernet physical interface is a relatively simple task it was decided to rewrite the entire module. In this process it was discovered that the old module did not correctly handle packets of sizes smaller than the minimum Ethernet packet length, something that had to be xed.

4.3.3 Routing Table

In the old router the routing table was just a small lookup table mapping a few IP address to dierent output ports. The original idea was for it to use a special pipelined routing table provided by the department.

The lookup table accepted a route lookup request on its socbus connection and then sent the reply to the packet buer, again over socbus. The problem

(37)

4.3. ROUTER BLOCKS 19 with this is that every lookup results in a new socbus connection which incurs a large overhead.

To handle this problem the new routing table can send the results of several route lookups to the packet buer in the same socbus transmission, if there are several available. Also the departments routing table was integrated into the routing table.

4.3.4 Packet Buer

Clearly this part of the router is where the most change was made since it now includes a DDR controller. The old packet buer also performed some tasks that were moved elsewhere in the router design, i.e. extracting destination IP address (moved to the input module) and adding source Ethernet address (moved to the output module).

For a more thorough description of the packet buer see Chapter 6. What has not changed is that the packet buer still uses two socbus con-nections, one for packet data and one for route information1_{. In the old router}

the route interface was used both for sending lookup requests and for receiving the destinations, the new only uses it for the latter.

4.3.5 Socbus

All of the dierent blocks are connected with an on-chip-network called socbus [5]. A socbus network consists of socbus routers and connections between them. To each socbus router there can also be a block connected, in the case of the router the dierent modules are connected to the socbus routers.

When modules want to communicate a packet-connected circuit is estab-lished. The circuit is unidirectional and is setup by the sender when it makes a connection request to the receiver. The request is routed through the socbus network and the resources needed along the path are reserved. If the receiver or any of the routers along the path cannot accept the connection (because a link is already in use) the connection attempt has failed and the sender will have to try again later.

Each socbus link2 _{is bidirectional in the form of two unidirectional links.}

This allows a module to both send and receive data at the same time. All the unidirectional links consists of a number of control and data signals. There are a total of four control signals, strobe, qual, ack and cancel. Two of these, strobe and qual, are used in the forward direction (from sender to receiver), while the other two are used in the reverse direction (from receiver to sender).

The strobe signal is used to control connections, a transition from low to high signals a connection request while a high to low signals a connection tear-down. The two reverse direction signals are used to accept/reject connections,

1_{the packet buer described later actually has several more connections for data, but with}

the limited number of in/output modules one is enough

(38)

20 CHAPTER 4. ROUTER DESIGN as might be guessed from the names ack is used to accept and cancel to reject. The last control signal, qual, is used to indicate valid data once the circuit is up and running. This means that pauses in data transfers is possible, the feature is however not used in this design.

Figure 4.1 shows the life of a successful socbus connection, it also shows a part of the connection setup which hasn't been discussed yet. In order for the socbus network to know where to route a request the target address is needed. This is transfered over the data wires during the connection setup and the data sent is called req0 and req1. The address is contained in req0 along with some information on the type of connection needed, here however only one connection type is implemented and used. The second request data, req1, contains unspecied data and can be used for whatever purpose the application deems useful3_. clk strobe qual data cancel ack

req0 req1 data data data

Figure 4.1: Successful Socbus Connection

In both the old and new routers the modules are connected in a 3 × 3 grid, however the placement of them dier. This is because of the changed behaviour of the input module. Since it now communicates with both the routing table and packet buer a dierent conguration was preferred. The old and new socbus network congurations can be seen in gures 4.2 and 4.3 respectively.

Another dierence with the two socbus congurations is the width of the data bus, in the old it was 36 bits wide which was used in some special cases. The new design however does not use these extra bits so the bus width was decreased to 32 bits, the width at which packet data is transfered.

3_{it is used in this project to send packet identiers, what those are will be presented later}

(39)

4.3. ROUTER BLOCKS 21

PB

IN

OUT

IN

RT

Figure 4.2: Old Router Socbus Network Conguration

PB

IN

OUT

IN

RT

(40)

(41)

Chapter 5

Router Memory Usage

Before designing the packet buer and memory controller some idea of how it will be used is required. Just having a memory controller is not enough to get good performance, knowledge about the access patterns used will give vital information to designing a good controller for the task. This chapter describes how the packet buer memory will be used and in what way the router will keep track of the packages while in the router.

5.1 Packet Identiers

In the feasibility study the designed router marked all packets with a 32-bit identier upon entering the input modules. These identiers would then be transfered with the packet to the dierent places in the router; the packet buer and routing table.

In the previous implementation this feature was removed but here it is reintroduced. The feasibility study however did not specify a good way to generate the identiers so this needs to be taken care of.

The requirements on the identiers is that they should identify the packet uniquely and thereby give all the information needed about the packet. So what is this extra information needed? In the router the packet needs to be allocated some memory space in the packet buer memory, it also has a size. Sometime during its passing of the router it also needs to get some kind of a destination. Also the input module might want to classify the packet, so some information of what type of packet it is would also be interesting.

When giving the identier all of this information is not available, the size and classication are, the destination however is not. The location in the packet buer memory is the most complex part of the information. Some kind of storage scheme is needed, see Section 5.2 for information on this.

Regardless of the storage scheme there still has to be a mapping between storage space and packets and this can be decided upon at any time before

(42)

24 CHAPTER 5. ROUTER MEMORY USAGE the packet is written into the memory, even as early as when the packet enters the router and the identier is given. This has the advantage that all the information about the packet, except the destination, can be associated with the identier right away and the packet buer and routing table (which work in parallel on the packet) can have the same information associated with the identier. This in turn means the associated information can be encoded into the identier right from the start, requiring only the destination to be added later on.

5.2 Memory Storage Scheme

The choice of storage scheme will depend on a number of factors such as the size of the available memory, the book-keeping needed and the type of access pattern it would incur. A number of dierent schemes have been used in routers and some important types are [2]:

1. Fixed size blocks. The approach is very simple, the memory is divided into blocks large enough to t any packet. A variation on this is to divide the memory into xed sized blocks but letting there be a number of dierent block sizes. This gives the advantage that less memory will be wasted on small packets because a smaller block size can be used for those packets. Regardless the advantage is that packets will be stored in a continuous addresses space and book-keeping can be kept minimal because of the simple scheme.

2. Variable sized blocks. This is complex solution, start and end addresses (or length) of each block needs to be stored. Clearly very high utilization can be achieved because no memory at all needs to go to waste, but it has a potential for fragmentation, which could be very hard to handle. Still the memory would be allocated in a continuous fashion which is good, while book-keeping becomes signicantly more complex.

3. Linked list of small blocks. The last version is a combination of the good parts of the previous two. The xed sized blocks enable the simpler version of book-keeping while the small blocks allow for a higher memory utilization. The packets would however not be stored in a contiguous fashion and information about what block is next would need to be stored. Choosing between the three is simple. DDR SDRAMs are good at pushing lots of data when it is stored continuously in the memory. The third version could have this advantage too with a good enough controller and allocation, however the increased complexity is not worth it compared to the alternative. In addition to this both the variable and linked list versions are methods which increase memory utilization and they are good when memory size is an issue.

(43)

5.3. PACKET TO MEMORY MAPPING 25 Here however there is 128MB of memory available so why not use this abun-dance and get a simpler usage, and thereby a simpler allocation scheme and controller.

This leaves the question of one or several dierent block sizes. The simple solution is to use one single block size. Since the router only has Ethernet connections and the Ethernet MTU (M aximum T ransmission U nit) is 1500 bytes1 _{this, or rather 1536 bytes (1536 will be explained below), would be the}

size to select.

However it is clear that a higher memory utilization can be reached with several block sizes and in fact packet sizes in a Internet core router application are rather well suited for this type of memory division. The so called Internet mix is important here, it is an observed packet size distribution from a real core network which has become more or less a standard in benchmarking of Internet core applications [4]. Table 5.1 shows the packet size distribution of the Internet mix, it also shows how much of all data can be expected to come in a specic packet size.

Table 5.1: Internet Mix

Packet Size Probability Relative Size

40 bytes 56% 5%

1500 bytes 23% 74%

576 bytes 17% 21%

52 bytes 5% <1%

Given the Internet mix a division into the 3 dierent block sizes 64, 576 and 1536 bytes covers the packet sizes quite well. All these sizes are divisible by 64 which will be explained below. The allocation of blocks in the memory should be similar to the relative size column in Table 5.1 and a simple way to do that is just continuously allocate rst 5-6% of the memory to 64 bytes blocks, and so on. However there is another way to allocate memory which better suits the memory in question, thereby resulting in a simpler controller.

5.3 Packet to Memory Mapping

In both schemes above blocks of some size would be used and it is desirable to have these block allocated to the memory in such a way as to allow rapid reading and writing of data. Regardless of the block size it is also good if as little memory resources as possible is involved in a read/write. Other than actual memory used (which has already been considered above) the DDR SDRAM

1_{Ethernet jumbo frames of 9000 bytes will not be considered for reasons explained in}

(44)

26 CHAPTER 5. ROUTER MEMORY USAGE has a few other resources; data bus, command bus (control+address lines) and banks.

The data bus usage could be optimized by only transferring exactly as much data as needed and then break o the read/write. This is called a burst terminate and the only place where it could be used is on the last read/write operation for a packet since this is the only place where there could be a mismatch in sizes between the burst length and packet size. This method will not be considered in this project as the burst terminate command disallows using read/write commands with auto-precharge.

To use less of the command bus would mean to use less commands for a read/write, this is hard to do and as will be seen in Chapter 6 it is also unnecessary.

This just leaves the memory banks to optimize on. There are number of them and a packet could be spread across all of them, but it could just as well be contained within just one bank which would be the resource optimal solution.

The 128MB SODIMM memory used in this project is a double-sided mem-ory and has a total of eight Micron MT46V8M16-75 chips, four on each side. Double-sided means that it is actually two 64MB memories that share the same address, data and command bus, all except for the chip select signal. This means that a command can be sent to either side of the memory (or both if both should get the same command) as long as only one uses the data bus, in other words commands other than read and write commands can work in parallel.

Each of the four chips on a side has a data width of 16 bits giving a total of 64 bits = 8 bytes that can be transfered at the time. With the DDR SDRAM maximal burst of 8 this gives a total of 64 bytes transfered per read/write command. This is the reason why the block sizes should be divisible by 64. The reason for using the maximal burst length will become apparent in Chapter 6.

A burst must be contained within the same bank and since a bank can only be activated with one row at the time this implies a burst must be contained within a single row. In fact a DDR SDRAM burst is contained within a block the size of the burst, changing the lower bits in the address only aects the order, the same data is still used. What this means is that blocks should be allocated so that they are contained within a row and they need to start on an address that is an even multiple of 64.

The row size in the memory used is 4kB. This makes jumbo frames imprac-tical to handle since they would have to extend over three dierent rows and therefor it was decided to skip such support. With just one block size (of 1536 bytes = 1.5kB) two such blocks ts in a row and this leaves 1kB of memory unused. Since the rest of the memory cannot be used for any new block the blocks might as well be made 2kB in size.

If however the three block sizes (64, 576 and 1536) are used a more ecient allocation can be made. Based on the relative size numbers of Table 5.1 it can

(45)

5.3. PACKET TO MEMORY MAPPING 27 be seen that about 75% of the memory should go to 1536 byte blocks, 75% of 4kB is 3kB which ts nicely with two 1536 blocks. This leaves 1kB for the other two block sizes and since only one 576 byte block can be t into this space there can naturally only be one, leaving the remainder of 448 bytes for seven 64 byte blocks. Figure 5.1(a) shows the resulting block allocation for a single block size and Figure 5.1(b) shows it for the 64, 576, 1536 allocation.

1536

576 7x64

2048

(a)

(b)

Figure 5.1: Block Allocation per Row

Table 5.2 shows the relative sizes for the second allocation and as can be seen compared to Table 5.1 there is too little space for 576 byte blocks and about two times as much for 64 byte blocks. However the resulting allocation is close enough and gives a single allocation that can be used across all rows.

Table 5.2: Relative Block Sizes Block Size Relative Size

64 bytes 11%

576 bytes 14% 1536 bytes 75%

The only issue not addressed is that of how to actually map the packets to these blocks. As indicated in Section 5.1 this can be done already in the input module by letting each input module manage their own part of the memory. In the requirements for the project a maximum of eight gigabit connections would be connected to the router, resulting in eight input modules. As it turns out each side of the memory has four banks and with two sides this gives a total of eight banks. Allocating each input module to their own bank has the advantage that one input module cannot block another by using the same bank all the time. Also since the memory transfer speed is much higher than that of a single gigabit port (it has to be to be able to handle 4, or even 8 ports) it is very likely the bank will be available again when the next packet from the same port arrives for reading/writing.

What all this boils down to is that each input port is allocated 1 8 of the

memory, 16MB. When a new packet arrives it is given the next available block of the correct size (if only one block size exists the choice is simple). To minimize

(46)

28 CHAPTER 5. ROUTER MEMORY USAGE book-keeping a very simple way to allocate blocks is to have a counter per block size and when space for a packet is allocated the counter is increased by one.

5.4 Packet Identier Format

As discussed in Section 5.1 the packet identier can have almost all the infor-mation needed about the package encoded into it. Here the exact format for the identier will be given.

The now three parts of the identier; packet size, packet type and storage location, are all available when the identier is given.

A packet can be up to 1500 bytes long which means dlg21500e = 11 bits

(range 0 to 2047) are needed to store the size of the packet with byte precision. Here a little trick can be used which simplies some hardware later on. By storing the packet size minus one it is simple to determine how many blocks of size 2n _{would be needed to store the packet. For example with n = 3}

a packet of size 107 bytes (10710− 110= 10610= 11010102) can be stored in

11012 = 1310+ 110 = 1410 blocks of size 23 = 8 bytes (14 × 8 = 112 ≥ 107).

The n lower order bits in the length-minus-one is simply discarded to get the number of blocks (minus ones). This helps the controller when it needs to determine how many 64 byte blocks are needed when storing the packet in the DDR SDRAM.

The packet type could potentially be lots of bits, but to make things simple allocating whatever is left over after the other two get their share of the 32 bit identier and hoping it is enough will do for now2_.

This leaves the storage location and this size will of course depend on the number of block sizes that are used but it must at least include 3 bits to dierentiate between the 8 input modules.

For the simple case of 2kB blocks there is a total of 16M B

4kB × 2 = 8192blocks

available for allocation. This requires dlg28192e = 13bits and leaves 5 bits for

the packet type. Figure 5.2(a) shows the packet identier for this case. With the three dierent block sizes the bit allocation becomes a little more complex, now the packet size can be stored in dierent sized elds depending on the block size which enables some compression to be made. Doing too much compression however will make the hardware more complex and that can be a problem3_{. To dierentiate between the block sizes 2 bits is required, 3 bits are}

still needed to dierentiate the input modules and at least for the 1536 byte blocks 11 bits are needed for the packet length. The number of rows available to a input module is 16M B

4kB = 4096, resulting in dlg24096e = 12 bits needed

for row identication and there are two 1536 byte blocks per row which gives 1 additional bit. All this adds up to 29 bits for 1536 byte packets, leaving 3 for packet type. For 576 byte packets the length required by the 1536 byte

2_{the feature of letting the input module classify the packets was almost not used at all in}

the previous project [1] and this project makes no attempt to extend it

(47)

5.4. PACKET IDENTIFIER FORMAT 29 packets can be reused and since there is only one slot per row no additional data is needed for that. For the 64 byte packages however there are 7 slots per row, this requires 3 bits to dierentiate among them. These 3 bits could be allocated in the most signicant part of the packet length since such small packets will always have zero there, however that makes the hardware more complex so it is better to instead add 2 bits (and use the 1 bit from the 1536 byte packets) to select the correct slot. This leaves the packet type with only 1 bit. Figure 5.2(b) shows what this allocation looks like.

31 27 26 11 10 0

type [5] location [16] length [11]

31 11 10 0

type [1] block input module & row [15] length [11] size [2] 26 29 28 25 30 slot [3] 64 576 1536 0-1 0-6

(a)

(b)

Figure 5.2: Packet Identier Format

Both of these formats are viable, even though the second only has a single bit for packet type the current input module makes no real use of it. Both have been implemented and in the end it turned out the increased size in hardware for the second version was too much of problem without providing any real gain, there is enough memory that wasting some isn't a problem.

(48)

(49)

Chapter 6

Packet Buer Design

This chapter describes the design of the packet buer. Figure 6.1 shows an overview of the design and the following sections will go into detail on the dierent parts. The grey part of the gure shows the clock domain division of the packet buer.

M E M I N T E R F A C E Controller in buffer out buffer S O C B U S

Figure 6.1: Packet Buer Overview

6.1 DDR Controller Selection

When starting the project a choice had to be made on what DDR controller should be used or if even a completely new should be designed. As can be seen later on in this chapter the latter approach was used and the sections below will explain why, and from where the basis of the controller selected came.

(50)

32 CHAPTER 6. PACKET BUFFER DESIGN To better understand the requirements on the controller the rst source for information was the DDR SDRAM standard document [3]. This gave some insight as to what modes of operations would be required from a controller to achieve high speed data transfers.

6.1.1 Available Controllers

The rst step in the process of nding a good controller design was of course to look for what ready-to-use solutions were available.

Opencores1 _{has a DDR controller available which is designed for Virtex-II}

devices. It is free to use but not particularly congurable. Especially it is designed to perform burst of length 2 only, the shortest burst available. After reading the standard it was apparent that such a solution would probably waste too much time controlling the memory to get good transfer speeds.

Xilinx, the Virtex-II manufacturer, has a number of dierent memory con-trollers available for their FPGAs. For the Virtex-II there exists basically two reference designs that handle DDR SDRAM memory. These are described in Xilinx application notes XAPP253 [6] and XAPP688.

Application note XAPP688 is not free, a short document describing the basic design method is available but the code cost money, thereby disqualifying it for usage in this project. Also the design includes delaying the strobe signal from the memory within the FPGA using a rather sophisticated method. This method needs some sort of control for the delay circuit which depends on the supply voltage and temperature of the FPGA and guring out a solution to do this from scratch seemed like a bit too much work.

XAPP253 on the other hand has code available and the design is simpler than that of XAPP688. There are however some problems with it and Xilinx does not recommend its usage. Internally it works with a special construction which uses clock signals as input to combinatorial logic. This construct requires special constraints on the design and to use it the data bus would have to be extended from 16-bits to 64-bit, which is complicated because of this construct. The design is not a complete controller either, it does not handle refreshing the memory and does not allow read and write commands to be issued back-to-back, which would hinder long transfers. It is also made for a single memory, which the memory used in this project is not (see Section 5.3). Some of the ideas from XAPP253 can however be applied to building a custom controller.

6.1.2 Custom Controller

A DDR controller implemented in an ASIC design is normally run at twice the clock frequency of the memory. Compared to a FPGA an ASIC is much faster which means this is no problem. By doing so the ASIC requires no special DDR IOBs and doesn't have to deal with half clock cycles internally. This type of a design would be simpler to implement from a pure code perspective, however

(51)

6.2. THE CONTROLLER 33 getting such a design to meet timing at the speeds required is not feasible with the FPGA.

The XAPP253 controller uses the DDR IOBs and DCMs described in Chap-ter 3 to achieve its goals. It uses DCMs to phase-shift clock signals so that gen-eration of data and strobe signals ts the memory timing model. It also shows how to congure the FPGA to use the I/O-standard used by DDR SDRAMs. Starting with this information a custom controller that doesn't need to run at twice the memory speed can be constructed by using the special FPGA resources. Designing the controller from scratch gives complete control over how the memory should be used and allows for a solution that ts the specic needs of the application. It also gives promise of a solution that integrates nicely into the overall design of the router.

In the next section the exact design of the controller is presented, however Chapter 5 gives some background on how it will be used and as a consequence some of the reasons why it is constructed as it is.

6.2 The Controller

The controller is at the heart of the packet buer. It controls the memory by sending commands and generating all the control signals needed for the dierent elements in the data path. It also decides how to schedule all read, writes and refreshes during operation and provides a correct startup procedure. To simplify the implementation the dierent tasks of the controller has been divided among several smaller controllers. This helps because smaller state machines are easier to place and route so that timing is met. Also with a good division the parallel processing simplies the problem of issuing commands to the memory in an ecient way.

Initially a controller incorporating all the dierent tasks into a single se-quential state machine was designed. This rst controller managed to do reads and writes, but only at a low frequency because of the delay between memory and FPGA during read operations, see Section 9.3 for more information on this. With compensation for the delay a higher frequency was achievable but in order to get really good performance memory commands cannot be issued in the straight forward sequential fashion of rst activating the row, then doing the read/write and precharging the row.

The reason things cannot be done sequentially is the delays, after a row as been activated some time needs to pass before it can be used, therefor it is preferred to do the activation during another read/write operation. In this way activation time is hidden during another operation and no time is lost due to it. The same goes for precharging the row, the time needed can be hidden during another operation.

In addition to the active/precharge and read/write operations there is really only one more operation needed, refresh. During refresh the memory cannot be used for anything else, but since it is a double-sided memory while one side

(52)

34 CHAPTER 6. PACKET BUFFER DESIGN refreshes the other can still perform operations.

Getting maximal performance out of the memory is the goal and this means getting as much data in/out of it as possible. In other words the data bus should be used as much as possible so any command that is not a read or write2_should

be performed while a read/write operation is using the data bus. Of course this is not always possible, but when there are no read/write operations available there really isn't any need to hide the commands because there is time for them anyway.

So since there can only be one read/write command in progress at any one time it makes sense to allocate these to a single controller. Of the other opera-tions needed the precharge is easiest to allocate because precharge commands can be grouped with the read/write commands in the form of the read/write commands with auto-precharge. This leaves the active and refresh commands, both of which should be executed in parallel with the read/write commands. In other words they should be allocated to a separate controller, and since the dierent sides of the memory can work independently a separate controller for each side might be a good idea.

This just leaves the startup procedure. Both the sides of the memory are the same in their conguration so it is possible to send the same sequence of commands to both sides at the same time and both will be initialized in the same way. Also the startup procedure is in a way more complex than the tasks needed during operation but still it is very straight forward. So it makes sense to put this in a separate controller since that will remove complexity from the other controllers.

With these three dierent controllers, all of which can send commands to the memory, a way to give them dierent priorities is required. The priorities must be given in such a way that no starvation occurs. The controller handling the memory startup procedure, from now on called the startup controller, will never work at the same time as the others so there will never be any conict. The other controllers, the primary controller handling read/write commands and the two secondary controller (handling active and refresh commands, one for each side of the memory), all compete for the command interface of the memory.

Before going into exactly how to prioritise among these three controllers some information on how read and write operations get to the controller. When a packet has arrived that should be read or written a command is sent to one of the secondary controllers (depending on which side of the memory the command is destined for) through an asynchronous FIFO3_{. Once the secondary controller}

has read the command and activated the row on which the operation should be performed on it forwards the command to the primary controller so that it in turn can perform the read or write. Since there are two secondary controllers there might be a conict of which command the primary controller chooses.

2_{read and write operations are the only ones that uses the data bus}

3_{it needs to pass an asynchronous FIFO because the socbus input ports and controller}

(53)

6.2. THE CONTROLLER 35 Also the primary controller could already handling a command meaning it cannot accept another. In the case of a conict a priority is needed which does not starve either of the two secondary controllers and a simple way to do this is to let the primary controller accept the command from the secondary controller which it is not currently serving. The resulting connections between the dierent elements of the controller can be seen in Figure 6.2.

Secondary

Controller

Secondary

Controller

Primary

Controller

Startup

Controller

AFIFO AFIFO AFIFO AFIFO M E M O R Y Figure 6.2: Controller

Now back to the priority of the memory command interface. The primary controller has control of the data bus and getting maximal utilization is the goal so naturally this should in some way be prioritised. With only the two secondary controllers to compete with a rather simple scheme can fulll the re-quirements of giving the primary controller priority while none of the secondary suers from starvation. The reason this works is because the primary controller only needs to do read and write operations with the full burst length. The full burst length is 8, which means 4 clock cycles because it is a DDR memory. In other words the worst case scenario is that every 4th cycle is used by the pri-mary controller, leaving the two secondary controllers with 3

4 of all command

opportunities to the memory. So the simple scheme is that if the primary con-troller wants to send a command to the memory, let it. If the primary concon-troller has no command to send a selection between the secondary controllers might be needed. If only one of the secondary controllers want the send a command there is no choice, but if both do a number of selection methods could be possi-ble, for example at random, one always having priority over the other or which ever did not get to send last. All of these would probably work out fairly well but since implementing the one that did not send last is simple and gives a fair balance this was the choice.

Design of a Gigabit Router Packet Buffer using DDR SDRAM Memory

Design of a Gigabit Router Packet Buer

using DDR SDRAM Memory

Examensarbete utfört i datorteknik

av

Daniel Ferm

LITH-ISY-EX--06/3814--SE

Linköping 2006

Design of a Gigabit Router Packet Buer

using DDR SDRAM Memory

Daniel Ferm

Abstract

Abbreviations

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1 Background

1.2 Objective

1.2.1 Primary Requirements

1.2.2 Secondary Requirements

1.3 Reading Instructions

1.3.1 Thesis Outline

1.4 Method

Chapter 2

Memories

2.1 SRAM

2.2 DRAM

2.2.1 SDRAM

2.2.2 DDR SDRAM

(a)

(b)

WRITE

READ

Chapter 3

Virtex-II FPGAs

3.1 DCMs

3.1.1 Clock De-skew

3.1.2 Variable Phase Shift

3.1.3 Statically Phase Shifted Clock Outputs

3.1.4 Frequency Altered Outputs

3.2 DDR IOBs

3.3 Global Clock Network

3.4 Block RAMs

3.4.1 Asynchronous FIFOs

AFIFO

read side

write side

Chapter 4

Router Design

4.1 Packet Path: Old Router

4.2 Packet Path: New Router

4.3 Router Blocks

4.3.1 Input Module

4.3.2 Output Module

4.3.3 Routing Table

4.3.4 Packet Buer

4.3.5 Socbus

PB

IN

OUT

OUT

IN

RT

PB

IN

OUT

OUT

IN

RT

Chapter 5

Router Memory Usage

5.1 Packet Identiers

5.2 Memory Storage Scheme

5.3 Packet to Memory Mapping

1536

1536

576

7x64

Design of a Gigabit Router Packet Buer

Design of a Gigabit Router Packet Buer

4.3.4 Packet Buer

5.1 Packet Identiers

5.4 Packet Identier Format

Packet Buer Design