Analysis and Optimisation of Communication Links for Signal Processing Applications

(1)

Analysis and Optimisation of Communication Links for Signal Processing Applications

ANDREAS ÖDLING

Examensarbete inom elektronik- och datorsystem, avancerad nivå, 30 hp Degree Project, in Electronic- and Computer Systems, second level

School of Information and Communication Technology, ICT Royal Institute of Technology, KTH

Supervisor: Johnny Öberg Examiner: Ingo Sander Stockholm, November 12, 2012

(2)

(3)

Abstract

There are lots of communication links and standards currently being employed to build systems today. These methods are in many way standardised, but far from everyone of them are. The trick is to select the communication method that best suit your needs. Also there is currently a trend that things have to be cheaper and have shorter time to market. That leads to more Component Off The Shelf (COTS) systems being build using commodity components.

As one part of this work, Gigabit Ethernet is evaluated as a COTS-solution to building large, high-end systems.

The computers used are running Windows and the protocol used over Ethernet will be both TCP and UDP. In this work an attempt is also made to evaluate one of the non-standard protocols, the Link Port protocol for the TigerSHARC 20X -series, which is a narrow-bus, double- data-rate protocol, able to provide multi-gigabit-per-second performance.

The studies have shown lots of interesting things, e.g.

that using a standard desktop computer and network card, the theoretical throughput of TCP over Gigabit Ethernet can almost be met, reaching well over 900 Mbps. UDP performance gives on the other hand birth to a series of new questions about how to achieve good performance in a Windows environment, since it is constantly outperformed by the TCP connections.

For the Link Port assessment a custom built IP block is made that is able to support the protocol in full speed, using a Xilinx Virtex 6 FPGA. The IP block is verified through simulation against a model of the Link Port pro- tocol. It is also shown that the transmitter of the IP block is able to send successfully to the receiver IP block. The IP block that is created, is evaluated against some competing multi-gigabit protocols to show it in comparison, and it is a rather small IP block, capable of handling all transactions

(4)

Referat

I nuläget finns många olika sorters kommunikationslänkar, både standardiserade och inte. Dessutom har krav på kor- tare tid till marknad i många fall gett upphov till att fler och fler system byggs med färdiga komponenter som kopp- las ihop till hela system. Som ett led i detta används ofta väl beprövade tekniker som man vet fungerar.

Som en del i det här arbetet kommer prestandan hos Gigabit Ethernet att utvärderas för vanliga persondatorer som kör Windows genom att använda TCP och UDP- protokollen. Dessa är utrustade med standardnätverkskort med låg kostnad och undersökningen går ut på att ta reda på om dessa kort och datorer kan användas till att byg- ga system med hög prestanda. Dessutom kommer ett ic- kestandardiserat protokoll, Länkportsprotokollet för Ti- gerSHARC 20X -serien, som är ett protokoll som stödjer flera Gbps, att utvärderas för prestanda.

Studien av TCP och UDP ledde till mycket intressan- ta resultat. Bland annat så har studien visat att man kan få TCP-kommunikation mellan två persondatorer att vara bara enstaka Mbps från det teoretiska maximala värdet, och kommunikationshastigheter långt över 900 Mbps har uppmätts för TCP. UDP i sin tur, väckte mer frågor än det nåddes svar, och den hade genomgående sämre prestanda än TCP-testerna. Det tyder på att man, när man gör program för vanliga persondatorer, inte tjänar något på att använda UDP utan snarare tvärt om.

För studien av Länkportar så skapades ett IP-block som kan sända och ta emot data i samma hastighet som specificeras som den högsta i protokollbeskrivningen, fyra gigabit per sekund. Blocket verifierades genom simulering och genom att låta sändaren sända data som mottagaren lyckades ta emot. Slutligen jämfördes Länkporten mot andra protokoll med liknande karakteristik, och jämförelsen framställer det skapade IP-blocket som ett gott alternativ till andra protokoll, mycket på grund av sin enkelhet.

(5)

List of Figures

2.1 Radar PPI . . . 8

2.2 Example of partitioned radar system . . . 10

2.3 Example data processing flow . . . 10

2.4 Example Radar System . . . 12

2.5 Differential Signalling . . . 13

2.6 Multi Gigabit Transceiver placements . . . 15

2.7 Link Port Back to back transmissions . . . 16

2.8 Link Port Checksum Transmission . . . . 17

2.9 Link Port Start and Stop of Transmission . . . . 17

4.1 Flowchart description of Ethernet measurement . . . 33

4.2 TCP Checksum Offloading Effects . . . 36

4.3 More Checksum Offloading Examples . . . 37

4.4 Interrupt Moderation Effects On TCP Performance . . . 38

4.5 Throughput of 4088 B Jumbo Frames . . . 39

4.6 Throughput of 9018 B Jumbo Frames . . . 40

4.7 TCP Throughput With Variable Sender Buffer Size . . . 41

4.8 TCP Throughput With Variable Receive Buffer Size . . . 42

4.9 UDP Performance With Varying Interrupt Moderation . . . 43

4.10 Packet Loss With Different Interrupt Moderation Settings . . . 44

4.11 UDP Throughput With Variable Buffer Size . . . 45

4.12 Packet Loss For UDP with Variable Buffer Size . . . 46

4.13 UDP Throughput at Different Frame Sizes . . . 47

4.14 Comparing Received and Sent Bytes per Second for UDP . . . 48

4.15 Linear Approximation of Measured UDP Throughput . . . 49

5.1 Original Link Port Receiver . . . . 52

5.2 Original Link Port Transmitter . . . . 53

5.3 Link Port Transmitter Block Diagram . . . . 54

5.4 Transmitter Clocking Relationships . . . 54

5.5 Transmitter FSM Chart . . . 55

5.6 Link Port Transmitter Enable Schematic . . . . 57

5.7 Output Clock of Link Port Transmitter . . . 57

(10)

5.8 Writable Control Registers . . . 58

5.9 Readable Status Register . . . 59

5.10 Link Port Receiver Block Schematic . . . 61

5.11 Receiver FSM Flowchart . . . 62

5.12 Link Port receiver timing start . . . 64

5.13 Link Port receiver timing end . . . . 64

5.14 Link Port receiver timing with CoreGenerator . . . . 65

5.15 Link Port receiver first schematic . . . . 66

5.16 Receiver LVDS Inputs . . . 67

5.17 Input Clocking of Link Port Receiver . . . . 67

5.18 Input Logic With Clocking Net Shown . . . 69

5.19 Link Port Receiver Clock Crossing . . . . 69

D.1 OSI reference model . . . 102

E.1 PCI Express packet . . . 106

F.1 An overview of the layers in Gigabit Ethernet . . . 110

F.2 Ethernet MAC Frame . . . 110

F.3 Theoretical Ethernet Throughput . . . 113

G.1 An IPv4 Packet Header with optional options field following it . . . 116

G.2 IP over Ethernet maximum throughput . . . 117

G.3 UDP Packet Outline . . . 118

G.4 UDP over Ethernet theoretical throughput . . . 119

G.5 TCP Packet Outline . . . 122

G.6 TCP over Ethernet theoretical throughput . . . 123

I.1 Rapid IO to OSI Mapping . . . 128

I.2 The layout of a serial RapidIO packet of arbitrary size. The pink is the Logical layer, the light gray is the transport layer and the blue is the physical layer. All sizes are in bits unless otherwise specified. . . 131

M.1 The concept of the original ATLAS network. It is split in two separate sub-networks where one compute application one, and the other computes application two. . . 140

M.2 Atlas Split Network . . . 141

(11)

List of Tables

1 Definitions . . . xiii

2 Special Text Decorations . . . xiii

2.1 Link Port Input/Outputs . . . 16

4.1 Computer Setup in Ethernet Test . . . 30

5.1 Resource usage for IP blocks . . . 71

5.2 Time Spent On IP Block Creation . . . 72

6.1 IP Block Resources Comparison . . . 75

B.1 Summary of Xilinx Primitives . . . 97

(12)

Listings

4.1 Ethernet Setup Message . . . 34

4.2 Client Program Pseudocode . . . 35

C.1 Multicycle Checksum Constraints . . . 99

C.2 Link Port Input Constraints . . . . 99

xii

(13)

Definitions

Byte Eight bits, equal to an Octet Half Word Two Octets (16 bits)

Octet Eight bits, equal to a Byte

Packet A unit of transmission of a protocol Quad Word Four Words (128 bits)

Quartet Four bits

Word Four Octets (32 bits)

Table 1. A table of some common definitions that will be used throughout this report.

There are some definitions specified here that will be used throughout the report and they are specified in Table 1. Also, there are some special text decorations that will be used and they are specified in Table 2.

PRIMITIVES Primitives are written with capital letters in a typewriter font.

Signal Signals used are written in bold letters.

1 and 0 Logical one and zero are written in typewriter font as 0 and 1.

Table 2. A table summarising the font decorations of special words.

(14)

(15)

Part I

Prelude

(16)

(17)

Chapter 1

Introduction

The industrial revolution created something humans thus far never would have dreamt about, standards for components. And according to some source this probably has been a great improvement and made the technical revolution possible in the last century. However, this has not applied to everything. In the computer world a lot of things are standardised, especially in the personal computer domain, but on the industrial side there are more non-standardised solutions. However, this lack of standards increases time to market because communication standards have to be created and thoroughly tested before shipping the product.

In the light of harder competition, and also the need to sell products already developed in advance instead of concepts that have to be developed after purchase, the use of pretested and verified techniques is inevitable. In many areas commodity standards are already used, e.g. for processors, memory modules etc. However in communications, this is a process that is currently taking place.

For those reasons the need for a comprehensive overview of the communications standards available for large scale embedded systems need to be done and that is why this work is initiated. The main target application is embedded radar systems of varying sizes with applications both in the civil security as well as the military field.

1.1 Purpose

The purpose of this work is to evaluate the effect of different communication links in embedded systems for radar applications. The type of communication is divided into three categories:

• Inter-chip communication. The communication between chips on the same printed circuit board (PCB). Examples are RocketIO which are Multi Gigabit Transceiver (MGT) on Xilinx FPGAs and Link port which is a Low Voltage Differential Signal (LVDS) communication protocol by Analog Devices.

(18)

CHAPTER 1. INTRODUCTION

• Inter-board communication. The communication between different PCBs in the same system. Examples are RocketIO which is a Multi Gigabit Transceiver (MGT) on Xilinx FPGAs [1] and Link Port which is a Low Voltage Differ- ential Signal (LVDS) communication protocol.

• System to host-communication. The communication between the Host and the rest of the system. Examples are Gigabit Ethernet (GbE), USB3.0, Thunder- bolt, Infiniband (IB) and possibly some other technique.

Some of these techniques will be studied separately since they all pose different demands on the communication and all of them have different demands on the communication in terms of reliability, throughput and latency. Some of them will be just theoretically compared while others are tested or simulated in order to obtain performance of them.

1.2 Goals

The goals of this work are to:

• Create a VHDL implementation of a Link Port [2] for a TigerSHARC (TS20X)- processor to provide a communication interface between an FPGA and a DSP.

This model should be verified through simulation for functionality and in that simulation tested for maximum throughput, latency, area and power.

• Investigate how transfer speed of the TCP and UDP protocols over GbE between two units is affected by altering the maximum payload of the Ethernet frame (jumbo frames) as well as the buffer size and the interrupt settings of the network cards. From the results draw conclusions of how to best utilise GbE in embedded system design.

• Collect research results regarding some high speed protocols supported by the multi-gigabit-transceivers (MGTs) inside a Xilinx FPGA, as well as USB3.0 and Thunderbolt. Compare the protocols to show which the benefits and drawbacks of each protocol is, and also recommendations of when to use which protocol.

• Examine latest research results to try to predict what will be the future standards and trends of digital communication within embedded systems.

1.3 Motivations for This Work

The Gigabit Ethernet part of this work will focus on TCP/IP-protocols over Ether- net. In contrast to most previous work which has been done in Linux environments, this study will look into the TCP/IP-protocols in a Windows environment and how to optimise the networking performance. Furthermore, this work tries to specify

4

(19)

1.4. LIMITATIONS FOR THIS WORK

how a certain traffic pattern associated with radar signal processing will affect the performance on such interconnected machines instead of optimising for an arbitrary traffic pattern.

For the Link Port implementation, this work will contribute to an FPGA inter- face to communicate at gigabit speeds with DSPs in a cluster. Furthermore, if this implementation is successful, it could also be used for lower-end FPGA-to-FPGA communication by means of implementing this rather low-speed communication to FPGAs without any Gigabit Transceivers. This work will then be compared to other communication techniques which are standardised in order to examine which method is the most beneficiary.

The studies concerning other protocols will be beneficiary when selecting which communication standard or standards that will be implemented in which link when constructing a high-end communications network for radar data processing. Since every single technique has a different characteristic and is optimised for different traffic patterns, several different communication techniques may be chosen in order to best serve the traffic pattern of the selected application. In this part, the evaluation of future communication techniques will also be included to some extent, since the future standards are the most recent research trends.

1.4 Limitations for This Work

The aim of this work is to examine how different techniques and protocols are best utilised and to give some guideline to which to choose when implementing a system.

However, it is beyond the scope of this work to try to set up environments to actually test all of these protocols. The aim is to take a theoretical approach and evolve it into guidelines for how to select the most appropriate communication protocol.

Two techniques will be studied deeper, the Link Port and TCP and UDP over GbE. The Link Port protocol will be examined by creating an IP block to simulate and to measure its characteristics. The TCP and UDP over GbE will consist in evaluating the achievable throughput over GbE lines while using COTS components.

1.5 Layout for the Report

In this chapter the topic is introduced and the purpose of this work is explained.

It covers some details about radar systems which may be superfluous. However, a concept system is introduced and looked at.

In chapter 2, an attempt to lay some ground for readers into this subject of communications with focus on radar and embedded systems. It will also try to summarise some of the contemporary research made in these areas.

By reading chapter 3, readers will get an explanation of which methods have been used in this project to reach the goals.

(20)

CHAPTER 1. INTRODUCTION

In chapter 4 the methods of how Ethernet was tested as well as the result and the conclusions made and an analysis of the reached results.

The content of chapter 5 explains how the creation of the Link Port IP blocks and the evaluation of them as well. All metrics are presented and all parts of the IP blocks are specified.

The content of chapter 6 does compare different techniques that have been evaluated, not only the two which are tested but also some which have only been studied in theory.

In chapter 7 some final conclusions are drawn regarding the work that has been carried out.

Finally, chapter 8 suggest improvements and future work which have to be done in order to straighten out some of the question marks raised in this work.

As an aid for the reader, all (or most) of the abbreviations used are listed in Appendix A, and all of the Xilinx primitives used to describe the FPGA part are listed in Appendix B.

In the appendices some in-depth material will be presented in certain areas for readers who wish to have deeper understanding of the subject, even though the content is in no way necessary for the results.

6

(21)

Chapter 2

Background and Related Work

In this chapter, the focus is to introduce some concepts of radar and computer communication. It will start with some history in the subject of radar and then move on to an example radar system setup and discuss the system and data flow.

Then it will describe which techniques that can be used and what is currently used.

Finally, the chapter is finished by a walk-through of some comparisons of common protocols in systems with similar specifications as the radar systems.

2.1 History of Radar Systems

The development of radar began when Henry Hertz in the late 19th century verified a prediction from Maxwell’s electromagnetic field theory [3]. When he was verifying this, he used an apparatus with functionality resembling pulsed radar. This work was later continued and built further upon by Hulsmeyer who created radar with which he wanted to mount on ships in order to monitor other ships and thus avoid collision at sea.

During the Second World War, a lot of radar development was carried out [3]. All participating forces developed their own radars, including the forces from America, Great Britain, Germany, the Soviet Union, Italy, Japan, France and the Netherlands. The radars they developed were both land-based and ship-borne; some with long range and others with shorter range, but their main task was to search the airspace.

After the Second World War, the Moving Target Indicator (MTI) was invented to find moving objects when analysing the radar echo. In order to find the moving objects the Doppler Effect was exploited. Further in the radar development, radars have travelled into space for surveillance of our planet and the exploration of the universe [3].

Another application that was first theorised in 1951 but yet sparsely used to its full extent is the SAR (Synthetic Aperture Radar). This technology has been difficult to realise in real time due to large amounts of data needed to be processed regularly: But thanks to the ever-shrinking size and an increase in performance

(22)

CHAPTER 2. BACKGROUND AND RELATED WORK

of micro controllers and integrated circuits, more and more SAR-systems are seen today. For example, modern aircraft carry SAR-systems in order to map the sur- rounding terrain [4].

2.1.1 Radar Construction Basics

Historically when building a radar system all components have been custom-built;

but in the recent years when prices have started to fall and cost savings are a reality for developers, many radar systems are made from COTS-components (Commercial off the shelf) [3]. Still, however, the front end with antenna, transmitter and receiver is created specifically for each kind of radar. The changes have mostly been further back in the data processing line, in the signal processing and detection parts (see Figure 2.4).

The signal processing in a radar system is mostly on I and Q-components of the received signal, i.e. the real and complex components [3]. The objective of the processing is that it should remove clutter as well as unwanted noise and jamming signals. For removing clutter from the incoming signal the signal processing applies different filters to extract data from it, e.g. MTI (Moving Target Indicator) and MTD (Moving Target Detection) to detect non-stationary objects in range. Here, the trend has been to move from very specialised hardware to COTS-components.

Figure 2.1. An example of a PPI radar image, common in surveillance radars. Used with permission of Christian Wolff © at www.radartutorial.eu

After the filters have been applied to the signal, often several detections are recorded and they need to be filtered additionally to understand how many real objects that have been detected and their position [3]. When the amount and location of targets is decided, the data may be displayed on a monitor in shape of a plan position indicator (PPI, see Figure 2.1) which is a common display used

8

(23)

2.2. AN EXAMPLE SYSTEM

in surveillance radar applications which indicates the underlying terrain. On this, the objects detected by the different filters are displayed and furthermore, some additional information like its calling code if it is a friendly aircraft in a military system.

This is of cause only one example of what a radar can do, since there are several other uses as well. In some applications for example the need for a monitor showing data is unnecessary, e.g. in traffic cameras measuring speed. In such an application it is sufficient to determine the speed of moving objects in order to determine whether to photograph them or not.

2.1.2 A Probable Future

The use of custom built components is an era which is coming closer towards its end for the most of the radar systems [3]. Instead, new systems have to be cheap enough and have to have a short time to market. To enable this, the use of COTS- components is crucial and thus the future designers of such systems need to be able to choose the correct components to build these high-end systems.

In those systems, data transfers will be of crucial importance since it is large quantities of data that need to be transferred quickly. This part will be the main focus in this report, the data links connecting the COTS components.

2.2 An Example System

For purposes of discussion further on in this report, an example of a radar system will be shown. The system is a pipelined system, with different components doing their specific tasks all the time. The problem in such a system is that data has to be transferred between the different stages of the pipeline, and in the case of a radar system, the amounts of data are large. To address this problem there is a substantial need for high-throughput, low-latency links.

2.2.1 Conceptual Radar System

A very basic diagram of a radar system is shown in Figure 2.2, where the basic macro-components and their interconnects are visible. A brief summary of its operation is that an operator watches the radar screen with a PPI image on it and the operator is also able to set some parameters to the system in order to control what output the system gives and the responsiveness of the system. The transmitter sends a signal to the antenna which transmits that signal and then the receiver receives the echo of that radar signal. The received echo is then sent into the signal processing of the radar system.

When data arrives at the signal processing, it is often raw data in large amounts.

Signal processing and detection may be looked at as one step since both involve computations on the radar raw data in an often sequential manner. The data that arrives has to be moved down the signal processing and detection system in a

(24)

Control Signals

Antenna

User

Signal processing

Duplexer

Rx

Detector

Video

Transmitter ^Tx

Control Signals

Figure 2.2. An example radar system partitioning where the major parts are shown.

timely manner, and for radar applications there are often hard real-time deadlines that need to be met. This is basically since new data continuously arrives from the next scan and all data needs processing in order to create the radar images, target indication, target tracking, ground maps etc [4]. These are techniques that require the movement of very large quantities of data [5], all with hard real-time requirements even though the transfer-characteristics of different kinds of algorithms may differ a lot [5].

Since this work will be targeted towards the data transfers from entering the front end of the signal processing, until it exiting the back-end of it, and there- fore a deeper understanding of the data transfers will be sought after in the next subchapter.

2.2.2 Data Transfers in the Conceptual Radar System

Filter 1 Filter 2

Filter 3

Filter 4

Filter 5 Input

Output

Figure 2.3. Example processing flow where the input data passes through five filters on its way towards the output. It is visible that filters 3 and 4 may run in parallel, but apart from that the processing needs to be done sequentially.

In order to extract all the valuable information from the received radar data the data needs to be processed. The processing consists of a number of filters, such as FIR filters, FFTs (Fast Fourier Transforms) and other digital filters with

10

(25)

2.2. AN EXAMPLE SYSTEM

specific purposes [5]. Since the processing in many ways are a series of sequential calculations [5], see Figure 2.3, where each calculation need to be completed prior to the start of the next in the most cases, there are some ways to cope with this.

The two most obvious solutions to this is to either pipeline the processing so that the later filters do not have to wait for data except in the start-up phase, or it can be solved by using powerful processors which are able to complete all processing between the arrival times of two consecutive datasets. The pipelining solution is a solution which provides higher latency of the operation, probably at the benefit of less strict timing restrictions of each filter. If the latency is low enough when pipelining, it is a feasible solution. If it is not, then a more powerful processing solution must be created.

Some exploration into this subject is presented in Bueno et al. [6] and in Bueno, Conger and George [5] where they try to implement a space-based radar system for Synthetic Aperture Radar (SAR) and Ground Moving Target Indicator (GMTI).

In order to do so a network of processing cards is implemented, where the network should be fast enough to finish the processing of the data between two data arrivals.

They also experiment with a pipelined solution where new data is fed into the system while computing results from the old data in order to improve performance.

Their system has high demands on throughput inside the network for efficient partitioning of data. The problem with transferring data set N+1 into the system while calculating set N is that the interconnection network may be congested with data and thus lowering performance of the N:th calculation. However, they found that it was beneficiary for the performance of the GMTI-algorithm to do this pipelining while it was more difficult when implementing SAR.

In solving the problem with data transfers when implementing the radar system ([5] and [6]), a Parallel RapidIO-solution was chosen (see Appendix I for an overview of RapidIO). By interconnecting the units with Parallel RapidIO an FPGA-supported industrial standard was chosen, which may deliver data rates over serial lines from one up to five gigabits per second [7]. The benefits of using an industrial standard interconnect with an open standard as RapidIO is that no single vendor dependency exists since as long as the components support the standard, they may be connected to each others.

An alternative approach to having a network calculating every sample in between data releases is to pipeline the events. This is done in [8] where a hardware SAR processor is built, and which pipelines the calculations into several parts which naturally improves throughput.

The downside with pipelining is the increased latency associated with it, since everything cannot proceed in full speed. However, if implemented in a smart manner, the penalties of pipelining might be very small or in some cases even negligible and may increase throughput without increasing latency too much so that it violates the timing deadlines.

In Figure 2.4, an example system of a pipelined signal processing and detection system in a radar which consists of four stages. In the first stage, data is read into the system from a lot of Analog-to-Digital Converters (ADC). The data is then

(26)

Figure 2.4. A layout of an example radar signal processing system. The letters i, M, N are arbitrarily chosen to make a scalable system with correct characteristics.

passed on to front ends which initiates the calculations and probably tries to reduce the dataset and remove unnecessary data before moving it down the line. In the back-end, which is the third pipeline stage, the data is further processed in order to extract the wanted data, e.g. moving targets in an MTI or GMTI. This data is finally sent to some video processing for visualisation and possible to some data storage for possible off-line processing in a later stage. If comparing Figure 2.3 with Figure 2.4, we may map the filters in Figure 2.3 to the stages in Figure 2.4, where then filter 1 would go into the ADCs step, filter 2 would go into the front-end signal processor, filter 3 and 4 would be in the back-end signal processor whereas filter 5 is implemented in the Video processing processor. By partitioning the system like this, and often reduce the amount of data between steps, an effective pipeline is created.

However, between the components there are lots of data transfers that need to take place and they often need to do that in very short time. This puts requirements on the systems to have high throughput in order to have the sufficient data transfer power. To transfer data there are several techniques but the question is which components are to be used and how?

2.3 A Background to Physical Signalling

Firstly an introduction to different physical signalling techniques will be presented, and in the field of computer communications, as in all other computer fields, there is a desire to increase the speed and throughput of data. In the past, systems were

12

(27)

2.3. A BACKGROUND TO PHYSICAL SIGNALLING

V+

V- Input

Figure 2.5. Idea of differential signalling is to take one input and then create both that signal and its complement and send both. By doing that, noise immunity and emitted noise lowers, as well as the voltage swing of the difference between V⁺ and V⁻is the double of their individual swings.

often made on chips interconnected with multi-drop buses. The solution when more data had to be sent was then to either increase the bus width or the bus frequency, and thus increasing the total throughput of the system.

Recently the trend has changed. The problems associated with multi-drop buses are amongst others skew and increased pin count. This becomes a problem when frequency is increased along with the width, causing serious routing problems on the boards. These problems have driven the trend towards high-speed serial communications. The benefits with serial communication is that no, or less, skew may happen, depending on the layout of the serial bus. If the clock is embedded into the data stream, the data transfer may happen with only a single line. Since it is only one line, skew is zero.

The serial link has the advantage of using less pins, giving less footprint in the I/O of a design. However, many of the high-speed links are using differential signalling for their communication, which gives double the amount of pins compared to single-ended signals. This might seem like a problem but in most cases it is not sine differential signalling have other large benefits. The idea is to use two pins instead of one and on one of the pins (p) have the positive voltage, i.e. it is high for 1 and low voltage for 0. In addition to this pin, its negative complement is put on the n-pin. That pin has high voltage for 0 and low voltage for logical 1 [9].

By using these differential signals and subtract them from each others, the V⁺− V⁻ differential has double the swing of only V⁺. This means that the voltage swing on each line is only half of what it would have had to be for the single ended signalling, and thus the rise and fall times for the differential pins is lowered, enabling higher frequencies [9]. To increase noise immunity, the differential pairs should be routed tightly together, since an electrical field is emitted between the conductors. This way they emit less noise as well as they are almost identically affected by noise. The noise immunity comes from that they both are affected by the same noise, and in the voltage subtraction, the noise hopefully was equal and then the resulting difference is still the same.

The only standardised differential signalling technique is Low-Voltage Differen- tial Signalling [9] and [10]. This is a very energy-efficient way of signalling, with raw data rates up to 3.125 Gbps. Included in that is optional encoding, in order to provide good signal integrity, and if such encoding is present, the baud rate will be lower than the initial signalling speed. This is a technique that is used in several serial communication standards, even though not all of them use the standardised

(28)

LVDS-signalling. One example is Serial RapidIO [7] which uses LVDS-signalling in order to increase data integrity.

The second alternative for differential signalling is Emitter Coupled Logic, ECL.

ECL is the oldest of the differential techniques and is today widely used in different military applications, mainly due to its ability to work in all temperature ranges [9]. The main drawback of ECL is that it operates at negative voltages. This is a problem since it is not common to supply negative voltages to chips, and people tend to use mostly positive voltages.

The last physical technique is Current Mode Logic, CML. CML is a kind of ECL but with some characteristics that differ. The biggest difference is the transistor circuitry which causes CML to have a higher common mode output voltage [9]. This structure makes CML to be the fastest choice when creating a differential link, with transfer speeds exceeding LVDS-standards. However, CML is restricted in length due to its high transfer speed, and may almost exclusively be used for chip-to-chip communication on a single board. Furthermore, it is far more power-consuming than LVDS, given a certain bit-rate.

2.4 Multi-Gigabit Transceivers

To enable high bitrates between devices, the shift has been from parallel wide buses, to multi-lane independent serial lines with point-to-point connections as discussed in the previous section. In using these serial connections, circuits need to be able to transfer single bits at several gigabits per second. This requires specially built chips, which could be external to the processing element (PE) as in Figure 2.6 a), or an integral part of an FPGA or embedded processor of some sort, as in Figure 2.6 b). These components are sometimes referred to as Multi-Gigabit Transceivers, or MGTs, and are used in order to generate very high speed signals.

In Figure 2.6 b), a typical layout of an FPGA with embedded processing elements and integrated MGTs. This is the implementation used by both Altera and Xilinx, the two major FPGA producers. By integrating these MGTs into different IP-cores supplied with the FPGAs the system developer has most of the common high-speed serial communication protocols already available. Some of them which are supported by both Altera and Xilinx high-end FPGAs are PCI Express, Serial RapidIO, XAUI (a part of 10GbE standard) and SATA, but a lot more are supported [11] and [12].

The ability to use these standards ensures high transfer speeds between chips, boards and chassis. Their availability inside FPGAs makes it a lot easier for developers to use these high speed interconnect technologies when comparing to having to add an external card which does the serial transfer, since everything is han- dled on-chip and thus may be thoroughly tested and simulated inside the FPGA development environment.

14

(29)

2.5. THE LINK PORT PROTOCOL

PE Wide, slower bus MGT Narrow Multi-gigabit serial line a)

b)

PE Wide, slower bus MGT Narrow Multi-gigabit serial line

FPGA

Figure 2.6. Different placements of MGTs, either as a separate chip, as in a), or as an integral part of e.g. an FPGA as in b).

2.5 The Link Port Protocol

First out of the studied protocols is the Link Port protocol, since it is a key part of the project. The Link Port protocol exist in several versions, but the one looked at here is the protocol for the TigerSHARC TS20x-series and is specified in [13]. The Link Port is specified as a differential data bus and associated source-synchronous clock, with two additional control signals, acknowledgement and a block complete.

The idea of the Link Port protocol is for the TigerSHARC DSP to be able to interface to other components through a multi-gigabit per second serial interface.

2.5.1 Some Link Port Characteristics

The Link Port protocol exist in many versions, for different types of processor families from Analog Devices. The protocol which is of interest here is the one targeting TigerSHARC 20x Processor Series. The Link Port is a DDR (double data rate) protocol which means that two datas are presented each clock cycle. The protocol targets point to point connections only, meaning that for each link there can be only one sender and one receiver. Several sender and receiver circuits can coexist on the same device however, enabling the creation of processing clusters with many present devices.

The Link Ports have a specific set of ports, listed in Table 2.1. Of these four ports, the two most crucial and fast-switching are differential while the two control signals (Ack and n_BCMP) are normal single-ended signals.

The start up of the Link Port transmission is done by the transmitter setting n_BCMP to logic 1 (deasserting it). Then the receiver knows that the transmitter

(30)

Port Name Width (Data/Phys- ical)

Description

Data 4/8 Four differential data pairs. Outputs from sender.

Clk 1/2 Differential clock pair clocking the data. Out- puts from the sender.

Ack 1/1 Acknowledgement sent by receiver, indicating that it may receive data. Output from the receiver.

n_BCMP 1/1 Block Complete, used to signal last quad-word of a transmission and to setup the link after reset. Output from the sender.

Table 2.1. The inputs and outputs of the Link Port . Their respective transfer origin is specified in the table. Naturally they are inputs at the side which is not a sender.

is present and may indicate the possibility to receive data by asserting Ack. Data may then be transmitted until Ack is deasserted again.

Due to inner workings of the TigerSHARC processor and the data bus being 128 bits wide, that is also the transmission unit size of the Link Port . This means that data is sent in chunks of 128 bits, with a checksum option available which sends an additional 16 bits for increased data integrity. The checksum is one byte long and is sent after the data. After the checksum byte, a dummy byte is also sent before the transfer of the next data begins. An example of the end of a transmission with checksum enabled is presented in Figure 2.8.

The Link Port protocol specifies a discontinuous clock to clock the data. This is specified as that the first input data is to be clocked in at the first rising edge of the input clock. It also specifies that the last received data arrives at the last falling edge of the clock. Also, the clock output is low when there is no data transmission currently happening. The start and end of a transmission is shown in Figure 2.9 where we see that the clock is driven Low when no transaction takes place.

When transmitting more than a single quad-word there is no need for a gap between the words, and the next transmission starts on the rising edge following the last falling edge of the previous clock, see Figure 2.7.

Clk

Data D120-123 D124-127 D0-3 D4-7

Figure 2.7. The transmission of two back-to-back quad-words have no gap between them.

The clock signal of the Link Port protocol may be clocked up to 500 MHz, and the data comes at DDR at a 4-bit wide bus. This indicates that it can receive 1 byte per clock cycle, or 500 MB per second. The unit of transfer is either 128 bits or 144 bits, depending on whether the signal should have a checksum enabled or

16

(31)

2.5. THE LINK PORT PROTOCOL

Clk

Data D120-123 D124-127 C0-3 C4-7 Dummy Dummy

Figure 2.8. The end of a transmission with the checksum option enabled.

Clk

Data ^D0-3 ^D4-7 D120-123 D124-127

Figure 2.9. The start and end of a transmission, showing the discontinuous clock at both times.

not. This means that transfers of small quantities of data is rather inefficient, but if lots of data are sent, the throughput is either the full 500 MB/s, or if checksum is enabled, 128 ÷ 144 · 500M B/s ≈ 444M B/s

For some more information on the Link Port protocol, see either Appendix H or read on to chapter 5 where details will be explained on the fly.

2.5.2 Previous Work on Link Ports

In the area of Link Port , comparative studies are less common than with standard protocols. However, this does not mean that the performance of Link Ports has not been tested. The Link Port is a point-to-point technology which connects two units (DSPs) for inter-chip communication. The fact that they are associated with DSPs mean that Link Ports are often used in computationally intensive applications, e.g. radar and image processing.

The use of Link Ports in literature often concerns the creation of real-time processing systems. In both [14] and [15] Link Ports are used to create pipelined radar processing systems. The first work uses a pipelined version of the radar systems where several DSPs have different tasks and data is passed down through the pipeline. The other approach is more of a brutal one where the nodes are connected in a cluster of DSPs are connected to an FPGA and communicate through the Link Port protocol.

A third work which uses Link Ports is presented in [16]. This design uses Link Ports to interconnect DSP clusters with each others and with FPGAs, as well as interconnecting several FPGAs. This article concerns some design problems when creating FPGA link ports, such as receiver clocking and input design. The biggest design challenge in their Link Port design was the receiver input clocking. In their Virtex 5 FPGA, they used a global clock buffer to gain equal clock delay to all the input clocking components. They also used a lot of primitives similar to ISERDES1 to serialise the incoming data.

(32)

2.6 Communication Protocols

After summarising the most common differential signalling techniques and how they are implemented in FPGA solutions, a summary of different communication protocols will now be presented. The common characteristic is in some sense that all of the protocols specify everything from a physical layer and upwards to a layer where application data may be transferred.

RapidIO One common feature of all protocols that will be studied is that they are almost exclusively serial in their nature, with one exception. The RapidIO link [7] has both a parallel and a serial physical interface. The RapidIO is a fairly new communication protocol which targets embedded high-end systems with requirements on high transfer speeds [9] and high connectivity. Since it provides both serial and parallel interface, and the feature that a switch should be able to handle both parallel and serial modes [7], high interoperability is possible to achieve. The serial RapidIO links operate at effective speeds ranging from one to five Gbps. On that, there is an overhead of 12 to 20 bytes for payloads of up to 256 bytes, giving maximum effective throughput to 95.5% of link speed. RapidIO is further described in Appendix I.

PCI Express Another technology which is of interest to this study is PCI Express (PCIe) [17], which is explored in depth in Appendix E. PCI Express is the evolution from the legacy PCI bus, which is a multidrop bus, to a switched fabric, point-to-point topology. PCI Express has the disadvantage that it has to be backwards compatible with PCI, to maintain operability with older operating systems. However, if implemented in a completely new system it has potential as embedded interconnect with low overhead and low latency. Given the correct conditions, it may achieve a 99.5% efficiency over its links(Appendix E).

Ethernet One of the most common interconnect-technologies today is Ether- net (see Appendix F) and almost every new PC is sold with an Ethernet interface.

Ethernet is an old interconnect network, which has seen many improvements since it was first introduced. As of now, Ethernet has evolved from a half-duplex, sub 10Mbps system, to a full-duplex 100 Gbps system. This makes Ethernet one of the most popular networks that exist today, and it is the interconnect technology in several high-performance computers today [18]. The main advantages of Ethernet are its low price/performance-ratio and the number of people with knowledge about it.

TCP/IP Ethernet, however, is only standardised until Layer 2 in the OSI model ( Appendix D) and above that, other protocols are commonly implemented in order to ensure reliability. The most well-known protocol framework is the TCP/IP suite. In this suite, several protocols are fitted, including TCP, UDP and IP, all

18

(33)

2.7. PREVIOUS WORK ON PROTOCOL COMPARISON

explained in depth in Appendix G. These together serve as the backbone for the most well-known network, the Internet. There are a lot of different protocols in the suite, but the most well-known are TCP, which is a reliable end-to-end protocol that guarantees delivery, UDP, which is a connectionless protocol with very little overhead, and IP, which takes care of routing of both TCP and UDP packets. The TCP guarantees delivery of packets in sequential order [19] which might be very beneficial. The TCP protocol has little overhead in bytes, however it might be a larger processing overhead providing its guarantees.

Infiniband The other popular interconnect for high-performance computers, beside Ethernet, is Infiniband [18]. Infiniband is less mainstream than Ethernet, but delivers higher performance in aspects of e.g. latency. Furthermore, it has several different transmission techniques, in some sense like the TCP/IP-suite. The bandwidth of Infiniband is scalable and it is possible to upscale by increasing the number of parallel lanes. A more in-depth explanation exists in Appendix K.

USB A very common and mainstream interconnect is USB [20, 21]. It has been released in three specifications, each has increased the bandwidth with a great margin. The current USB 3.0 standard specifies a full-duplex, 5 Gbps connection.

However, USB is not as easy to interpret in terms of communication speed as several other protocols. It has different transactions, ability to reserve bandwidth (up to 80% for a USB 3.0) and so forth. This makes the link available for real-time usage, but not to a full extent. Furthermore, the specified rate is the raw bit- speed, meaning that penalties for encoding need to be calculated with. For a more comprehensive summary of USB, please refer to Appendix J.

Thunderbolt One more recent technique is Thunderbolt [22], released by Intel, which is a technology that should encapsulate both PCI Express and Display port communication in an external cable [22]. The specification for this standard is only available through non-disclosure-agreements. The technology supports a bi- directional, full-duplex, 10 Gbps channel for inter-chassis communication. The link itself works with both isochronous data transfers and burst transfers, although the amount of bandwidth which may be reserved is not clear from the source.

2.7 Previous Work on Protocol Comparison

A lot of work has been done in order to evaluate these different communication methods and below a summary of the work is attempted. The authors in [23, 24] state that there are three main backbone architectures for embedded systems;

Ethernet with TCP or some other upper level protocol, PCI Express and RapidIO.

These three are considered by them to be the backbone architectures best suited for embedded systems. In addition to those three, Infiniband, USB and the newly developed Thunderbolt will be reviewed.

(34)

2.7.1 TCP and UDP Performance Over Ethernet

In [25], one of the few comparisons between Linux and Windows TCP-performance is carried out. Furthermore, they investigate how the performance varies with different NICs, different internal bus-widths and payloads (MTUs). They show that the performance when a card is installed directly out of the box is often far less than its optimum configuration. However, they find it easier to improve performance in a Linux environment than in Windows. Some factors they find that can improve performance are increased MTU, increased socket buffers and reduced interrupt rate.

However, the conclusion is not that increasing everything gives the best benefits;

instead it is that tuning a NIC correctly improves performance the most.

In [26] they also investigate how buffer sizes and MTUs affect the performance over long transmission lines with TCP/IP over 10GbE. They show that increased buffer size and MTU size increase the performance of the communication when transmitting over long distances.

In [27], the objective is to compare Feodora Linux with Windows XP and Server 2003. The experiment examines the operating systems’ ability to forward packets by trying to send as many packets as possible through a PC. This work, however, does not examine how to improve performance but is only a measurement of how fast the operating systems are in forwarding packets in user and kernel space.

In [28] an attempt to monitor the different delays in Windows and Linux UDP stack. The study indicates, as the previous studies also have suggested, that the processing time in Linux is shorter than in Windows. However, their tests were conducted when using the minimal Ethernet packet size and thus the overhead would be the maximum one. In contrast, this study will look at larger packets.

This article also discusses some performance enhancements and their impact on real-time behaviour. The setting that limits real-time performance most in the Windows case is interrupt moderation, i.e. wait for more packets before issuing an interrupt and thus reducing the number of interrupts. Tests were showing that if this setting was set improperly, the system showed very poor performance in terms of latency.

In addition to these works, a lot of work concerning Ethernet performance was made when building the ATLAS detector at CERN [29]. A thorough examination is available in Appendix M, but a brief summary will be presented here. The decision made when building the data acquisition for ATLAS was to use Ethernet as the backbone communication methodology, to do the real-time filtering of data from approximately 60 TB/s when captured, to 300 MB/s when stored to disk [30]. The initial thought was to go with Fast Ethernet, at 100 Mbps, but as technology evolved the chosen technology became a combination of Gigabit and 10 Gigabit Ethernet [31].

As a communications protocol for ATLAS, TCP/IP was thought of, but was later given into due to the non-real-time effects of it [32]. Since the application layer already had timeouts which were much more predictable to ensure real-time behaviour, the use of TCP was looked at as a performance risk rather than bene-

20

(35)

2.7. PREVIOUS WORK ON PROTOCOL COMPARISON

fit. This mainly due to over-occupation of the buffers, polluting the network with acknowledgements and potentially sending unwanted data. However, for non-real- time data TCP is looked at as a great option since it has no requirement of an application guarantee of delivery of raw Ethernet or UDP packets.

2.7.2 RapidIO Analysis

The RapidIO is a rather new standard with both a serial and parallel interface. A more thorough explanation may be read in Appendix I. The idea is to present a high- speed interconnect for embedded systems with low overhead and high throughput.

In contrast to PCI Express, it does not need to be compatible with old PCI buses which alleviates a lot of pre-made design considerations [9]. Unlike PCI Express, the main area of use for RapidIO is as an embedded-system’s interconnect.

Some implementations using RapidIO have been analyzed in the research com- munity. One of the most interesting for this thesis work is made at the University of Florida [5, 6], where a distributed signal processing system for radar applications were implemented with parallel RapidIO. In their work they conclude that RapidIO is able to meet the demands in communication for their real-time processing.

In [24], claims are made that RapidIO is the best of the most popular embedded interconnect architectures (Ethernet, RapidIO, PCI Express) in the sense that it gives the best of the other two into a single solution. The ability to send both unreliable and reliable transactions is only possible in RapidIO for example. It is exemplified in an experiment where a highly interconnected system is created in the form of a dual star, with very high bandwidth.

In [33] an application is made based on Serial RapidIO (SRIO) but without any performance metrics measured, except the statement that it is a scalable system.

In [34] however, measurements are made as to the efficiency of SRIO, with results up to 86% of the link utilised for payload data. However, it is not entirely clear which settings they have used when creating their packets and if they are using the maximum payload, but it is clear that this is not very far from the maximum theoretical utilisation of just over 92% (see Appendix I and [7] part 6).

Another interesting performance evaluation of RapidIO, the parallel kind, is made in [35]. Here the latency and saturation link-utilisation is measured. They conclude that for a single switch in between two end-nodes a 64-bit read may be read in fewer than 100 ns, proving that latency-sensitive data may be sent over a RapidIO link.

2.7.3 PCI Express Evaluation

In [36], a data acquisition system is built into a PC using a PCI Express interface.

The study involves measuring maximum throughput of the link, which in most cases was at 75% of the theoretical maximum, even though their calculations estimated that with their transfer characteristics the theoretical maximum would be at 85%

(also read Appendix E for more details why 85%). However, their study shows that

(36)

for 99.998% of the time, throughput were over 50%, and in all cases it exceeded 45%.

In [37] a study where COTS PCIe-enabled motherboard is used to speed up calculations in a parallel benchmark. They also show differences between PCIe and PCI-X, where PCIe has several advantages in low latency and high bandwidth.

2.7.4 USB Experiments

A pseudo-real-time USB application is created in [38] for a Windows 7 environment.

This work however only uses a full-speed (12 Mbps) USB connection but does achieve the timely behaviour in terms that they actually reads one time exactly every 10 ms, which is their deadline, thus proving the possibility to use USB for real-time applications.

In [39] an FPGA implementation of a USB device is created in a lower and faster mode, both operating in high-speed USB. Their transfer limitations are in the underlying architecture on the FPGA, where the transfer rates are approximately 100 and 400 Mbps. These are achieved and exceeded in testing, indicating that their solution is able to utilize the high-speed USB almost fully to a full extent.

2.7.5 Infiniband Studies

Infiniband (IB) is of high interest to high-energy physics [40] in terms of Data Acquisition, very similar to the work conducted here. The appealing factors of Infiniband are the high throughput and low latency. It is also very important in the field of High Performance Computing (HPC), where the high throughput and low latency also is critical [41].

In [42] an assessment of Internet Protocol(IP) performance over Infiniband is performed and in that evaluation they find that IB is a competitor to 10GbE since it delivers very high throughput, in this particular study up to 11 Gbps. When adding the results in [43] that shows that latency of IB is less than Ethernet latency, the conclusion that has to be made is that IB is a very big competitor to Ethernet.

In [42] the result also compares the two modes of IB, connected and unreliable.

They find that by using connected mode, and thus not having to have a TCP-layer on top, the system works faster than the usual IP-defragmentation algorithms do.

Hence, speedup is gained when IB gets to do the fragmentation/defragmentation instead of the UDP-stack.

Further research in [40] indicates that the choice between Infiniband and 10G Ethernet is dependent on the expected packet size, since Ethernet outperforms Infiniband at small packets, and vice versa for big packets. However, they indicate that these tests were only conducted for a point-to-point case and might not be valid for another setup with several machines over another type of network.

22

(37)

2.8. DATA ACQUISITION NETWORKS

2.7.6 Intel Thunderbolt

In terms of research on thunderbolt, there is little to find. Some work has been made though, e.g. an attempt to interconnect many PCs running Windows Server 2008 R2, and then making them communicate with each others over Thunderbolt [44], or Light Peak as it was called prior to public release. Their results suggest that Thunderbolt may be used to create data clusters in the future since their prototype managed to achieve well over 50% utilisation of the busses.

2.8 Data Acquisition Networks

Since the aim here is to evaluate the performance in radar signal processing hardware, and a characteristic of that are large amounts of data transfers, this subchapter will look into different systems with similar requirements called Data Acquisition systems (DAQs). There are lots of similarities since these systems are created to transfer data at very high rates from their input to the storage or post-processing at the back.

According to [45] DAQ-systems may be categorised in three categories: PC- based, embedded and FPGA systems. PC-based are the systems which use a PC to visualise the data in some way. Either connected to an internal bus or to an external connector, these are standard PCs with an extension that captures data.

For transmission of data, some techniques are USB, FireWire, RS232, Ethernet and so on. An internal alternative is to connect to a PCI or PCI Express bus.

Embedded DAQ-systems are found in cars, aeroplanes, medical equipment and several more applications [45]. These are often fast and high-performance systems, but they have a fixed architecture and are not hardware-reconfigurable after they are built. This is in contrast to the FPGA-solution which may be hardware reconfigured after the system has been deployed [45].

In the work conducted in this thesis, the main focus will be towards a hybrid between FPGA and PC-based if following this definition. The theoretical setup, as may be seen earlier in Chapter 1 in Figure 2.2, shows that there is a PC in the end where the user gets to see the output data. However, the data processing steps prior to the visualisation of the data passes through some non-PC components, e.g. COTS-components or custom created components, which may very well be implemented in an FPGA.

In general, many systems are in some sense PC-based, they only differ in the amount of processing done in the PC. Two examples which use a PCI-bus based card for capture and then does processing on the CPU are [46, 47]. This is made in order to use commodity computers to do the processing.

Another approach is often made in which high-end computers are used in the middle between the data collection and the visualisation PC. This is done for example in the ATLAS experiment (see Appendix M for a case study) and as well in other experiments such as the Daya Bay Neutrino DAQ [48, 49] and in the KM3NeT Detector [50, 51]. All of these use different sizes of their computational clusters,

Analysis and Optimisation of Communication Links for Signal Processing Applications