• No results found

Network Processor specific Multithreading tradeoffs

N/A
N/A
Protected

Academic year: 2021

Share "Network Processor specific Multithreading tradeoffs"

Copied!
112
0
0

Loading.... (view fulltext now)

Full text

(1)

Network Processor specific

Multithreading tradeoffs

Victor Boivie

Reg nr: LiTH-ISY-EX--05/3687--SE Link¨oping 2005

(2)
(3)

Network Processor specific

Multithreading tradeoffs

Examensarbete utf¨ort i Datorteknik vid Link¨opings Tekniska H¨ogskola

av Victor Boivie

Reg nr: LiTH-ISY-EX--05/3687--SE

Supervisors: Andreas Ehliar and Ulf Nordqvist Examiner: Dake Liu

(4)
(5)

Framläggningsdatum

Publiceringsdatum (elektronisk version)

Institution och avdelning

ISBN: ISRN: Serietitel Språk

Svenska

Annat (ange nedan) ________________ Rapporttyp Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport __________________ Serienummer/ISSN

URL för elektronisk version

Titel Title Författare Sammanfattning Nyckelord

2005-06-03

Institutionen för systemteknik

581 81 Linköping

2005-06-09

LITH-ISY-EX--05/3687--SE

Engelska

http://www.ep.liu.se/exjobb/isy/2005/3687/

Network Processor specific Multithreading tradeoffs

Network Processor specific Multithreading tradeoffs

Victor Boivie

Multithreading is a processor technique that can effectively hide long

latencies that can occur due to memory accesses, coprocessor

operations and similar. While this looks promising, there is an

additional hardware cost that will vary with for example the number of

contexts to switch to and what technique is used for it and this might

limit the possible gain of multithreading.

Network processors are, traditionally, multiprocessor systems that

share a lot of common resources, such as memories and coprocessors, so

the potential gain of multithreading could be high for these

applications. On the other hand, the increased hardware required will

be relatively high since the rest of the processor is fairly

small. Instead of having a multithreaded processor, higher performance

gains could be achieved by using more processors instead.

As a solution, a simulator was built where a system can effectively be

modelled and where the simulation results can give hints of the

optimal solution for a system in the early design phase of a network

processor system. A theoretical background to multithreading, network

processors and more is also provided in the thesis.

(6)
(7)

Abstract

Multithreading is a processor technique that can effectively hide long latencies that can occur due to memory accesses, coprocessor operations and similar. While this looks promising, there is an additional hardware cost that will vary with for example the number of contexts to switch to and what technique is used for it and this might limit the possible gain of multithreading.

Network processors are, traditionally, multiprocessor systems that share a lot of common resources, such as memories and coprocessors, so the potential gain of multithreading could be high for these applications. On the other hand, the increased hardware required will be relatively high since the rest of the processor is fairly small. Instead of having a multithreaded processor, higher performance gains could be achieved by using more processors instead.

As a solution, a simulator was built where a system can effectively be modelled and where the simulation results can give hints of the optimal solution for a system in the early design phase of a network processor system. A theoretical background to multithreading, network processors and more is also provided in the thesis.

(8)
(9)

Acknowledgements

First I would like to thank my supervisors, Andreas Ehliar at the university, and Ulf Nordqvist at Infineon Technologies, for the opportunity to work with this interesting and challenging project and for guiding me throughout the project.

I also feel gratitude towards the people I worked with at Infineon Technologies in Munich. Thank you Xianoning Nie, Jinan Lin and Benedikt Geukes for all your help.

I would also like to thank my examiner, Dake Liu, professor at the Computer Engineering Division, for offering me the opportunity to work on this project.

My opponent, David B¨ackstr¨om, should also be mentioned here for giving me valuable feedback which helped me improve the thesis.

Last, but absolutely not least, I would like to thank all of my friends for being there all the time. We really had a lot of good time together.

(10)
(11)

Acronyms

Context All the information associated with a processor thread. For example all registers, program counter, flags and more.

Thread A lightweight process ILP Instruction level parallelism TLP Thread level parallelism

CAM Content Addressable Memory - an associative memory TCAM Ternary CAM

GPP General-purpose processor

ASIC Application Specific Integreated Circuit. Dedicated hardware for a specific function or algorithm.

ASIP Application Specific Instruction-Set Processor. A processor whose instruction set has been adapted to a certain application.

NP Network Processor. A programmable hardware device designed to process packets at high speed. Network processors can perform protocol processing (PP) quickly PP Protocol processor. A processor specialised for protocol processing

Protocol A set of message formats and the rules that must be followed to exchange those messages.

OSI The interconnection of open systems in accordance with standards of the Inter-national Organization for Standardization (ISO) for the exchange of information.

NIC Network Interface Controller v

(12)

Ingress Traffic which comes from the network into the network controller. Incoming traffic. Egress Traffic which comes from the network controller and is destined for the network.

Outgoing traffic.

IP Internet Protocol which is the network layer protocol for TCP/IP suite. A connec-tionless best-effort packet-switching protocol.

TTL Time To Live (field in the IP header). Defines how many router hops a packet can be routed before it will be discarded

MAC Media Access Control

ATM Asynchronous Transfer Mode. A high speed network protocol which uses 53 byte “cells”

AAL5 ATM Adaption Layer Five. Used predominantly for the transfer of classic IP over ATM

WAN Wide Area Network LAN Local Area Network

HPA Header Processing Applications PPA Payload Processing Applications ISA Instruction Set Architecture NAT Network Address Translation PP32 Infineon’s 32-bit Packet Processor

(13)

Contents

1 Introduction 1 1.1 Background . . . 1 1.2 Objectives . . . 2 1.3 Methods . . . 3 1.4 Time . . . 3 2 Background 5 2.1 Computer Networks . . . 5 2.2 Network Processors . . . 8 2.3 Multithreading . . . 16

2.4 System Level Methodology . . . 30

2.5 Area Efficiency . . . 32

3 Previous Work 33 3.1 Spade . . . 33

3.2 StepNP . . . 33

3.3 NP-Click . . . 34

4 Proposed Solution - EASY 35 4.1 Difference from Previous Work . . . 35

4.2 System Modelling . . . 36

4.3 Memories . . . 38

4.4 Coprocessors . . . 39

4.5 Processors . . . 41 vii

(14)

5 Simulation 1 - assemc 45

5.1 Introduction . . . 45

5.2 Application modelling . . . 45

5.3 Architecture modelling . . . 46

5.4 The simulation goal . . . 47

5.5 Parameters . . . 47 5.6 The simulation . . . 47 5.7 Results . . . 48 5.8 Conclusions . . . 52 6 Simulation 2 - NAT 53 6.1 Introduction . . . 53

6.2 Network address translation . . . 53

6.3 Paritioning . . . 54

6.4 Cost Functions . . . 55

6.5 The goal . . . 56

6.6 Parameters . . . 56

6.7 Possible settings and questions . . . 57

6.8 Packet lifetime . . . 57 6.9 Simulation results . . . 58 7 Conclusions 63 8 Future Work 65 8.1 Introduction . . . 65 8.2 Suggested features . . . 65

A Modeling Language Microinstructions 69 A.1 Executing commands . . . 69

A.2 Statistics instructions . . . 70

A.3 Program flow instructions . . . 70

A.4 Interrupt instructions . . . 71

A.5 Switch instructions . . . 72

A.6 Semaphore instructions . . . 72

(15)

B Architecture File 75 B.1 Introduction . . . 75 B.2 Semaphores declarations . . . 75 B.3 Memory declarations . . . 76 B.4 Coprocessor declarations . . . 76 B.5 Processor declarations . . . 78 B.6 Dump statement . . . 82 B.7 Queue statement . . . 82 B.8 Penalty statement . . . 83 C Simulator Features 85

(16)
(17)

Chapter 1

Introduction

1.1

Background

Network processor systems are traditionally multiprocessor systems that share some common re-sources, such as memories, coprocessors and more. When a processor wants to work with a resource and it is currently processing a previous request, the processor will have to wait until the resource is free until it can issue the request on the resource once more. While it is waiting, it will normally not do anything useful, but if the processor supports multithreading, it can start working with something else while waiting and later, when the resource is free, it can continue where it stopped earlier.

This processor feature is called multithreading and can be implemented in many different ways depending on how many threads it can work on at the same time (but not necessarily simultaneously), the length of the delay required to switch from one thread to another and more. The different techniques differ a lot in hardware costs, which occupies expensive area on the chip, so finding an optimal technique for a given system can lead to high performance number while still having a low area.

(18)

Processor core Processor context three singlethreaded processors available chip area

two multithreaded processors with three contexts each

Figure 1.1: Different architectural choises for a fixed chip area.

For a network processor system, the processors traditionally have a low chip area and intro-ducing multithreading is very expensive. Instead of having one processor with a high number of threads, the same area can instead be used for deploying two singlethreaded processors instead, or similar as shown in figure 1.1. Finding the optimal solution by hand will be very difficult, or even impossible, for a large system.

Processor design decisions must be made early in the design phase to cut down the amount of time spent on development, so a methodology must be used to get early numbers that, even though they might be uncertain and approximate, give a hint on how the next step in the development should be taken. In this stage, there are no instruction set architecture simulators nor definitive target applications.

A solution to this problem is the system level design exploration simulator for multithreaded network processor systems that was created for Infineon Technologies by the thesis author. It allows an architecture and an application to be shallowly described in a short amount of time, but still generates important information on how the system behaves in respect to multithreading properties. The architecture and application can then be changed or refined to find an optimal solution.

1.2

Objectives

The main objective of this thesis is to find the optimal multithreading solution for a given network processor system and application, and to be able to easily change the architecture or application to reflect the uncertainty of the early design phase.

• To design a simulator that can generate performance numbers suitable for decision making early in the design phase

• To simulate a benchmark from Infineon Technologies using the proposed simulator and ex-tract interesting results.

• To design a larger benchmark to see if it is possible to model a complex network processor system with many processors and shared resources

(19)

1.3

Methods

First, a theoretical background in network processors, processor architecture and specifically mul-tithreading aspects were collected from various papers, literature and other publications. When it was thoroughly studied, different benchmarks and other documentation was studied at Infineon to get a good overview how their typical network systems were designed. This was later used to design and implement a high level simulator, which was written in C++ due to the author’s preference.

1.4

Time

This thesis work has been performed during a period of 20 weeks in accordance with the require-ments at the university of Link¨oping.

(20)
(21)

Chapter 2

Background

2.1

Computer Networks

2.1.1

Introduction

This is only a short introduction to explain some of the network terms that will be used throughout this report.

2.1.2

Layered communication

The end-to-end communication over a network can be broken down into several layers. Every layer must only pass information up to the layer above it, or down to the layer below it and from the layer’s point of view, it can communicate directly with the receiving layer at the other endpoint since the other layers make the rest of the communication transparent.

Layer models

The Open Systems Interconnection (OSI) reference model divides communication into seven dis-tinct layers, seen in figure 2.1. The by far most common model on the internet is the TCP/IP model, which divides the communication into four layers:

Application layer

This layer corresponds to the application that the user is running and that tries to communicate. This could, for example, be a web browser (using the HTTP protocol), an e-mail client (using POP/SMTP protocols).

(22)

Application

Presentation Application FTP/HTTP/SMTP/...

Session

Transport Transport TCP/UDP

Network Network

Data Link Physical

OSI layers TCP/IP layers Protocol examples

Link ARP

Ethernet IP

Figure 2.1: Layer models and protocol examples.

Transport layer

Here, TCP and UDP are the most common protocols. TCP creates a reliable stream channel between two endpoints and UDP creates an unreliable packet-based channel. Most communication is packet based on the internet, and TCP must convert the unreliable packet based communication into reliable streams.

Network layer

Almost all communication over the internet is done using the IP protocol, which can be found at this layer. It is a packet-based unreliable best-effort service. All hosts on the internet (with the exception of hosts behind a NAT server) can be reached by a globally unique address, called an IP address. IP packets can not be infinitely large, so when the transport layer wants to send something that is larger than the maximum transfer unit (MTU), IP will break apart (fragment) the information into smaller packets which it will send and later rebuild (reassemble) the parts at the receiving side.

Link layer

This layer represents the network card, the physical wires and such and its main task is to transfer the actual data to the receiver. This might include splitting the data into smaller portions, called frames, and add error checking and similar.

The layers do not have to know much about the layers below or above it. The network layer only has to know that it must give the link layer an IP packet, and that it will receive an IP packet

(23)

at the receiving side. The same goes for the rest of the layers. However, for example a protocol at the application layer can decide whether it wants to use TCP or UDP which is in a layer below it.

(24)

2.2

Network Processors

2.2.1

Architecture

Introduction

The Internet has influenced the way we communicate and build our infrastructure significantly, and new applications that make use of the Internet in new ways are invented every day. The rapid growth of it has made it possible for a lot of users to benefit from it all the time. These are two reasons why it is difficult to develop good network system architectures – The requirements changes often and the speed requirements increases rapidly. Like early computer designers, the builders of network processor architectures have experimented a lot with architectural aspects, such as functional units, interconnection and other strategies since there is a lack of experience and no proven correct solution. Due to this, there is no standard architecture, as there is for most general purpose computing.

The First Network Processing Systems

In the 1960s and 1970s, the first network systems were introduced, and during this time, the CPU of a standard computer was fast enough to process the relatively slow speeds that the networks operated in. In the following years, CPU speeds increased in a higher rate than the network speed, and a small computer could take care of more and more complicated tasks, such as IP forwarding. Increasing Network Speeds

Nowadays, the network speeds are increasing much faster than the rate of computer architecture. Only a few years ago, 100Base-T replaced the old 10Base-T networks at companys, and now many are moving to 1000Base-T.

Small packets (us) Large packets (us)

10Base-T 51.20 1214.40 100Base-T 5.12 121.44 OC-3 3.29 78.09 OC-12 0.82 19.52 1000Base-T 0.51 12.14 OC-48 0.21 4.88 OC-192 0.05 1.22 OC-768 0.01 0.31

Table 2.1: Network speeds and the time available for large and small packets

As seen in table 2.2.1 (and with some math), a single single processor running at 1GHz, which is considered to be a high frequency, have only time to execute around 500 cycles for every small packet when working with 1000Base-T. This is a very small number and insufficient for most processing. A router with 16 interfaces in this speed can only spend 31 cycles on each packet which

(25)

is far from enough. In other words, the increasing network speeds are a big problem to network processing.

Different Architectures

Since single general-purpose CPUs are insufficient for processing packets at a high rate, many alternative architectural design explorations have been introduced[2] to overcome the performance problem. These techniques have previously proven to increase performance for certain types of computational-heavy processing.

• Fine-grained parallelism (Instruction level parallelism)

• Symmetric coarse-grained parallelism (Symmetric multiprocessors)

• Asymmetric coarse-grained parallelism

• Special-purpose coprocessors (ASIC)

• NICs with onboard stacks

• Pipelined system

All these solutions have their advantages and disadvantages, and a trade-off has to be done when choosing any of them, if not a completely different solution is selected instead. This is due to the fact that network processing is a collective name for so many different tasks and their difference makes it impossible to find a general ideal solution.

Fine-grained Parallelism (Multi-issue)

This is a widely used technique for high performance systems. The idea is that a processor performing operations in parallel should be able to process much more data at a time. In a normal program, some instructions can be run at the same time and the detection of parallelism can either be perform in compilation time (as for VLIW processors) or on-the-fly while the program is executing, as for superscalar processors.

It has been shown that for network systems, instruction-level parallelism does not achieve significantly higher performance[16] and the architectural costs are high.

Symmetric Coarse-grained Parallelism (GPP)

Instead of using instruction-level parallelism (instructions that can be run in parallel), thread level parallelism (TLP) can be exploited instead. This requires that the program is structured in that way so that portions of the code (threads) can be run independently from each other. TLP makes it possible to run an application on multiple processors at the same time and this has a few advantages, one of these being that no modification has to be performed on the processor and normal small processors can be used.

(26)

It will also be fairly easy to scale up the system - just improve the number of processors, but unfortunately the performance will not scale as easily. One major bottleneck is that they often are connected to the same memory which they must share. The same will also be true for other shared resources such as buffers, coprocessors and more. In the worst case in a system of N processors, a processor might have to wait for as much as N − 1 processors to be able to communicate with a shared resource.

Since normal general purpose processors are used, the same amount of processing per packet is still required, but since many more packets can be worked on simultaneously, the overall perfor-mance will increase.

Special-purpose Coprocessors (GPP+ASIC)

Another solution is to have a general-purpose processor together with a special-purpose co-processor (ASIC) that can perform some special operations very fast. The coco-processors are very simple in design (they do not have to do anything else except what they are built for) and can be controlled by the general purpose processor. If the operations that account for most of the pro-cessing of a packet are performed by the ASIC, the performance on the whole system will improve significantly.

It is also possible to have an ASIP (Application Specific Instruction-set Processor) together with coprocessors and this can have even higher performance gains, but will become slightly less flexible.

Application Specific Instruction-set Processors (ASIP)

Another possibility to increase performance is to have specialised processors. Each processor is very good at performing its task, for example IP fragmentation, while another takes care of another layer or similar. This can increase the performance a lot, but some drawbacks are that they will be more difficult to program than a general processor, it will only be good at what it was intended for and if this changes, the performance can degrade significantly. Another big disadvantage is that they are expensive to design and build.

A trade-off can be made in the way the processor is specialised. There are some common tasks for all protocol processing applications[11], such as bit and port operations, so if these are optimised, the performance of many applications - even the ones not yet known - will increase, and if these optimizations are not as extreme, the processor will also be more flexible.

Pure ASIC Implementation

For ultimate performance, but also for the cost of least flexibility, specialised hardware can be used directly. Designing an ASIC is expensive and since they can not do anything other than what they were designed for, their use is limited in most applications.

(27)

Pipelined Architecture

Instead of processing an entire packet or a certain layer for a packet, the processing can be broken into smaller operations that are performed sequentially by different processing elements. For example, when an IP-packet is forwarded, first the checksum is verified, then the TTL (Time To Live) field is decremented, the destination address is looked up in a table and so on.

It would be possible for one processing element to verify the checksum, then pass it to the next processing element that decrements the TTL, while at the same time, a new packet is fetched by the checksum verification processor. This will be repeated for all remaining operations that has to be done for a packet. Even though one packet is processed at the same time as for a simple processor, more packets are processed simultaneously thus increasing the total throughput of the system. The hardware requirement for each stage is low since it will only have to do a small task, but balancing the stages is difficult (the chain will not be faster than its slowest link), and it will be difficult to program.

2.2.2

Packet Processing Tasks

Introduction

To be able to understand how the hardware architecture for network processors should be built, it is essential to know what a packet processor actually does.

A network processor can work on very different areas and layers (see page 5) and because of the diversity among these tasks, it is very difficult to categorise them and to compare different tasks and to understand why they have chosen a specific architecture. A network processor that mainly works on the link layer (ethernet frames for example) but processes packets at higher layers also usually optimise their architecture for the low layer processing even though they perform a lot of high layer functions.

The functionality of network processing is often divided into these categories: • Address lookup and packet forwarding

• Error detection and correction

• Fragmentation, segmentation and reassembly

• Frame and protocol demultiplexing

• Packet classification

• Queuing and packet discard

• Security: authentication and privacy

• Traffic measurement and policing

(28)

Address lookup and packet forwarding

Address lookup is frequent in several layers. At Ethernet switching, the MAC address is looked up in a table to know where to forward the frame. At IP level, the IP address is looked up during IP routing to know where to forward the packet and so on. There are many more cases, and in all of them, the system maintains a table and uses it to perform the lookup in. This lookup can be more or less advanced and the complexity differs a lot. While Ethernet switching lookups are fairly easy (they look for an exact match of a MAC address in a fairly small table), IP lookups can be more advanced requiring a partial match (longest prefix matching) among a table of up to 80000 entries. In high performance solutions, dedicated hardware for table lookups and maintenance is required, and content addressable memories (CAMs) are common.

Error detection and correction

Error detection is a very common and heavily analysed feature of protocol processing. The need for error detection is essential – bit errors often occur making a packet corrupt. This can be due to signalling problems when transferring the packet, or due to faulty hardware or software. Error detection is present in many protocols using e.g. checksums and the computational power required to verify or calculate a checksum can be large compared to the rest of the processing of that layer (e.g. CRC in Ethernet).

In most network system solutions, dedicated hardware for calculating Ethernet CRC check-sums are present since they must be calculated for every incoming and outgoing Ethernet frame. For higher level protocols, checksum calculation is often performed in software.

Error correction is not very common in today’s systems since the layered OSI model puts most responsibility to the lower layers (error correction would have to be implemented there), and the additional data and processing required for correcting bit errors is not insignificant.

Fragmentation, segmentation and reassembly

Fragmentation is what IP does to split up large higher layer packets into smaller chunks that fit inside an IP packet, and this is what segmentation is for splitting large AAL5 packets into ATM cells. Fragmentation and segmentations is fairly straightforward while reassembly can be more complex. The length of the full packet is not known in advance, the packets can be delayed and lost, and they can even come in the wrong order. This makes reassembly to be both computationally complex and require extra resources such as large memories and timers.

Frame and protocol demultiplexing

Demultiplexing is an important concept in the OSI layer model. For example, when a frame arrives, the frame type is used to demultiplex the frame to see to which upper layer it should be passed to, for example IP or ARP. This is used throughout the layers.

(29)

Packet classification

Classification is to map a certain packet to a flow or category, which is a very broad concept. For example, a flow can be defined as:

• A frame containing an IP datagram that carries a TCP segment • A frame containing an IP datagram that carries a UDP datagram • A frame that contains other than the above

These flows are static and decided before any packets arrive. It is also possible to have dynamic flows that are created during the processing and an example would be to map a certain IP source address to a flow for extra treatment. Classification can work with data from several layers and can, in contrary to demultiplexing, be stateful. Looking up a packet among several flows might require a partial match search using a ternary CAM (TCAM).

Queuing and packet discard

Packet processing is characterised as store-and-forward since packets are normally stored in memory while they are waiting to be processed. This is called queuing and can be more or less complicated. A simple example is a standard FIFO which guarantees that packets are processed in the order they arrived, but for more advanced situations, it might be required to introduce priorities among the queues to allow packets from a certain flow to be processed more often than remaining flows. Security: Authentication and privacy

In some protocols, authentication and privacy, which both relies on encryption, is provided and the network systems will in some cases have to process these packets. The additional processing required for authentication or privacy covers the entire packet and is very computation extensive, and if high performance is essentital, dedicated hardware is required.

Traffic measurement and policing

Traffic analysers perform traffic measurement to gather statistical information of what type of data flows through the system. This requires that all frames are analysed, and that their contents is also analysed to examine upper layer header items and contents. This information might be used for billing or similar.

Traffic policing is similar, but is used to restrict access in some ways. For example, for a customer who has bought a connection with limited bandwidth, traffic policing is used for dropping packets that exceed this rate.

Traffic shaping

Traffic shaping is similar to traffic policing and is used for enforcing more soft limits and attempts to change traffic without drastically dropping packets too much. This requires often large buffers and good timer management.

(30)

2.2.3

A Typical Network Processor System

Microengine Microengine Microengine Microengine Coprocessors FIFOs Schratchpad SRAM FBI Engine I&F SRAM controller SDRAM controller PCI controller Microengine Microengine StrongARM Core Cache memories

Figure 2.2: The IXP1200 Architecture, simplified.

Introduction

The title is misleading - there are no typical network systems that can be representative for all protocol processing. But in the last years, some vendors have created architectures that are composed quite similarly. The Intel IXP architecture is one example that will be studied shallowly now, which is one of Intel’s recent network processor product families. A simplified overview can be found in figure 2.2

Fast path and slow path

The IXP contains both control plane and data plane processing, which can be called the slow path and fast path processing.

Architecture Host processor

Traditionally, there is at least one embedded RISC processor, or another type of GPP that takes care of some slow path processing, higher protocols and other administrative tasks, such as handling exceptions, updating router tables and similar. In the IXP network processor, this is a StrongARM.

(31)

Packet processors

For the fast path, there are a number of small packet processors that have a specialised instruction set, which is more limited than a traditional RISC. In the IXP network processors, these are called microengines and have multithreading support for four thread which all share the same program memory. The special instructions the ASIPs have are optimised for packet processing and the processor can run in high clock frequency. The processors do not run an operating system. Coprocessors and other functional units

To offload some of the tasks that require high computational power, there are some coproces-sors on the chip. They perform common tasks in protocol processing, such as computing check-sums, looking up values in tables (CAMs) and more. Some other tasks as timers and similar can be provided by coprocessors to make it easier for the programmer.

Memories

The packet processors can have access to both a private fast, small memories and a shared larger memories. A large off-chip SDRAM, a faster external SRAM, internal scratchpad memories and local register files for all microengines can be used for storing data. Each of these memories are under full control of the programmer, and there is no hardware support for caching of data, except for the StrongARM. There are also a number of transmit and receive FIFOs to store packet information in.

(32)

2.3

Multithreading

2.3.1

Introduction

Multithreading is a processor feature to reduce the inefficiency for long instruction latencies and other reasons that prevents the processor from fully utilising its resources[15]. When, for example, a thread is blocked due to a multi-cycle operation, the processor can perform a context switch and resume the operation of another thread. If this is handled fast enough, the amount of time the processor is idle waiting for an internal or external event is reduced, thus making it more efficient. In the past, multithreading was considered to be a too expensive technique to achieve higher performance due to the fact that the hardware changes required for the technique were to large. During the later years, this consideration has changed due to these reasons:

The first reason is the ever increasing memory gap. Throughout the time, the processor speed has steadily increased between generations, and since they can perform computations faster, they also need data faster as well. One problem is that the memory speed has not increased in the same pace as the processor speeds. The result is that the processor will have to wait for the memory to complete its request, which will require the processor to spend cycles waiting when it instead could perform useful computation. Especially multiprocessor systems that communicate with a shared remote memory have this problem. Remote memories, often off-chip, are generally large, which results in a larger access time. Shared memories can often only perform one request at a time, which can lead to collisions requiring requests to be queued, which will even more delay the requests to it.

Another reason is the increased use of accelerators, or dedicated coprocessors, that perform some computation for a host processor. In high performance systems, it is common for a host processor to off-load some of its time-consuming and common tasks to dedicated hardware that can perform the computation much faster. During this time, the processor generally will have to wait until the coprocessor has finished its computation.

2.3.2

Context Switch Overhead

The context switch overhead is the time it takes to switch from one thread to another. This often implies replacing the processor’s internal states such as registers, flags, program counter and other data that is associated with a certain thread. The context switch overhead depends on the technique on how you save the other context and how much you must save. If the context switch overhead is larger than the latency source which triggered the context switch, the processor performance will of course degrade. For short latencies, a fast switch technique is required, which often requires quite complex hardware.

These are some techniques that are common in multithreaded solutions: • All contexts in memory and no hardware support

• All contexts in memory but with hardware support • All contexts kept in dedicated hardware registers • Some contexts kept in hardware and some in memory

(33)

All contexts kept in memory and no hardware support

This is often the case for common general purpose processors (GPP) which have software running on them that takes care of the context switching. In most cases, an operating system is responsible for changing the currently active thread, and this is often performed by saving all registers in a thread to one location in the memory, and then loading another thread’s registers from another location in the memory. There is no, or little, additional hardware required for this, but the switch overhead will be high. This is depending on the number of registers in the processor, the memory access time and how efficient the code can save the registers from the old thread and load the registers from the new thread. This normally takes a few hundred cycles to complete, so short latencies (on-chip memory or fast coprocessor accesses) can not be hidden using this technique.

Operating systems normally handle multithreading by switching between the different threads at a fixed interval, called a time slice. The time slice is often very small, so that the user doesn’t notice this effect and thinks that all threads are running at the same time. Using different priorities among the threads, the length of the time slice can be varied so that a certain thread will get more cpu time than another. This technique is called preemptive multithreading and differs from cooperative multithreading which was more common earlier. It relied on the each program to volountarily tell the OS when it was finished. If the programs didn’t behave as expected, this could lead to dangerous results, but the complexity of the system was smaller. For hard real-time systems, this is still used when the tasks can guarantee that they will finish in the designated timeslot since the complete execution will be deterministic (static scheduling).

All contexts are kept in memory but with hardware support

In this case, dedicated hardware takes care of storing and loading the registers’ contents from a high performance memory. The hardware required is fairly low and the context switch overhead is depending on the number of registers and how fast the memory is. An estimate is around tens of cycles.

All contexts are kept in hardware

This requires that there are multiple copies of the register file, flags, program counter and all other internal states. This can lead to very fast switch times since no data has to be saved or loaded from memory. To be able to switch in one (or zero depending on how you count) cycles, copies of the pipeline will also have to be saved for every thread. If the pipeline is not copied, it will have to be flushed whenever a context switch happens which will result in as many cycles lost as the number of pipeline stages before the execution stage. This technique, especially if you save the pipeline, requires quite a lot of additional hardware, but it allows very fast context switches. It will require no software although software can be used, but the hardware logic for the context switching will be fairly advanced.

Some contexts kept in hardware and some in a fast memory

This can be a reasonable trade-off if the number of threads you want to run is high. In this case, some threads are kept in hardware and some are kept in software. With help of dedicated

(34)

hardware, the context that is not currently executing can be replaced by another thread while the processor is executing. This can lead to a fast context switch, but in the worst case, the delay will be high. The hardware required for this is moderate compared to having all contexts in hardware and the performance can vary between moderate and good depending on how much effort is spent on preloading and how the target application looks like.

2.3.3

Multithreading Techniques

Explicit multithreading techniques are divided into two groups[1]; they who issue instructions from a single thread every cycle and they who issue from multiple threads every cycle, as can be seen in figure 2.3

Explicit Multithreading

Issuing from a single thread per cycle

Issuing from multiple threads per cycle

Interleaved Multithreading (IMT) Blocked Multithreading (BMT) Simultaneous Multithreading (SMT)

Figure 2.3: Multithreading techniques.

The techniques that only issue from one thread per cycle is most efficient when applied to RISC or VLIW processors, and is further divided into two groups:

• Interleaved multithreading (IMT). In every cycle, a new instruction is fetched from a thread different from the one currently running.

• Blocked multithreading (BMT): A specific thread is running until an event occurs that forces the processor to wait for the results of the event, for example latencies due to memory accesses. When this happens, another thread is invoked during a context switch.

(35)

Interleaved multithreading C D A B D E E A A C B B D E A 1 2 3 1 2 3 1 2 3 2 3 1 2 3 1 2 1 1

short instruction long latency instruction thread 1

thread 2 thread 3 active thread

cycles

Figure 2.4: Interleaved multithreading handling a long latency.

Interleaved multithreading, or fine-grained multithreading as it is also called, means that the processor performs a context switch on every cycle. Some gains of doing this are that control and data dependencies between instructions in the pipeline can be eliminated[17]. This can reduce a lot of hazard-reduction processor logic, such as forwarding that removes true data dependency hazards.

FI

DI

EX

WB

1 2 3 4 5 6 7 8 9 thread 1 thread 2 thread 3 thread 4 cycles pip eline sta ges A A A B

Figure 2.5: Interleaved multithreading eliminating branch penalties.

As can be seen in figure 2.5, branch penalties due to mispredicted target addresses can also be avoided since the processor does not need to fetch instructions from the same thread until the branch condition has already been evaluated. In the figure, the instruction ’A’ is a conditional jump whose condition can not be determined until at the execute stage (EX). It has been fetched at cycle 1, but at the next cycle, an instruction from another thread will be fetched. At cycle 3, the ’A’ instruction will be executed, and now the jump destination is known. At cycle 5, the next

(36)

instruction from the first thread will be fetched, and since we know exactly which instruction to fetch, there will be no branch penalty.

This technique requires at least as many threads as pipeline stages in the processor. By not issuing cycles from a thread that is blocked due to an instruction that led to latency, longer latencies can also be avoided, as seen in figure 2.4.

Blocked Multithreading

Blocked multithreading, or coarse-grained multithreading, means that the processor executes in-struction from a thread until an event occurs that cause latency thus forcing the processor to be idle. This triggers a context switch resulting in the processor executing cycles from another thread instead. Compared to IMT, a smaller number of threads is needed and a thread can continue at full speed until it gets interrupted[17]. An example of how it could look can be seen in figure 2.6.

B C D E G H F A B C A thread 1 thread 2 processor instruction blocked F

long latency instruction memory access

memory access coprocessor access

coprocessor access

Figure 2.6: Blocked multithreading with a context switch overhead of one cycle.

Blocked Multithreading

Static Dynamic

Explicit switch Implicit switch Switch on cache

miss Switch on signal

Switch on use Conditional switch

Figure 2.7: Blocked Multithreading

Blocked multithreading is classified by the event that triggers a context switch, and they can be divided as seen in figure 2.7.

(37)

Static models

In this case, a context switch is invoked by an instruction. This allows the programmer to have full control of the context switching and if the instruction format is simple, these instructions can be identified very early in the pipeline. This will lead to a low overhead since the processor can then on the next cycle fetch instructions from another thread, and the pipeline does then not have to be flushed to remove old instructions from the previous thread. There are two cases of static models:

• Explicit switch. In this case, there is a specific switch instruction that forces the processor to perform a context switch. Since this instruction doesn’t do anything else useful, the overhead will be one additional cycle if it is detected already at the first pipeline stage and will be more if it is detected later.

• The implicit switch model. In this case, the instruction does something useful, for example loads from a memory or branches to a target address, but also performs a context switch so that the next cycle fetched will be from another thread. ”Switch on branch” can avoid branch prediction and speculative execution if the instruction is identified soon enough in the pipeline stages. This results in an additional overhead of zero cycles if it is detected at the first pipeline stage, but there are some negative issues also. For example when using ”switch on load”, there will be very many switches throughout the execution of the program and this requires a very fast context switch - preferably in zero cycles. Some architectures have tried to reduce this, and will only switch when communicating with an external memory and not switch when reading from a local memory.

Dynamic models

Dynamic models are the cases where a context switch is triggered by a dynamic event. In these models, the decision to perform a context switch will be performed later in the pipeline which will either require a pipeline flush, or require multiple copies of the pipeline, as discussed on page 16.

Since the context switch triggering is dynamic and handled by the processor, the programmer does not have to consider it when programming and unnecessary or very frequent context switches can be avoided since the processor has more knowledge of the current internal state of the processor and its resources.

• Switch on cache miss. This model switches to another thread if a load or store instruction resulted in a cache miss when writing or reading from the memory. The gain of this method is that a context switch will only occur when a long latency is expected, but can result in an overhead since checking the cache memory will take some time.

• Switch on signal. This model switches to another thread when for example an interrupt, trap or message arrives which often is triggered by an external event.

• Switch on use. The switch on use model switches when an instruction tries to use a value that is not yet fetched from memory (for example due to a cache miss), or otherwise unavailable. A compiler or programmer which has knowledge about this can take advantage of this and try to load the values requested as soon as possible before they are needed. To implement

(38)

this, a “valid” bit is added to each register which will be cleared on a memory load and set when the value is available.

• Conditional switch. This will lead to a context switch only when a condition is fulfilled. This condition can for example be defined as if a group of load/store instructions resulted in any cache miss. If all load/store instructions resulted in cache hits, the thread will continue its execution, but if any resulted in a cache miss, the processor will do a context switch after the last load/store instruction. When the control returns to that thread, hopefully all values are available that was requested prior to the context switch.

Figure 2.8 shows possible places for a context switch using three different models.

load r1, [mem] load r2, [mem] load r3, [mem] add r4, r1, r2 load r5, [mem] add r6, r3, r5 load r7, [mem] add r8, r7, r7 add r9, r4 r6 mul r10, r8, r9 load load load add load add load add add mul load load load load load switch add add add add mul

switch on load explicit switch switch on use

(possible) context switch points

(39)

2.3.4

Multithreaded Multi-issue Processors

Combining multithreading with multi-issue processors such as superscalar or VLIW processors can also be a very efficient design solution[16].

The problem with multi-issue processors when it comes to efficiency is that they are limited by the instruction dependencies (i.e. the instruction level parallelism, ILP), and long latency operations. The effects are called horizontal waste (due to issue slots not being filled because of low ILP) and vertical waste[16] (due to e.g. long latency operations). This can be seen in figure 2.9

Issue width C ycles Horizontal waste (total: 11 slots) Vertical waste (total: 15 slots)

Figure 2.9: Horizontal and vertical waste.

Multithreading, when used on single-issue processors, can only attack vertical waste (since there exists no horizontal waste on single-issue processors), while on multi-scalar processors, both vertical and horizontal waste can be reduced.

There are a number of possible design choices for multithreading together with multi-issue processors:

• Fine-grain multithreading, where only one thread can issue instructions every given cycle, but can use the entire issue-width of the processor. This is normal multithreading as described earlier and will effectively reduce vertical waste but not horizontal waste.

• Simultaneous multithreading with full simultaneous issue. This is the least realistic model. Simultaneous multithreading works so that all hardware threads are active simultaneously and competing for access to all hardware resources available. When one thread has filled its issue slots for a certain cycle, the next thread can fill the remaining slots. This will be repeated for all threads until there are no more issue slots available (i.e. they have all been filled), or there are no more threads. The order in how the threads are allowed to fill the slots can be decided by different priorities among the threads, or cycled using for example round robin scheduling to result in a fair distribution.

• Simultaneous multithreading with single issue, dual issue or four issue. In these cases, the number of instructions every thread can issue is limited. For dual issue, a thread can issue two instructions and filling an 8 issue slot processor requires at least four threads.

(40)

• Simultaneous multithreading with limited connection. In this case, a hardware resource can process instructions from a limited number of threads. For example, if there are eight threads and four integer units, every unit is connected to two threads. The functional units are still shared, but the complexity is reduced.

Using Simultaneous multithreading, not only can unused cycles in the case of latencies be filled with instructions from other threads, but also unused issue slots in every cycle, thus both exploiting instruction level parallelism (ILP) and thread level parallelism (TLP), as can be seen in figure 2.10. A well-known implementation of this is the Hyperthreading features on modern Pentium 4 processors. Issue width C ycles Thread 1 Thread 2 Thread 3

(41)

2.3.5

Multiprocessor Systems

Multithreading can be seen as a type of parallel hardware that exploits program parallelism by overlapping the execution of multiple threads. Multiprocessors and multithreading both exploit parallelism in the application code. This is done somewhat different in the two techniques since multiprocessors execute multiple threads in parallel on multiple hardware resources such as pro-cessing elements and caches while multithreading execute multiple threads on the same hardware resources. The performance of multiprocessor systems will always outdo the performance of mul-tithreading when having as many threads as processors, but this is not a fair comparison since the hardware costs are larger for the multiprocessor system since everything must be replicated (func-tional units, decoder logic, caches etc). Multithreading on the other hand overlap the execution instead of performing execution in parallel which only requires a part of the processor to grow with the number of threads.

Communication between processors is also a non-trivial problem compared to the case of mul-tithreaded processors. Multiprocessor systems will have to synchronise threads and communicate using a bus or a shared resource, such as an external memory. There will be no simple way of signalling between them, except hard-wired interrupts. Multithreaded systems can communicate faster and easier using internally shared resources, such as internal memories or shared registers and internal exceptions can be used for signalling.

2.3.6

Analytically Studying Multithreading

It would be an advantage if it was possible to do an analytical analysis of the gains of multithreading for network processors. This can for example be used for guaranteeing a certain line-rate when a worst-case execution time is known, or for doing an initial estimation to see if multithreading can be worth investigating further. Unfortunately, the complexity of real-world systems makes it practically impossible to come up with an exact answer. The uncertainty of the surrounding system such as packet distribution complicates this further which forces us to simplify the description and constraints of the systems.

One possible estimation is to statistically calculate how much utilisation a system will have given different multithreading aspects, as described in [14].

Given the total number of cycles for memory and coprocessor accesses for the entire program, MemC and CopC, then the total amount of time a thread will have to wait will then be, under the assumption that the resource is always available (which will be a large limitation):

K = M emC + CopC (2.1)

If the total number of cycles required to execute the program is T (including the time the thread will have to wait for an external resource), the probability p that a thread is waiting for a latency source is:

p = K/T (2.2)

If the processor has n threads, the probability that all are waiting at a certain time will be pn,

(42)

not all threads are waiting, or that at least one thread on the processor is performing something will be 1 − pn.

When a thread is performing something, it can either do something useful (computation) or perform a context switch, which takes time. The probability that a given thread is in the switch state, that is the probability that it is blocked times the probability that any other thread can run, is p · (1 − pn). Using this, the probability that the processor is in a thread-switch state, q, can be

derived as, given that the time to perform a thread switch is C and the total number of latency sources are L:

q = (L · C)/K · p · (1 − pn

) = L · C/T · (1 − pn) (2.3)

The processor efficiency, that is the number of cycles the processor is doing something useful, can then be expressed as:

U = (1 − pn) − q = 1 − pn

− L · C/T · (1 − pn) = (1 − pn)(1 − L · C/T )

= (1 − (K/T )n)(1 − L · C/T ) (2.4)

This will be an optimistic approximate since thread and resource dependencies are neglected as described earlier. The characteristics of the latency resource will also be omitted which can have a great impact of the real utilisation of the system. However, it will be a good estimate considering how little information you have to provide.

Another approach was taken in [9] where the worst case execution time was analytically computed using integer linear programming. It requires thorough investigation in how the program works and has some major limitations that are needed to make it mathematically solvable. Some limitations are in the application (no recursion is allowed or unbounded loops, but this is not common anyway), the scheduling must be strict round-robin, which is also very common, and only one latency source that all threads work on is supported. The computed WCET (Worst Case Execution Time) figures differ from simulated WCET between 40-400% depending on the application.

As shown above, it is only possible to analytically describe very small systems and with many limitations, which brings us to the conclusion that for now, simulations are still the fastest way to come up with estimations of multithreading impacts on large systems.

2.3.7

The Downside

When multiple threads are running on the same processor, memory accesses will be more frequent and the number of cache accesses will also increase. When the threads are running in different portions of the code, many cache accesses will result in cache misses and this will reduce the performance of the system.

A simple solution to cope with instruction cache is to have a private cache per thread. This can also be considered for data caches, but this imposes a few problems. If the same process is parallelised over multiple threads on the same processor, the memory accesses will often occur in

(43)

the same memory area. With private caches, it is important to maintain a cache coherency so that no false data is present. False-sharing can happen, as in multiprocessor systems, unless all caches are updated (either flushed or the updated with the new value which will have to be propagated to all caches) when writing to one cache memory. Since this is difficult to implement efficiently, private caches are generally not used in multithreaded architectures[4].

An intuitive solution is to instead have a shared cache for all threads in the processor. This however often leads to an increased cache memory miss rate, especially for threads that are run-ning different processes which operate in different memory areas, but whose location in the cache memory is the same and will collide. Since the increased cache memory miss rate is the largest downside for multithreaded systems compared to multiprocessor systems, the cache memory size and implementation can be changed to improve performance. A multithreaded processor with n threads and m kb cache memory and a multiprocessor system with n processors each with m kb cache memory will be an unfair comparison. [3] and [15] has tried to equalise the cache memory size in the multithreaded processor to n · m kb and compare it to the multiprocessor system. Both systems now have the same total amount of cache memory in the system, and in most cases, this lead for the multithreaded system to an improvement of 33%-94% less cache memory misses in average. This is very application dependant and in some cases when the cache memory miss rate was already low, this did not lead to large improvements. In some cases the performance even degraded.

Another solution is to increase the number of set-associativity to n, while keeping the total cache memory size at n · m. This leads to, according to [3], to 3.5%-88% reduction in cache misses compared to the previous solution when the cache memory size was the same.

2.3.8

Other Techniques for Hiding Latencies

Caches

Memory latencies have traditionally been hidden by caches, and a lot of research has been done in this area deciding cache schemes and sizes. Caches however, do not actually hide memory latencies[1], but instead they try to eliminate as many of the long latencies as possible and try to minimize the remaining memory access that do not result in a cache hit. If the code has low temporal and local locality then it will suffer low cache improvements and this technique is then no longer useable.

Prefetching

If the program has knowledge that it will use data stored in a given location in the future, then it might be a good idea to introduce memory prefetching to the architecture. If the data is prefetched long enough before it will have to be used, it will have been fetched from the memory network in time for the processor to use it, thus resulting in no penalty. This is particularly useful for complex interconnection networks with high latencies. For less regular code, such as traversing data structures (trees, linked lists etc), the location can not be predicted soon enough and prefetching will be impossible.

(44)

2.3.9

Multithreading for Network Processors

Introduction

Assigning what each thread does on a network processor can be done in numerous ways - each with its advantages and disadvantages.

• One thread for each layer

• One thread for each protocol

• Multiple threads for each protocol

• Threads for protocols plus management tasks

• One thread for each packet

One thread for each layer

Dividing the protocol processing operation into separate threads for each layer has a few advantages. First of all, the code will be smaller and simpler, but since all protocols for a certain layer will have to covered, the code will still be large for multiprotocol systems. It is also possible to assign different priorities to different layers, so that lower layer protocols, e.g. layer two, have higher priorities than higher layer protocols, for example layer three.

A disadvantage is that each thread must handle both incoming and outgoing packets which will complicate the code. The way packets are passed between the layers, i.e. threads, will also introduce a lot of overhead. A packet will normally (depending on the application of course) be processed in many layers, and when the packet is handed over from one thread to another, the threads will either have to be synchronised, or buffers will have to be put between them. In either case, this will take some additional time.

One thread per protocol

To make the code even smaller and simpler, it is possible to have a certain protocol have its own thread. The code will be easier to understand, and it will be possible to assign priorities so that UDP-processing has a higher priority than TCP-processing since UDP packets normally have require lower latencies for e.g. VoIP (Voice over IP) or similar. The disadvantages for the previous example are also applicable here.

Multiple threads per protocol

This is often used to split up the processing of a protocol into different directions. The designer can then assign higher priorities to outgoing packets for a system to avoid congestion.

(45)

Protocol processing plus management tasks

There are some common time-critical operations that have to be performed, especially for higher layer protocols. Some of them are retransmission timers, reassembly and route updates. A ded-icated thread for these operations can significantly simplify the design and programming, and collecting all timer management for all layers in a special thread might even more simplify the design. But since the rate of timer events are very different among the protocols, ranging from minutes between router updates and seconds between packet timeouts, this might be difficult. One thread per packet

The disadvantages described earlier for layer threads and especially the overhead from thread switching when passing packets between layers or protocols can be avoided by partitioning the threads in another way. If a certain thread works with a packet for its entire lifetime in the system, the overheads are avoided, but it also requires all protocol processing code for every layer and every protocol to be available by the packet threads. This is probably the most common technique in network processors.

(46)

2.4

System Level Methodology

2.4.1

Introduction

Modern embedded systems, such as signalling and network processing, are increasingly becoming more complex and together with future unknown requirements, programmability is an unavoidable requirement[8]. However, performance, cost and power requirements are also very important which implies that parts of the solution must be made with dedicated hardware components that offers better performance, power etc, but less flexibility. This leads to more and more systems are heterogeneous, i.e. they consist of both programmable and dedicated components. The increasing complexity of these systems requires that tools suitable for modelling, simulation and benchmarking must be available to cut down the development time. The less time that has to be spent on the early levels (at system level), the shorter time to market can be achieved.

2.4.2

Application modelling

In the beginning of a new system development, detailed knowledge of the application or the archi-tecture is very rare. No compilers or simulator exist making it difficult to benchmark a program, and in many cases, no definitive input data or application exist either, making it impossible to gen-erate a program trace that can be used for simulating. When an application does exist prior to the development of the system, traditionally the system was constructed around the given application. This leads to the results that the system has very good performance for the given application, but if it changes later on or more applications are introduced to be used on the same architecture, the new performance results can be a lot worse or even unusable.

2.4.3

Architecture modelling

In the past, embedded system developers have almost only worked with VHDL models which only give a few abstraction levels to explore and limited design opportunities[18]. In order to cut down time-to-market, an exploration methodology should be used.

Many system level design exploration tools offers to efficiently explore different architectures by starting with an abstract, yet executable model, and iteratively refine it to be able to find optimal solutions. The tools often make it possible to test many different applications for a given architecture, and also to test the applications on different architectures, which often is of hetero-geneous nature, without having to completely rewrite neither the architectures nor applications.

(47)

2.4.4

Design Space Exploration Models

Y-chart Model Architecture Application Mapping Simulation Results

Figure 2.11: The Y-chart model for design space exploration.

At system level, a common and well proven methodology is the Y-chart approach[7][18], which allows reusability for efficient design space exploration. In this model, the architecture and appli-cations are modelled separately and multiple target appliappli-cations can be mapped on after another onto the architecture components available. The result after the performance analysis should be used to refine the architecture, applications and mapping to fulfil the requirements of the system. The Y-chart model can be seen in figure 2.11.

Design Space Exploration Pyramid High

Low

Low

High Design space (alternative realisations)

Opp or tunities Ab str a ct ion le v el Co st of mo dell ing (t ime)

back of the envelope abstract executable models cycle-accurate models synthes. VHDL

Figure 2.12: The design space exploration pyramid.

Another model is the design space exploration pyramid[18], which shows how you iteratively can explore the design space and finding an optimal solution. It can be seen in figure 2.12. High up in the modelling abstraction, the costs are small (it takes little time to model a system), and the design space is large. After iterative tests and going further down the pyramid to lower abstraction, the design space will be narrowed and finally an optimal solution should be found.

(48)

2.5

Area Efficiency

2.5.1

Chip Area Estimation

The chip area equation for a large network processor is expressed in [5] as: areanp= s(io) + m X j=1 s(mchl) + s(pj,k, t) + s(cij,k) + s(cdj,k)  (2.5) Where s(io) is the area for I/O (common for all processors), m is the numbers of processor clusters, s(mch) is the memory channel, s(p) is the size of the processors and s(c) is the size of the caches. Only the largest area contributions are described.

For a multithreaded protocol processor, the area can be divided into two parts; one part of the area is depending on the number of hardware contexts and one part that is independent of this.

• The base processor logic, such as processor control, processing units, branch prediction etc, will be constant and independent on the number of threads and this can be called pbasis.

• The second component is the hardware contexts (all registers, flags and similar) and the other logic associated with a thread, called pthread. For a small number of threads, and for a

simple thread scheduling and control policy, this can be approximated as increasing linearly with the number of threads. This will be inaccurate for exotic solutions such as multi-issue multithreaded processor, but in the general case, this is not an issue.

This leads us to the simplified description of one (1) protocol processor:

s(t) = s(pbasis) + t · s(pthread) (2.6)

Where s() is an area function, t is the number of threads. For the Infineon PP32 AND ARM7, the two area components have been acquired, but due to the confidential nature in these numbers, they can unfortunately not be published here.

2.5.2

Defining Area Efficiency

For on-chip solutions in embedded systems, best performance is not always the final goal of a system. Due to limited area on the chip, the final solution must be fast relative to the area it consumes. By having a good performance per area, it will be possible to later scale up the performance by using multiple systems running in parallel, for some extent. We call this number area efficiency and maximising this number for a variable number of threads will be the goal of the first simulation later on.

One possibility is to define the area efficiency as IPS/area unit, and if the processor utilisation is ρp, this area efficiency can then be defined as:

λae=

ρp· clkp

(49)

Chapter 3

Previous Work

3.1

Spade

SPADE[8] (System level Performance Analysis and Design space Evaluation) is a tool that can be used for architecture exploration of heterogeneous signal processing systems and follows the Y-chart paradigm, which represents a general scheme for design of heterogeneous systems.

The application, which is described in C or C++, is transformed to a deterministic Kahn Process Network (KPN) by structuring the code using YAPI, which is a simple API. The designer must identify computational blocks and surround them with the API functions. Examples of blocks are for example DCT and other well-known algorithms within signal processing. The application is then ran using real life data and a trace is generated that represents the communication and computational workload. SPADE then uses that trace in its simulation since it is using trace-driven simulation technique.

In the architecture you will have to estimate how long time the computational blocks (appli-cations) requires on the architecture you model. The architecture file will contain all processors you can select from and the time it takes for all tasks. You also specify the interfaces between the blocks, for example if they should be point-to-point or via a shared memory. The bus width, buffer sizes and memory latencies must be defined.

With SPADE, the mapping of application to the architecture is performed using one-to-one or many-to-one. If the designer wishes to distribute a computational workload over several computa-tional resources, i.e. many-to-one, the designer has to rewrite the application so that the workload can be split into two processes.

3.2

StepNP

StepNP[13] (SysTem-level Exploration Platform for Network Processing) is a tool that can be used for simulating network processor systems in an early stage. It’s using ISA simulators together with a system framework written in System C. There are models available for some common processors such as ARM and PowerPC and multithreading capabilities can be implemented in these models.

References

Related documents

A two-part test and interviews were carried out in order to find out i) how familiar a group of 15-year- old students were with a selected set of English idioms, ii) what

within and between clusters results in the more or less numerous types of F-module discussed above, a strong adherence to this version of the modularity thesis may well turn out

• Operands come either from register file or from immediate data (carried by an instruction).. • Results in the register file need to be written back to the

A common mechanism for initiating a data prefetch is an explicit fetch instruction issued by the processor [9,10]. Fetch instructions may be added by the programmer or by the

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

But because of how the FPGA code is synthesized or how the Senior processor has been synthesized, the clock signal to the Senior processor has to be inverted to be able to

One might also want to introduce a new chunk in the BEAM file where the translated bit- string comprehensions could be kept, in this chunk the kernels should be ordered so that

The second step; communicate and ask for support states that it is important that not only the lower management talks about the company vision; also the founder/owner and