Characterization of Partial and Run-Time Reconfigurable FPGAs

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2016,

Characterization of Partial and Run- Time Reconfigurable FPGAs

EMILIO FAZZOLETTO

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Abstract

FPGA based systems have been heavily used to prototype and test Application Specific Integrated Circuit (ASIC) designs with much lower costs and development time compared to hardwired prototypes. In recent years, thanks to both the latest technology nodes and a change in the architecture of reconfigurable integrated circuits (from traditional Complex Programmable Logic Device (CPLD) to full-CMOS FPGA), FPGAs have become more popular in embedded systems, both as main computation resources and as hardware accelerators. A new era is beginning for FPGA- based systems: the partial run-time reconfiguration of a FPGA is a feature now available in products already on the market and hardware designers and software developers have to exploit this capability. Previous works show that, when designed properly, a system can improve both its power efficiency and its performance taking advantage of a partial run-time reconfigurable architecture. Unfortunately, taking advantage of run-time reconfigurable hardware is very challenging and there are several problems to face: the reconfiguration overhead is not negligible compared to nowadays CPUs performance, the reconfiguration time is not easily predictable, and the software has to be re-though to work with a time-evolving platform.

This thesis project aims to investigate the performance of a modern run-time reconfigurable SoC (a Xilinx Zynq 7020), focusing on the reconfiguration overhead and its predictability, on the achievable speedup, and the trade-off and limits of this kind of platform. Since it is not always obvious when an application (especially a real-time one) is really able to use at its own advantage a partial run-time reconfigurable platform, the data collected during this project could be a valid help for hardware designers that use reconfigurable computing.

Keywords: FPGA, Reconfigurable Computing, Partial Reconfiguration, Hardware Architectures, Embedded Systems, Computation Efficiency, Xil- inx

(4)

Sammanfattning

FPGA-baserade system har tidigare främst använts för snabb och kost- nadseffektiv konstruktion av prototyper vid framtagandet av applikation- sspecifika integrerade kretsar (ASIC). P˚a senare ˚ar har användandet av FPGA:er i inbyggda system för implementation av h˚ardvaruacceleratorer s˚aväl som huvudsaklig beräkningsenhet ökat. Denna ökning har möjliggjorts mycket tack vare den utveckling som har skett av rekonfigurerbara integrerade kretsar: fr˚an de mer traditionella Complex Programmable Logic Devices (CPLD) till helt CMOS-baserade FPGA:er. Nu inleds en ny era för FPGA-baserade system tack vare möjligheten att under körning rekon- figurera delar av FPGA:n genom s˚a kallad partial run-time reconfiguration (RTR) - en teknik som redan idag finns tillgänglig i produkter p˚a marknaden. Tidigare forskning visar att användandet av en RTR-baserad h˚ardvaruarkitektur kan ha en positiv effekt med avseende p˚a prestanda s˚aväl som strömförbrukning. Att använda RTR-baserad h˚ardvara innebär dock flera utmaningar: En ej försumbar rekonfigurationstid m˚aste tas i beaktning, s˚a även den icke-deterministiska exekveringstiden som en rekonfiguration kan innebära. Vidare m˚aste anpassningar av mjukvaran göras för att fungera med en h˚ardvaruplattform som förändras över tid.

Denna uppsats syftar till att undersöka prestandan hos ett modernt RTR- baserat SoC (Xilinx Zynq 7020) med fokus p˚a rekonfigurationstider och dess förutsägbarhet, prestandaökning, begränsningar samt nödvändiga kompro- misser som denna arkitektur innebär. Huruvida en applikation kan dra nytta av en RTR-baserad arkitektur eller inte kan vara sv˚art att avgöra.

Den insamlade datan som presenteras i denna rapport kan dock fungera som stöd för h˚ardvarukonstruktörer som önskar använda en RTR-baserad plattform.

Nyckelord: FPGA, Rekonfigurerbar h˚ardvaruarkitektur, Partiell rekonfiguration, H˚ardvaruarkitektur, Inbyggda system, Ber¨akningseffektivitet, Xil- inx

(5)

Acknowledgements

During this thesis, several people have provided an irreplaceable help; without them, I would have probably learnt less and I would have lost a core element of the whole experience: my personal and professional growth.

I would like to thank my examiner, Prof. Ingo Sander, who has always given me new inputs to widen the contribution of my work. He has been an example to follow, inspiring me with his passion and bringing me closer to the academic research.

I also wish to thank my supervisor, George Ungureanu, who has been extremely patient and meticulous, always finding the time to answer any of my questions.

I am also grateful for the support given by my supervisor from Polytechnic of Turin, Prof. Luciano Lavagno, who has helped me with extremely accurate and productive criticisms.

Finally, I owe a debt of gratitude to all my colleagues, both PhD and Mas- ter’s Thesis students, who have shared their experience and their thoughts with me. A special thanks to Youssef Zaki and Elisa Orso, who have spent a lot of time and effort in helping me with final corrections.

Emilio Fazzoletto Stockholm, June 2016

(6)

List of Figures

2.1 Typical topology of a Triple Modular Redundancy (TMR) architecture . . . 5 2.2 Trend of the leakage current vs technology scaling. Data from

the International Technology Roadmap for Semiconductors (2007) . . . 6 2.3 Parallel processor pipelines with reconfigurable datapath and

integrated configuration interface [1] . . . 9 2.4 Structure of the Xilinx Internal Configuration Access Port

(ICAP) designs, from [4] . . . 11 2.5 Structure of the custom DMA HWICAP design proposed in [4] 12 2.6 Structure of MST ICAP and BRAM ICAP, from [4] . . . 13 2.7 Virtual reconfigurations on single-context FPGAs . . . 16 3.1 Simplified block diagram of the Xilinx Zynq 7020 . . . 18 3.2 First reconfigurable architecture developed to demonstrate

the feasibility of the concept idea . . . 20 4.1 Graph of the reconfiguration throughput versus the size of

the reconfigurable partition . . . 25 4.2 Graph of the reconfiguration time versus the size of the re-

configurable partition . . . 25 4.3 Graph of the execution time for 1 million 32-bit integer mul-

tiplication/addition . . . 27 4.4 Graph of the execution time for the FIR filter and the Prime

Number Checker . . . 29 4.5 Structure of the fully-parallelized 21-tap FIR filter . . . 30 4.6 Resource utilization and performance comparison between

the FIR filter discussed in 4.2.1 (on the right) and a fully- parallelized version of it (on the left) . . . 31 4.7 3x3 edge detector filter applied to a test image . . . 34 4.8 Graph of the execution time for the 2D convolution . . . 35 4.9 2D convolution power consumption graph: Run-Time Recon-

figurable (RTR) hardware vs software implementation . . . . 37

(10)

4.10 High-level synthesized IP core implementing a FIR filter equipped with AXI4-Lite, AXI4-Stream and an interrupt line . . . 38 4.11 On the left the polling-based approach, on the right the ded-

icated interrupt one . . . 39 4.12 Example of Reconfigurable Partition (RP) erasing to save

static power . . . 40 4.13 2D Convolution reconfigurable system capable of prefetching

the Reconfigurable Module (RM)s . . . 42 4.14 Implementation of the architecture shown in Figure 4.13 . . . 43 4.15 Scheduling of two 2D convolutions using two image proces-

sors: on the left using only one RP, on the right using two RPs . . . 45 4.16 Static architecture used to provide three different Finite Im-

pulse Response (FIR) filters to the Zynq Processing System (PS) . . . 47 4.17 N-TAP fully-parallelized FIR filters resource utilization: mul-

tiplexed architecture vs RTR one . . . 48 4.18 Performance comparison between the N-TAP fully-parallelized

FIR filters and the respective software implementations . . . 49 4.19 Performance comparison between the RTR N-TAP fully-parallelized FIR filters and the respective software implementations . . . 50 4.20 Impact of the reconfiguration time on the total execution time

using fully-parallelized FIR filters . . . 51 4.21 N-TAP 1-MAC FIR filters resource utilization: multiplexed

architecture vs RTR one . . . 52 4.22 Performance comparison between the N-TAP 1-MAC FIR

filters and the respective software implementations . . . 53 4.23 Performance comparison between the RTR N-TAP 1-MAC

FIR filters and the respective software implementations . . . 54 4.24 Impact of the reconfiguration time on the total execution time

using 1-Multiply and ACcumulate (MAC) FIR filters . . . 55 4.25 7-TAP FIR filter performance comparison: software imple-

mentation vs RTR fully-parallelized hardware accelerator . . 57 4.26 3-TAP FIR filter performance comparison: software imple-

mentation vs RTR fully-parallelized hardware accelerator . . 58 4.27 8-bit gray-scale image convolution: software implementation

vs RTR fully-parallelized hardware accelerator . . . 58 A.1 Synthesis performance report of the AXI4-Lite 32-bit multiplier 78 A.2 Synthesis utilization report of the AXI4-Lite 32-bit multiplier 78 A.3 Synthesis interface report of the AXI4-Lite 32-bit multiplier . 79 A.4 Hybrid block design: the custom hardware peripheral calc 0

is connected to the Zynq PS . . . 82

(11)

A.5 The block design created to allow the communication between the Zynq PS and the Advanced eXtensible Interface (AXI)4- Stream peripheral using Direct Memory Access (DMA) trans- fers . . . 89 A.6 Suggested working directory organization for a Partial Re-

configuration (PR)-based project . . . 96 A.7 Functional flow chart of the software application used to test

the partially reconfigurable system . . . 104

(12)

List of Tables

3.1 Average performance of the Device Configuration Interface (DevC)-Processor Configuration Access Port (PCAP) peripherals . . . 22 4.1 Behavior of the reconfiguration time with different RMs and

RPs . . . 24 4.2 Performance comparison between a single ARM Cortex A9

core running at 666 MHz and two high-level-synthesized IP cores in 32-bit integer addition/multiplication . . . 27 4.3 Performance Comparison: FIR Filter and Prime Number Checker 28 4.4 Performance and reconfiguration overhead comparison be-

tween the two versions of FIR filters . . . 31 4.5 Performance comparison between the ARM Cortex A9 and

the HW IP core in 2D convolution . . . 34 4.6 2D convolution power consumption for both hardware and

software implementations . . . 37 4.7 Performance Comparison: 1 RP vs 2 RPs . . . 44 4.8 Reconfiguration overhead scaling versus the image size . . . . 46 A.1 Average performance of the PCAP . . . 114

(13)

Listings

A.1 Source code for an AXI4-Lite 32-bit multiplier using Vivado HLS . . . 76 A.2 C testbench to test the AXI4-Lite multiplier . . . 76 A.3 Script to add an AXI4-Lite peripheral to the Zynq PS . . . . 80 A.4 Test application to use the AXI4-Lite IP core . . . 82 A.5 Source code of an AXI4-Stream 32-bit multiplier . . . 85 A.6 C testbench to test the AXI4-Stream IP core . . . 87 A.7 Script to connect the AXI4-Stream IP core to the Zynq PS . 89 A.8 Test application used to manage the AXI4-Stream peripheral 92 A.9 Synthesize the reconfigurable modules . . . 97 A.10 Create the reconfigurable partition . . . 98 A.11 Implement one of the two possible configurations . . . 100 A.12 Lock the static routing and load the new RM inside the RP . 101 A.13 Implement the empty configuration . . . 101 A.14 Verify the consistency between the different configurations . . 102 A.15 Generate partial and total bitstreams . . . 102 A.16 Test application for the discussed partial reconfigurable ar-

chitecture . . . 105

(14)

Acronyms

AMBA Advanced Microcontroller Bus Architecture.

ASIC Application Specific Integrated Circuit.

AXI Advanced eXtensible Interface.

BRAM Block RAM.

CPLD Complex Programmable Logic Device.

DevC Device Configuration Interface.

DMA Direct Memory Access.

DSP Digital Signal Processor.

FF Flip Flop.

FIR Finite Impulse Response.

FPGA Field Programmable Gate Array.

FSBL First Stage Boot Loader.

FSM Finite State Machine.

IC Integrated Circuit.

ICAP Internal Configuration Access Port.

ILP Integer Linear Programming.

ISR Interrupt Service Routine.

LUT Look-Up Table.

MAC Multiply and ACcumulate.

(15)

OPB On-Chip Peripheral Bus.

PCAP Processor Configuration Access Port.

PE Processing Element.

PL Programmable Logic.

PoR Power-on-Reset.

PR Partial Reconfiguration.

PS Processing System.

RISP Reconfigurable Instruction Set Processor.

RM Reconfigurable Module.

RP Reconfigurable Partition.

RTR Run-Time Reconfigurable.

SoC System on Chip.

SRAM Static Random Access Memory.

TMR Triple Modular Redundancy.

VLIW Very Long Instruction Word.

(16)

Chapter 1

Introduction

1.1 Motivation

Field Programmable Gate Array (FPGA) technology is evolving year by year and it has substantially reached a breakpoint respect to solutions available few years ago. The first FPGAs became popular for building prototypes, offering a much cheaper solution than Application Specific Integrated Cir- cuit (ASIC)-based prototyping, and then for final products, since they became even more convenient in terms of performance, design time and costs.

Nowadays, FPGAs are not only reprogrammable on the field (a FPGA- based system could be upgraded with a new hardware architecture without substituting any physical component) but they are also dynamically and partially reconfigurable: this means that is possible to reconfigure part of the architecture loaded into a FPGA while the rest of the system continues to operate normally. In this way, the hardware architecture can be customized in order to provide to the system with specific computing resources on- demand. These capabilities give the designer the possibility to reach a new level of power efficiency at fixed performance (saving silicon area decreasing the static power consumption and using specific purpose hardware accelerators to reduce dynamic power) or to increase the performance given a fixed die size [5] [6] [7].

The work presented in this thesis is an investigation about the potential of Partial Run-Time Reconfigurable (RTR) FPGAs to improve the efficiency of the system (both in terms of performance and energy), despite the challenges which have to be faced while working with RTR devices, and a comparison of such solutions with more standard approaches. The final aim is to demonstrate the real capabilities of a Partial RTR FPGA, focusing on its advantages and limits, and to better understand in which application fields it could be competitive and why. In order to compare a Partially RTR architecture with classic approaches in a real world scenario, a modern ARM-based hybrid System on Chip (SoC) was chosen as the core of the development

(17)

platform for collecting meaningful data. In particular, a ZedBoard [24], equipped with a Xilinx Zynq 7020 SoC, has been used, offering a dual- core ARM Cortex A9 hardwired CPU that, together with its peripherals, constitutes the Processing System (PS). Beside this, some Programmable Logic (PL) is placed on the same die and could be programmed and used together with PS or as a standalone FPGA unit. The PL of the Zynq could be partially run-time reprogrammed by itself or by the PS, allowing maximum flexibility at design time.

1.2 Objectives

The main goal of this master’s thesis is to explore the benefits, peculiarities and challenges of a modern RTR FPGA and related design flow. Specif- ically, the test platform will be based on a hybrid SoC, the Xilinx Zynq 7020, allowing not only to compare a RTR platform versus static hardware architectures but also to conduct performance comparison between hardware and software implementation. The experiments will be designed to investigate the critical aspects of RTR FPGAs, such as the reconfiguration time and the reconfiguration overhead. Together with the needed literature and background study, these are the mandatory goals for this thesis project, fully described below.

• Literature study of a RTR FPGA platform: advantages, trade-offs, and challenges.

• Development of an initial test platform using the Xilinx Zynq 7020 and the Xilinx toolchain (Vivado HLS, Vivado Suite, Xilinx SDK).

• Measurement and characterization of the reconfiguration time.

• Overhead estimation between a full software, a static hardware, and a RTR solution.

• Speed up measurement between a RTR hardware solution and a full software implementation given a specific algorithm.

These key points represent the core of the presented work and could give a contribute to the research in this field by providing a better characterization for phenomena like the reconfiguration process. This knowledge is particularly important for real-time applications, where the predictability of the reconfiguration time is an essential requirement.

In addition to the mandatory goals, some optional goals, listed below, are considered to widen the contribution of the thesis project if time permits.

• Investigate how to reduce the reconfiguration overhead.

(18)

• Develop experiments to test the energy efficiency of a RTR implementation compared to a full software one.

• Discuss complex RTR architectures, using more than one reconfigurable slot (referred as Reconfigurable Partition (RP) in the Xilinx documentation).

1.3 Strategy

Research methods and methodologies are very important to make a good research and thesis projects. It is important to choose a proper research method and the related methodologies ahead of the research work, in order to assure the quality of the results in terms of validity, reliability, replicability and ethics. Not only a good knowledge of the subject is needed, but also a suitable background is required to work properly in a research field. For these reasons, a literature study was conducted before the development of the experiments.

Given the experimental-driven nature of the work, a quantitative research method will be used. It will be a semi-empirical research which gains knowledge by getting proofs based on evidence from experiments and observations, measuring variables to verify or falsify theories and hypotheses.

The expectations are to collect meaningful data from real testbeds and to use that data to lead the design and develop theoretical models.

All the technical procedures will be described in their entirety in order to allow other researchers, students and scientists to replicate the presented experiments. An additional tutorial (attached as appendix) describes, step- by-step, the development flow exploited to build the initial test platform discussed in Chapter 3.

(19)

Chapter 2

Background and Literature Study

This chapter provides an overview about the state-of-the-art in the reconfigurable computing area. The focus is on the some critical aspects of RTR FPGAs, such as the reconfiguration overhead (both in terms of hardware resources and reconfiguration time) and reconfiguration strategies.

2.1 RTR FPGAs

RTR FPGAs entered into the market with a huge potential to improve systems performance, but they also brought many challenges to face in order to use reconfigurable hardware in real products. For this reason, it is very important to know the strong points and the drawbacks of using this solutions to exploit, when possible, all the benefits of a RTR platform.

2.1.1 Advantages and Fields of Application

The possibility to partially reconfigure the hardware at run-time offers the chance to greatly widen the design space but, during the past years, some fields of application were found to be more suitable than others for RTR FPGAs. More specifically, these devices have been largely used in high- reliability critical systems [11]. These include all electronic systems in which their reliability is highly stressed, typically due to extreme operating conditions, and in which a failure of an electronic system is critical or fatal for the whole device. In spaceborn systems, the electronic boards are subjected to high levels of radiations: since FPGAs are very exposed to these, it has always been difficult to use them in space without reliability issues. However, through the usage of RTR FPGAs, it is possible to build self-healing devices which are capable to detect soft-errors (comparing the status of the configuration memory with a reference) and restore the correct

(20)

configuration. Techniques like the Triple Modular Redundancy (TMR) are used successfully not only to keep the system working correctly, even if part of the hardware is compromised, but also to have the possibility to restore one of the system instances after a soft-failure, while the system itself keeps working correctly [12]. Figure 2.1 shows the topology of a TMR architecture with a 2-out-of-3 voter, that guarantees the rightness of the output of the system even if one of the redundant modules suffers a failure (that can be fixed, in case of soft-error, thanks to the capability of the platform to reconfigure at run-time a part of the FPGA, in this case the fault module).

Figure 2.1: Typical topology of a TMR architecture

Other fields of application for RTR FPGAs are low-power devices and cost-sensitive systems. Regarding power consumption of modern reconfigurable devices [19], the latest silicon technology nodes show a clear trend (see Figure 2.2): the static power increases due to an increment of the leakage current (both gate-oxide and sub-threshold ones) caused by the decreasing of the length of the transistor gate. Despite the ability of the latest silicon technologies (like FinFET one [26]) to allow the technology scaling and, at the same time, limit the leakage current, many technology independent techniques could be used to contribute to the reduction of

(21)

the power consumption. With electronic systems based on reconfigurable hardware, one useful strategy to reduce the static component of the total power consumption is to erase a RP when it is not used, instead of keeping an idle Reconfigurable Module (RM) inside it. In this way, it is possible to reduce the active area of the die and decrease the equivalent number of powered transistors, switching a portion of the die itself off [13].

Figure 2.2: Trend of the leakage current vs technology scaling. Data from the International Technology Roadmap for Semiconductors (2007)

Finally, considering cost-sensitive applications, it has been proven that RTR architectures could improve the utilization of the area of a FPGA, re- ducing the resource requirements to implement a given system or increasing the explorable design space given a specific FPGA model [14]. Loading a RM only when needed and re-use the same part of the FPGA for different processing elements in different time slices can greatly reduce the needed resources and thus the equivalent FPGA die size. Since the die cost is proportional to the fourth power of the die size [15], RTR FPGAs have to be taken into account in applications in which many tasks are performed in separated time intervals (exploiting a time-multiplexing of the available hardware resources).

In addition to the just presented scenarios, RTR technology is also being included in hybrid SoCs, such as the Xilinx Zynq 7020 used for the experiments: as previously discussed, this type of Integrated Circuit (IC) offers a hardwire processor and some dynamically RTR PL on the same die, allowing the designer to accelerate the computation (heavy workloads can be offloaded from the CPU to the FPGA part of the chip), increase the parallelism and add custom features, on the hardware level, to the standard hardwired PS.

(22)

2.1.2 Challenges

Using a RTR FPGA implies to face a certain number of challenges intrinsi- cally related to these devices. The reconfiguration overhead is probably the most obvious one.

The reconfiguration overhead arises due to the reconfiguration time, that is that time interval which is added to the normal computation time due to the loading of a new RM inside a RP. During this period of time, the RP cannot be used and any stream of data to and from has to be stopped to avoid coherency problems (outputs of the RP are not known while it is reconfiguring). Even if modern FPGAs allow significantly high reconfiguration throughput (almost half GB/s [2]), the reconfiguration overhead is still not negligible due to the size of the partial configuration bitstream files (used to reconfigure a RP) and the execution speed of todays logic, capable to execute hundreds of operations every microsecond. Some milliseconds may be needed to reconfigure a medium size RP at a decent reconfiguration throughput (for example, 2.5ms are needed to load a 32-bit multiplier into a Xilinx Zynq 7020 with a reconfiguration throughput of 130MB/s like in Section 3.4) and thus it is one of the designers’ duties to cope with this relatively huge reconfiguration overheads. Typically, the effect of the reconfiguration overhead may be mitigated with one or more of the following strategies.

• Increasing the reconfiguration throughput: this is a direct way to decrease the reconfiguration time and, consequently, the reconfiguration overhead. The reconfiguration throughput can be increased by overclocking the reconfiguration peripheral [10], exploiting a compression of the partial bitstream files [9] or developing and implementing a faster reconfiguration device [2].

• Implementing a hardware prefetching technique [3] [8]: in this way a RM can be loaded as soon as possible inside a RP. In some cases, this can completely hide the reconfiguration overhead.

• Minimizing the number of reconfiguration events: compared to the execution time of a single 32-bit multiplication, the reconfiguration time needed to load a 32-bit multiplier can be 5 or 6 order of magnitude higher. Nevertheless, if billions of multiplications are going to be performed after the 32-bit multiplier is loaded, then the reconfiguration overhead becomes relatively much smaller 4.2.1.

• Working with computation intensive workloads: computation intensive workloads, such as image processing related ones, need long computation times and can be accelerated by 5-15 times using an hardware implementation [16]. In these cases, the reconfiguration overhead is usually smaller due to the long computation time (especially for high-resolution images and videos) and the obtainable speed up

(23)

using an hardware implementation can help to further balance the reconfiguration overhead.

In addition to these considerations, the reconfiguration time not only affects the performance of the system, causing an overhead in the computation time, but it also complicates the development of real-time systems. In fact, the reconfiguration time is difficult to predict exactly and it may depend on many factors: the size of the RP, the architecture of a RM (especially if a bitstream compression technique is used), the reconfiguration peripheral, the device technology etc. In real-time systems, where the tasks scheduling is of critical importance, the reconfiguration time adds a variability factor that complicates the system development.

To address these challenges, a system built on top of a RTR hardware platform should be aware of the nature of the hardware architecture: using a reconfiguration-aware scheduler [18] or a reconfiguration-aware operating system [17] the full potential of the RTR platform could be exploited without undesired effect on the system behavior. Using a proper scheduler on the top of a RTR architecture, it is possible to schedule both hardware and software tasks (in the case of a hybrid systems like the Xilinx Zynq 7020), migrate them (from a Processing Element (PE) to another PE or to a general purpose processor) and reuse the already loaded hardware to reduce the reconfiguration overhead .

2.1.3 Limits

Although a dynamically reconfigurable hardware platform could bring many advantages compared to a static hardware architecture, RTR FPGAs come with some intrinsic limits too. An overview about some of them is provided in the list below.

• The partial bitstream files must be stored in a memory, which means either to occupy extra FPGA resources or to increase the cost of the system (adding, to give an example, a non volatile external memory).

Moreover, with modern and larger FPGAs and complex RMs, the size of a partial bitstream files could reach some MB, which is a massive amount of memory for an embedded system.

• The I/O pins of a RM have to be designed carefully: in order to be accommodated in the same RP, different RMs have to be developed with the same boundary pins and interfaces in order to match the static logic outside the RP after every partial reconfiguration [20].

• RTR hardware is based on FPGA technology, so it is still slower and less dense than a ASICs.

(24)

2.2 Related Work

Many research groups all over the world have put great efforts to develop the needed technology to exploit the potential of RTR devices: in this section, some interesting publications have been chosen to provide a better background in this research field and to inspire the research work of this thesis project.

2.2.1 Using Xilinx FPGAs to Build a RISP

When using Xilinx FPGAs for partial run-time reconfiguration, one of the obvious choices to reconfigure selected partitions in the FPGA itself is to use the Internal Configuration Access Port (ICAP). A very important point in determining the performance of a system and the needed resources is knowing how the ICAP is used, how it is connected to the control system (which manages the partial reconfiguration), and how a RP is connected to the datapath. The aim is to reach the bandwidth required to meet the design specifications while minimizing the resource utilization, in terms of both logic elements and routing resources. Paper [1] proposes to manage the ICAP port as an additional computational pipeline in a super-scalar processors (such as Very Long Instruction Word (VLIW) microprocessors) in order to add specific computational capabilities on-demand. The result is to have a Reconfigurable Instruction Set Processor (RISP) architecture, shown in Figure 2.3, that can improve significantly the system efficiency.

Figure 2.3: Parallel processor pipelines with reconfigurable datapath and integrated configuration interface [1]

(25)

In addition to this architectural configuration, some advices are given to minimize a RM routing overhead (it has to be connected in some way to the rest of the system). Since the ICAP is capable both to write the configuration memory of the PL to reconfigure it (loading partial bitstream files) and to read the same configuration memory to check its status, then the ICAP itself could be used to move data from and to a RM. For example, the ICAP can first re-program a specific RM and, then, it can read from a Block RAM (BRAM) (inside the RM) the processed data, moving them to the main memory. It this way, it is possible to have a very tight coupling between the static part of the system and the RM with a low routing overhead, having no need to use a dedicated interface to move data from and to a RP.

2.2.2 Run-Time Partial Reconfiguration Speed Investigation Each FPGA vendor has its own approach to give the user the possibility to load partial bitstream files into the configuration memory of a RTR FPGA. Focusing on Xilinx FPGAs, one of the most used reconfiguration peripherals is the ICAP. With a theoretical physical bandwidth of 400MB/s, it should not be a bottleneck in many scenarios. Nevertheless, there are some applications which require high reconfiguration throughput (measurement of how quick is the change from a RM to another one in a RP) to sustain huge streams of I/O data (to cite one example, this is the case of high-resolution video decoders). In these cases, the bottleneck may be not the ICAP itself but the velocity of the static logic (a processor, a Finite State Machine (FSM) etc.) in providing data to the ICAP.

Xilinx provides two different ICAP-based designs, shown in Figure 2.4: the OPB HWICAP and the XPS HWICAP [4]. They are both designed for the low-performance On-Chip Peripheral Bus (OPB) bus and they both integrate a control state machine but, while the OPB HWICAP uses a BRAM as buffer, the XPS HWICAP uses a Read/Write FIFO buffer.

(26)

Figure 2.4: Structure of the Xilinx ICAP designs, from [4]

These designs, in the tests conducted by the authors of the paper [4], are capable of a maximum reconfiguration speed of 11.1MB/s and 22.9MB/s (for the OPB HWICAP and the XPS HWICAP respectively) with different resources utilization (since one of them uses BRAM and the other one uses 4-way Look-Up Table (LUT) to implement shift registers).

Executing the performance tests on the Xilinx standard reconfiguration devices, the authors of paper [4] proceed in proposing one custom reconfiguration peripheral with a built-in Direct Memory Access (DMA) controller shown in Figure 2.5. This solution offers advantages in terms of CPU utilization (thanks to the DMA) and reaches a peak bandwidth of about 82.6MB/s: however, it is so complex that it requires the 25% of the total number of 4-way LUT of the Virtex-4 FPGA used [4].

(27)

Figure 2.5: Structure of the custom DMA HWICAP design proposed in [4]

In order to try to reach the maximum physical limit of the ICAP, two additional solutions are proposed in [4] and shown in Figure 2.6: the former is based on an optimized version of the DMA HWICAP and is called MST HWICAP, the latter is called BRAM HWICAP and uses a block RAM as cache memory (for the whole partial bitstreams). The BRAM based solution offers the highest performance with a peak of 371.4MB/s (reaching almost the limit of the ICAP interface) but, due to the large utilization of BRAM (since the whole partial bitstream has to be loaded into it), almost half of the available BRAM on the Virtex-4 FPGA is used. Considering all the previous observations, the authors of [4] identify in the MST HWICAP design the best trade-off between complexity and performance, with a quite low resource utilization (comparable with the Xilinx standard designs) and a peak bandwidth of 235.5MB/s, which is an order of magnitude greater of the already provided Xilinx solutions.

(28)

Figure 2.6: Structure of MST ICAP and BRAM ICAP, from [4]

2.2.3 Scheduling, Hardware Prefetching and Anti-Fragmentation Techniques on RTR Systems

When reconfigurable computing is involved, it is mandatory to focus not only on the hardware architecture but also about software and hardware tasks and how they can be efficiently scheduled on one of the possible hardware configurations. Using a task graph to describe the system and assuming that the FPGA is not big enough to contain all the necessary hardware modules, the authors of the paper [3] argue that the scheduler of reconfigurable systems has to be aware of the reconfiguration capability of the hardware platform alongside its limitations, such as the reconfiguration time of each hardware task and the FPGA area constrain. A reconfiguration- aware scheduler should be able to exploit all the capabilities of a RTR FPGA, implementing technique like the configuration prefetching and the module reusing. This type of scheduler could easily outperform a classic one.

Module reuse and configuration prefetch are often ignored while using a RTR FPGA, but they can allow a significant increase in both the performance and the efficiency of the system. Module reuse is a technique thought to reuse an already loaded hardware module for two slightly different tasks (which can be executed on it), avoiding a RM swapping in a RP and thus

(29)

hiding the reconfiguration time. The configuration prefetching, on the other hand, acts to load a RM into the FPGA as soon as possible in order to virtually reduce the reconfiguration time penalty. These two techniques, together with anti-fragmentation capability (to efficiently use the FPGA area avoiding non-optimal hardware mapping), allow the user to mitigate the typical drawbacks of partial RTR hardware architectures.

Considering each task as a node of a graph with an associated partial bitstream (which is a file that describes the relative RM), an area requirement, an execution time and a reconfiguration time, then the scheduler has to manage these tasks giving a solution in terms of:

• RM-to-RP loading;

• optimize RM-to-RP mapping;

• maintain system integrity.

In [3] a reconfiguration-aware scheduler is proposed, capable to use both module reusing and configuration prefetching techniques, as well as an anti- fragmentation one. Comparing the obtained results in some self-developed tests, this scheduler can perform up to 40% better than another HW/SW co-design based scheduler, which does not use any module reusing strategy. However, it is important to note that the specific test used for this reconfiguration-aware scheduler gives no information about its behavior in real world applications. The tasks considered in [3] are very small and do not represent a real world scenario. To cope with the complexity of real applications, the authors suggest to develop an heuristic approach, instead of the Integer Linear Programming (ILP) one used, without forgetting the achievable benefits with the discussed techniques.

2.2.4 ZyCAP: Efficient Partial Reconfiguration Management on the Xilinx Zynq

In recent years, after partial RTR FPGAs have become popular, another kind of IC has became even more attractive: hybrid SoCs. A hybrid SoC, like the Xilinx Zynq 7020, is basically an IC which integrates both a hardwired processor (in the Zynq case a dual-core ARM Cortex A9) and some programmable logic. The trend, thus, is moving from having self-reconfiguring programmable logic (standard partially RTR FPGA approach) to hybrid systems in which a processor can dynamically load hardware accelerators to offload complex computations. However, an inefficient Partial Reconfigura- tion (PR) management can frustrate any hardware acceleration advantage.

For Xilinx Zynq SoCs, one of the available peripherals dedicated for the PL reconfiguration is the AXI HWICAP, based on the classic Xilinx ICAP controller, largely used inside Xilinx partially RTR FPGAs. Since this interface

(30)

does not use a DMA to transfer bitstream files, the maximum throughput is less than 20MB/s. Fortunately, Xilinx embedded in the Zynq PS a new reconfiguration device called Device Configuration Interface (DevC), which uses a dedicated DMA controller to transfer bitstream files to the Processor Configuration Access Port (PCAP) (connected to the configuration memory of the PL). In this way, it is possible for the PS to reconfigure the PL with a peak throughput of almost 130MB/s without using any additional resources in the PL (both the DevC and the PCAP are hardwired in the PS part of the Zynq).

Although some of the already available PR peripherals have few limitations (like the PCAP), all of them share a common weak point: they block the processor during the reconfiguration, not allowing to process data on the PS in the meantime. For this reason, the authors of the paper [2] developed a new open source configuration controller together with its driver. It is called ZyCAP and its first target is efficiency, providing high performance, over- lapped computation/reconfiguration capability and high-level management functions (written to be used in a bare-metal software application). ZyCAP uses an Advanced eXtensible Interface (AXI)4-Lite interface to configure a DMA controller (with the starting address and the size of the selected bitstream file) and an AXI4-High Performance port in burst mode to move bitstream files. Using the ICAP interface together with the custom designed DMA controller, it reaches almost the physical limit of 400MB/s and it allows the PS to do other tasks during the reconfiguration process.

In computing intensive tasks and multi-threaded scenarios (for example, high resolution image processing mixed with other tasks), the ZyCAP can show its potential. On the other hand, its main limitations are the drivers (which are only available for bare-metal applications) and the resource utilization, since the ZyCAP uses almost double the hardware than a standard Xilinx ICAP based controller (while the DevC-PCAP does not need any resource in the PL).

2.2.5 Other Techniques to Reduce the Reconfiguration Over- head

The literature provides many other works which propose different strategies to optimize the reconfiguration overhead of a RTR system. Many of the proposed techniques are technology- and vendor-dependent and, as a con- sequence, it is a designer’s responsibility to chose a development platform not only for its features and its performance but also considering how much can be done to exploit these optimization techniques using that specific product.

In regard to vendor-dependent technique, two optimization strategies are proposed in the papers [10] and [9].

(31)

• The authors of [10] used a recent Xilinx FPGA to prove that it is possible, with that specific product, to achieve higher performance than the stock advertised one. The experimental results show that, overclocking the ICAP controller to operate at a doubled frequency respect the maximum original one (200MHz instead of the initial maximum 100MHz), the new throughput reaches 800MB/s. In addition to the ICAP overclock, the authors also implemented a bitstream compression technique exploiting the redundancy in bitstream files to reduce their size.

• In [9] the authors proposed another very effective bitstream compression strategy which is capable to exploit redundancies both within a single bitstream file and between different bitstream files of multiple configurations. This solution can perform up to the 75% better than other solutions previously proposed, but it requires a customization of the configuration and readback circuits of the Xilinx FPGA used (in particular a decoder circuit is needed for the decompression of the configuration bitstream files).

On the contrary, one of the technique discussed in the paper [8] is technology-independent and can be exploited on any RTR capable device.

In this work, the designation of this technique is Virtual Configuration and it is basically a hardware prefetching technique.

Figure 2.7: Virtual reconfigurations on single-context FPGAs Figure 2.7 shows the working principle for a single-context FPGA: the Virtual Configuration is implemented reserving a RP of the same size of the original one (considering a system with only one RP). In this way, while a RM is used in the foreground (active) RP, the background RP (a temporary unused one) can be reconfigured to host the RM that will be required after the one currently in use: multiplexers are used for the context switch. This technique comes with a substantial resource overhead but allows a throughput improvement of almost the 30% in stream-based applications, since the reconfiguration overhead can be partially hidden.

(32)

Chapter 3

Initial Design Development and Analysis

The following chapter will introduce the reader to the initial test platform developed during the thesis project to perform the very first tests. The development board used for the entire thesis project was a ZedBoard [24]

and the initial hardware architecture was built to dynamically add at run- time one Advanced eXtensible Interface (AXI) based Reconfigurable Module (RM) to the Xilinx Zynq 7020 Processing System (PS).

3.1 Concept Idea

The first detail which has led the development of the testing platform is that the Zynq 7020 is not an FPGA [21]. It is an hybrid System on Chip (SoC) made up of two different parts:

• a hardwired PS, based on a dual-core ARM Cortex A9 CPU (with its caches, built-in accelerators and peripherals), capable to handle software applications;

• the Programmable Logic (PL), that supports partial Run-Time Re- configurable (RTR) architectures and that is accessible from the PS using a multi-layered ARM Advanced Microcontroller Bus Architec- ture (AMBA) AXI interconnect.

(33)

Figure 3.1: Simplified block diagram of the Xilinx Zynq 7020 Due to the hybrid nature of the chosen SoC (see Figure 3.1), the developed test platform is not fully FPGA-based: it is a mixed architecture in which the hardwired ARM CPU can dynamically load into the PL one custom designed peripheral, choosing between two available ones.

Even though this first architecture was developed for demonstrative purposes only, this kind of reconfigurable architecture has a great potential. General purpose processors have always provided more flexibility and shorter development time compared to full hardware solutions: in addition, nowadays they are also offering impressive performance/Watt ratios, especially considering ARM-based microprocessors. For these reasons, the market is moving from partial RTR FPGAs to hybrid RTR SoCs: the Xilinx Zynq 7020 is only the first-born of a family for Xilinx and, considering competitors, Intel has already planned to include FPGAs in its future server Xeon CPUs. Following this market and research trend, this thesis project is focused on the idea of providing dynamically reconfigurable hardware accelerators on-demand to an hardwired general purpose processor in order to offload heavy workloads, both to lower the CPU usage and to improve the energy efficiency of the system.

3.2 The Xilinx Toolchain

In order to develop a system with the topology described in section 3.1, different development tools are required. In the Xilinx case, the toolchain includes three softwares which cover all the design flow.

(34)

• Vivado HLS provides a high-level synthesizer capable to generate and test an IP core, starting from a functional description (in this thesis project, the generated IP cores have been described in C language).

• Vivado Suite is the tool used to configure the PS, build the block design, implement the hardware architecture in the PL and generate a description of the whole customized platform (needed in Xilinx SDK).

• Xilinx SDK is an Eclipse based Integrated Development Environ- ment (IDE) capable to automatically generate the software drivers for the whole platform: it can be used to develop and debug a software application that can be run by the Zynq PS or by a soft processor loaded inside the PL (like the Xilinx MicroBlaze [27]).

Since one of the goals of this thesis is to collect experimental data to investigate benefits and limits of a RTR architecture and not to investigate low-level implementation details and optimizations, all the work flow has been optimized to minimize the design time, focusing the effort on architectural investigations and performance measurements. Using a full custom approach to develop a RM would imply to spend much more time developing and debugging the system than in the case in which Vivado HLS is used to automatically generate IP cores with standard I/O interfaces (using, for example, an AXI4-Lite protocol). In addition, using a full custom RM would make impossible to automatically generate the drivers required by the software application.

3.3 Adding Custom RTR Peripherals to the Zynq

The initial test platform has been developed following the concept idea discussed in section 3.1. Vivado HLS has been used to generate two slightly different versions of the same IP core with the following features:

• an AXI4-Lite interface used to write the control registers and to read the status registers;

• an AXI4-Stream interface to provide a stream of input data and to supply the output ones (32-bit integer numbers);

• pipelined architecture to increase the throughput;

• dedicated interrupt line to be used to signal the end of the computation;

• 100MHz clock signal.

(35)

The difference between the two generated IP cores is that, while the first one multiplies the stream of input data for a programmable gain (a 32-bit integer value), the second IP adds a programmable offset (again, a 32-bit integer value) to mentioned stream.

The user can interact with the system using a terminal window on a PC (using a virtual serial connection over USB cable) requesting to the software application to load a specific RM, described by a partial bistream file generated with Vivado Suite, into the Reconfigurable Partition (RP) (writing the partial bitstream file into its configuration memory) and to start a test calculation. The architecture topology is shown in Figure 3.2.

Figure 3.2: First reconfigurable architecture developed to demonstrate the feasibility of the concept idea

After the power-on, the system uses a SD card to boot-up, following the sequence listed below:

• a First Stage Boot Loader (FSBL) is used to start up the PS;

• a total bitstream file is used to configure the whole PL with an initial configuration;

• the software application is loaded into the DDR3 memory.

Once the software application is run, the PS copies the partial bitstream files (one for each loadable RM) from the SD card into the faster DDR3 memory in order to improve the reconfiguration throughput. When a reconfiguration of the RP is requested (in this case by the user), the PS uses the Device Configuration Interface (DevC)-Processor Configuration Access Port (PCAP) peripherals to load a partial bitstream file from the DDR3 memory into the FPGA configuration memory. The PCAP was chosen among all the available reconfiguration peripherals for the following reasons:

• Xilinx designed it for this specific scenario, that is reconfiguring the PL using the PS;

(36)

• it has no resource overhead in the PL since it is already built in the hardwired part of the Zynq 7020;

• it is the reconfiguration interface with the highest bandwidth provided by Xilinx (about 130MB/s): to get closer to the theoretical limit of 400MB/s, a custom reconfiguration controller is needed (like the ones discussed in [4] and [2]), requiring extra design effort and a higher resource utilization.

Regarding the I/O data streams, a standard AXI Direct Memory Access (DMA) IP core, provided by Xilinx, was used to move data from/to the DDR3 memory to/from the loaded RM with high bandwidth (400MB/s from the DDR3 to an AXI4-Stream based IP core, 300MB/s in the opposite direction [23]) and with low CPU overhead.

3.4 First Performance Estimation

Even if the discussed partial reconfigurable system was developed for demonstrative purpose only and without any optimization, it is still interesting to study its performance and the behavior of the PCAP interface used to reconfigure the RP. The test application previously discussed manages the partial bitstreams in the following way:

• when the application starts, all the partial bitstream files (one for the multiplier, one for the adder and a blank one to erase the RP) are transferred from the SD card to the DDR3 memory: in this way it will be quicker to move them to the configuration memory of the RP later on;

• when required, the PCAP is used to move a partial bitstream file from the DDR3 memory to the configuration memory of the RP with a DMA transfer;

• the DevC-PCAP DMA transfer is checked via polling to know when the partial reconfiguration ends.

The time measurements needed to obtain the reconfiguration time are performed using the Global Timer, which is a 64-bit incrementing counter with an auto-incrementing feature. It is incremented every two clock cycles:

thus, knowing the processor clock frequency, it is possible to calculate a time interval starting from two different values with high resolution (about 3 ns). Furthermore, the size of partial bitstream files is also known since it is one of the parameters required by the software function used for the Partial Reconfiguration (PR) DMA transfer. Thanks to these data, an average reconfiguration throughput was calculated and it is shown in Table 3.1.

(37)

Average Throughput

PCAP, D-Cache Off 124 MB/s

PCAP, D-Cache ON 130 MB/s

Table 3.1: Average performance of the DevC-PCAP peripherals It is interesting to notice that enabling the processor data cache, in this specific scenario, provides almost a 5% gain in reconfiguration throughput (even if the system would probably require more power). Since the size of the partial bitstream files is about 300KB for all the RM used in this demo, the reconfiguration time is about 2.4 ms. In addition to this values, there is another behavior which is worthy to discuss: when trying to load all the three different RMs (2 different IP cores and 1 empty configuration) in the RP, it is possible to notice that the reconfiguration time is substantially constant.

This happens because, even if the three partial bitstream files describe three different RMs, every bitstream file describes the same partition of the FPGA, hence exactly the same area and reconfigurable resources. For this reason, all the partial bitstream files have exactly the same dimension and take about the same time to be loaded in the RP (since no compression algorithm is applied on these configuration files, the reconfiguration time is not affected by a possible internal redundancy in the RM architecture).

(38)

Chapter 4

Test Platforms and Collected Data

This chapter will present the results collected during several experiments, characterizing the Run-Time Reconfigurable (RTR) platforms used in this thesis project by deduction.

4.1 Behavior of the Reconfiguration Time

During each of the many experiments with different RTR architectures developed during the thesis project, the reconfiguration time has been mea- sured. It is intended as a measure of the time overhead needed to load a Reconfigurable Module (RM) into a Reconfigurable Partition (RP), moving its partial bitstream file from a memory (in particular, the DDR3 memory of the ZebBoard was used) to the reconfiguration memory of the FPGA.

As described in Section 3.3, the Processor Configuration Access Port (PCAP) has been used to reconfigure the Programmable Logic (PL) using a software application running on the Processing System (PS): in particular, the Di- rect Memory Access (DMA) of the Device Configuration Interface (DevC) controller has been used to move the partial bitstream files from the main memory to PCAP peripheral. Having in mind the method chosen to partially reconfigure the PL, the reconfiguration time has been assumed as the time interval which starts with the beginning of the DMA transaction from the DDR3 and terminates when the DevC DMA completes the data writing process into the FPGA reconfiguration memory and the RM is thus ready to be used. To properly measure this time interval, one of the hardware timers of the ARM Cortex CPU, included in the PS, was used. Specifically, the Global Timer was employed, which is a 64-bit incrementing counter with an auto-incrementing feature every two ticks of the CPU clock signal. With a resolution of about 3 ns, the Global Timer is more than capable to provide a reliable reconfiguration time measurement.

(39)

All the results presented in this section have been collected in RTR systems in which the partial bitstream files were stored in the DDR3 memory:

actually, the partial bitstream files are initially stored in a SD card (in order to not lose them after a Power-on-Reset (PoR)) but, after the initial boot of the system, they are moved to the DDR3 memory by the software application running on the PS in order to increase the reconfiguration throughput. For the same reason, the Data Cache of the ARM Cortex A9 CPU has been activated in all the test platforms (if not differently specified) since, as shown in Section 3.4, it can help to increase the reconfiguration throughput of about 5%. Table 4.1 shows the behavior of the reconfiguration time related to different dimension of the RP and with different RMs.

RP size [kB]

Reconfig. Time [µs]

Reconfig.

Throughput [MB/s]

Blank Configuration 313 2408.60 130.0

AXIStream Adder 313 2408.47 130.0

AXIStream Multiplier 313 2408.90 129.9

AXIStream Prime

Number Calculator 636 4894.19 130.0

AXIStream FIR Filter 530 4082.84 129.81

AXIStream Image

Processor 559 4307.29 129.78

Table 4.1: Behavior of the reconfiguration time with different RMs and RPs Data in Table 4.1, Figure 4.1 and Figure 4.2 show that the reconfiguration time, on the Xilinx Zynq 7020, with PCAP/DMA-based approach, is independent from the RM and it only depends on the size of the RP. The throughput is substantially constant and thus the reconfiguration time is linearly proportional to the size of the partial bitstream files. The content of the bitstream file itself, that is the specific RM, is not important due to the fact that, regardless of the size of the RM, the RP will be fully reconfigured each time. In addition, since the partial bitstream files were not compressed, an eventual redundancy in the architecture of a given RM is not exploited to reduce the size of the reconfiguration packet sent to the reconfiguration memory of the PL. These observations, the linearity of the reconfiguration time and the exclusion of additional compression methods

(40)

300 400 500 600 126

128 130

Reconfigurable Partition Size [kB]

ReconfigurationThroughput[MB/s]

Figure 4.1: Graph of the reconfiguration throughput versus the size of the reconfigurable partition

300 400 500 600

1 2 3 4 5

Reconfigurable Partition Size [kB]

ReconfigurationTime[ms]

Figure 4.2: Graph of the reconfiguration time versus the size of the reconfigurable partition

for the bitstreams redundancy, allows the developer to make an estimation of the reconfiguration time making possible the usage of this RTR systems even for real-time applications.

(41)

4.2 Move the Workload to the FPGA: Is It Worth It?

Nowadays embedded processors have reached power efficiency levels that were unthinkable just 10 years ago: a typical 1 W smartphone System on Chip (SoC), like the Qualcomm Snapdragon 810, can pack 8 64-bit ARM cores and a 4K-ready GPU. To have a basis of comparison, it is sufficient to think that a modern NVIDIA Tegra X1 mobile SoC offers 1 TeraFLOPs with FP16 operation, the same computational power of the fastest supercomputer until year 2000. Due to the extreme competitiveness of the latest ARM- based devices, it is not so straightforward that FPGAs can reach better performances with a similar power-envelope.

4.2.1 ARM Cortex A9 vs HW IP Cores: 32-bit Integer Op- erations

32-bit Integer Multiplication/Addition

To investigate the competitiveness of the hardwired processor included in the PS with respect to a reconfigurable hardware approach, the same test platform described in Section 3.3 was used both to measure the reconfiguration time and as benchmark to compare the performance of a single ARM Cortex A9 core (running at 666 MHz) with the performance achievable using high- level-synthesized hardware accelerators implemented in the PL. Both the previously discussed Advanced eXtensible Interface (AXI)4-Stream based 32-bit integer multiplier and the 32-bit integer adder have been used to process the same dataset. Their performances are reported in Table 4.2 together with the performance of the PS (running a single thread application on only one Cortex A9 core). The following measurements of the computation time of the IP cores also include the communication overhead due to the AXI4 interfaces in order to consider the communication overhead that actually exists to move the computation outside of the hardwired processor.

(42)

SW Mul. SW Adder AXIStream 32-bit Mul.

AXIStream 32-bit Adder

Reconfig. Time [µs] / / 2408.517 2408.369

Exec. Time - 1M

operations [µs] 27072.393 25571.078 27039.621 27078.928

HW/SW Speed Up -

No Overhead / / 1.001x 0.944x

HW/SW Speed Up / / 0.919x 0.867x

Table 4.2: Performance comparison between a single ARM Cortex A9 core running at 666 MHz and two high-level-synthesized IP cores in 32-bit integer addition/multiplication

Add. Mul.

26 27 28 29

25.57

27.07

27.08 27.04

29.49 29.45

ExecutionTime[ms]

SW HW HW+Ov.

Figure 4.3: Graph of the execution time for 1 million 32-bit integer multiplication/addition

Table 4.2 and Figure 4.3 report very significant data: the two IP cores, generated with Vivado HLS (forcing the pipelining of their datapaths), are not faster than a single ARM Cortex A9 core clocked at 666 MHz. They reach the same throughput but, when the reconfiguration time is included into the total execution time, the global throughput is less than the one offered by a single-thread software application running on the PS. In any case, this test is not significant for real-world applications: performing 1 million multiplications or additions in a row is not a common situation and even in this case the reconfiguration overhead is not negligible. The reconfiguration time is actually 5 orders of magnitude greater than the time needed to do a 32-bit multiplication and only a very heavy workload could

(43)

hide it. Of course, the FPGA may offer a better performance by increasing its clock frequency (the two IP cores, used for the performance comparison, run at only 100MHz), but also the ARM core could run at higher frequency and could be optimized to perform better (only one of the two cores of the PS was used and no optimization was activated in the compiler options).

32-bit FIR Filter/Prime Number Checker

To confirm the competitiveness of the ARM Cortex A9 core against automatically generated IP cores, regarding 32-bit arithmetic operations, other two examples have been investigated: a Finite Impulse Response (FIR) filter and a prime number checker. Both these IP cores use the AXI4-Stream as data interface and the AXI4-Lite for control and status registers. Regarding their datapaths, both the IP cores use 32-bit floating point operations inside them to test not only the integer performance but also the floating point capability of the Zynq 7020 PL.

The FIR filter works on a stream of 1000 32-bit integer numbers, applying a 21-tap low-pass filtering and generating the output stream in a pipelined fashion. Its datapath uses two Multiply and ACcumulate (MAC) units used to implement the 21-tap filtering in 11 clock cycles.

The prime number checker, on the other hand, reads a stream of 1000 32- bit integer numbers and checks if each of them is a prime number or not, generating an output stream of 0 and 1 to signal the results of the checking operation (if the input value[i] is prime, then the output value[i] will be 1).

A test platform, similar to the one used in Paragraph 4.2.1, was used to compare the performance of these two IP cores (running at 100MHz) with the one achieved by a software application running on one core of the PS at 666 MHz. The measurement of the execution time starts when the first sample is moved from the DDR3 memory to the selected RM and finishes when the last output value is written back to the main memory. Conse- quently, the performance comparison includes the communication overhead to move data to and from the loaded IP core. The test results are available below in Table 4.3.

HW/SW Speed Up HW/SW Speed Up

with Reconfig. Ov

FIR Filter 7.277x 0.193x

Prime Number

Checker 0.230x 0.202x

Table 4.3: Performance Comparison: FIR Filter and Prime Number Checker

(44)

FIR Prime N. Checker 0

10 20 30 40

0.97

7.93 0.13

34.35

5.03

39.35

ExecutionTime[ms]

SW HW HW+Ov.

Figure 4.4: Graph of the execution time for the FIR filter and the Prime Number Checker

Table 4.3 and Figure 4.4 show that the ARM Cortex A9 core outperforms the hardware prime number checker, whether or not the reconfiguration time is considered as part of the hardware computation time .

On the other hand, the FIR filter is a completely different scenario. With a hardware implementation it is possible to parallelize the MAC operations needed to implement the 21-tap low-pass filter: using just 2-MAC units, the tested accelerator achieves more than 7x the performance of the ARM A9 core. Nevertheless, including the reconfiguration time in the total execution time makes the RTR hardware solution slower than the software implementation. This is the typical scenario in which the RTR solution could be convenient or not, depending on how long the RM will be used after it is loaded in the RP. Loading the RTR FIR filter to process only 1 stream of 1000 32-bit integer samples negatively affects the performance instead of providing a hardware acceleration. On the contrary, loading the RTR FIR filter and using it to process a huge set of data would mitigate the impact of the reconfiguration time providing an actual performance improvement.

In any case, even if offloading the workload to hardware accelerators in the PL is not helpful to increase the system peak performance, it could still be helpful to reduce the CPU utilization, which could lead to an improvement in terms of efficiency and to an increased execution parallelism (if the CPU is used for something else, while the workload is offloaded to the FPGA).

Characterization of Partial and Run-Time Reconfigurable FPGAs

Characterization of Partial and Run- Time Reconfigurable FPGAs

EMILIO FAZZOLETTO

Acknowledgements

Contents

List of Figures

List of Tables

Listings

Acronyms

Chapter 1

Introduction

1.1 Motivation

1.2 Objectives

1.3 Strategy

Chapter 2

Background and Literature Study

2.1 RTR FPGAs

2.2 Related Work

Chapter 3

Initial Design Development and Analysis

3.1 Concept Idea

3.2 The Xilinx Toolchain

3.3 Adding Custom RTR Peripherals to the Zynq

3.4 First Performance Estimation

Chapter 4

Test Platforms and Collected Data

4.1 Behavior of the Reconfiguration Time

4.2 Move the Workload to the FPGA: Is It Worth It?