Dessign and Implementation of Hardened Reconfiguration Controller for Self-Healing Systems on SRAM-Based FPGAs

(1)

Des

Rec sign a configu

Sys

A M

nd Im uratio stems

Master The

mplem on Con

on SR

esis Prese

Naser

E

Axe

S

Cristia

Spr

entati ntrolle RAM‐B

nted to Po

By:

Derakhsh

Examiner:

el Jantsch

upervisor:

ana Bolch

ring 2013

ion of er for Based

olitecnico

han

h

hini

3

f a Har r Self‐H

FPGA

di Milano

rdene Healin As

ed

ng

(2)

IN THE NAME OF GOD

(3)

Abstract

As digital systems become large and complex, their dependability is getting more important, particularly in mission‐critical and safety‐critical applications. Among various available platforms for implementing a digital system, SRAM‐based Field Programmable Gate Arrays (FPGAs) are increasingly adopted in embedded systems due to their flexibility in achieving multiple requirements such as low cost, high performance, and fast turnaround time compared to Fixed Application Specific Integrated Circuits (ASICs). The most attractive feature of SRAM‐based FPGAs is the ability of re‐programming¹ the device in a few clock cycles. This feature is further enhanced by the introduction of Partial Dynamic Reconfiguration (PDR). PDR allows reconfiguration partially and on the fly, while the device is operating.

Nevertheless, SRAM‐based FPGAs are more susceptible to faults compared to other type of FPGAs and ASICs. One of these faults, which mostly happen in higher altitude², is bit flop in configuration memory caused by ionizing radiation. If this bit flop alters the critical bits³ in the configuration memory, the function of the design can be corrupted. Thus, appropriate hardening techniques should be used in order to increase device dependability.

In general, fault tolerant techniques are mostly based on spatial redundancy. However, these techniques can be combined with FPGA’s re‐configuration capability for recovery. Since the complexity of systems is increasing and utilizing hardening techniques demand higher resources, a single FPGA may not suffice to contain whole system. In this case, multi‐FPGA platforms would be taken into account.

In this thesis, a hardened generic reconfiguration controller that manages the occurrence of soft‐errors in self‐healing systems implemented on SRAM‐based FPGAs is demonstrated and analyzed. The controller shows the ability to correct the SEUs in the configuration memory ‐ in both static and partial reconfigurable regions ‐ by means of Xilinx PDR capability. Moreover, the controller, itself, is hardened with fault‐tolerant techniques and it is able to detect and mask its own errors. The developed controller is compared with similar approaches based on micro‐controller inside the FPGA. Eventually, the presented structure is proven fully functional on XUPV5‐LX110T evaluation board.

1 Re‐configuring

2 40000 feet and above

3 critical bits are those bits that cause functional failure if they change state

(4)

Preface

This report is provided as a master thesis to fulfill the requirement for master degree in System on chip Program at ICT School of Royal Institute of Technology (KTH). This thesis is carried out at spring 2012 at Politecnico di Milano during an exchange study.

I would like to take this opportunity to express my sincere appreciation to Prof. Cristiana Bolchini, my supervisor at Politecnico di Milano, for her constant support, motivation and guidance during this project. Further, I would like to thank Dr. Antonio Miele, Dr. Chiara Sandionigi and Matteo Carminati for their practical advices and all MicroLAB students for their kind support during this thesis. I would like also to show my sincere gratitude to all KTH and Politecnico staff which I might not remember their names but they helped me a lot to finish my master thesis.

Last and the foremost, I wish to thank my parents, Akbar Derakhshan and Tooran Hamedmoghadam, that nothing can be comparable with their dedications, spiritual support and encouragements in my whole life. Moreover, I wish to kindly thank my lovely wife, Zeinab Hassani, who broke her study in Iran to company me during my study abroad. I really could have not finished my master study without her support.

(5)

Table of Contents

1 Introduction ... 1

2 Background and Related Work ... 3

2.1 Motivation ... 3

2.2 Working scenario ... 4

2.3 Adopted Fault Model ... 5

2.4 Self‐Healing System Architecture ... 5

2.5 SEU Mitigation Schemes ... 8

2.6 Summary ... 9

3 Proposed Controller Architecture ... 11

3.1.1 Implemented design in the Master side ... 14

3.1.2 Implemented design in the slave side ... 23

3.2 Summary ... 26

4 Design Hardening ... 27

4.1 State Machine Encoding ... 27

4.2 Internal Signal Hardening ... 28

4.3 Interface Hardening ... 28

4.4 Bitstream Memory Protection ... 29

5 Test Results ... 30

6 Conclusion and Future Works ... 33

7 Glossary ... 34

8 Works Cited ... 35

9 Appendices ... 39

9.1 Appendix A: Bitstream Scrubbing and Readback ... 39

9.2 Appendix B: Redundancy ... 42

9.3 Appendix C: Xilinx Virtex‐5 overview ... 44

9.4 Appendix D: Configuration modes in Virtex 5 ... 47

9.4.1 Configuration Modes and Pins in Virtex 5 [31] ... 47

9.4.2 Serial Configuration Interface [31] ... 47

(6)

LIST OF FIGURES

FIGURE 1 BASIC PREMISE OF PARTIAL RECONFIGURATION ... 6

FIGURE 2 FT SYSTEM ON MULTI‐FPGA PLATFORM. DISTRIBUTED SOLUTION (LEFT); CENTRALIZED SOLUTION (RIGHT) ... 8

FIGURE 3 A CONFIGURATION CONTROLLER BLOCK‐DIAGRAM BASED ON MICROBLAZE ... 8

FIGURE 4 RECONFIGURATION CONTROLLER BLOCK DIAGRAM ... 12

FIGURE 5 SLAVE FPGA (LEFT) AND MASTER FPGA (RIGHT)... 13

FIGURE 6 CONFIGURATION CONTROLLER BLOCK DIAGRAM ... 13

FIGURE 7 BLOCK DIAGRAM OF THE MASTER SIDE AND THE TOP MODULE SIGNALS ... 14

FIGURE 8 PR CONTROLLER INTERFACE ... 15

FIGURE 9 MODULES INSIDE THE TOP (MASTER SIDE) ... 16

FIGURE 10 FAULT‐CLASSIFIER INTERFACE ... 17

FIGURE 11 FAULT CLASSIFIER FINITE STATE MACHINE DIAGRAM ... 18

FIGURE 12 PR CONTROLLER INTERFACE ... 19

FIGURE 13 PR CONTROLLER FINITE STATE MACHINE DIAGRAM ... 20

FIGURE 14 COMPLETE BLOCK DIAGRAM ... 20

FIGURE 15 FULL CONFIGURATION CONTROLLER INTERFACE ... 21

FIGURE 16 FULL CONFIGURATION CONTROLLER FINITE STATE MACHINE ... 22

FIGURE 17 THE IMPLEMENTED DESIGN WITH AN EXTERNAL MEMORY FOR STORING PARTIAL BIT‐STREAM FILES ... 23

FIGURE 18 IMPLEMENTED DESIGN ‐ SLAVE SIDE ... 24

FIGURE 19 DIFFERENTIAL INPUT BUFFER PRIMITIVE (IBUFDS) ... 25

FIGURE 20 THE CONNECTION BETWEEN TWO EVALUATION BOARDS ... 28

FIGURE 21 GENERATED PR REGIONS ON THE FPGA FABRIC ... 30

FIGURE 22 A SCHEMATIC FPGA STRUCTURE. TAKEN FROM [8] ... 40

FIGURE 23 TMR BASIC PRINCIPLE ... 42

FIGURE 24 TMR ‐ DEVICE LEVEL ... 43

FIGURE 25 XILINX VIRTEX‐5 XC5VLX110T DEVICE. TAKEN FROM [44] ... 44

FIGURE 26 XILINX XUPV5‐LX110T EVALUATION PLATFORM. TAKEN FROM [46] ... 46

FIGURE 27 VIRTEX‐5 FPGA SERIAL CONFIGURATION INTERFACE. TAKEN FROM [31] ... 47

FIGURE 28 SERIAL CONFIGURATION CLOCKING SEQUENCE. TAKEN FROM [31] ... 48

FIGURE 29 MASTER SERIAL MODE CONFIGURATION. TAKEN FROM [31] ... 49

(7)

LIST OF TABLES

TABLE 1 FPGA VS. ASIC DESIGN ADVANTAGES. TAKEN FROM [10] ... 3

TABLE 2 TOP MODULE (MASTER SIDE) INTERFACE PINS ... 15

TABLE 3 FAULT‐CLASSIFIER INTERFACE PINS ... 17

TABLE 4 PR CONTROLLER PIN DESCRIPTION ... 19

TABLE 5 FULL CONFIGURATION CONTROLLER. PIN DESCRIPTION ... 21

TABLE 6 BIT ORDERING FOR ICAP 8‐BIT MODE ... 25

TABLE 7 BIT ORDERING ... 25

TABLE 8 DEVICE UTILIZATION SUMMARY FOR CONFIGURATION CONTROLER (EXCLUDE BITSTREAM MODULE) ... 31

TABLE 9 CONFIGURATION TIMES FOR DIFFERENT PARTIAL BITSTREAMS ... 32

TABLE 10 RESOURCE UTILIZATION OF ICAP CONTROLLER ... 32

TABLE 11 VIRTEX‐5 DEVICE FRAME COUNT, FRAME LENGTH, OVERHEAD, AND BITSTREAM SIZE [31] ... 39

TABLE 12 PERFORMANCE OVERVIEW OF MITIGATION SCHEMES. PART OF THE TABLE IS TAKEN FROM [12] ... 43

TABLE 13 VIRTEX‐5 (LX110T) DEVICE SPECIFICATION TAKEN FROM [43] ... 44

TABLE 14 VIRTEX‐5 CONFIGURATION MODES ... 45

TABLE 15 VIRTEX‐5 FPGA SERIAL CONFIGURATION INTERFACE PINS ... 48

(8)

1 Introduction

As digital systems become large and complex, their dependability is getting more important, particularly in mission‐critical and safety‐critical applications. Among various available platforms for implementing a digital system, SRAM‐based Field Programmable Gate Arrays (FPGAs) are increasingly adopted in embedded systems due to their flexibility in achieving multiple requirements such as low cost, high performance, and fast turnaround time compared to Fixed Application Specific Integrated Circuits (ASICs). The most attractive feature of SRAM‐based FPGAs is the ability of re‐programming⁴ the device in a few clock cycles, which allows the system implemented on the FPGA to be updated during design lifetime. This feature is one of the reasons in which SRAM‐based FPGAs are taken into account for mission‐critical applications where direct maintenance is difficult. This feature is further enhanced by the introduction of Partial Dynamic Reconfiguration (PDR), which allows reconfiguration partially and on the fly while the device is operating. Some advantages of using SRAM‐based FPGAs in space applications are discussed in [1], [2].

Nevertheless, SRAM‐based FPGAs are more susceptible to faults compared to other type of FPGAs and ASICs. One of these faults, which mostly happen in higher altitude⁵, is bit‐flop in configuration memory caused by ionizing radiation [3], [4], [5]. Ionizing radiation (such as neutrons or alpha particles emitted by natural radioactive isotopes present in device packaging) is able to induce undesired single event effects (SEEs) in most silicon devices. SEEs that result in temporary damages to the device are called soft errors. Soft errors in FPGAs often show up as bit‐flops in user flip‐flops, internal block memory and configuration memory. Bit‐flops within the configuration memory are especially challenging. If these bit‐

flops alter the critical bits (those that cause functional failure if they change state) in the configuration memory, the function of the design can be corrupted. This is clearly unacceptable for mission‐ or safety‐

critical applications. Thus, appropriate hardening techniques should be used before they can be deployed.

In general, fault‐tolerant techniques are mostly based on spatial redundancy. However, these techniques can be combined with FPGA’s re‐configuration capability for recovery. Since the complexity of modern systems is increasing and utilizing hardening techniques demand higher resources, a single

4 Re‐configuring

5 40000 feet and above

(9)

FPGA may not suffice to contain the whole system. In this case, multi‐FPGA platforms would be taken into account.

In this thesis, a generic dynamic partial reconfiguration controller for a fault‐tolerant design based on Multi‐FPGA is proposed. The final goal is to have a dependable controller that is able to recover all recoverable faults⁶ by exploiting the reconfiguration capability of the FPGAs. This controller is able to correct the SEUs in the configuration memory of the neighbor FPGA by means of Xilinx PDR⁷ capability.

It can correct and classify soft errors in the configuration memory, in both static and partial reconfigurable regions. Moreover, the controller, itself, is hardened and it is able to detect and mask its own errors.

Modern fault‐tolerant architectures using PDR often utilize microprocessors such as PowerPC or MicroBlaze embedded into FPGA as the main processing unit for the configuration controller; like the ones presented in [6], [7]. The innovative contribution of this thesis is implementing all necessary units and components for the FT⁸ configuration controller generically on the FPGA fabric. Moreover, in this thesis we focused on multi‐FPGA platforms, which are less discussed in the literatures. We have proposed a distributed solution where each FPGA on the multi‐FPGA platform is responsible for monitoring and recovering, in case of faults, the neighbor FPGA on the platform. This method, which is discussed in [8], will increase the overall reliability in contrast to centralized solution. In addition to this, the proposed solution in this work is able to correct single or multiple faults (assuming the faults are detected) inside the FPGA.

The rest of this thesis is organized as follows: Chapter 2 briefly introduces the preliminary aspects of the problem and introduces the background elements useful to set the basis for understanding the rest of the thesis. Moreover, other SEU mitigation schemes have been discussed in this chapter. We also introduce the self‐healing system architecture, which our controller is designed based on that. Chapter 3 describes the proposed controller architecture. Chapter 4 presents the design hardening of the implemented controller. In chapter 5, we present the testing results. Eventually, chapter 6 draws some conclusions and gives some possible future research directions.

6 Recoverable faults are a kind of faults that do not cause permanent damage to the FPGA fabric

7 Partial Dynamic Reconfiguration

8 Fault Tolerance

(10)

2 Background and Related Work

In this thesis, we proposed a dependable reconfiguration controller for embedded systems on multi‐

FPGA platforms. Our aim is to increase the overall reliability of system by means of PDR capability. The chapter is structured as follows: Section 2.1 presents the motivations of the proposed work and introduces the background elements useful to set the basis for understanding the rest of the thesis.

Section 2.2 discuss what the working scenario for this thesis is, and what the characteristics are. In Section 2.3, we explain the adopted fault model. Section 2.4 presents the self‐healing system architecture. We follow this architecture in the rest of the thesis. Other mitigation schemes are also discussed in section 2.5. At last, section 2.6 draws the chapter summary.

2.1 Motivation

Occasionally, electronic devices show erroneous behavior for no explicit reason. By performing several experimental designs and by considering statistical analysis, scientists and engineers discovered that background radiation is the reason. These failures are generally rare and could be ignored for common applications. However, for many applications, such as mission‐critical and safety‐critical applications, it is important to consider the role of radiation in system reliability. Reliability problems due to radiation most commonly fall into the category termed single event effect (SEE) and show up as a type of soft errors called single event upsets (SEU) [9].

Among various available platforms for implementing a digital system, SRAM‐based Field Programmable Gate Arrays (FPGAs) are increasingly adopted in embedded systems due to their flexibility in achieving multiple requirements such as low cost, high performance, and fast turnaround time compared to Fixed Application Specific Integrated Circuits (ASICs). Table 1 compares FPGAs with ASICs in the various aspects.

Table 1 FPGA vs. ASIC Design Advantages. Taken from [10]

FPGA Design

Advantage Benefit

Faster time‐to‐market No layout, masks or other manufacturing steps are needed

No upfront non‐recurring expenses (NRE) Costs typically associated with an ASIC design

Simpler design cycle Due to software that handles much of the routing, placement, and

timing

More predictable project cycle Due to elimination of potential re‐spins, wafer capacities, etc.

Field reprogramability A new bitstream can be uploaded remotely

ASIC Design

Advantage Benefit

Full custom capability For design since device is manufactured to design specs

Lower unit costs For very high volume designs

Smaller form factor Since device is manufactured to design specs

(11)

FPGA designs present faster time to market and less non‐recurring expenses (NRE). They also have a simpler design cycle in contrast to ASICs. However, in general, FPGA designs exhibit worse performance in terms of logic density, circuit speed, and power consumption than ASICs. In [11] the authors presented empirical measurements quantifying the gap between 90 nm CMOS FPGAs and 90 nm CMOS Standard Cell ASICs. They observed that for circuits implemented entirely using LUTs and flip‐flops (logic‐

only), an FPGA is on average 40 times larger and 3.2 times slower than a standard cell implementation.

An FPGA also consumes 12 times more dynamic power than an equivalent ASIC on average.

“Although FPGAs used to be selected for lower speed, complexity, volume designs in the past, today’s FPGAs easily push the 500 MHz⁹ performance barrier. With unprecedented logic density increases and a host of other features, such as embedded processors, DSP blocks, clocking, and high‐speed serial at ever‐lower price points, FPGAs are a compelling proposition for almost any type of design” [10]. The most attractive feature of SRAM‐based FPGAs is the ability of re‐programming¹⁰ the device in a few clock cycles, which allows the system implemented on the FPGA to be updated during design lifetime. This feature is one of the reasons in which SRAM‐based FPGAs are taken into account for mission‐critical applications where direct maintenance is difficult. This feature is further enhanced by the introduction of Partial Dynamic Reconfiguration (PDR), which allows reconfiguration partially and on the fly while the device is operating.

In this thesis, we focus on the SRAM‐based FPGAs in Multi‐FPGA platforms. In a SRAM based FPGA, the combinational and sequential logic are implemented in programmable complex logic blocks (CLBs), which are customized by loading configuration data (bitstream) in the SRAM cells of the program memory [12]. Since the functionality of SRAM‐based FPGAs is determined by the configuration memory, any bit‐flop that alters the critical bits¹¹ in the configuration memory would corrupt the function of design. Thus, to have a dependable system specifically in a harsh environment, the system on the chip should be hardened using suitable FT techniques.

2.2 Working scenario

The working scenario of this thesis is space applications where SEUs are caused by secondary particles.

According to [9] “secondary particles liberated by the collision of a neutron with a silicon atom or from a contaminant emitting an alpha particle in an electronic device. The neutrons are generated when cosmic rays and protons from space interact with the atmosphere. The cosmic rays are from both inside (the sun) and outside (novas and supernovas) of the solar system. The neutrons range in energy from below 1 million electron volts (MeV) to more than 1,000 MeV.”

Although it is possible to protect electronic equipment against these hi‐energy neutrons by means of shielding, this is not practical for most applications because the amount of material required to make this shield is prohibitive (e.g., as much as 30 meters of water for neutrons with high energy) [9].In

9 Xilinx Zynq‐7000 technology has already passed 800 MHz

10 Re‐configuring

11 critical bits are those bits that cause functional failure if they change state

(12)

addition to neutron effects, an SEU could be caused by alpha particles that emitted by natural radioactive isotopes present in device material and packaging [9].

2.3 Adopted Fault Model

We can organize the effects from ionizing radiation into three main categories: transient current pulses, changes in memory values (such as bit‐flops or SEUs), and latch‐up. The first two categories will result in recoverable (or soft) faults while latch‐up, which can results in sever overheating, melting, or vaporization, can cause damage to FPGA fabric and will result in non‐recoverable (or hard) faults. Due to the difficulty of maintenance in mission‐critical applications, we have to add aging effects to the above‐

mentioned categories. Aging effects can also end in non‐recoverable faults. Since the primary concern for FPGAs are soft‐faults, we need to expand the first two mentioned categories in this section:

1‐ Transient current pulses may change the values of the internal signals or they may strike the clock line. They may have transient effect and get vanished after a short time or they may propagate to flip‐flops inputs and get registered. In both cases, they can cause erroneous value that will lead to an incorrect result at the output. Suitable error detection and masking technique is necessary to avoid the propagation of an incorrect result to the other modules.

Such approach is discussed in [13], [14]. The fault can, then, be recovered by performing the reset.

2‐ The second type of recoverable‐faults is change in the memory values. SRAM‐based FPGAs have two types of memory: The user registers and block RAMs, which store the user data, and the configuration static memory which stores the configuration bitstream. Any changes in the configuration memory will modify the functionality of the system implemented inside the FPGA.

The only method to recover the configuration memory is to rewrite the corrupted portion of the configuration memory by the correct portion of the bitstream. In this work, we concentrate on hardening the design implemented inside the FPGA against upsets in the configuration memory.

The proposed controller in our research is able to correct single or multiple‐bit upsets (MBUs) in the configuration memory by performing the partial reconfiguration of the corrupted portion of the memory or, at the worst case, reconfiguring the whole FPGA.

2.4 Self‐Healing System Architecture

We applied a hybrid fault‐tolerant technique to our multi‐FPGA architecture. In this architecture each FPGA hosted a portion of the design. This portion on each FPGA is hardened with hardware redundancy techniques and distributed among available partially reconfigurable Regions (PRR‐1 to PRR‐n).

Partitioning the system into different portion and then into n PR regions is not mentioned here since the proposed architecture is not depended on it. The hardware redundancy techniques implemented in this

(13)

scenario controller Partial Re file [15]. A of on‐site design. Pa operating configure without c being reco

In this sce reconfigu modified contents loading o should no The partia be update download stored in If these p fault toler other logi Partial R comprehe design aft reconfigu

12 A brief in

13 Protecte

are able to r of the neigh configuration According to X e programmi artial Reconfi g FPGA design

s the FPGA, p ompromising onfigured.” [1

enario, the FP rable (PR) re

by means o of the partia f a partial bi ot) be reconfig

al BIT files (PR ed later durin ding one of se an external m partial bit files rance by reco cs remains fu econfiguratio ensive solutio ter reconfigu ration of a b

ntroduction to ed against radia

detect, loca bor FPGA for n is the modif Xilinx Partial ng and re‐p guration (PR) n by loading a

partial BIT file g the integrity

15] The basic

Figu

PGAs are stru egions. The p

f partial reco l bit file. The t file. The st gured.

R_Bit_x.bit) s ng the design everal availab memory.

s are stored i onfiguring the unctioning an on can be on for PR des

ration. There block has not

hardware red ation

ate and mas r recovery¹².

fication of an Reconfigurat rogramming ) takes this fle a partial conf es can be do y of the appli block diagram

ure 1 Basic Prem

uctured into t ortion of the onfiguration

static logic r atic region c

hould be calc lifetime. As ble partial bit

in an protect e faulty portio

d are comple done via JT sign regardin e are some st

t been succe undancy is ava

k faults and

operating FP ion User Guid without goin exibility one‐s figuration file wnloaded to cations runni m of Partial R

mise of Partial Re

two separate e system that controller. T remains funct ontains the o

culated offline shown in Fig

files, PR_Bit_

ted memory¹³ on of the FPG etely unaffect TAG, SelectM ng to the cap

tatus registe eeded. Furthe ailable at Appe

inform the

PGA design by de, “FPGA tec ng through r step further, e, usually a pa o modify reco

ing on those Reconfiguratio

econfiguration

e regions: a s t is impleme The reconfigu tioning and is other parts o

e prior the FP ure 1, each P _A.bit to PR_

3, partial reco GA with a cor

ted.

MAP, Maste pability of do rs in ICAP wh ermore, it is endix B.

faults to th

y loading a pa chnology prov re‐fabrication allowing the artial bit file.

onfigurable re parts of the d on is illustrat

static region a ented in the

urable logic i s completely of the design

PGA design; h PR modules c _Bit_D.bit. The

onfiguration c rect partial b

er‐Serial, or ing readback hich indicate

possible to

he reconfigur

artial configur vides the flex n with a mo modification After a full b egions in the

device that ar ed in Figure 1

and several p PR regions ca

s replaced b unaffected b which canno

however, they can be modifi

ese bit files c

can improve bitstream whi

ICAP. ICAP k and verifyin an error if p implement a

ration

ration xibility dified of an bit file FPGA re not 1.

partial an be by the by the ot (or

y may ied by can be

FPGA le the

is a ng the partial a CRC

(14)

checker in the PR controller to check the CRC for the received file before forwarding it to the ICAP. By using these two techniques, (monitoring the ICAP registers and CRC checking) we can be sure that the target FPGA is partially reconfigured correctly.

Using PR approach has some advantages and disadvantaged. These include:

Advantages:

• Partial BIT files are calculated offline and stored in the FPGA prior the FPGA design.

Therefore, the necessary controller for doing partial reconfiguration can be smaller than the other methods.

• BIT files can be updated later during the design life time

• The PR flow is straightforward and can be done from beginning to the end in Xilinx PlanAhead™ software

• Function of each partial reconfigurable region can be changed completely by using a different BIT file (ability to time multiplex hardware dynamically)

• Many interfaces exists to perform partial reconfiguration from outside

• Do not need to know the memory address of the PR modules Disadvantages:

• Extra memory is needed to store both full configuration and partial reconfiguration BIT files

• Not all implementation options are available to the PR flow. (e.g. techniques perform optimization across the entire design) [15]

• PR design affects the performance. In general, one should expect 10% degradation in Clock Frequency, and expect not to exceed 80% slices in Packing Density. [15]

• Routing challenges may occur if the reconfigurable region is too small or is constructed of non‐rectangular shapes. [15]

We considered a distributed solution for this Multi‐FPGA design in which each FPGA is responsible to monitor its neighbor FPGA, and in case of fault, recover the neighbor FPGA to a correct state¹⁴. Another approach could be a centralized solution that a rad‐hard FPGA monitor all other FPGAs in a design. The main supremacy of distributed to a centralized solution is that, there is no need for a controller to be resided in a separate device. It can be implemented alongside the main system on the same FPGAs [8].

Moreover, the distributed solution is independent of the number of FPGAs whereas in the centralized solution the number of FPGAs must be defined prior the design. In both scenarios the original configuration bitstreams should be protected against SEUs. We will discuss this issue in section 4.4. The Figure 2 illustrates the basic principle of distributed and centralized solution.

14 By means of a reconfiguration controller

(15)

Like any implemen hard proc processor registers gives a be only meth resource

In our pro any proce the contro

2.5 SE

Any time way to re applicable This durat

Figure 2 FT sy

other digit ntation for th cessor (such a r, as shown

for read/writ etter flexibility hod for harde

utilization.

oposed hard‐

essors. Imple oller. In addit

EU Mitigat

the FPGA is ecover the F e to many ap tion is not to

ystem on Multi‐

al designs, he above‐me as MicroBlaze

in Figure 3, te operation y to the user, ening a soft p

Figure 3 a conf

‐ware based menting in th tion to this, im

tion Schem

powered up, FPGA to a co pplications be olerable for m

‐FPGA platform.

there is tra entioned arch e, PowerPC o can manage of Xilinx XPS , the processo processor inv

figuration contro

solution, the his way will l mplementing

mes

, all its config orrect conditi ecause it will many applicat

Distributed sol

adeoff betwe hitecture. In or ARM) shou

e the reconf S HWICAP co or itself is a p

olves triplicat

oller block‐diagr

controller is et the design

in hardware

guration cont ion is to pow cause the FP tions. In thes

lution (left); Cen

een softwar software‐bas uld be embed figuration pr ore [16]. Alth point of failur

tion, which c

ram based on M

s implemente ner to apply a

would be spa

tents are refr wer cycle it.

PGA to stop fu se application

ntralized solutio

re‐based and sed impleme dded into the

ocess by set hough softwa re and should could be very

MicroBlaze

ed purely on any available ace/speed op

reshed. There However, th unctioning fo ns, other mit

on (right)

d hardware‐b entation, a so e design. The tting the req are‐based so d be hardened y costly in ter

hardware wi e FT techniqu ptimize.

efore, the sim his method i or several sec igation techn

based oft or en the quired lution d. The rms of

ithout ues on

mplest is not conds.

niques

(16)

should be deployed. Moreover, the state of the FPGA will be lost and a synchronization technique should be deployed to synchronize the FPGA with other processing elements in the design.

Another mitigation scheme is ''bitstream scrubbing and readback'' (or simply scrubbing) which means reading back the configuration bitstream stored in the configuration memory, comparing it with an original one and correcting any affected configuration bits. The process is continuously performed, independently of the occurrence of a soft error. Such approach is discussed in [17], [18]. Since this approach is blind, it will introduce latency in detecting a fault and it may cause much more overhead compared to the other approaches because of continues readback and checking¹⁵. Some works have been carried out recently to make the scrubbing faster and on demand. In [19] the author proposed a constraint driven re‐placement method to reduce the number of sensitive configuration frames and consequently the scrubbing time.

The faster and on‐demand solution is the modification of an operating FPGA design by loading a partial configuration bitstream. Partial reconfiguration is only a recovery technique which means soft errors should be detected (and located) first, before they can be repaired. Detection and masking could be performed by well‐known hardware redundancy techniques, either triple modular redundancy (TMR) [20], [21], [22], [23] or duplication with comparison (DWC) combined with concurrent error detection (CED) [24].

A first implementation for this kind of reconfiguration controller has been presented in [25]. The author in the mentioned paper propose a distributed mesh topology in which each FPGA monitors the neighbor FPGA in a multi‐FPGA platform and triggers the reconfiguration of the faulty portion of the neighbor FPGA. However, the proposed solution in the mentioned work for hardening the reconfiguration controller is based on blind readback and checking which may introduce delay in recovery. Another work is presented in [16] where the author compares different software‐based solution for reconfiguration controller to achieve the minimum reconfiguration time. However, since the reconfiguration controller is implemented in the embedded processor, hardening the controller is very difficult. The latest study in this genre is presented in [26] where the author implemented a hardware‐based ICAP controller for doing partial reconfiguration. We will compare these approaches in terms of speed and resource utilization with our proposed controller in the upcoming discussion.

2.6 Summary

In this chapter, we presented the necessary requirements for a multi‐FPGA system in a mission‐critical application. We talked about the importance of the SRAM‐based FPGAs and we introduced their limitation in different environments. We also included a brief comparison in Performance, Consumption, Cost, and Flexibility between SRAM‐Based FPGAs and similar embedded processing units. Then, we show that although SRAM‐based FPGAs are attractive not only in commercial markets, but also in the mission critical and safety critical application, special hardening techniques must be used in a harsh environment. Moreover, we described our working scenario and talked about its characteristics. Next,

15 For more information regarding scrubbing and Xilinx SEM controller please refer to Appendix A.

(17)

we mentioned the main types of fault that threaten electronic devices in this environment, and we discussed the Radiations and its effects on the electronic devices in general and on the SRAM‐based FPGAs in particular.

Furthermore, we introduce our self‐healing system architecture, which our controller is designed based on that. We have also discussed other possible approaches for increasing the reliability of SRAM‐based FPGA’s design. We performed a brief literature analysis on similar approaches as well.

In the next chapter, we will discuss our proposed solution for this scenario.

(18)

3 Proposed Controller Architecture

The main problems in fault tolerant system is to first detect error during system operation, then locate the error as fast as possible, next, recover the system to a normal condition and last, bring the system back to the correct state. Error detection and localization could be done by means of online checkers like the one presented in [27]. In this paper, the author presents an on‐line testing technique for TMR.

Another approach is to combine 2‐rail logic and self‐checking to have a concurrent error detection technique like the one presented in [24].

In this thesis, we only focus on fault recovery by means of PDR capability. Our proposed solution is based on the design methodology presented in [8]. As shown in Figure 2, each FPGA (FPGA_i) in our architecture hosted a reconfiguration controller. The main responsibilities of these controllers are as follow:

1‐ The controller has to monitor the error signals of the PR regions, static region, and the reconfiguration controller of the next FPGA (FPGAi+1) in the proposed mesh topology.

2‐ In case of any error in the FPGAi+1 the controller should perform appropriate action to recover the FPGA_i+1 to a correct condition by means of reconfiguration.

3‐ The controller should be hardened itself in a way that if a fault occur in the controller, it should detect, locate and mask the fault and inform the reconfiguration controller in the FPGA_i‐1 for performing the recovery.

By considering these responsibilities, the controller can be organized into four main parts: Fault Classifier, Partial Reconfiguration (PR) Engine, Full Reconfiguration Engine, and Bitstream Module. The main block diagram of the controller is illustrated in Figure 4.

(19)

The fault technique with the whether t the FPGA may also The origin responsib possible reconfigu could not master‐se The contr can be ex of FPGAs.

In the im Master) a this FPGA In this sec compone

classifier ha e [28], [29]. I address of re the error is co

i+1, the Fault be initiated if nal bitstreams ble to provid

speed. To a ration is don t be used fo erial configura roller in this

tended to an

mplemented s and the syste A as slave).

ction, we des nts in the ma

Figu

as to monito f an error is elevant parti orrected or n

Classifier wil f PR Engine co s in our desig de the neces achieve the ne via Interna or full reconf

ation mode a thesis is imp y number of

solution, the em which sho

scribe the imp aster side and

re 4 Reconfigura

r the error s detected on al bitstream.

not. If the err l initiate the ould not fix a gn are stored ssary protoco

maximum al Configurat figuration and

t 10 Mbps.

lemented an FPGAs; since

configuratio ould be harde

plemented co then the com

ation Controller

signals, whic a PR region, . Then, it wo or is detected

Full Reconfig n error in a P in a rad‐hard ol for comm

speed for r ion Access P d, for this re

d tested on t the impleme

n controller ened by mea

ontroller (Fig mponent in th

r block diagram

ch are encod , the Fault Cl ould monitor d inside the s guration Engin

R region afte d external me unication wi reconfiguratio Port (ICAP) at

eason; the fu

two FPGA pla ented solution

reside in on ns of PDR re

ure 6) in deta he slave side.

ded with two lassifier initia the error sig static region ne. Full Reco er a specific nu emory. The Bi

ith this mem on, the act t 3.2 Gbps. H ull reconfigu

atforms (Figu n is independ

ne FPGA (we esides in anot

ails. We start .

o‐rail coding ates the PR E gnals again t or PR contro nfiguration E umber of try.

itstream Mod mory at max

of doing p However, the

ration is don

ure 5); howev dent of the nu

call this FPG ther FPGA (w

t by explainin (TRC) Engine to see ller of Engine

. dule is

imum partial e ICAP ne via

ver, it umber

GA as we call

ng the

(20)

LED1LED2LED3

Virtex‐5 Evalua

(Ro inte FPG RM1

R W 01 00 10 00 10 00 10 11

RM2

R W 01 00 10 00 10 00 10 11

Mux Mux

Figu

ation board (Slave S

FPGA‐2 (Target)

CPLD‐2 outing full configuratio

erface to the dedicated GA‐2 configuration pin

RM3

R W 01 00 10 00 10 00 10 11

Sta

Mux

Figu

re 5 slave FPGA

Side)

n d s)

Full Configura atic Parts

ICAP

ure 6 Configurat

A (left) and mast

Partial Reconfiguration

interface P (M

ation Error Signals

con c

c

tion Controller B

ter FPGA (right)

Virtex‐

FP

(Routin t R Controller Master Side)

Full nfiguration controller

Fault classifier

Block Diagram

‐5 Evaluation board

PGA‐1 (Master)

CPLD‐1 ng the platform flash to the FPGA‐1)

BRAM Bitstream‐1 Bitstream‐2 Bitstream‐3

(Master Side)

Full configuration o FPGA‐2 via master se

Platform Flash of

erial

LED1LED2LED3