Design and Verification of SOPC FDP2009 and Research of Reconfigurable Applications By Zhang Fanjiong

(1)

Design and Verification of SOPC

FDP2009 and Research of

Reconfigurable Applications

By

Zhang Fanjiong

Supervisor: Prof. Tong Jiarong

Thesis Period: Aug, 2009 ~ Jul, 2010

Department of Microelectronics and Information Technology (IMIT), School of Information and Communication Technology (ICT),

Royal Institute of Technology (KTH), Stockholm, Sweden

(2)

Content

CONTENT ... II

ABSTRACT ... 1

CHAPTER 1. INTRODUCTION ... 3

1.1 RESEARCH BACKGROUND ... 3

1.2 ORGANIZATION OF THE THESIS ... 8

CHAPTER 2. DESIGN AND IMPLEMENTATION OF FDP2009 CHIP ... 9

2.1 THE OVERALL ARCHITECTURE OF FDP2009SOPC ... 9

2.2 DESIGN OF THE EXTERNAL BUS INTERFACE (EBI) ... 12

2.3 DESIGN FLOW OF FDP2009 ... 23

2.4 VERIFICATION OF FDP2009 ... 24

2.5 FDP2009CHIP RESULT AND PLATFORM TEST RESULTS ... 27

2.6 THE BUG CORRECTION OF EBI DESIGN ... 29

2.7 SUMMARY OF THE FDP2009SOPC DESIGN AND VERIFICATION ... 30

CHAPTER 3. SOFTWARE DESIGN BASED ON FDP2009 CHIP ... 32

3.1 COMPILER AND LINKER ... 32

3.2 BOOTLOADER AND INTERRUPT HANDLING ... 33

3.3 SOPC TEST SUITE PROGRAM DESIGN ... 36

3.4 SOFTWARE DEBUG USING RISCWATCH ... 37

3.5 SUMMARY OF THE SOFTWARE DESIGN BASED ON FDP2009 PLATFORM ... 39

CHAPTER 4. RECONFIGURABLE IMAGE NOISE FILTER APPLICATION ... 40

4.1 IMAGE NOISE THEORIES ... 40

4.2 IMAGE NOISE REDUCTION RESEARCH PROGRESSES AND THEIR SHORT-COMINGS ... 42

4.3 THE DESIGN OF THE BUS MACRO ... 43

4.4 THE RECONFIGURABLE SYSTEM IMPLEMENTATION BASED ON THE BUS MACRO DESIGN ... 44

4.5 RECONFIGURABLE IMAGE NOISE REDUCTION FILTER DESIGN ... 45

4.6 THE PARTIAL RECONFIGURABLE FILTER TEST PLATFORM INTRODUCTION ... 50

4.7 RECONFIGURABLE IMAGE FILTER TEST RESULTS ... 52

4.8 RECONFIGURATION TIME TEST AND ANALYSIS ... 57

4.9 CONCLUSION OF THE RECONFIGURABLE APPLICATION DESIGN ... 58

CHAPTER 5. CONCLUSION AND RESEARCH OUTLOOK ... 60

5.1 CONCLUSION ... 60

5.2 RESEARCH OUTLOOK ... 61

(3)

FIGURES

FIG 1.1 TRADITIONAL FPGA ARCHITECTURE ... 4

FIG 1.2 THE CONTENT OF SOPC ... 5

FIG 1.3 IMPLEMENTATION METHODS OF SOPC ... 6

FIG 1.4 DYNAMIC RECONFIGURATION MAKES HIGHER USAGE OF THE HARDWARE RESOURCES... 7

FIG 2.1 OVERVIEW OF FDP-FPGA SOPC ARCHITECTURE ... 10

FIG 2.2 OVERVIEW OF FDP CONFIGURATION COMPONENT ... 12

FIG 2.3 WRITE-TO AND READ-FROM SEQUENCE BETWEEN EBI AND 16-BIT MEMORY DEVICE 13 FIG 2.4 THE BLOCK-LEVEL INTERCONNECT AND STRUCTURE OF EBI... 13

FIG 2.5 EBI CONFIGURATION REGISTERS‟ USAGE ... 14

FIG 2.6 32-BIT SLAVE CONNECTION SCHEMATIC TO 64-BIT PLB BUS ... 15

FIG 2.7 PLB 4-WORD LINE TRANSFER TIMING DIAGRAM ... 16

FIG 2.8 PLB WORD READ OPERATION TIMING DIAGRAM ... 17

FIG 2.9 PLB WORD WRITE OPERATION TIMING DIAGRAM ... 18

FIG 2.10 FLASH READ TIMING ... 19

FIG 2.11 FLASH PROGRAM TIMING ... 19

FIG 2.12 FLASH ERASE TIMING ... 19

FIG 2.13 SRAM READ TIMING ... 20

FIG 2.14 SRAM WRITE TIMING ... 20

FIG 2.15 EBI‟S BLOCK-LEVEL TESTBENCH ARCHITECTURE ... 21

FIG 2.16 SIMULATION WAVEFORM OF EBI‟S LINE TRANSFER (PART 1) ... 21

FIG 2.17 SIMULATION WAVEFORM OF EBI‟S LINE TRANSFER (PART 2) ... 22

FIG 2.18 SIMULATION WAVEFORM OF EBI‟S WORD TRANSFER ... 22

FIG 2.19 FDP-SOPC DESIGN FLOW ... 23

FIG 2.20 DIAGRAM OF POWERPC CPU BLOCK VERIFICATION ... 25

FIG 2.21 THE IMPLEMENTATION OF REGRESSION AUTOMATION SYSTEM ... 26

FIG 2.22 TEST-BOARD OF FDP-SOPC CHIP ... 27

FIG 2.23 FDP2009 SOPC CHIP OVERALL LAYOUT AND FLOORPLAN ... 29

FIG 2.24 THE PREVIOUS BACK-TO-BACK WRITE OPERATIONS OF EBI ... 29

FIG 2.25 THE IMPROVED BACK-TO-BACK WRITE OPERATIONS OF EBI ... 30

(4)

FIG 3.1 REGISTER USAGE OF POWERPC EABI ... 32 FIG 3.2 STACK FRAME STRUCTURE OF POWERPC EABI ... 33 FIG 3.3 THE WHOLE SOPC SOFTWARE INITIALIZATION SEQUENCE ... 34 FIG 3.4 FETCHING THE PROGRAM CODE AND DATA FROM FLASH TO SRAM AFTER BOOT-UP ... 35 FIG 3.5 GUI OF THE RISCWATCH SOFTWARE AND DEBUGGING INTERFACE... 38 FIG 4.1 THE PROPOSED BUS MACRO STRUCTURE SCHEMATIC ... 44 FIG 4.2 THE BUS MACROS SEPARATE THE FPGA RESOURCES INTO STATIC REGIONS AND PARTIAL RECONFIGURABLE REGIONS ... 45 FIG 4.3 THE SYSTEM DIAGRAM OF THE PARTIAL RECONFIGURABLE FILTER ... 46 FIG 4.4 THE CONFIGURATION PROCEDURE OF THE FILTER SYSTEM ... 47 FIG 4.5 IMAGE PIXEL PROCESSING ORDERING IN THE PROPOSED RECONFIGURABLE FILTER 48

FIG 4.6 THE DETAILED IMPLEMENTATION PROCESS OF THE PR DESIGNS ... 49 FIG 4.7 THE OVERALL STRUCTURE DIAGRAM OF FDP300K FPGA ... 50 FIG 4.8 THE BLOCK DIAGRAM OF THE RECONFIGURABLE FILTER HARDWARE SYSTEM

51

FIG 4.9 FDP300K CHIP AND ITS PACKAGE ... 51 FIG 4.10 THE FDP300K TEST BOARD AND TEST INSTRUMENTS... 52 FIG 4.11 THE INPUT “LENA” IMAGE WHICH CONTAINS BOTH GAUSSIAN AND SALT & PEPPER NOISE ... 53 FIG 4.12 THE INTERMEDIATE RESULT OF THE RECONFIGURABLE FILTER ... 53 FIG 4.13 THE FINAL RESULT OF THE RECONFIGURABLE FILTER ... 54 FIG 4.14 THE RESULT “LENA” IMAGE OF THE RECONFIGURABLE FILTER PROPOSED IN THE THESIS ... 55 FIG 4.15 THE RESULT “LENA” IMAGE OF THE CONVENTIONAL MEDIAN FILTER ... 55 FIG 4.16 THE RESULT “LENA” IMAGE OF THE CONVENTIONAL AVERAGE FILTER ... 55 FIG 4.17 THE “BARBARA” IMAGE CONTAINING 20DB GAUSSIAN NOISE AND SALT & PEPPER NOISE ... 56 FIG 4.18 THE RESULT “BARBARA” IMAGE OF THE RECONFIGURABLE FILTER

PROPOSED IN THE THESIS ... 56 FIG 4.19 THE RESULT “BARBARA” IMAGE OF THE CONVENTIONAL MEDIAN FILTER 56 FIG 4.20 THE RESULT “BARBARA” IMAGE OF THE CONVENTIONAL AVERAGE FILTER

57

(5)

TABLES

TAB 2.1 POWERPC PLB ADDRESS ALLOCATIONS ... 11 TAB 2.2 DCR ADDRESS ALLOCATIONS ... 11 TAB 2.3 THE SUPPORTED MEMORY SIZES AND THE EBI DCR REGISTER CONTROLLING BITS 14

TAB 2.4 SIMULATION TIME COMPARISON BETWEEN THE PROPOSED VERIFICATION METHOD AND THE CONVENTIONAL METHOD ... 27 TAB 2.5 FDP-SOPC CHIP PARAMETERS ... 27 TAB 3.1 FDP2009 SOPC TEST-SUITE CATEGORIES ... 36 TAB 4.1 THE FILTER PERFORMANCE COMPARISON BETWEEN THE FILTER PROPOSED IN THIS THESIS AND THE CONVENTIONAL FILTERS ... 57 TAB 4.2 THE PARTIAL RECONFIGURING SPEED BETWEEN FDP300K AND XILINX

(6)

Abstract

In recent years, reconfigurable devices are developing fast because of its flexibility and less development cost. But intrinsic shortcomings of reconfigurable devices, for example, high power, low speed, etc. induce difficulties in complex designs realizations. So people began to consider combination of ASIC (Application-Specific Integrated Circuit) and reconfigurable device on a single chip, which is SOPC (System on Programmable Chip). SOPC can not only decrease development risk and timing to market, but also be used in different applications, especially of products that keep varying, for example, communication and network products.

Dynamically reconfiguration means reconfigurable device of the chip can be reconfigured repeatable, and performs different functions at different times. Compared with static reconfiguration, dynamic reconfiguration can use the reconfigurable device more thoroughly. It‟s a hot spot of research in the world, especially in reconfigurable computing.

This paper mainly concludes my research work in reconfigurable SOPC in 3 major parts: hardware, software and application. The following works and innovations are completed:

1. SOPC hardware system architecture design and discussion. Helps to define the system architecture and design goals. The design of EBI controller which is used in the SOPC. The integration of the blocks in the system. 2. The building-up of the SOPC system-level verification and block-level

verification environment. The set-up of the hardware-software co-simulation environment. The post-layout simulation and formal verification tasks. We propose an innovative automated regression system. The system helps to achieve the same simulation coverage (95%) and the total simulation time is reduced by approximately 30%.

3. SOPC software design, including the OS kernel porting, drivers design and application design. The design of the PowerPC initialization program and UART (Universal Asynchronous Receiver/Transmitter), reconfiguring communication driver programs. Writing the test-cases which are specialized for the system verification and hardware testing.

(7)

reconfigurable logic core. And we realize the whole reconfigurable system based on this bus macro.

5. The reconfigurable application research based on Reconfigurable Logic Core. The reconfigurable image filter designed implemented on FDP300K Reconfigurable Logic Core device. Using self-design Reconfigurable Logic Core internal bus macro to implement the partial reconfigurable system. The test results showed that the reconfigurable filter has the feature of fast configuration speed and good output image quality.

Keywords ：

Dynamic reconfiguration, partial reconfiguration, configurable computing, image noise reduction algorithm

(8)

Chapter 1. Introduction

1.1 Research Background

Integrated Circuits (IC) can be categorized by means of design methodologies: Full-custom, Standard-cell based and Reconfigurable Hardware. While there are many coarse-grained devices developed to meet the need of specific applications， FPGA (Field Programmable Gate Array) is still the most common and widely-used fine-grained reconfigurable device. In the past two decades, ASIC (Application-Specific Integrated Circuit) together with FPGA remains to be the mainstream electronic design technology. In a considerable period of time, the feature of ASICs and FPGAs make them popular in different market areas: ASICs are used to build specialized products which have very high volume to divide their high engineering expense to individual chips; FPGAs are more expense individually but gain their popularity through flexibility in comparatively low-volume products. In recent years, FPGAs are developing swiftly and beginning to make its advantage over ASICs.[1] Because of the advance of semiconductor manufacturing technology, the growing complexity of design, the higher NRE (Non-Recurring Engineering cost) and tool costs and ever intensive time-to-market pressure, the designers become unaffordable of ASIC designs. On the other hand, the high expense of FPGA can be sharing in that each type of FPGAs can be sold in high volume. So, FPGAs are gradually replacing ASICs in some areas of the design market.

The classical structure of FPGA is shown in Fig 1.1. FPGA is composed of CLBs (Configurable Logic Block), CBs (Connection Box), SBs (Switch Box) and the metal interconnects between them surrounded by configurable I/O blocks. [2] The traditional FPGA architecture is commonly referred to as “fine-grained” reconfiguring when it comes to the field of reconfigurable computing. Meanwhile, there are some “coarse-grained” reconfigurable devices that are widely adopted by the academia and industry. The coarse-grained devices are consisted of bigger configurable blocks, thus alleviating the pressure on the global interconnects and they are commonly specific to a certain application. [3]

(9)

Fig 1.1 Traditional FPGA architecture

In the 1980s and 1990s, FPGA‟s scale is about tens of thousands logic gates and is mainly used as the glue logic to connect various parts of the PCB (Printed Circuit Board). From the 1990s up till now, especially in the 21st century, FPGA is boost in process technology as well as programming and testing methods. The integration scale has also increased to millions of gates and the clock frequency to several hundred Megahertz. So it is possible that system-level application be implemented in one single FPGA chip. In recent years, the main FPGA vendors are shipping new high-performance FPGA products. Altera Stratix IV GX FPGAs, which use 40-nm technology, achieve highest density with up to 820K logic elements (LEs), 23.1 Mbits of embedded memory, and up to 1,288 18x18 multipliersError! Reference source not found.. Built on a 40nm state-of-the-art copper process technology, Xilinx Virtex-6 FPGA deploys up to 566K logic cells and 6Mbit BRAM running at the maximum 600MHz clock speed[5] . Now the system requirements and protocols or standards change very frequently, FPGA is now used to build up the end products rather than only the prototype system. The reconfiguration feature of the FPGA makes it the core component of the modern network, consumer electronics, spacecraft and defensive applications.

The flexibility of FPGAs also brings trade-offs such as redundancy in logic area, increase in delay and more power consumption. It is estimated that the area efficiency of FPGAs is about 1/20~1/10 the efficiency of ASIC chips and the maximum speed is also 1/5~1/3 of the ASIC ones. [6] Because of the banes of the

(10)

traditional FPGA, people plan to combine the advantages of FPGA‟s reconfigurable feature together with ASIC design‟s high performance and area efficiency. The system‟s fixed logics are implemented in ASIC, meanwhile the parts need to alter by the users are implemented in FPGA. This combination gives the one-chip both the merits of FPGAs and ASICs, which is often called „SOPC‟ (System on programmable chip). The run-time reconfigurable ability provides SOPC the adaption to a variety of applications. SOPC makes a single chip suitable for a wide variety of applications, especially for the products under development or have multiple standards. For example, the 3rd generation wireless communications (3G) have 3 standards in China, WCDMA, CDMA2000 and TD-SCDMA. At the early stage of design, the communication device-maker cannot be sure which standard will their product finally adopt. SOPC is a cross-subject definition realized by the embedded computing technology as well as the reconfigurable hardware technology. The SOPC is generally divided into 3 major parts: hardware, software and application, which is shown in Fig 1.2:

Application Software Hardware SOPC

Fig 1.2 The content of SOPC

There are 2 ways to build up an SOPC: embedding FPGAs into ASICs and embedding ASICs into FPGAs. (Fig 1.3) The former way is commonly adopted by research institutions to build hardware platform specialized for a certain kind of application. Main universities and research academies have developed numerous prototypes, such as Berkeley‟s GARP[7] and SCORE[8] and Univ. of Toronto‟s ONECHIP[9] . The latter way is often used by the FPGA vendors. In the recent years, world-leading FPGA companies are embedding micro-processors, memory

(11)

devices, mixed-signal circuit and other full-custom logic cells in their FPGA products. Including processors in FPGA is one of the most representative methods of this kind. For instance, Xilinx‟s Virtex-5 FXT devices have PowerPC 440 embedded processor cores which can run at the speed of 1,100 DMIPSError! Reference source not found.. In addition, using FPGA combined with fixed-logic has higher area efficiency in actual applications. Now the FPGA IP core design is also a very hot-spot in the FPGA research field and our lab is working on the national research project about FPGA IPs.

Fig 1.3 Implementation methods of SOPC

If the system can be reconfigured during run-time and complete different functions during different period of time, the system is a dynamic reconfigurable system.Error! Reference source not found. Dynamic reconfiguration introduces the idea of virtual hardware. The hardware is only implemented when it is needed, thus enabling the time-division multiplexing of the device. The hardware resource sharing of dynamic reconfiguration is shown in Fig 1.4. Dynamic reconfiguration has become the developing trend of the FPGA technology and reconfigurable computing. The developing of digital logic systems used to only focus on chip scale enlargement. Now the reconfiguration and chip resource sharing is more and more taken into consideration during the design of digital systems. Dynamic reconfiguration can be useful in a wide range of applications: self-repairable and adaptive digital system, reconfigurable computing, high-performance digital filter designs, evolvable hardware, and etc. The SOPC incorporates hardware, software and applications together. My thesis and lab work emphases on hardware research of reconfigurable systems, but the software and application will also be covered. Since, without software and application, the hardware performance and innovation cannot be verified.

(12)

Fig 1.4 Dynamic reconfiguration makes higher usage of the hardware resources

SOPC surely has a wide range of application fields. In my thesis, I focus on the image-processing and image noise reduction application and will introduce the research results of the reconfigurable image-processing application. Many researchers had devoted much effort in this field and published many papers and publications. Partial dynamic reconfigurable FIR (Finite Impulse Response) filter systems [11] [12] and dynamic reconfigurable dynamic image filtering systems [13] are both hot-spots in this research field. But most FPGA-based reconfigurable systems are implemented on the Xilinx commercial platforms. Moreover, there is not a good reconfigurable solution in the area of image noise reduction applications. Therefore, in my thesis, I will propose some system and application designs which are based on our FDP FPGAs which are optimized in hardware structure for such reconfigurable applications. I will also focus the application design to some image noise filters.

In a conclusion, the SOPC, which combining ASIC with FPGA technology, exploit the advantage of FPGA‟s flexibility and ASIC‟s high performance. Combining the pros of these two, the reconfigurable SOPC has already become one of the most promising developing trends of the FPGA technology and is the basis and starting point of this thesis. SOPC‟s hardware reconfiguration of FPGA IP combined with the software running on the embedded processor or DSP core of it make it an ideal platform for 3G hardware platform. Moreover, SOPC and reconfigurable hardware is also an ideal platform for image processing and image

(13)

noise reduction which will be introduced in detail in this thesis.

1.2 Organization of the Thesis

The thesis is organized as follows：Chapter 1 introduces the meaning and background of this research. Chapter 2 describes the design and implementation of our reconfigurable hardware platform: FDP2009. Chapter 3 is about the software design based on the FDP2009 platform and the design methodologies. Chapter 4 focuses on the novel Bus Macro design which is used in the reconfigurable system-level interconnect and describes the application of the reconfigurable hardware platform focuses on the novel Bus Macro design which is used in the reconfigurable system-level interconnect. In this paper, the application of SOPC is mainly about image processing and image noise reduction. And finally, Chapter 5 concludes the thesis and gives some further research expectations.

(14)

Chapter 2. Design and implementation of

FDP2009 Chip

2.1 The overall architecture of FDP2009 SOPC

Fig 2.1 shows a reconfigurable System on Programmable Chip (SOPC) architecture which is implemented in the FDP2009 chip. The proposed system is designed by using mature IBM CoreConnect architecture.[14] There are altogether 3 different buses which perform distinguished tasks in the system: high-speed Processor Local Bus (PLB), low-speed On-chip Peripheral Bus and Device Control Register bus (DCR) which only supports register reading/writing. The user application programs and operating system are stored in external Flash memory and SRAM controlled by the External Bus Interface (EBI). PowerPC 405 RISC processor was used to run the algorithm in the system. FDP FPGA core plays a key role in the system to enable hardware reconfiguring of the system. Reconfigurable functions are mapped to FDP-FPGA using the external downloading port of FPGA or the algorithm running on the processor. Specific interfaces were designed to connect FDP-FPGA with processor and other modules seamlessly which is called “FDP_IF”. The FDP-FPGA core consists of 288 basic cells (16×18) with 80K equivalent system gates. We will emphasize to describe the structure of configuration component and the structure of FDP_IF respectively in the following section.

(15)

Fig 2.1 Overview of FDP2009 SOPC Architecture

PowerPC 405 processor core runs the first instruction from the address 0xFFFF_FFFC during boot-up and begin the whole chip initialization sequence. And so we arrange the system flash addresses to the highest address space of the PLB. The whole PLB system address allocation is indicated by Tab 2.1. The space is separated into 3 parts: memory interface, FPGA interface and OPB devices. There are 3 memory devices used in the system. The system flash is 256KB while the data flash and SRAM are both 16MB. FPGA ports are divided into 4 individual sub-ports. FDP-IF are a set of controlling registers which are combined with configuration block control interfaces. PORT1 to PORT3 are 3 64-KB memory space used separately as a shared memory, a Reconfigurable Logic Core configuration register and a debug port. All the control registers can be directly accessed by the PPC405 using DCR bus. The DCR bus space allocation table can be in Tab 2.2. Interconnect

Types

Functionality Start Address End Address Size EBI System

FLASH

0xFFFD_0000 0xFFFF_FFFF 256KB Data FLASH 0xFD00_0000 0xFDFF_FFFF 16MB SRAM 0xFE00_0000 0xFEFF_FFFF 16MB

(16)

Reconfigurable Logic Core Ports FDP-IF REG 0xF100_0000 0xF100_03FF 1KB Reconfigurable Logic Core PORT1 0xF101_0000 0xF101_FFFF 64KB Reconfigurable Logic Core PORT2 0xF102_0000 0xF102_FFFF 64KB Reconfigurable Logic Core PORT3 0xF103_0000 0xF103_FFFF 64KB PLB-to-OPB Bridge PLB-OPB Interconnect 0xF400_0000 0xF4FF_FFFF 16MB OPB Peripherals UART0 0xF400_0000 0xF400_00FF 256B UART1 0xF410_0000 0xF410_00FF 256B GPIO 0xF420_0000 0xF420_00FF 256B

Tab 2.1 PowerPC PLB Address Allocations

Block Start Address End Address Size PLB Arbiter 0x090 0x09F 16B External Interrupt Controller 0x020 0x028 9B EBI 0x010 0x01F 16B

(17)

Fig 2.2 Overview of FDP Configuration component

As shows in Fig 2.2, the structure of FDP configuration in FDP core has four components: Address Decoder, Addressable configurable register, Partial configuration registers and Configuration Control State machine. It is different from the conventional configuration architecture of Reconfigurable Logic Core reloading the whole frame into shift registers.[15] Each time, our configuration architecture uses addressable configuration register to replace shift register and adds the internal frame decoder to fix the address of every 32bit within each frame.[15] In this architecture, the entire configuration data of the reconfigurable hardware is partitioned vertically and horizontally into 32bits data, which is the minimal configurable cell that can be accessed with a single configuration command instead of a frame. It is possible to alter part of configuration without stopping the operations running on the reconfigurable hardware.

2.2 Design of the External Bus Interface (EBI)

EBI (External Bus Interface) is the interface between the PLB and the memories which are off the chip. We use PLB v4.0 protocols in our system, so it is named as “EBI2PLB4”. It is commonly functioned as a memory controller and data buffer to realize the communication between the CPU and the Flash/SRAM devices. We use NOR Flash AM29LV160B (AMD) as the system boot-ROM flash and S29GL128 (Spansion) as the data flash. We use the asynchronous SRAM IS64LV51216 (ISSI). Since the PLB master PowerPC405 is 32-bit, we design the EBI to be a 32-bit PLB slave for maximum bus signals utilization. The memory device data buses are all 16-bit, so there will be a serial-parallel conversion between these two different-width buses. The read and write sequence can be seen in the following diagram:

(18)

Upper 16bit 16-bit 16-bit READ ADDR ADDR+1 Lower 16bit 32-bit

Upper 16bit 16-bit

16-bit WRITE ADDR ADDR+1 Lower 16bit 32-bit

Fig 2.3 Write-to and read-from sequence between EBI and 16-bit memory device

In Fig 2.4, we can see that the EBI is a complex FSM (finite state machine) plus several data transmission buses and a DCR bus configuration interface. Because of the absence of PLLs in our SOPC chip, it is quite hard to control DRAM interface timing. So the EBI is not currently supporting external DRAM devices.

EBI2PLB4 PLB Interface Read Port Write Port Decode DCR Port Flash Interface SRAM Inteface Main Statemachi ne P LB DCR Bus External Flash Memory External SRAM Ex te rn al M em o ry B u s

(19)

The address bus and data bus of the external memory bus are time-multiplexed in the usage of different memory devices to minimize the I/O pins of the whole chip. Only the chip-enable (CE) signals are independent for each individual memory device. In order to support different types of the external Flashs or SRAMs, the DCR configuration registers are used to configure the timing and size of the 4 memory ports. The registers are named “B1CR” to “B4CR”. Fig 2.5 shows the width reservation of the configuration registers:

Fig 2.5 EBI configuration registers‟ usage

Theoretically, the EBI can support any Flash or SRAM device by the means of software controlling through these DCR interfaces. The memory addressing is also controlled by the configuration registers‟bit 20 to bit 22. The maximum supported size is 128Mbit or 16M bytes. The relationship between the value of BxCR (x is a number from 1 to 4) and the supported external memory size are listed below:

BxCR[20] BxCR[21] BxCR[22] Size 0 0 0 4 Mbit 0 0 1 8 Mbit 0 1 0 16 Mbit 0 1 1 32 Mbit 1 0 0 64 Mbit 1 0 1 128 Mbit 1 1 0 n/a 1 1 1 n/a

Tab 2.3 The supported memory sizes and the EBI DCR register controlling bits

The EBI is a 32-bit PLB slave device. Because our PowerPC405 processor is a 64-bit PLB master, additional bus MUXes should be inserted to make the bus

(20)

communication available. The connection can be seen in Fig 2.6.

Fig 2.6 32-bit slave connection schematic to 64-bit PLB bus

Currently, there are two types of PLB transfer that are supported by EBI: word transfer and line transfer. The line transfer corresponds to the cache flush and cache fill operations of the processor while the word transfer is the atomic read/write of the processor. The bus timing diagram of line transfer is shown in Fig 2.7 and the diagram of word transfer is shown in Fig 2.8 and Fig 2.9. [14]

(21)

(22)

(23)

Fig 2.9 PLB word write operation timing diagram

In the EBI‟s default configuration, there are 3 three memory devices supported: Am29LV160B 1 M x 16-Bit CMOS 3.0 Volt-only Boot Sector Flash Memory II. S29GL128 MirrorBit™ Flash with Alternative BGA Pinout 128 Megabit, 3.0 Volt-only Page Mode Flash Memory

III. IS61WV102416ALL 1Mx16 HIGH-SPEED ASYNCHRONOUS CMOS STATIC RAM WITH 3.3V SUPPLY

The type I, boot sector flash, is used to contain boot-up program and is faster than the data flash which contains other programs and data. The SRAM read/write operations and FLASH read/erase/program operations are all supported by the EBI. The timing diagrams of these operations are shown in Fig 2.10, Fig 2.11, Fig 2.12, Fig 2.13 and Fig 2.14.

(24)

Fig 2.10 Flash read timing

Fig 2.11 Flash program timing

(25)

Fig 2.13 SRAM read timing

Fig 2.14 SRAM write timing

A simple simulation environment is used to do the block-level simulation for the proposed EBI design. The Flash and SRAM verilog timing models are provided by the memory vendors. All the PLB bus transactions are encapsulated in verilog tasks and functions. The simulation environment is in Fig 2.15.

(26)

PLB

PLB_LIB PLB Transaction Tasks PLB_Assert PLB timing and protocol assertions EBI (DUT) External Flash Memory Model External SRAM Model

Fig 2.15 EBI‟s block-level testbench architecture

Here are some of the simulation results. The transferred data is 0xDEADBEEF in hexadecimal.

Fig 2.16 Simulation waveform of EBI‟s line transfer (part 1)

When the processor sends out valid qualifier signals, the EBI will recognize this qualifier and respond with a rise edge of the signal “Sl_Addrack”. After that, a sequence of data will be sent from processor to the EBI controller. We can see that

(27)

in Fig 2.16.

Fig 2.17 Simulation waveform of EBI‟s line transfer (part 2)

When all the data are sent, the EBI will respond with a WrAck or RdAck signal which indicates the processor to finish the current operation. The sequence is shown in Fig 2.17.

Similarly, we have also made the simulation results of the word transfer:

(28)

2.3 Design flow of FDP2009

Full custom FPGA Design & Modeling RTL Design Specificati on & Architectur e Design Co-Verificatio n FDP FPGA Library Co-Emulation Back-end Design Presilicon Verificatio n RISC IP Algorithm & Software Design Tape-out Formal Verificatio n Static Timing Analysis

Fig 2.19 FDP-SOPC Design Flow

The FDP-SOPC system development flow is illustrated in Fig 2.19. The full custom Reconfigurable Logic Core is elaborated using custom layout design and then it is converted into a Verilog behavior model for the later simulation usage.

The FDP2009 chip is implemented using SMIC 0.13μm technology. The Reconfigurable Logic Core is implemented using full-custom layout design while the PPC SoC logics and Reconfigurable Logic Core configuration blocks are implemented using ASIC design flow provided by Synopsys. We use VCS for logic simulation, Design Compiler for synthesis, JupiterXT for floorplan and Astro for

(29)

P&R and layout generation. The layout of the two parts are connected together manually during the final layout and DRC (Design Rule Check) step of the whole design period.

2.4 Verification of FDP2009

Algorithms written in C language was compiled and prototyped in the co-verification during the design steps. Hardware-software co-simulation is adopted to make it possible to evaluate the whole system performance at the early stage of design process. The adaptive testbench makes it easy for us to reuse the testcases and test programs at various stages of the design. In all design steps, we use formal verification (Synopsys Formality) to check the validity of our system which is very fast and efficient. The formal verification helps us to find bugs in early design stages.

To verify the functional correctness of the PPC405 IP before integrating it into the whole system, we use a dedicated CPU verification platform during the IP verification process. PowerPC 405 Core RTL Verilog PLB MONITOR PLB Slave Backend Control PLB USER IP PLB Arbiter PLB Local Bus(PLB) ICU DCU On-Chip Peripheral Bus (OPB) OPB USER IP Interruput Ctrl Model JTAG Model DCR Model JTAG Interface DCR Interface Interrupt Pins DCR Monitor Clock & Reset ISOCM Model (memory) DSOCM Model (memory) Vera Models Verilog Source Code

(30)

Fig 2.20 Diagram of PowerPC CPU block verification

The PowerPC 405 core verification environment is indicated in Fig 2.20. The grey blocks are behavior models written in OpenVera language while the white ones are verilog RTL source blocks. Both black-box and white-box verification methodologies are used in the PowerPC block verification. The Vera monitors are a great advantage when the simulations situations are considerably complex and the waveform is too long to make eye-checks. The OpenVera has a good feature for assertion-based verification and is also included in the Vera monitors and models. The monitors will do transaction-level checking as well as detailed signal timing checking.

In order to accelerating our verification testcase development, 105 assembly test programs which are released with the IP core kit are modified and contained in our block-level and chip-level verification process. The set of test programs are specially designed to verify the pipeline, execution unit, MMU, cache control unit, exceptional control and processor SPRs(Special Purpose Registers).

We have taken a great emphasis on the test of clocking and reset control blocks to ensure the processor will work during boot-up sequence and will wake up from the sleeping mode while the instruction pipeline is filled with new instructions or there is a hardware interrupt.

This chip verification is accomplished by developing a complete verification automation system which is essential to eliminate the time of debug and achieve a high functional coverage.[17] This system included the following tasks:

1) Defining a test plan based on the function of design

2) Writing the test bench environment using SystemVerilog, Vera and C 3) Generating test vectors

4) Checking results

5) Measuring progress against the test plan

(31)

……. Test plan Interconnect test IP Function test Communication test Testcase 1 Testcase 2 Testcase n DUT Input Vector Golden Model Checker

Monitor Collecting_Result

Fig 2.21 The implementation of regression automation system

Fig 2.21 shows how SOPC verification environment implemented. According to the design specifications, the test plan includes three aspects: 1) Interconnect test, 2) IP functional test, 3) Communication test of IPs. A Monitor is designed to collect and verify the bus protocols, observe state transitions and data integrity. The Checker module take data from both Golden Model and DUT(design under test) and makes a comparison.[18] If there is a mismatch, error messages are produced and the simulation is stopped. The functional coverage is collected by the Collecting Result module, so new tests are then designed to fill the coverage holes. By the means of these advanced verification platforms and techniques, we have achieved a code coverage of 100% and functional coverage of 95.1% in 30% less time. The following table shows the simulation time comparison between the conventional vector-directed method and our improved methods:

Method Simulation Vectors Simulation Time

(min) Coverage Our proposed method 181 365.3 95.1% Conventional vector-directed method 545 474.7 95.2%

(32)

Tab 2.4 Simulation time comparison between the proposed verification method and the conventional method

2.5 FDP2009 Chip Result and Platform Test Results

Fig 2.22 Test-board of FDP-SOPC chip

Our reconfigurable SOPC chip and its test-board are illustrated in Fig 2.22. Table 2.2 shows the SoPC chip‟s parameters.

Process 0.13um CMOS

FDP-Reconfigurable Logic Core

4.5mm×4.0mm RISC Processor & SoC

Peripheral

4.15×1.73mm Configuration 0.33×1.73mm Chip total Size 4.5mm×6.3mm Core Power Supply 1.2V

Pad Power Supply 3.3V Frequency of Power PC processor 100MHz

Package QFP208

Tab 2.5 FDP-SOPC Chip Parameters

(33)

Reconfigur able Logic Core

CPU

FPGA Config 4.6mm 4.4mm 1.8mm

results showed this chip could work correctly and quickly. The details results are showed in one paper which was populated.[15]

As is shown in Fig 2.23A, the overall floorplan of chip is divided into 3 parts: the PPC405 logics, the FPGA core and the Reconfigurable Logic Core configuration control block. Fig 2.23B is the physical layout result of the FDP2009 chip. Fig 2.23C is the P&R (Place and Route) result of the PPC405 block, in which the Caches and TLBs (Translation Look-aside Buffer) of the PowerPC405 CPU and the RAM containing data for Reconfigurable Logic Core configuration are place in the surrounding of the block.

A. The whole chip floorplan

B. The whole layout

(34)

Fig 2.23 FDP2009 SOPC chip overall layout and floorplan

2.6 The bug correction of EBI design

In the initial EBI, bugs have been found during the process of the system testing. The memory timing bug can be described in the following waveforms:

ADDR1 ADDR2 ADDR3

DATA3 DATA2 DATA1 ADDR CE_n WE_n DQ

Fig 2.24 The previous back-to-back write operations of EBI

Fig 2.24 shows the timing sequence of back-to-back write operations of the previous version of EBI. In the timing diagram, the chip select signal “CE_n” and write indicator “WE_n” stay at low during the whole process. This violates the address setup time “Tsa” of the SRAM device. [19] This design mistake will cause error during the sequential back-to-back EBI write operations of the SRAM device.

The bug-fixed timing diagram is like Fig 2.25, where the CE_n will be disabled (high voltage) after the each read/write operations. This will avoid the address setup time problems. This is realized by adding additional “idle” states in the finite state machine of the EBI main control block.

ADDR1 ADDR2 ADDR3

DATA3 DATA2 DATA1 ADDR CE_n WE_n DQ

(35)

Fig 2.25 The improved back-to-back write operations of EBI

A temporary method is used to fix the bug using an external CPLD during the board test and we have successfully brought-up the SOPC chip using this hacking method. The external CPLD logic is shown below:

FDP2009 CPLD SRAM FLASH EBI Ex te rn al B u s

Fig 2.26 Hacking method during board test

In this method, instead of connecting the EBI pins directly to the external memory bus, the pins are firstly connect to a CPLD device which perform timing repair to correct the “CE_n” and “WE_n” signal timings. During the board test, the method helps to successfully perform the read/write operation of the EBI and make the PowerPC processor running program normally.

2.7 Summary of the FDP2009 SOPC design and

verification

In summary, we have introduced the design and bug-fixing and the EBI. The verification flow and methodology of the SOPC system and the simulation and board test results.

The EBI aims at the high performance memory read/write operations for the PowerPC-based SOPC system. We found the timing errors and memory malfunction during the test and managed to plan a temporary correction and as well as the bug-fixing in the next design version.

We use the innovative verification flow and achieved hardware-software co-simulation. The self-check testcases and simulation coverage control feature of the testbench gives us more efficiency by reducing the total RTL and post-layout netlist simulation run-time.

(36)

the mother board and the sub board. It has high speed USB interface and test interface which can be connected to the logic analyzer.

(37)

Chapter 3. Software

Design

based

on

FDP2009 Chip

3.1 Compiler and Linker

For the PowerPC applications and testcases, the GNU cross-compiler GCC-EABI is used to compile and link the assembly and C-language programs. The compiler is compatible with the Embedded Application Binary Interface (EABI) protocol and is very suitable for compiling our testcases. The EABI is a set of conventions for embedded applications and development tools designed to insure compatibility for a described set of functionality. Following the EABI convention, the compiler always uses the same stack frame structure and maintains the same processor register usage [20] which is shown in Fig 3.1 and Fig 3.2. The EABI makes it possible to porting applications from other process architecture to our system with miniature alterations.

(38)

Fig 3.2 Stack frame structure of PowerPC EABI

The unified interface is especially important when we are doing assembly-C mixed programming. While most parts of our programs are written in C, the bootloader and initialization programs are inevitably written in assembly. Meanwhile, the building of the interrupt vector table is also implemented in assembly language which makes the mixed-language programming a distinguished feature of our software design.

3.2 Bootloader and Interrupt Handling

A specially-made bootloader is used to do initialization tasks during the boot-up of our SOPC system. The bootloader initialize the PowerPC processor and then the FDP Reconfigurable Logic Core and all the other peripherals. The bootloader flow diagram is shown below:

(39)

Initial instruction fetch from 0xFFFFFFFC branch to initialization code Configure guarded attribute （SGR） for performance Configure endianness and compression Invalidate the instruction/data cache and enable cachability

Initialize vector table and interrupt handlers

Initialize and configure timer facilities

Initialize Machine State Register (MSR)

Initialize FPGA Interrupt and Interface

Initialize EBI, UART and external interrupt controller (UIC) if necessary Branch to operating system or application code

Fig 3.3 The whole SOPC software initialization sequence

As a matter of fact, we need to do a lot of setting and initialization before running the operating system or application programs. First of all, the instruction at the address 0xFFFFFFFC is fetched and executed. [22] In our system, the first instruction is in the Boot-rom Flash and the instruction is a branch to the initialization codes in the Boot-rom Flash. Then we should configure the Special Purpose

(40)

Register (SPR) named SGR and CR0 for guarded attribute (whether the code area is read-only or not), endianness (little endian or big endian) and compression (whether the program are compressed or not). The next step is to invalidate and initialize the Instruction Cache Unit (ICU) and Data Cache Unit (DCU). Because the POWERPC is a type of Harvard architecture processor, the instruction cache and data cache are split and both of them need to be invalidated and enabled before coming into use. After that, we will set up all the timers in the processor. There are 3 kinds of timers in PowerPC 405: PIT, FIT and Watchdog Timer. FIT‟s interval cannot be changed during running while PIT‟s interval is programmable all the time. Here, the watchdog timer is used to prevent the program from deadlocks or endless loops. The interrupt vector table is built in the Flash using assembly. In the table, all kinds of interrupts will jump to their according interrupt handler entry address by looking up in an “array of structs” (C language) data structure named “ExceptionHandler”. The interrupt table‟s segment address must be stored in the Exception Vector Prefix Register (EVPR) to make it effective. Finally, we shall initialize the Reconfigurable Logic Core interface and other peripherals and launch the application program.

PowerPC 405 PLB Flash EBI SRAM

Fig 3.4 Fetching the program code and data from FLASH to SRAM after boot-up

In the software design, we use a special technique which is widely used in other embedded systems to accelerating the software running: moving the main program from FLASH to SRAM and run the main program in SRAM. Since the SRAM‟s speed is much faster than the one of the FLASH, using this technique will greatly increase our system‟s performance. The program moving procedure begins right after the initialization process. This technique is indicated in Fig 3.4. Here, the main program commonly refers to the operating system, test suite program or the

(41)

application program.

The interrupt handling program is the program that response to the hardware interrupts from Reconfigurable Logic Core, UART or PLB arbiter and performs certain handling routines. The interrupt handler should firstly mask the appropriate control bit in the Universal Interrupt Controller (UIC) to avoid further occurrence of the same kind of interrupt which will lead to interrupt errors.

3.3 SOPC test suite program design

A whole set of test suite programs were developed to satisfy both the demands of the SOPC verification and those of the SOPC test. The uniformity of the test suites will ensure the verification coverage and test coverage of the SOPC chip. And the reuse of the programs helps to reduce the total design and testing time of the SOPC. The test suite programs can be categorized into 6 types and is shown in the following table:

Test Suite Types Verification Result Test Result CPU Test (30 cases) PASSED PASSED DCR Register Map Test (8

cases)

PASSED PASSED

EBI Testcase (5 cases) PASSED PASSED GPIO Testcase (3 cases) PASSED PASSED UART commucation Tests

(5 cases) PASSED PASSED Reconfigurable Logic Core Configuration Testcases (6 cases) PASSED PASSED

Tab 3.1 FDP2009 SOPC test-suite categories

From the table above, we can see that all the testcase in the suites have been tested and have got expected result both in the simulation process during design and in the system test after the chip tape-out. CPU tests aim at testing the basic function of the embedded PowerPC processor, including the instruction pipeline, ALU operations, program branch, exceptions and process internal register manipulations.

(42)

The DCR register map test-suites are programs which configure all the control registers in the system other than the PowerPC and Reconfigurable Logic Core to validate the DCR registers. The EBI test-cases are to validate the functionality of the EBI and to test all the supported word transfer and line transfer protocols of it. The GPIO and UART test-cases are to test the chip‟s serial and parallel external interface. The “stdio.h” standard C library is also modified to support using standard function like “printf” and “scanf” to transfer data or string through the GPIO or UART. The FPGA configuration test programs validate the correct data transferring between the PLB and the Reconfigurable Logic Core. There are currently 2 Reconfigurable Logic Core-PLB interconnection ports. One of the ports is a two-way asynchoronous FIFO and the other is a small SRAM buffer shared by the PowerPC and Reconfigurable Logic Core. For the data transferring on these two ports, different system functions are built to control the process.

3.4 Software Debug Using RiscWatch

During the software debugging process, we use the RiscWatch processor on-line debugger and obtained very good results.

RISCWatch is a hardware and software development tool for the PowerPC600/700/900 Family of micro-processors and the PowerPC 400 Series. [23] The source-level debugger and processor-control features provide developers with the tools needed to develop and debug hardware and software quickly and efficiently. Developers who take advantage of RISCWatch are provided a wealth of advanced debug capabilities. Among the advanced features of this full-functioned debugger are real-time trace (on supported processors), Ethernet hardware interface, C/C++support, extensive command file support and on-chip debug support. Debugging of multi-core and multi-processor PowerPC systems is also supported. This is a debugger that supports both XCOFF and the Embedded ABI for PowerPC industry standard.

We use the RiscWatch to debug the bootloader and testsuite programs first in hand before using them to test our target hardware and systems. The powerful features of RiscWatch give us great ease and freedom in testing the software and help us find the bugs in very early stage.

(43)

Trace port and the software, the RiscWatch software debugger and simulator. In order to use the RiscWatch on-line tracer, we have to add 3 additional PowerPC JTAG signals at the chip-level I/Os which are reserved for the usage of RiscWatch. As our testing and debugging work continues, we find that the accelerating in software development is much more important than the limited loss in hardware packaging.

RISCWatch fully supports code debug at both the C/C++ source and assembler levels. Run control functions allow stopping/starting the program and the ability to restart the program while retaining the setting of current break-points and watch points. The program can be single-stepped by assembler or C/C++ source line. Function calls can either be stepped into or over as desired. Breakpoints take full advantage of the debug capability in the PowerPC processor. In addition to standard trap-based software instruction breakpoints, hardware assisted instruction and data breakpoints are also available.

Fig 3.5 GUI of the RISCWATCH software and debugging interface

In Fig 3.5, we can see the user interface of the RiscWatch software. The software is especially useful in our C-assembly mixed programming environment. Managing the many types and methods of setting breakpoints is made easy through a single breakpoint control screen. Assembler level debug is supported in a number of ways. The C/C++ source screen provides a mixed source/assembler mode that shows each source line and its associated lines of assembler code. An

(44)

assembler-only screen can also be used to provide actual memory disassembly of the code and the ability to change it dynamically. Other screens provide the ability to easily navigate through the program during the debug session. A caller‟s screen allows the program context to be switched between the various levels of the call chain. Files and functions screens provide the ability to decide what files/functions appear in the source window. The functions screen also allows breakpoints to be set or cleared at the beginning of functions. Local and global variable screens not only allow variables to be displayed and updated, but they also provide extensive control over what gets shown and when. On-chip debugging is accomplished via the IEEE 1149.1 (JTAG) interface, which allows access to the debug logic built into the PowerPC processors. Since the debug logic is separate from the rest of the processor logic, access to processor resources is possible even if the processor is in an error state. Low-level processor controlling functions allow the developer complete control of the processor. Processor control features include run, start, step, set break-points, reset and initialize the processor. Low-level processor watching functions include displaying and modifying memory, registers and cache. Memory can also be loaded and disassembled.

3.5 Summary of the software design based on FDP2009

platform

In our SOPC software design, both the programs which act as testcases and the application programs are developed. We have done the porting the uc-OSII kernel and the design of hardware drivers such as USB and JTAG.

We have used some common embedded software design techniques in our development. We use the IBM RiscWatch hardware tracer and instruction set simulator (ISS) to help us debug the programs.

(45)

Chapter 4. Reconfigurable Image Noise Filter

Application

4.1 Image noise theories

Image noise is the random variation of brightness or color information in images produced by the sensor and circuitry of a scanner or digital camera. Image noise can also originate in film grain and in the unavoidable shot noise of an ideal photon detector. Image noise is generally regarded as an undesirable by-product of image capture. Although these unwanted fluctuations became known as "noise" by analogy with unwanted sound, they are inaudible and actually beneficial in some applications, such as dithering. Here are several kinds of the most common image noise.

(1) Gaussian noise (Amplifier noise)

The standard model of amplifier noise is additive, Gaussian, independent at each pixel and independent of the signal intensity, caused primarily by Johnson–Nyquist noise (thermal noise), including that which comes from the reset noise of capacitors ("kTC noise"). [24] In color cameras where more amplification is used in the blue color channel than in the green or red channel, there can be more noise in the blue channel. [25] Amplifier noise is a major part of the "read noise" of an image sensor, that is, of the constant noise level in dark areas of the image. [26]

(2) Salt-and-pepper noise (Spike noise or impulsive noise)

Salt and pepper noise is sometimes called impulsive noise or spike noise. [27] An image containing salt-and-pepper noise will have dark pixels in bright regions and bright pixels in dark regions. This type of noise can be caused by dead pixels, analog-to-digital converter errors, bit errors in transmission, etc. [28] [29] This kind of noise can be eliminated in large part by using dark frame subtraction and by interpolating around dark/bright pixels.

In our image noise reduction filter designs, we mainly focus on the Gaussian noise and salt-and-pepper noise and use them as examples to show the performance of the filters when faced with different noise types or different noise sources at the same

(46)

time.

(3) Image noise modeling

In general, noisy image model can be described as the sum of the additive signal-dependent and the additive signal-independent random noise which is shown in the formula below:

In the formula above, is the image pixel without noise while and are the signal-dependent noise and signal-independent noise. This is the theoretical basis of the mult-stage image noise filtering. The multi-stage image noise filtering tends to eliminate the different kinds of noise step by step.

(4) Image noise reduction

Most algorithms for converting image sensor data to an image, whether in-camera or on a computer, involve some form of noise reduction. There are many procedures for this, but all attempt to determine whether differences in pixel values constitute noise or real photographic detail, and average out the former. However, no algorithm can make this judgment perfectly, so there is often a tradeoff made between noise removal and preservation of fine, low-contrast detail that may have characteristics similar to noise. Many cameras have a setting to control the aggressiveness of the in-camera noise reduction.

This decision can be assisted by knowing the characteristics of the source image and of human vision. Most noise reduction algorithms perform much more aggressive chroma noise reduction, since there is little important fine chroma detail that one risks losing. Furthermore, many people find luminance noise less objectionable to the eye, since its textured appearance mimics the appearance of film grain.

The high sensitivity image quality of a given camera (or RAW development workflow) may depend greatly on the quality of the algorithm used for noise reduction. Since noise levels increase as ISO sensitivity is increased, most camera manufacturers increase the noise reduction aggressiveness automatically at higher sensitivities. This leads to a breakdown of image quality at higher sensitivities in two ways: noise levels increase and fine detail is smoothed out by the more aggressive

(47)

noise reduction.

4.2 Image noise reduction research progresses and their

short-comings

Some new methods of image noise techniques and algorithms have been proposed in the past few years. But these noise reduction algorithms are quite complex and slow and not suitable to implement in hardware, which greatly limit their utilization in real high-performance image processing systems.

Nowadays, many researchers are doing many research works on the reconfigurable image processing area. A large amount of papers were published in recent years, such as [31] , [32] and [33] . But the application and innovation in reconfigurable image noise reduction applications are rare to see in the publications. In [34] and [35] , the image noise reduction algorithms are only ideal to eliminate a certain kind of image noise. These algorithms are rendered useless when facing images with mixed sources of noise. In [36] and [37] , the algorithms are basically designed for software implementation and not suitable and optimized for hardware implementation. If these algorithms are to be implemented in hardware, a lot of hardware resources will be used but the increase in performance is rather limited. The paper [38] proposes good techniques for removal of images containing both Gauss noise and Salt & Pepper noise, but the technique requires the super-resolution of the original image, thus becoming very complicated in implementation.

In the paper [39] and [40] , the researcher implements a virtual ICAP port using the external SelectMAP port based on Xilinx Spartan-III FPGAs and tries to exploit the speed limit of the external configuration port of the Xilinx FPGAs. The paper [41] proposed some high-performance reconfiguration internal interface based on Xilinx Virtex-4 devices‟ ICAP interface. The authors propose some different reconfiguration architecture and the different choices of the memories which used to store the partial bitstream in the dynamic reconfigurations. These two papers give us the detailed data of the configuration speed and performance of the Xilinx FPGAs. They are also very good examples to compare with when we are measuring our configuration speed and time in our own designs and systems.

In this paper, the image noise reduction and removal algorithms are combined with partial reconfiguration technologies to realize better performance and flexibility.

(48)

Not only the image results are better, the configuration time and overall system running speed is optimized as well.

4.3 The Design of the Bus Macro

In partial reconfigurable systems, the interconnect and communication between the fixed logics and partial reconfigurable regions (PRRs) has always been the key point of reconfiguration. Here, we propose a bus macro structure used in the reconfigurable implementation based FDP300k Reconfigurable Logic Core‟s vertical CLB structure. [21] The image filter applications in the thesis are based on FDP300K device which has optimization for reconfiguration in its CLB unit and column interconnects. The bus macro design is built on CLB resources, aiming at the separation of the static/dynamic blocks physically.

The bus macro design is mainly implemented on the SLICEs consisted of LUTs (Look-up Tables) and local interconnect wires inside the FDP300K Reconfigurable Logic Core. Two columns of dedicated CLBs (Configurable Logic Block, the basic element of Reconfigurable Logic Core) and local interconnects are used to separate the circuits vertically. The dedicated resources are from the input CLB columns, the IMUX, and the fixed wires, the OMUX and the output CLB columns. The used resources are shown in red in Fig 4.1. The other logic resources are in black and are open to other user logics. We can regard the bus macro as interconnection wires with enable switches and they connect the bus wires and signal from one block to their counterparts in the other block.

IMUX OMUX GRM SLICE1 SLICE0 CLB bus IMUX OMUX GRM SLICE1 SLICE0 CLB bus

(49)

IMUX OMUX GRM SLICE1 SLICE0 CLB bus IMUX OMUX GRM SLICE1 SLICE0 CLB bus

Fig 4.1 The proposed bus macro structure schematic

Because of the reasonable usage of the FDP Reconfigurable Logic Core resources and its vertical SLICE structure, the bus macro design proposed in this thesis is easy to use and very area-efficient. The delay of the bus macro is also optimum for some critical and high performance application situations which will be demonstrated in our image filter application examples.

4.4 The reconfigurable system implementation based on

the bus macro design

The bus macros make the physical definitions of the boundaries of the logic blocks in the system. The bus macros are commonly stripe-like in our designs which can be seen in Fig 4.2. The bus macros make it possible that when one block is under reconfiguration, its internal signals will be blocked out by the bus macro to avoid error data transmission.

We can find the diagram of the whole reconfigurable system and bus macros‟ utilizations in Fig 4.2. The logic blocks are classified as fixed blocks and dynamic blocks according to whether they will be altered during the system running. The reconfigurable hardware platform includes the fixed logic blocks, the partial reconfigurable blocks and bus macros.

(50)

Fixed Logic

B U S M A C R O B U S M A C R O Reconfigura ble Logic_1 IOB Reconfigura ble Logic_2 IOB I O B I O B Reconfigurable Logic Core Partial Bitstream

Fig 4.2 The bus macros separate the Reconfigurable Logic Core resources into static regions and partial reconfigurable regions

From the chip floorplan point of view, these stripe-shaped bus macro columns make the signaling between static and dynamic blocks possible. This is the basis of reconfiguration system implementation.

4.5 Reconfigurable image noise reduction filter design

The partial dynamic reconfigurable filter is mainly consisted of 2 parts: 1 fixed-logic part and the partial reconfigurable block (PR in Fig 4.3). The JTAG interface controlled by the PC is used to transfer the partial bitstream to the Reconfigurable Logic Core device. The static logic block includes:

(1) main_fsm：the main finite state-machine which control the interaction and bitstream downloading

(2) pixel_fsm：the finite state-machine which controls the image pixel ordering and the pixel address decoding in the SRAM device. The pixel_fsm also controls the order of the output image pixels.

(51)

(3) sram_mc：SRAM controller that implements the read/write timing of the external SRAM device (the SRAM in Fig 4.3). The SRAM device is used as the intermediate buffer as well as the image source container of the image filters.

The dynamic reconfigurable part is named “PR” and implements different types of algorithm when downloading different bitstreams. The overall system diagram of the reconfigurable image filter is shown below:

SRAM Pixel_fsm Sram_mc Main_fsm PC JTAG static PR

Fig 4.3 The system diagram of the partial reconfigurable filter

The top-level signal, “pr_download” is the major the signal controlled by the PC JTAG. Its rise-edge is the partial bitstream downloading indication and its fall-edge is the starting signal of the PR and whole filtering process. The state-machine “main_fsm” will detect the fall-edge of the signal “pr_download” and start the PR when the valid edge is detected. Different partial bitstream corresponds to different partial configurations and realize the time-multiplexing usage of the Reconfigurable Logic Core resources (including LUTs, Flip-Flops and interconnects) in the same PR region. This helps to exploiting the advantage of the efficiency and flexibility of the system. The following chart explains the configuring process of the PR at the different moments.

(52)

static

PR

Bus

Macro

PR Configuration of median filter PR configuration of average filter m om en t1 m om en t2

Fig 4.4 The configuration procedure of the filter system

In the meanwhile, we optimize the algorithms which specialize in the removal of the salt & pepper noise, thus making the noise reduction performance better. In the conventional hardware filter designs, the median filter is commonly used. That is sampling a window from the original image and using the median value of the elements in the window as the output value. Although it is quite ideal in eliminating weak salt & pepper noises, it will get poor result when faced with strong salt & pepper noise. Now we improve the median filtering algorithm to the new median filtering with soft thresholds. The soft threshold helps to distinguish noise pixels from normal pixels so as to preserve as much the image detail information as possible when eliminating noises. This improvement in algorithm will also contribute to our entire system performance.

Because images are 2-dimensional data sources, so the image processing must be carried out in both the X direction and the Y direction. In our reconfigurable filter design, the images are processed in X direction first and then in Y direction. The pixel processing ordering is shown in Fig 4.5.

(53)

Fig 4.5 Image pixel processing ordering in the proposed reconfigurable filter

The SRAM controller will perform the pixel scan pattern described above and the pixel FSM will help to realize this. The image pixel data fetched will be stored in the shift-registers which are used as image buffers. The pixels will be scanned row-wise first and then column-wise. The image pixels at the boundaries will perform boundary extension before sent to filtering. In the current design, the method of boundary extension is symmetric extension which is very good for image processing. The main FSM will treat the data from the shift registers as the main input the filter circuit.

In the filter design, we use the self-defined design flows for partial reconfigurable (PR) design. We use Synplify as the synthesis tool to generate the netlists of the static and PR blocks from Verilog HDL designs. The packing, place & route and bitstream generating procedures are all performed using FDE software which is the Reconfigurable Logic Core EDA software package developed by Fudan University. The software tool used to generate the bitstream is called FDE BitGen which generates bitstreams by consulting the specially design ARCH file (Reconfigurable Logic Core architecture file). The detailed flow is shown below. We can see that some of the design process still require manual work and lack automation. So, a lot of further work can be done in the PR design flow improvement.

(54)

Generating netlists for static and PR

blocks using Synthesis Tool

Writing the physical location constraints

(LOC) file

Manually configure the LUT units of the

bus macros and constraint bus macro locations in LOC files

Do the packing and place & route procedure and generate global and

partial P&R result files

Using BitGen software to generate

global and partial dynamic bitstreams