School of Innovation, Design and
ABB AB Force Measurement
MASTER THESIS IN COMPUTER SCIENCE
High Performance FPGA-Based Computation and Simulation for MIMO
Measurement and Control Systems
Author: Johan Palm
email@example.com Examiner: Prof. Lennart Lindh
firstname.lastname@example.org Academic Supervisor: Prof. Lennart Lindh
email@example.com Industrial Supervisor: Dr. George Fodor
The Stressometer system is a measurement and control system used in cold rolling to im-prove the flatness of a metal strip. In order to achieve this goal the system employs a multiple input multiple output (MIMO) control system that has a considerable number of sensors and actuators. As a consequence the computational load on the Stressometer control system be-comes very high if too advance functions are used. Simultaneously advances in rolling mill mechanical design makes it necessary to implement more complex functions in order for the Stressometer system to stay competitive. Most industrial players in this market considers im-proved computational power, for measurement, control and modeling applications, to be a key competitive factor. Accordingly there is a need to improve the computational power of the Stressometer system. Several different approaches towards this objective have been identified, e.g. exploiting hardware parallelism in modern general purpose and graphics processors.
Another approach is to implement different applications in FPGA-based hardware, either tailored to a specific problem or as a part of hardware/software co-design. Through the use of a hardware/software co-design approach the efficiency of the Stressometer system can be increased, lowering overall demand for processing power since the available resources can be exploited more fully. Hardware accelerated platforms can be used to increase the compu-tational power of the Stressometer control system without the need for major changes in the existing hardware. Thus hardware upgrades can be as simple as connecting a cable to an accel-erator platform while hardware/software co-design is used to find a suitable hardware/software partition, moving applications between software and hardware.
In order to determine whether this hardware/software co-design approach is realistic or not, the feasibility of implementing simulator, computational and control applications in FPGA-based hardware needs to be determined. This is accomplished by selecting two specific appli-cations for a closer study, determining the feasibility of implementing a Stressometer measur-ing roll simulator and a parallel Cholesky algorithm in FPGA-based hardware.
Based on these studies this work has determined that the FPGA device technology is per-fectly suitable for implementing both simulator and computational applications. The Stres-someter measuring roll simulator was able to approximate the force and pulse signals of the Stressometer measuring roll at a relative modest resource consumption, only consuming 1747 slices and eight DSP slices. This while the parallel FPGA-based Cholesky component is able to provide performance in the range of GFLOP/s, exceeding the performance of the personal computer used for comparison in several simulations, although at a very high resource con-sumption. The result of this thesis, based on the two feasibility studies, indicates that it is possible to increase the processing power of the Stressometer control system using the FPGA device technology.
1 Introduction 3
1.1 Background and Motivation . . . 4
1.2 Problem Formulation . . . 6
1.3 Delimitations . . . 8
1.4 Contributions . . . 9
1.5 Chapter Summary . . . 10
2 Digital System Design 15 2.1 Device Technologies . . . 15
2.1.1 Field Programmable Gate Arrays (FPGAs) . . . 18
2.2 Hardware Descriptive Languages (HDL) . . . 25
2.2.1 Very High Speed Integrated Circuit Hardware Description Language (VHDL) . . . 27
2.2.2 Testbench . . . 28
2.2.3 Higher Level Abstraction . . . 28
2.3 IP Components . . . 29
2.4 Hardware/Software Co-Design . . . 30
2.5 System on a Chip (SoC) . . . 32
2.6 Busses . . . 33
2.6.1 IBM’s CoreConnect Bus . . . 34
2.7 Network on a Chip (NoC) . . . 36
3 Arithmetics in Digital Systems 39 3.1 Binary Number Systems . . . 39
3.1.1 Arithmetics and FPGAs . . . 41
3.2 Variable Wordlength and Scaling . . . 42
3.3 Binary Arithmetics . . . 42
The Stressometer System
454 The Stressometer Measuring Roll 49 4.1 Slot and Wrap Angles . . . 51
4.2 Pressductor Technology . . . 52
4.3 Force Signals . . . 52
4.3.1 Force Signal Measurement . . . 52
4.3.2 Force Signal vs. Carrier Wave . . . 56
4.3.3 Discussion: Force Signals . . . 58
4.4 Pulse Signals . . . 59
5 A Case Study of PFSK187 61 5.1 PFSK187 Power Supply . . . 61
5.2 Hardware Resources . . . 61
5.2.1 Xilinx Virtex-5 FPGA . . . 62
5.2.2 SDRAM . . . 62
5.2.3 Flash: Intel StrataFlash P33, 512Mbit . . . 62
5.2.4 Fiber Optic Interface . . . 63
5.2.5 Coaxial Interface . . . 63
5.2.6 Ethernet Transceiver (LAN8700i) . . . 63
5.2.7 DA and AD Converters . . . 63
5.2.8 Voltage Controlled Crystal Oscillator (VCXO) . . . 64
5.2.9 Multiplexers . . . 64
5.2.10 JTAG / Config . . . 65
5.2.11 Light Emitting Diodes (LED) . . . 65
5.2.12 Pin-List . . . 65
5.2.13 Additional IO Ports . . . 65
5.2.14 Power On Reset / Reset . . . 67
5.2.15 Clock Generation . . . 67
5.2.16 Phase Recovery . . . 67
5.3 FPGA Hardware Functions . . . 67
5.3.1 Receiver . . . 68
5.3.2 Force Signals and Calibration . . . 73
5.3.3 Clock Management . . . 74
5.3.4 PI-Regulation of PFSA140 . . . 74
5.3.5 System Status and Warnings . . . 75
5.3.6 Leakage Level . . . 77
5.3.7 Parameters, Hardware Images and Diagnostic Information . . . 77
The Stressometer Measuring Roll Simulator
796 Specification 83 6.1 Specification: Measuring Roll Simulator . . . 83
7 Prototype Design 93
7.1 Roll Position . . . 96
7.2 Pulse Signals . . . 98
7.3 Carrier Wave . . . 100
7.4 System Clock Re-synchronization . . . 102
7.5 Force Signals . . . 102
7.5.1 Wordlength and Scaling . . . 105
7.5.2 Force Signal Generation Timing Requirements . . . 106
7.5.3 Force Generator Component . . . 107
7.5.4 Amplitude Modulation Component . . . 109
7.5.5 Multipliers . . . 110
7.5.6 Shift Registers . . . 110
7.6 Force Sample Storage . . . 111
7.7 DA Controller . . . 111
7.8 Force Signal Modification . . . 111
7.9 Internal Variables . . . 112
7.9.1 Constants . . . 112
7.9.2 Registers . . . 115
7.9.3 Force Memory . . . 120
7.9.4 Force Transition Point Memory . . . 121
7.9.5 Carrier Wave Lookup Table (ROM) . . . 121
7.9.6 Valid Internal Variable Values . . . 121
7.9.7 Internal Variable Update Synchronization . . . 124
7.10 Communications Interface . . . 125
7.10.1 Variable Memory . . . 125
7.10.2 Communication Registers . . . 126
7.10.3 External Communications Signals . . . 128
7.10.4 Service Requests . . . 131
7.10.5 Internal Variable Update . . . 134
7.10.6 Diagnostic . . . 137
7.11 Peripheral Devices . . . 138
7.11.1 Clock Generation and Synchronization . . . 138
7.11.2 Roll Simulator Controller . . . 140
7.11.3 Constant Signals . . . 142
8 Implementation and Verification 143 8.1 Implementation . . . 143
8.2 Verification . . . 150
8.2.1 Force Signal Generation Timing . . . 150
8.2.2 Signal Generator Verification . . . 151
8.2.3 Roll Controller and Communications Interface Verification . . . 156
8.2.4 Verification of Peripheral Devices . . . 158
9 Conclusions and Further Work 161
9.1 Conclusions . . . 161
9.2 Further Work . . . 162
Feasibility Study of Parallel FPGA-Based Cholesky
16710 Literature Study 171 10.1 Serial Cholesky Decomposition . . . 171
10.2 Parallel Algorithms . . . 172
10.2.1 Parallel Platforms . . . 173
10.2.2 Parallel Algorithm Design . . . 174
10.3 Scientific Papers . . . 176
10.3.1 Cholesky Decomposition using Fused Datapath Synthesis . . . 176
10.3.2 Implementation of Cholesky LLT-Decomposition Algorithm in FPGA-Based Rational Fraction Parallel Processor . . . 178
10.3.3 Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization . . . 179
11 Parallel FPGA-based Cholesky Decomposition 181 11.1 Cholesky Algorithm Dependencies and Interactions . . . 181
11.2 Prototype Design . . . 186
11.2.1 Arithmetics Unit . . . 187
11.2.2 Memory System . . . 189
11.2.3 Control Structure . . . 191
11.2.4 FIFO Components . . . 192
11.3 Prototype Design Simulation . . . 192
11.3.1 Method . . . 193
11.3.2 Results . . . 194
11.4 Suggestions for a Final Implementation . . . 197
11.4.1 Revised Algorithm . . . 197
11.4.2 Matrix Input Control . . . 197
11.4.3 Clock Frequency and Latency . . . 198
11.4.4 Memory Component . . . 198
11.4.5 Memory Sizes . . . 198
11.4.6 Control Structure . . . 199
11.4.7 Variable Matrix Sizes . . . 199
11.4.8 Reduced Arithmetic Unit . . . 200
11.4.9 Structural Parallel Arithmetic Unit . . . 203
12 Conclusions and Further Work 205 12.1 Conclusions . . . 205
Conclusions and Further Work
13 Conclusions 211
14 Further Work 213
A PFSK187 FPGA Pin Out 224
B Roll Simulator - Internal Components Signal Interfaces 235
B.1 Components . . . 235
B.2 Memories . . . 245
B.3 Multipliers . . . 248
B.4 Shift Registers . . . 250 C Roll Controller - Instruction Generation 251 D Roll Control - Instruction Verification 257
List of Figures
1.1 Hardware/Software Co-Design . . . 4
1.2 Stressometer System . . . 5
2.1 Programmable Logic Device . . . 17
2.2 Lookup Table (LUT) . . . 19
2.3 Logic Block . . . 20
2.4 Configurable Logic Block . . . 20
2.5 Testbench . . . 28
2.6 System on a Chip . . . 33
2.7 Network on a Chip . . . 37
3.1 Binary Format: Integer . . . 40
3.2 Binary Format: Floating Point (Single Precision) . . . 40
3.3 Binary Format: Floating Point (Double Precision) . . . 40
3.4 Binary Format: Fixed Point . . . 41
3.5 Stressometer System . . . 47
3.6 Flatness System . . . 48
4.1 The Stressometer Measuring Roll . . . 49
4.2 Serially Connected Sensors in one Zone . . . 50
4.3 Wrap Angle . . . 51
4.4 Slot and Wrap Angle . . . 51
4.5 Pressductor Transducer . . . 52
4.6 Sampled Amplitude Modulated Force Signal . . . 53
4.7 Amplitude Frequency Spectra of Sampled Force Signal . . . 54
4.8 Demodulated and Filtered Force Signal . . . 55
4.9 Mean Centered Demodulated Force Signal . . . 56
4.10 Ratio: Slot Angle vs. Wrap Angle . . . 57
4.11 Ratio: Slot Angle Time vs. Carrier Wave Period Time . . . 57
4.12 Ratio: Wrap Angle Time vs. Carrier Wave Period Time . . . 58
4.13 Pulse-Signals . . . 59
4.14 Sync-pulse vs. Slot Angle . . . 60
5.1 PFSK187 FPGA Block Diagram . . . 69
5.3 Sample Period . . . 71
5.4 Force Signal Calibration and DA Conversion . . . 73
5.5 PI-Regulation of PFSA140 . . . 75
5.6 Memory Storage System Block Diagram . . . 77
7.1 Roll Simulator Component Block Diagram . . . 94
7.2 Sensor A Offset vs. Roll Position . . . 99
7.3 Carrier Wave Samples . . . 101
7.4 Force Signal Transition Points . . . 102
7.5 Force Signal Computation Block Diagram . . . 105
7.6 Force Signal Generation and Sampling Synchronization . . . 107
7.7 Force Generator State Machine . . . 107
7.8 Internal Variable Update State Machine . . . 135
7.9 Rotational Velocity Change Computation Block Diagram . . . 137
7.10 Roll Simulator Clock Generation Block Diagram . . . 139
8.1 Signal Generation Verification: Force Signal . . . 156
8.2 Communications Interface Testbench . . . 157
8.3 Measured Force Signal . . . 160
8.4 Measured Force Signal (Rotational Velocity Change) . . . 160
10.1 Cholesky Decomposition Block Diagram: Fused Datapath Synthesis . . . 177
10.2 Cholesky Decomposition Block Diagram: Rational Fraction Arithmetics . . . 179
11.1 Cholesky Decomposition Interaction Graphs . . . 183
11.2 Cholesky Decomposition Dependency Graphs . . . 184
11.3 Cholesky Decomposition Memory Access Pattern (Row-wise) . . . 185
11.4 Cholesky Decomposition Block Diagram . . . 186
11.5 Cholesky Arithmetic Unit Block Diagram . . . 188
11.6 Cholesky Memory Component . . . 190
11.7 Cholesky Control Structure . . . 191
List of Tables
2.1 LVTTL Voltage Characteristics . . . 25
4.1 Electrical Characteristics: Pulse Signals . . . 60
5.1 Xilinx Virtex-5 LX30T FPGA Resources . . . 62
5.2 JTAG Pin-List Description . . . 65
5.3 Electrical Characteristics: Digital Status Output . . . 66
5.4 Electrical Characteristics: Digital Status Input . . . 66
5.5 Electrical Characteristics: PFSA140 Regulation Input . . . 66
5.6 Electrical Characteristics: BMS Analog Leakage Level Input . . . 67
5.7 Electrical Characteristics: BMS Analog Force Inputs . . . 67
5.8 Frame Identifiers . . . 71
5.9 Service Frame Types . . . 72
6.1 Specification: Roll Simulator (a) . . . 87
6.2 Specification: Roll Simulator (b) . . . 88
6.3 Specification: Peripheral Devices . . . 91
7.1 Signal Interface: Stressometer Roll Simulator . . . 95
7.2 Carrier Wave Lookup Table . . . 101
7.3 Matlab Simulation: Wordlength and Scaling . . . 106
7.4 Variable Memory Address Map . . . 126
7.5 Version Register . . . 126
7.6 Status Register . . . 127
7.7 Service Request Register . . . 127
7.8 Service Request Response Register . . . 128
7.9 Register: Active / Inactive Zones . . . 128
7.10 Signal Interface: Communication Interface (External Communications Signals) 129 7.11 Address Mapping . . . 130
7.12 Service Request Register: Soft Reset . . . 132
7.13 Service Request Response Register: Soft Reset . . . 132
7.14 Service Request Register: Resume Operation . . . 132
7.15 Service Request Response Register: Resume Operation . . . 132
7.16 Service Request Register: Update Internal Variables . . . 132
7.18 Service Request Register: Activate / Deactivate Zone . . . 133
7.19 Service Request Response Register: Activate / Deactivate Zone . . . 133
7.20 Service Request Register: Revolution Count Information . . . 133
7.21 Service Request Response Register: Revolution Count Information . . . 133
7.22 Service Request Register: Re-Synchronization Count Information . . . 134
7.23 Service Request Response Register: Re-Synchronization Count Information . 134 7.24 Service Request Register: Diagnostic Information . . . 134
7.25 Service Request Response Register: Diagnostic Information . . . 134
7.26 Clock Factors . . . 140
7.27 Roll Controller: Instruction Format . . . 140
7.28 Signal Interface: Stressometer Roll Simulator Controller . . . 141
7.29 Roll Controller: Instruction Address Range . . . 141
8.1 Roll Simulator Pin Out in PFSK187 . . . 144
8.2 Force Signal Generation: Worst Case Timing . . . 150
8.3 Signal Generator: Test Configurations . . . 152
8.4 Signal Generation Verification Results . . . 155
8.5 Hardware Test Configurations . . . 159
11.1 Floating Point Operator Properties . . . 189
11.2 Cholesky Memory Component Sizes . . . 190
11.3 Parallel FPGA-Based Cholesky Decomposition Simulation Results . . . 194
11.4 Cholesky Decomposition Benchmark . . . 195
11.5 Cholesky Decomposition Resource Consumption . . . 196
A.1 FPGA Pin-out: Switched Power Supply . . . 224
A.2 FPGA Pin-out: FPGA-FPGA . . . 224
A.3 FPGA Pin-out: Clock Distribution . . . 225
A.4 FPGA Pin-out: Clock Generator . . . 225
A.5 FPGA Pin-out: Flash Memory . . . 226
A.6 FPGA Pin-out: SDRAM . . . 227
A.7 FPGA Pin-out: JTAG Device . . . 227
A.8 FPGA Pin-out: Config Jumpers . . . 228
A.9 FPGA Pin-out: Power-On Reset . . . 228
A.10 FPGA Pin-out: Fiber Optic Communication . . . 228
A.11 FPGA Pin-out: Coaxial Communication . . . 228
A.12 FPGA Pin-out: Leakage Level . . . 229
A.13 FPGA Pin-out: Phased-Locked Loop . . . 229
A.14 FPGA Pin-out: Phase Recovery . . . 229
A.15 FPGA Pin-out: Light Emitting Diodes . . . 230
A.16 FPGA Pin-out: On-Board Monitoring . . . 230
A.17 FPGA Pin-out: Ethernet . . . 231
A.18 FPGA Pin-out: PFSA140 Regulator . . . 231
A.20 FPGA Pin-out: Force Calibrator . . . 232
A.21 FPGA Pin-out: Pin List . . . 233
A.22 FPGA Pin-out: Status Signals . . . 233
B.1 Signal Interface: Roll Position Component . . . 236
B.2 Signal Interface: Pulse Generator Component . . . 237
B.3 Signal Interface: Carrier Wave Generator Component . . . 238
B.4 Signal Interface: Force Generator Component . . . 239
B.5 Signal Interface: Amplitude Modulation Component . . . 240
B.6 Signal Interface: Force Sample Storage Component . . . 241
B.7 Signal Interface: DA Controller Component . . . 241
B.8 Signal Interface: Signal Modification Component (Unmodulated) . . . 242
B.9 Signal Interface: Signal Modification Component (Amplitude Modulated) . . 243
B.10 Signal Interface: Communication Interface (Control Signals) . . . 244
B.11 Signal Interface: Force Memory . . . 245
B.12 Signal Interface: Force Transition Point Memory . . . 246
B.13 Signal Interface: Variable Memory . . . 247
B.14 Signal Interface: Force Generator Multiplier . . . 248
B.15 Signal Interface: Amplitude Modulation Multiplier . . . 248
B.16 Signal Interface: Rotational Velocity Change Multiplier . . . 249
B.17 Signal Interface: Force Generator Shift Register . . . 250
List of Algorithms
7.1 Roll Sample Position Counter (CCW) . . . 97
7.2 Roll Sample Position Counter (CW) . . . 97
10.1 Row Wise Cholesky Decomposition . . . 172
10.2 Column Wise Cholesky Decomposition . . . 173
11.1 Row Wise Cholesky Algorithm Interaction . . . 182
11.2 Column Wise Cholesky Algorithm Interaction . . . 182
11.3 Modified Row Wise Cholesky Decomposition . . . 187
In 1965 Moore (1965, [MOO65]) predicted that the semiconductor technology would double its efficiency every eighteenth month for at least one decade, leading to a doubling of the number of transistors that could be placed in one die. This prediction, which became known as Moore’s law, is valid to this day. In general people associated Moore’s law with increases in the performance of personal computers every eighteenth month, usually as a doubling of the clock frequency. [UCB09]
However, since around 2003 these performance increases could no longer be realized as increases in the clock frequency due to heat dissipation issues. There are simply no longer any viable options for cooling processors, in personal computers, with a higher clock frequency than what is used today. Instead the industry has been forced to find other ways in which the additional transistors can be exploited to increase performance. One trend that can be observed is the increase of hardware parallelism through the use of multiple cores in a single processor. [UCB09]
This approach has already been applied successfully for some time in graphics processors, which today can include as many as 240 parallel cores. Thus the modern graphics processors are highly parallel and programmable with high performance arithmetics and bandwidth. With these capabilities the graphics processors makes a very interesting alternative to general pur-pose processors when performing demanding computations. This is reflected in the rapidly expanding field of research known as Graphics Processing Unit (GPU) computing, see e.g. Owens J.D. et. al. (2009) [OWE08]. [OWE08]
One issue with this approach is the relatively small amount of research that has been per-formed on the subject of parallel algorithms. As a consequence all applications might not be able to exploit the hardware parallelism fully. However, applications that are particular suitable for parallel implementation can be found in areas such as graphics rendering, video playback, control systems, modeling and scientific computing [GRA03, OWE08]. In order for these applications to be able to use the hardware parallelism found in modern processors, software and compilers needs to be developed with this in mind.
A competitive alternative approach when developing embedded systems is to carefully se-lect a suitable system architecture using a hardware/software co-design development method. This by designing the hardware and software simultaneously in an effort to exploit the syn-ergism between them in order to meet the system objectives [MIC97]. Through the use of
co-design the efficiency of a system can be improved, which in turn lowers the overall de-mand on performance since the available resources can be exploited more fully. The co-design approach is especially interesting if reconfigurable hardware, such as the FPGA device tech-nology, is used. This since functions that are suitable for parallel operation more easily can be implemented as parallel hardware. With advances in FPGA technology, modern FPGAs are now able to perform dynamic and partial reconfigurations, making it possible to carry out hardware/software repartitioning even while a system is operational.
Figure 1.1: This figure illustrates how a basic system can be structured when exploiting hard-ware/software co-design in order to find suitable hardhard-ware/software partitions.
One way in which hardware/software co-design can be applied is illustrated in Figure 1.1, where physically separate hardware accelerators based on the FPGA device technology is used. Even though the illustrated system seems relatively fixed, co-design can be used to find a suitable hardware/software partition. A middle-ware provides a well defined interface be-tween the accelerators and the hardware/software platform, thus facilitating hardware/software partitioning. In addition to this, if technologies such as dynamic and partial reconfiguration is used the middle-ware can handle the related issues (e.g. hardware/software partitioning, hardware scheduling, etc.). This approach might be suitable if an existing system is to be up-graded since accelerator platforms can be added as needed while the hardware platform that executes the software is left unchanged. As a result applications can be moved from software to hardware while the physical hardware upgrades can be as simple as connecting a cable to an accelerator platform.
Background and Motivation
The Stressometer system, a product of ABB Force Measurement, is a measurement and control system used in cold rolling to improve the flatness of a metal strip. This system is a multiple input multiple output (MIMO) control system with a considerable number of sensors and actuators. As a consequence of this, in combination with time constraints, the computational load on the system becomes very high if too advance functions are implemented. At the same time advances in rolling mill mechanical design, which requires non-linear dynamic systems
and higher order computations, makes it necessary to implement more complex functionality in order for the Stressometer system to stay competitive. Thus there is an interest of finding ways in which the computational power of the Stressometer system can be increased. With this in mind, a number of different approaches to how the processing power of the system can be increased has been identified.
Figure 1.2: This figure illustrates a rolling mill that uses the Stressometer System. [ABB05]
• Exploiting hardware parallelism in existing general purpose processors.
• Using hardware/software co-design based on von-Neumann type processors and the FPGA device technology.
• Using digital systems based on the FPGA device technology to implement hardware tailored to solve specific problems.
• Exploiting the massive parallelism that exists in graphics processors through GPU com-puting.
However, in order to be able to use the FPGA-based approach to increase the computa-tional power of the Stressometer system, the FPGA device technology and related issues needs to be investigated further. There are two critical issues that needs to be addressed; appropriate algorithms for hardware implementation needs to be identified and a suitable system architec-ture needs to be determined. A first step in an effort to address these issues is to determine whether it is feasible to implement efficient simulator, computational and control applications as FPGA-based hardware. Advantages that stands to be gained from such an investigation are discussed briefly below.
• Reduced Hardware Cost:
It might be possible to reduce the hardware cost by moving functions from specialized computers to a FPGA based technology. In addition to this, increases in computational requirements on the Stressometer control system can be met by moving applications to hardware accelerators. This would reduce the frequency in which the Stressometer control system hardware, used to execute software applications, needs to be upgraded. • Reduce Production Time:
The use of efficient simulators can serve to reduce the time spent on product tests during production of the Stressometer system. For example a measuring roll simulator could be used instead of physically attaching a real Stressometer measuring roll when testing the Stressometer control system.
• Increased Accuracy in Simulations:
It might be possible to develop simulation tools based on the FPGA technology that are more realistic and accurate. This in turn can lead to a deeper understanding of various aspects of cold rolling using the Stressometer system, which improves the ability to find better measurement- and regulation-methods.
• Increased Knowledge of FPGA-based Digital Systems:
An improved understanding of the potential of digital systems can lead to additional ideas that in turn might improve the Stressometer system. This also includes an in-creased understanding of design- and verification-methods used when developing digital systems.
In order to determine whether a FPGA based hardware/software co-design approach, or an approach where the hardware is tailored to a specific problem, can be applied successfully to increase the processing power of the Stressometer control system, the FPGA device technol-ogy needs to be investigated further. Thus the aim of this thesis is to investigate whether it is feasible to implement simulator and computational applications in hardware based on the FPGA device technology.
This goal can be reached through a thorough investigation of both a simulator application and a computational application. Specifically this thesis will investigate the feasibility of implementing a Stressometer measuring roll simulator and a parallel Cholesky decomposition in hardware based on the FPGA device technology. In addition to this, if it is determined to be feasible, a prototype of the Stressometer measuring roll simulator is to be implemented. Finally subjects relating to hardware design and the FPGA device technology shall be surveyed in an effort to increase the understanding of them.
Accordingly the thesis objective can be separated into three distinct parts, which will be discussed further in the following sections.
Theoretical Survey of Hardware Design and the FPGA Technology
This part of the thesis shall describe various aspects of hardware design and the FPGA device technology. The survey shall start with a description of the FPGA device technology and its internal structure, discussing similarities and differences to alternative target technologies such as ASICs and PLDs. Then different ways of describing hardware shall be investigated, including hardware descriptive languages, high level design tools and IP components. Finally more advance topics, such as hardware/software co-design, may be covered.
The survey should try to highlight how different choices affects properties such as design complexity, costs, performance, time to market and resource consumption. In addition to this, the reader shall be given references to related material throughout the survey when appropriate.
Stressometer Measuring Roll Simulator
The aim of this part of the thesis is to investigate whether it is feasible to implement a simu-lator application, specifically a Stressometer measuring roll simusimu-lator, in hardware based on the FPGA device technology. The Stressometer measuring roll was selected as the simulator application since it is necessary to test the Stressometer electronics and control system during production [FOD07]. Through the use of a simulator, the time spent on the production tests can be reduced since the need to connect a real Stressometer measuring roll is removed. An-other possible application of the Stressometer measuring roll simulator is as a part of a larger rolling mill simulator or a Stressometer system simulator, where the flatness measuring and control can be run in a closed loop [FOD07]. Accordingly the Stressometer measuring roll simulator needs to be able to simulate the behavior of the actual Stressometer measuring roll and approximate its signals.
Thus if it is feasible to implement a FPGA-based Stressometer measuring roll simulator a prototype shall be specified, designed, implemented and verified. A first step towards this end is to survey the Stressometer system, concentrating on the Stressometer measuring roll and PFSK187 (a component of the Stressometer DTU). Based on this survey a specification of the roll simulator shall be written, detailing different functions that are desired or needed for a good approximation of the Stressometer measuring roll.
Based on this specification a design shall be written that details how a prototype is to be implemented. Somewhere in between writing the specification and verifying the correct operation of the prototype, the feasibility of implementing a simulator application in FPGA-based hardware will be determined. Of course if it is determined at some point that it is not feasible, this part of the thesis will end at that point. The specification, design, implementation and verification of the roll simulator shall also be described in detail.
Parallel Cholesky Decomposition
The aim of this part of the thesis is to investigate whether it is feasible to implement a com-putational heavy application, specifically Cholesky decomposition, in hardware based on the FPGA device technology. The Cholesky decomposition was selected since matrix-based de-compositions, such as linear least squares and singular value decomposition, is at the core of
the computations that are performed in the Stressometer system [FOD07]. These decomposi-tions are in turn based on Cholesky, LQ and LDL decomposidecomposi-tions [FOD07].
A first step towards this objective is to perform a survey of existing literature relating to parallel computation in general and parallel Cholesky decomposition in specific. Especially of interest is literature concerning parallel implementations of Cholesky decomposition based on the FPGA device technology. Based on the literature study a suitable parallel structure of the Cholesky decomposition algorithm shall be determined.
When determining the feasibility of a FPGA-based Cholesky decomposition it might be necessary to implement a parallel Cholesky decomposition algorithm in hardware. This to facilitate a closer examination of such an implementation. If it is determined that it is feasible, the thesis shall also include suggestions of how a prototype can be implemented and how different choices affects properties such as performance and resource consumption.
In addition to the scope of the thesis, as outlined in the problem formulation, additional de-limitations are discussed in this section. Notice that more detailed discussions of the scope of the various investigations performed in this thesis can be found, when suitable, throughout the report. This can include information concerning the relevant literature, software and hardware tools, performance measures, etc. that are used for a specific investigation.
One of the limiting factors is the choice of a target FPGA, which in this thesis is restricted to Xilinx’s Virtex models of FPGAs (Virtex-5 and Virtex-6). The main reason for this restric-tion is that the available resources, such as informarestric-tion, applicarestric-tions and IP components, are biased towards Xilinx’s FPGAs. Of course this restriction also implies that any IP component included in a design must be suited for implementation in Xilinx’s FPGAs.
Since Xilinx’s FPGAs are targeted, Xilinx’s ISE application will be used when imple-menting the designs and Xilinx’s ISIM application when simulating them. Furthermore the hardware will be described using the Very High Speed Integrated Circuit Hardware Descrip-tion Language (VHDL). Matlab will be used for simpler software tasks such as generating testvectors and plotting graphs. If it is necessary to perform more complex software tasks, these applications shall be written in ANSI C using Microsoft’s Visual Studio 6.
Due to the applications used, time constraints and the size of the designs, there is also a limit to the effort that can be put into the simulations and place and routes that are performed on the various developed designs.
Some of the main contributions of this thesis are discussed briefly in the following sections. Survey of the Stressometer Measuring Roll Force Signals
Different properties of the force signals generated by the Stressometer measuring roll are investigated and presented in Chapter 4. This includes an investigation of the relation between the carrier wave period time and the wrap and slot angles for different rotational velocities and angles. In addition to this, a sampled force signal is demodulated and examined for properties such as amplitude offset, rotational velocity and the general shape of the signal.
Prototype Implementation of the Stressometer Measuring Roll Simulator
A FPGA-based Stressometer measuring roll simulator prototype is developed and detailed documentation of the specification, design, implementation and verification is presented in Chapters 6, 7, 8 and 9. The verification is in part performed through simulations using test-benches developed specifically for the roll simulator. These testtest-benches includes a number of Matlab functions that can be used to generate testvectors and to evaluate the results.
A roll controller and other peripheral components, needed for the correct operation of the roll simulator, are also developed. The roll controller supplies instructions and data, which are generated before synthesis using accompanying Matlab functions, to the roll simulator. Parallel FPGA-Based Cholesky Component Prototype
Different properties of the Cholesky algorithm is investigated in an effort to find a suitable parallel FPGA-based design. This prototype design is evaluated through simulations using a testbench developed for this purpose. The testbench also includes a Matlab function that can be used in order to generate testvectors for the simulations. Suggestions of improvements and modifications that should be included in a final implementation are also proposed. The investigation, design, evaluation and suggested improvements are presented in Chapters 11 and 12.
This sections gives the reader an outline of the thesis and an introduction to the various subjects covered. This by summarizing the different parts and chapters that are included.
Part I: Background
This part of the thesis provides background information concerning the thesis and digital sys-tem design, with a focus on the FPGA technology and arithmetics. This includes different target hardware technologies, architectural approaches, binary number systems, hardware de-scriptive languages and higher level abstractions that can be used to describe digital circuits. These aspects of digital systems are discussed by emphasizing the differences in design, imple-mentation, validation and system complexity, development and production costs and efficiency of the final implementation.
Chapter 2: Digital System Design
The aim of the ‘Digital System Design’ chapter is to provide the reader with an introduction to some of the more central aspects of digital system design, including advantages and dis-advantages that may exist. This by covering different subjects, such as the different device technologies, hardware descriptive languages and architectural approaches, with a focus on the FPGA device technology.
For further reading regarding these subjects, see e.g. Maxfield (2004) [MFI04], Chu (2006) [CHU06] and Wiklund (2005) [WIK05].
Chapter 3: Arithmetics in Digital Systems
This chapter deals with different issues that arises when performing arithmetics in digital systems. More specifically it introduces different binary number systems, such as the two’s complement, the fixed-point and the floating point binary number systems, that can be used when performing calculations. These binary number systems are needed since digital circuits uses high and low voltage levels, which corresponds to a one and a zero, when representing information.
For more information concerning this subject, see e.g. Cook (2004) [COO04] and Patter-son and Hennessy (2005) [PAT05].
Part II: The Stressometer System
This part of the thesis provides information, concerning the Stressometer system, necessary in order to be able to design and implement the Stressometer measuring roll simulator prototype. Thus it provides a short general introduction to the system and a more detailed description of the Stressometer measuring roll and the PFKS187 component. This includes a detailed description of the analog and digital signals used to convey the force measures from the mea-suring roll, available resources in PFSK187 and the current digital design used in the FPGA that resides in PFSK187.
Chapter 4: The Stressometer Measuring Roll
This chapter provides an introduction to the functionality and technical specification of the Stressometer measuring roll, including a description of the force and pulse signals that it generates. The chapter also includes an investigation of the force signals and how they are affected by different wrap angles, slot angles and rotational velocities.
Chapter 5: A Case Study of PFSK187
The ‘A Case Study of PFSK187’ chapter provides a description of PFSK187 based on the documentation of the prototype available during this thesis. This includes a survey of both the electrical components available on PFSK187’s circuit board, the digital circuits that are implemented in its FPGA and the communications protocols used.
Part III: The Stressometer Measuring Roll Simulator
This part of the thesis covers the specification, design, implementation and validation of a simulator based on the FPGA device technology. More accurately a Stressometer measuring roll simulator prototype that is able to approximate the signals that are generated by the actual Stressometer measuring roll. This based on a number of variables, such as the rotational velocity and applied forces, that are supplied to it.
Thus this part of the thesis serves two distinct purposes, one of which is to investigate whether it is feasible to implement simulators as digital systems based on the FPGA device technology or not. The second purpose is the development of a prototype that can lead to a final product. This final product could be used e.g. during production testing of the Stressometer control functions (i.e. the Stressometer cubicle) or as a part of a larger simulation system. Chapter 6: Specification
This chapter outlines the different requirements of the Stressometer measuring roll simulator. This includes general design and documentation guidelines, variables and signals that needs to be supplied or are generated and the different functions that must be performed.
Chapter 7: Prototype Design
This chapter contains a detailed description of the Stressometer measuring roll simulator pro-totype design. The chapter starts with a description of the various components used to approx-imate the force and pulse signals of the Stressometer measuring roll. The chapter continues by discussing the user supplied and internal variables, including their formats and how they can be determine. The communications interface is then discussed, detailing the signal interface and the communications protocol used to allow external components to control the operation of the roll simulator and to update the user supplied variables. Finally the design of a number of peripheral components, external to the roll simulator, are discussed. These components are needed for the correct operation of the roll simulator and includes logic that generates the var-ious clock signals and a roll controller. The roll controller is able to update the user variables and control the operation of the roll simulator based on instructions that are stored in a ROM.
Chapter 8: Implementation and Verification
The ‘Implementation and Verification’ chapter discusses details concerning the implementa-tion and verificaimplementa-tion of the Stressometer measuring roll simulator prototype. This includes a pin-out specification when implemented in PFSK187 and verification through simulation and hardware tests.
Chapter 9: Conclusions and Further Work
This chapter discusses the conclusions that can be drawn from the design and implementation of the Stressometer measuring roll simulator prototype. Suggestions are also made for how the prototype simulator can be improved further.
Part IV: Feasibility Study of Parallel FPGA-Based Cholesky Decomposition
The aim of this part of the thesis is to investigate the feasibility of implementing a parallel FPGA-based Cholesky algorithm in hardware. This in an effort to determine if it is feasible in general to implement computational heavy algorithms, suited for parallel processing, as FPGA-based parallel hardware. If feasible, parallel FPGA-based implementations of compu-tational heavy algorithms could serve to reduce the compucompu-tational load on the Stressometer control system. This is of interest for several reasons, e.g. allow more functionality to be implemented or reduce the need for hardware upgrades in the existing Stressometer control system.
In order to accomplish this a brief survey, of literature relating to parallel processing in general and parallel FPGA-based implementations of the Cholesky algorithm in specific, is performed. In addition to this a close examination, studying properties that can be exploited in a parallel implementation and simulating a parallel prototype design, of the Cholesky algo-rithm is also performed. Finally the result of the feasibility study and suggestions of improve-ments are presented.
Chapter 10: Literature Study
The Cholesky literature study chapter provides a basic introduction to parallel implementation of algorithms, specifically the Cholesky algorithm when targeting the FPGA device technol-ogy. This by first describing the serial Cholesky algorithm and then giving a short description of parallel computation, highlighting issues that relates to the FPGA device technology. Fi-nally three papers, covering the subject of parallel Cholesky decomposition, are reviewed.
For more information concerning parallel computation, see e.g. Grama (2003) [GRA03], and regarding Cholesky decomposition see e.g. Higham (2002) [HIG02], Björk (1996) [BJO96] or Golub (1996) [GOL96]. The reviewed papers are Demirsoy (2009) [DEM09], Maslennikow (2007) [MAS07] and Kurzak (2008) [KUR08].
Chapter 11: Parallel FPGA-based Cholesky Decomposition
This chapter investigates the feasibility of implementing a parallel FPGA-based Cholesky decomposition. This by first investigating the Cholesky decomposition in an effort to find properties, such as task interaction and dependencies, that can be exploited in a parallel im-plementation. In a second step this information is used to develop a prototype design that can be used to investigate the behavior, through simulations, of a parallel FPGA-based Cholesky implementation. This in order to determine the feasibility and the performance of such an implementation. Suggestions of how the prototype design can be improved, for a final imple-mentation, are also presented.
Chapter 12: Conclusions and Further Work
This chapter discusses the conclusions that can be drawn form the feasibility study performed on the subject of parallel FPGA-based Cholesky decomposition. Suggestions of further work concerning the parallel Cholesky component, in addition to the suggested improvements of the prototype design in the previous chapter, are also presented.
Part V: Conclusions and Further Work
This part of the thesis discusses the more general conclusions that can be drawn from this thesis. This includes a summary of the conclusions drawn in Chapter 8, concerning the Stres-someter measuring roll simulator prototype, and Chapter 12, concerning the feasibility study of parallel FPGA-based Cholesky decomposition. In addition to this, general suggestions of further work on this thesis are also discussed.
Chapter 13: Conclusions
While Chapters 9 and 12 discusses conclusions that relates specifically to the Stressometer measuring roll simulator prototype and the feasibility study of parallel FPGA-based Cholesky decomposition, this chapter focuses on the more general conclusions that can be drawn from this thesis. This in an effort to put the findings of this thesis into a broader perspective.
Chapter 14: Further Work
Similar to the previous chapter, these suggestions of further work focuses on a broader per-spective. This compared to previous suggestions of further work in Chapters 9 and 12, which focuses more specifically on the Stressometer measuring roll simulator prototype and the par-allel FPGA-based Cholesky decomposition.
Part VI: Appendixes
This part of the thesis contains the Appendixes, which includes PFSK187’s FPGA pin-out specification and signal interfaces of the Stressometer measuring roll simulator prototype’s internal components. It also contains a description of the Matlab functions that can be used in order to generate instructions for the roll controller component.
Appendix A: PFSK187 FPGA Pin-Out Specification
This chapter contains the pin-out specification of PFSK187’s FPGA that was up to date the 27th of May 2008.
Appendix B: Roll Simulator - Internal Component Signal Interfaces
This appendix contains all the signal interfaces of the roll simulators internal components, memories, multipliers and shift registers.
Appendix C: Roll Controller - Instruction Generation
This appendix contains a description of a number of Matlab files used to generate the roll controller instructions memory initialization file. In addition to this, these files are used to generate test vectors for the testbenches. This in order to be able to simulate the behavior of the roll simulator and to evaluate its performance.
Appendix D: Roll Control - Instruction Verification
This Appendix gives a list of the roll controller instructions used, in the correct order, when generating the test vectors for communications interface and roll controller verification.
Digital System Design
The aim of this chapter is to provide a brief introduction to some of the more central aspects of digital system design, covering subjects such as device technologies, hardware descriptive languages and different architectural approaches (e.g. the use of IP components, SoCs and NoCs). This with a focus on the FPGA device technology. The goal is to give a basic under-standing of the different subjects, including advantages and disadvantages that may exist.
There exists a wide range of device technologies that can be used when implementing digital systems, some of which can be seen in the list below. Most of these devices can be divided into one of two categories; application specific integrated circuits (ASICs) and programmable logic devices (PLDs). The field programmable gate arrays (FPGAs) are placed in their own category since they have properties common to both ASICs and PLDs.
• Application Specific Integrated Circuit (ASIC) – Gate Arrays
– Structured ASICs – Standard Cell – Full Custom
• Field Programmable Gate Array (FPGA) • Programmable Logic Device (PLD)
– Simple Programmable Logic Device (SPLD) ∗ Programmable Read Only Memory (PROM) ∗ Programmable Logic Array (PLA)
∗ Programmable Array Logic (PAL) ∗ Gate Array Logic (GAL)
– Complex Programmable Logic Device (CPLD)
One issue that is critical during the design process of a digital system is the selection of a suitable device technology [CHU06]. This can be accomplished by comparing the area consumption, power consumption, speed and cost of a digital system when implemented using several different device technologies [CHU06]. A short summary of these measures can be seen in the list below. Also worth mentioning is that these measures also can be used when comparing different designs and design methods.
• Area Consumption:
This measure refers to the resource utilization, or how much of the available logic in the target device that is used. When comparing implementations of the same digital systems in different device technologies, the area consumption can be seen as a measure of how efficiently a design can be mapped to a certain device technology. This is of interest since the mapping process, in combination with the granularity of the technology (i.e. if the digital system is mapped to transistors, logic elements, gates, etc.), often leads to a waste of resources. [CHU06]
• Power Consumption:
The power consumption of a digital system is closely linked with the area consumption. This since implementations that uses less area also can be realized in physical smaller ICs which consumes less power. [CHU06]
The speed is usually measured using the worst-case propagation delay between the input and output signals. The propagation delay denotes the time that pass from when an input signal changes until the output signals reflects this change. [CHU06]
The cost of a digital system includes the development, production and time-to-market costs. The time-to-market cost denotes a loss of revenue that may occur as a result of long development times. For example, an implementation with a short time-to-market is able to generate profit (i.e. can be sold) while an implementation with a long time-to-market still is being developed. [CHU06]
The application specific integrated circuits (ASICs) are designed and produced to perform specific tasks. This is accomplished by supplying a number of so called masks to a foundry, which uses them during the production of the physical ICs [CHU06]. In the case of full cus-tom ASICs, these masks describes the transistors used, how they are placed and how they are connected to each other [CHU06, MFI04]. This means that the ASICs obtains their function-ality during the production of the ICs and can thus not be programmed or modified afterwards [CHU06].
The general concept of using masks to produce and configure the ICs are also used for the other, previously mentioned, ASIC technologies [CHU06, MFI04]. Suggested reading for
more information regarding the ASIC device technologies and how the different ASICs are produced can be seen at the end of this section.
In general ASICs consumes less power and area, while providing higher speeds, than FPGAs and PLDs [CHU06]. The drawback is that the design process is very complex and time consuming, resulting in high development and time-to-market costs [CHU06]. At the same time production cost per unit is lower than for FPGAs and PLDs, making the use of ASICs cost efficient when producing larger volumes [CHU06]. In addition to this, the ASIC technology can be used to implement very large and complex systems [MFI04].
In 1970 the first programmable logic device (PLD) arrived in the form of a programmable read only memory (PROM), which from the start was intended to be used as a computer mem-ory. However, designers also found that the PROMs could be used to implement simple logic functions. After the arrival of the PROM based PLD, a number of similar device technolo-gies also emerged, including the programmable logic arrays (PLAs) and the programmable array logics (PALs). This group of device technologies are commonly referred to as simple programmable logic devices (SPLDs). [MFI04]
Usually these devices are constructed as a two level array, one level corresponding with an and-plane and the other with an or-plane. Different logic functions can be implemented by programming the interconnect within each of these planes, as illustrated in Figure 2.1. In ad-dition to this, different vendors often include support for different features, such as registered outputs, configurable pins (i.e. either as input or output) and invertible outputs. This means that the designer needs to select a suitable SPLD, which have the necessary features, for an implementation of a digital system. [MFI04]
Figure 2.1: This figure illustrates the and-plane and or-plane in a programmable logic de-vice. The points at which the different paths cross each other can be open or closed, thus implementing a logic function. If x marks a closed connection, w = (a and !b) or (!a and b). [MFI04]
At the end of the 1970s more complex devices, known as complex programmable logic devices (CPLDs), arrived at the scene. In essence the CPLDs includes several SPLDs on the same physical chip, connecting them through their inputs and outputs using a programmable interconnect matrix. A very important property of this interconnect matrix is that the SPLD
blocks are connected with less than a 100% connectivity, thus giving the CPLDs a more scal-able power consumption, speed and cost. This at a cost of more complex software design tools. [MFI04]
Advantages of the PLDs over the ASICs includes both lower development and time-to-market costs [CHU06]. In addition to this, the PLDs can be programmed, in some cases also reprogrammed, after the physical chips have been manufactured [CHU06, MFI04]. However, the PLDs also consumes more power and area, while at the same time operating at lower speeds, than the ASICs [CHU06]. An additional disadvantage is that the PLDs only can be used to implement relatively simple designs [MFI04].
For further reading regarding ASIC, PLD and FPGA device technologies, including their properties and how they are configured, see e.g. Maxfield (2004) [MFI04] and Chu (2006) [CHU06].
Field Programmable Gate Arrays (FPGAs)
Xilinx introduced the first field programmable gate array (FPGA) in 1984, although it was not until the early 1990s that the FPGAs achieved a wide spread use [MFI04]. The FPGA technology served to fill a gap that existed between the programmable logic devices (PLDs) and the ASICs [MFI04]. This since FPGAs can be used to implement more complex designs than the PLDs, while still having lower development and time-to-market costs than the ASICs [MFI04]. Similar to the PLDs, the FPGAs are also programmed after the physical chips have been manufactured [MFI04]. One drawback with the FPGA device technologies is that they consume more power and area, while at the same time operating at lower speeds, than the ASIC device technologies [CHU06].
FPGAs can be based on either a non-volatile anti-fuse technology or a volatile SRAM tech-nology. While the non-volatile FPGAs are one-time programmable (OTP), the more common SRAM based FPGAs can be reconfigured repeatedly while resident in a system (in-system programmable, ISP). The ability to reconfigure the FPGAs is very useful since it facilitates hardware verification during development and hardware upgrades after the release of a prod-uct. A drawback with the volatile reconfigurable FPGAs is that they loose their configuration each time the system is powered down. This means that the FPGAs must be configured, using a stored configuration file, each time the system is powered up. [MFI04]
Modern FPGAs, based on the SRAM technology, might also be able to perform dynamical and partial reconfigurations. This means that pre-defined areas in the FPGA can be reconfig-ured during run-time while at the same time leaving unaffected areas operational. Dynamic and partial reconfiguration of FPGAs can be used to implement reconfigurable computing, which allows the hardware to be adapted during run-time to the changing needs of a system. [MFI04]
The FPGAs are based on a large number of small logic blocks that are connected to each other through a programmable interconnect. These logic blocks, which also are known as logic cells or logic elements, often includes elements such as lookup-tables (LUTs), registers, multiplexers, etc. The logic blocks can be configured to perform simple logic functions and by connecting several of these blocks together, more complex functions can be derived. In addition to these logic blocks, the FPGAs can also contain dedicated circuitry that can perform
specific functions (i.e. they can not be configured to perform different tasks). This at greater speeds and lower resource consumption than if the function would have been implemented using logic blocks. [MFI04]
The internal architecture of the FPGAs, including functions commonly found as dedicated circuitry, are described more thoroughly in the following sections. For more information concerning the FPGA device technology, see e.g. Maxfield (2004) [MFI04] and Chu (2006) [CHU06].
Most FPGA devices uses logic blocks based on lookup tables (LUTs) when implementing simple logic functions, although there also exists other techniques such as the MUX based logic blocks. As illustrated in Figure 2.2, logic functions can be described using truth tables that lists all the possible combinations of input values with their corresponding output value. The basic idea of a LUT based approach is to use the different combinations of input values as addresses, letting them point to different elements in the LUT. A logic function can thus be implemented by storing each output value in the LUT element pointed to by the corresponding combination of input values. [MFI04]
Figure 2.2: This figure illustrates how LUTs can be used to implement logic functions. Each output value of the logic function is stored in an element of the LUT that can be accessed using the corresponding combination of input values as an address. [MFI04]
As illustrated in Figure 2.3 the logic blocks are not only composed of a LUT, but also of other components such as registers and multiplexers. Also notice that the number of inputs to each LUT can vary between different vendors and FPGA models. A very useful property of the SRAM based FPGAs, which uses the LUT based logic blocks, is that some vendors (e.g. Xilinx) also allows them to be used as distributed memories and shift registers.
These logic blocks are then grouped together to form slices, which in turn also are grouped together into control logic blocks (CLBs), also known as logic array blocks (LABs) [MFI04]. This structure is illustrated for a Virtex-5 FPGA in Figure 2.4, where each CLB contains four slices and each slice contains two logic blocks [XIL09a]. Notice that the number of logic blocks in each slice and the number of slices in each CLB also varies between different vendors and FPGA models [MFI04].
The logic blocks in a single slice shares the same clock, clock enable and reset signals while the data input and output signals, used to access the different LUTs, are exclusive. The
Figure 2.3: This illustration shows a simplified version of a logic block. The basic components are a lookup-table, a register and a multiplexer. [MFI04, XIL09a]
Figure 2.4: This figure illustrates how the control logic blocks (CLBs), slices and logic blocks relates to each other. [MFI04, XIL09a]
logic blocks within a CLB can communicate with each other using a fast programmable inter-connect, which serves to reduce communication delays. This compared to the communication delays that occurs, when using the general programmable interconnect, between the different CLBs. [MFI04]
In addition to the more general resources described in the previous section, the FPGAs usually also includes a variety of different functions as dedicated circuitry. A brief description of the more commonly occurring functions can be seen below.
In addition to the distributed memory that some FPGA models provides, memories are usually also available as dedicated circuitry. This as a number of fixed size blocks that can be combined in order to create larger memories. These blocks can often be config-ured as single port memories, dual port memories or first-in-first-out (FIFO) memories. [MFI04]
As discussed in Sections 2.4 and 2.5 there exists several advantages, such as reduced complexity and time-to-market costs, of using a hardware/software co-design approach for some types of applications. Of course this implies the use of a microprocessor, e.g. in a system-on-a-chip (SoC). A microprocessor can be included in the design in several different ways; as an external component, using the logic blocks of the FPGA or as dedicated circuitry [MFI04].
Microprocessors implemented in dedicated circuitry (i.e. hard) are both faster and more complex (i.e. provides more features) compared to microprocessors implemented using the logic blocks (i.e. soft). However, one advantage of microprocessors implemented using logic blocks are that they can be added and removed as required. Using exter-nal microprocessors can be disadvantageous since this increases design complexity and development costs due to the need of additional chips (i.e. the microprocessor), more complex circuit board designs, etc. [MFI04]
• Multipliers and Adders:
Operations such as multiplication and addition can be found in a variety of different applications, e.g. when performing digital signal processing. The trouble with these kinds of operations are that they are usually slow and consumes a lot of resources when implemented using logic blocks. For this reason it is common to find multipliers, adders and multiply-and-accumulators (MACs) as dedicated circuitry in the FPGAs. [MFI04] • Clock Managers:
The FPGAs are usually supplied with a clock from an external source, which is used to drive the different synchronous components of the digital system. This clock signal is distributed to the different components using dedicated paths known as clock trees, which are separated from the general programmable interconnect in order to reduce clock skew. Usually there are several clock trees present in a FPGA which facilitates the use of different clock domains. [MFI04]
In addition to the clock threes, the FPGAs may also contain clock managers which can be used for a number of different purposes related to the clocks. This includes frequency synthesis (i.e. generate one or several daughter clocks based on an input clock), removal of jitter, create phased-locked loops and phase shifting. [MFI04]
• General purpose I/O:
Today’s FPGAs usually includes thousands of I/O pins which can be used for commu-nication with the surrounding environment, e.g. with DA converters, physical ethernet connections, etc. When communicating in digital electronics, a signaling standard must be used in order to be able to discern the different bits, the ones and zeroes, from each other. Thus the signaling standard describes the electrical characteristics of the commu-nication such as the voltage levels for the ones and the zeroes. [MFI04]
There exists a lot of different signaling standards that can be used, such as the Low Voltage Transistor Transistor Logic (LVTTL) and the Low Voltage Differential Signal-ing (LVDS) standards. In order to facilitate the use of different signalSignal-ing standards, the I/O pins are divided into a number of I/O banks, each with its own independent voltage supply (i.e. the voltage supply can vary between the different banks). Thus the I/O pins can be configured to use different signaling standards, as long as they are supported by the FPGA and the voltage supply requirements are fulfilled by the bank to which the I/O pin in question belongs to. Since the different I/O banks have their own power sup-plies, the internal logic of the FPGAs can use a signaling standard, based on the core voltage supply, which is independent from those used by the I/O pins. This enables the manufactures of the FPGA chips to reduce the power consumption through the use of a carefully selected internal signaling standard. [MFI04]
In addition to the use of different signaling standards, each I/O pin usually also has a configurable impedance. This impedance is used to reduce noise by preventing signals from reflecting back to the FPGA. [MFI04]
When the FPGA technology first appeared in the mid 1980s it was mainly used to implement relatively simple digital electronics, such as glue logic and state machines. As time passed and the FPGA technology evolved, so did the complexity of the digital systems that were implemented using FPGAs. Thus in the early 1990s the FPGA technology could be used to implement communications electronics (e.g. networking and telecommunications) and as prototypes for ASIC designs. By the early 2000s the FPGA technology had evolved to a point where it could be used to implement a wide variety of applications in several different areas. Five of the major areas are discussed briefly below. [MFI04]
• Replacing ASICs:
As the FPGA technology evolves and becomes more powerful an increasing amount of applications, that previously would have been implemented using ASICs, are now being implemented using FPGAs. [MFI04]
• Systems on a Chip:
Historically embedded systems were based on a microcontroller, which usually has on-chip program memories, I/O resources, etc. Special purpose hardware could also be attached externally (i.e. off-chip) to the microprocessor. However as the FPGA technol-ogy evolved, it became possible to implement these systems on a single FPGA. Using the FPGA technology gives the advantage of being able to tailor a system to a spe-cific application, including adding special purpose hardware on the same chip. These systems are known as systems-on-a-chip (SoC, see Section 2.5). [MFI04]
• Digital signal processing:
A large amount of dedicated circuitry, such as multipliers and memories, are commonly found in today’s FPGAs. This in combination with the potential for parallel processing that the digital hardware provides have led to an increase in the use of FPGA technology for digital signal processing applications. [MFI04]
• Physical Layer Network:
FPGAs have often been used in network communication applications, more specific as a bridge between the low level physical devices (i.e. the physical layer of the well known OSI model) and the higher layer network protocols. Today’s FPGAs often contains dedicated circuitry capable of high speed communications, usually providing both the physical layer and bridge (i.e. the media access layer) functionality. Thus it is possible to implement network capable devices in a single FPGA chip. [MFI04]
• Reconfigurable Computers:
A fairly new field of application, known as reconfigurable computing, takes advantage of the FPGAs’ ability to be reconfigured [MFI04]. Advances in FPGA technology, such as dynamic and partial reconfiguration, facilitates development in this area of applica-tion [COZ09]. Through the use of reconfigurable computing, the configuraapplica-tion of a FPGA can be adapted as the hardware needs of a system changes [COZ09]. This allows the available resources to be exploited more efficiently, reducing resource and power consumption while still providing the necessary performance [COZ09].
However there are also drawbacks with reconfigurable computing, such as reconfigure overhead and increased design complexity. The reconfigure overhead can be attributed to e.g. the time it takes to physically reconfigure the device and blocking of functions with dependencies to the reconfigured hardware [MFI04]. This while increases in de-sign complexity can be attributed to, for example, the additional control logic needed to perform dynamic and partial reconfigurations (i.e. deciding when and what to reconfig-ure during run-time).
Reconfigurable computing can be used for a number of different tasks such as provide increased reliability, failure redundancy (e.g. through a safe hardware mode) and hard-ware accelerators for softhard-ware algorithms (e.g. having a hardhard-ware component library of various mathematical functions) [MFI04, BEC06].
Xilinx’s Virtex-5 FPGAs
This section briefly describes Xilinx’s Virtex-5 FPGAs, which can be divided into five different platforms; LX (high-performance logic applications) LXT (high-performance logic applica-tions with serial connectivity), SXT (digital signal processing applicaapplica-tions), FXT (embedded system applications) and TXT (high bandwidth applications). [XIL06]
The general purpose logic can be implemented using the control logic blocks (CLBs), which each contains two slices. In turn each slice contains four logic cells that are based on a six-input lookup table (LUT). In addition to the general purpose logic, the Virtex-5 FP-GAs also contains dedicated circuitry for a number of different functions, a few of which are described briefly below. [XIL09a]
• Block RAM:
In addition to being able to use some of the logic cells as distributed RAMs or shift registers, Virtex-5 also includes a number of 36Kbit true dual port block RAMs. These block RAMs can be combined to create larger memories, used as two sperate 18Kbit memories or configured as FIFOs. [XIL09b]
• DSP48E Slices:
The DSP48E slices contains a 25x18 bit two’s complement multiplier and a 48 bit adder, which also can act as a subtractor or an accumulator (i.e. for multiply-and-accumulate (MAC) operations). They can also perform complex-multiply and bitwise logical oper-ations. [XIL09b]
• PowerPC 440:
The FXT platform Virtex-5 FPGAs also contains embedded PowerPC 440 microproces-sors that have a seven stage pipeline, instruction and data caches and supports multiple instructions per cycle and out-of-order execution. They can operate at frequencies up to 550MHz and are able to connect to a 128 bit wide processor local bus (PLB), which is a part of IBM’s CoreConnect bus architecture (see Section 2.6.1). In addition to the PLB interface, they also have dedicated interfaces for a DDR2 memory controller and an auxiliary processor unit (APU). [XIL09b]
• Clock Management Tiles:
The Virtex-5 family is equipped with clock management tiles, each containing two dig-ital clock managers (DCMs) and one phased-locked loop (PLL), which provides func-tionality to manage the clocks in a system. The DCMs can be used to create delay-locked loops and to phase shift clocks, while the PLLs can be used to create phase-locked loops and remove jitter from a source clock (e.g. an externally supplied clock). The DCMs and PLLs can also be used to synthesize daughter clocks from a source clock. [XIL09a]