Adaptation of an ARM compatible System on chip as an IP-module in a FPGA

(1)

IT 14 008

Examensarbete 30 hp

Januari 2014

Adaptation of an ARM compatible

System on chip as an IP-module

in a FPGA

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Adaptation of an ARM compatible System on chip as

an IP-module in a FPGA

Emanuel Wahlqvist

(4)

(5)

Acknowledgements

I would like to thank:

• Stig Silver at Syntronic for trusting in me to solve this problem.

• Lars Johansson, my supervisor at Syntronic, for guidance, knowledge and help along the way.

• Robert Adenmark who, despite being on parental leave, showed me in the right direction on more FPGA detailed issues.

• Leif Gustafsson, my supervisor at Uppsala University, for reading, correcting and giving knowledgeable input to this report and for directing my attention to related studies in this subject.

And all others at Syntronic who in some way aided me in this work.

(6)

4.8. SPI . . . 31 4.8.1. SPI protocol . . . 31 4.8.2. SPI controller . . . 32 4.9. UART . . . 33 4.9.1. UART protocol . . . 33 4.9.2. UART controller . . . 34 4.10. Ethernet MAC . . . 38 4.11. GPIO . . . 38 4.11.1. GPIO controller . . . 38 4.12. Timers . . . 39 4.12.1. Registers . . . 39 4.12.2. Setting up a timer . . . 41 4.13. Interrupt controller . . . 41 4.13.1. Registers . . . 42 4.14. Test module . . . 45

4.15. Verilog test bench . . . 47

4.15.1. UART . . . 48 4.15.2. SPI . . . 48 4.15.3. I2_{C . . . 48} 4.15.4. GPIO . . . 49 5. Configuration 50 5.1. Parameters . . . 50

5.2. Adding or removing a peripheral . . . 51

6. Tools 53 6.1. Xilinx ISE 14.5 . . . 53 6.1.1. Synthesis . . . 53 6.1.2. Simulation . . . 56 6.1.3. Bulk simulation . . . 58 6.1.4. Debug switches . . . 59

6.2. Sourcery CodeBench for ARM processors . . . 59

6.3. Amber specific tools . . . 59

6.3.1. amber-elfsplitter . . . 60

6.3.2. amber-elfsplitter-memcontents . . . 60

6.3.3. check mem size . . . 60

6.4. Installation . . . 61

7. Testing 62 7.1. Assembler tests . . . 62

7.1.1. SPI test (spi.S) . . . 62

7.1.2. I2 C test (i2c.S) . . . 63

7.1.3. UART test (uart tx.S) . . . 66

(8)

7.2. C tests . . . 68 7.2.1. Libraries . . . 68 7.2.2. boot-loader-serial . . . 69 7.2.3. dhry . . . 69 7.2.4. hello-world . . . 69 7.2.5. spi-timer . . . 69 7.3. Linux test . . . 69 8. Result 70 9. Conclusion 71 9.1. Specification . . . 71 9.2. Implementation . . . 71 9.2.1. Target independence . . . 71 9.2.2. Peripheral integration . . . 71 9.3. Documentation . . . 72 9.3.1. Peripheral controllers . . . 72 9.4. Testing . . . 72

9.5. Compiler optimization issue . . . 73

10.Discussion 74 10.1. Pros and cons with the Amber SoC . . . 74

10.2. Peripherals . . . 75

11.Future work 76

12.Bibliography 78

A. I2C test output I

(9)

List of Figures

1.1. Simple sketch of FPGA layout . . . 2

4.1. Diagram showing the complete system design . . . 11

4.2. Overview of the a23 verilog structure. . . 13

4.3. Example of pipelined execution. . . 16

4.4. Example of control hazard handling. . . 18

4.5. Wishbone handshake . . . 21

4.6. Wishbone single read cycle . . . 22

4.7. Wishbone single write cycle . . . 22

4.8. Wishbone synchronous burst cycle . . . 23

4.9. Schematic of the wishbone demultiplexer . . . 24

4.10. Tri-state buffers on SDA and SCL. . . 29

4.11. I2_{C transfer. . . 29}

4.12. SPI transfer timing diagram. . . 32

4.13. UART in half duplex mode. . . 33

4.14. UART in full duplex mode with RTS and CTS. . . 34

4.15. A UART transfer. . . 34

4.16. Structural schematic of the UART controller. . . 35

4.17. GPIO pin tri-state buffer connection. . . 38

4.18. Interrupt vectors and masks. . . 42

4.19. Fast interrupt vectors and masks. . . 42

6.1. Xilinx ISE design flow. . . 54

6.2. Simulation script organization . . . 57

7.1. SPI transfer of first word. . . 63

7.2. SPI transfer of second word. . . 63

7.3. Start condition and sending slave address plus write bit (0x20) . . . 63

7.4. Sending register address 0x01 . . . 64

7.5. Sending data 0xa5 . . . 64

7.6. Sending data 0x5a and a stop condition . . . 64

7.7. Start condition and sending slave address plus write bit (0x20) . . . 64

7.8. Sending register address 0x01 . . . 64

7.9. Start condition and sending slave address plus read bit (0x21) . . . 64

7.10. Reading data 0xa5 . . . 65

7.11. Reading data 0x5a and stop condition . . . 65

(10)

7.13. Sending invalid register address 0x10 and receiving NACK . . . 65

7.14. Send character ”H” . . . 66

7.15. Send character ”i”, receive character ”H” . . . 66

7.16. Send character ”!”, receive character ”i” . . . 66

7.17. Send character ” ”, receive character ”!” . . . 67

7.18. Pins [8:1] is ”0xDA” and mirrored on pins [16:9] . . . 67

(11)

List of Tables

3.1. Comparison between ARM cores . . . 9

4.1. ARMv2a instructions supported by the Amber core. . . 14

4.2. Some of the control signals for the execute stage. . . 17

4.3. Wishbone signals, direction is seen from a master perspective . . . 21

4.4. Slave numbering in the Wishbone demultiplexer. . . 23

4.5. Coprocessor registers. All registers are 32 bits wide. . . 26

4.6. Layout of coprocessor register CR0 . . . 26

4.7. I2 C registers. All registers are 8 bits wide . . . 30

4.8. SPI modes . . . 31

4.9. SPI core registers. All registers are 32 bits wide. . . 33

4.10. UART core registers. All registers are 8 bits wide. . . 36

4.11. Flag register bits. Bits 2 and 1 are always high. . . 37

4.12. GPIO core registers. . . 39

4.13. Timer core registers. . . 40

4.14. Control register bits. Unused bits are always low. . . 40

4.15. Timer prescaler value . . . 41

4.16. Interrupt vector outline. The unused bits (NA) are initialized to zero. . . 42

4.17. Timer core registers. . . 43

4.18. Test module registers. . . 45

5.1. Parameters to configure the system . . . 50

6.1. Simulation script options . . . 56

6.2. Simulation script options . . . 58

6.3. Required environmental variables . . . 61

(12)

Abbreviations

CISC Complex Instruction Set Computer

CLB Configurable Logic Block

CTS Clear To Send

DMIPS Dhrystone Million Instructions Per Second

DSP Digital Signal Processors

ELF Executable and Linkable Format

FIFO First In First Out

FIRQ Fast Interrupt ReQuest

GPL General Public License

GUI Graphical User Interface

HDL Hardware Descriptive Language

I2

C Inter-Integrated Circuit

IP Intellectual Property

IRQ Interrupt ReQuest

ISA Instruction Set Architecture

LGPL Lesser (or Library if old) General Public License

LSB Least Significant Byte

LUT Look-up table

MAC Media Access Control

PC Program Counter

PCB Printed Circuit Board

PLL Phase Locked Loop

(13)

RISC Reduced Instruction Set Computer

RTS Ready To Send

Rx Receive

SPI Serial Peripheral Interface

Tx Transmit

(14)

1. Introduction

Syntronic is a global consultant company dealing in product development, testing and maintenance. They are active in several areas like telecommunication, medical and automotive. Their idea is to cover all areas from a design idea to a finished product applied in the field. In doing this they really see the advantage in keeping a design flow that not only gets products out on the market quickly but also makes them easy to maintain and upgrade. As a step in optimizing that design flow they want to take a closer look into soft processors. The reason for this is that several of their earlier designs has involved both FPGAs and microprocessors. By integrating the microprocessor into the FPGA there is a great potential in lessening the development time and at the same time make the system easier to tailor for future needs.

(15)

1.1. FPGA fundamentals

There are several manufacturers of FPGAs that all use their own architecture but the main structure is very similar. A general FPGA mainly consists of Configurable Logic Blocks (CLB:s) but also contains memory and DSP blocks. All blocks are connected together through a configurable routing net that can connect any blocks with each other regardless of their physical location on the FPGA as shown in Figure 1.1. This makes it possible to create any logic function ranging from a simple AND gate to extremely complex digital circuits such as processors. Historically these functions were described by creating a schematic on a drawing board. When the designs grew in size and complexity the use of a Hardware Descriptive Language(HDL), such as VHDL and Verilog, followed by a synthesis process became common.

Figure 1.1.: Simple sketch of FPGA layout

1.1.1. CLB

The CLB consists of several Look-Up Tables (LUT:s) that works as a logic function generator and at least one flip-flop per LUT. A typical LUT has four, five or six inputs, one or two outputs and contains 2n

bits (where n is the number of inputs). The CLB usually also contain multiplexers and additional flip-flops or latches. The flip-flops are used to synchronize an output from the LUT with a clock signal.

(16)

memory type is often referred to as Distributed Random Access Memory (Distributed RAM).

1.1.2. RAM blocks

Another type of FPGA memory is the RAM blocks. They interface the rest of the FPGA with input and output buses, an address bus, write enable inputs, a clock input and a reset input. The internal memory array can be very large, up to at least 36K bits. Since there are no multiplexers and relatively few flip-flops compared to the distributed RAM it is the preferred way of implementing large memories as it uses less FPGA area.

1.1.3. Routing net

The routing nets cover the whole FPGA and can be configured to connect all blocks in many different ways. At every point where nets cross each other a configurable switch matrix is located that is used to connect nets with each other. There are a special type of nets called clock nets that is used to distribute clock signals through the FPGA with minimal delay and skew.

1.2. FPGAs and processors

The fact that an FPGA can be programmed to perform any (of course limited to the size of the device) amount of tasks in parallel makes it very suitable for digital signal processing. Earlier, FPGAs were often coupled with a separate microprocessor who took care of communication interfaces, task management and other small organizational tasks. This has lead to FPGAs with an integrated hard processor. Examples of this is the Xilinx Zynq[1] and Altera SoC[2] product lines which combines different FPGAs with an ARM Cortex A9 processor or Microsemis Smartfusion[3] which uses an ARM Cortex-M3 processor. This is a solution for one who needs to combine a high capacity FPGA with a very competent processor. Compared to the solution with a separate processor this has the following benefits:

• Smaller total Printed Circuit Board (PCB) footprint

• No hardware interface needed between the processor and FPGA modules

For someone with not so high demands on performance, this is probably not the optimal solution. Since also the cheaper FPGAs has grown in size it is possible to implement a soft processor core inside these FPGAs along with the desired parallel logic. This has several benefits over the hard core solution such as:

• Possibility to change/upgrade the processor in the finished product

• Companies can hide their designs better

(17)

The fastest way to implement a soft processor core in a FPGA is to use one of the vendor specific cores. For Altera the processor is called NIOS II[4] while Xilinx call theirs MicroBlaze[5]. Both are 32 bit Reduced Instruction Set Computer (RISC) processors with a variety of configuration parameters. They however, are not the only players in the market. There are several RISC processors in the open source community to choose from, along with ARMs own propriety soft processor called Cortex M1.

1.3. Why ARM

Since ARM has seized a firm grip on the embedded processor market and is likely to keep their position, companies including Syntronic, see an advantage in learning and using processors based on their architectures. Even though it is not a very big step to move from ARM to a NIOS or Microblaze processor Syntronic wanted to investigate the possibilities of using a soft ARM core in a FPGA.

1.4. HDL design with Verilog

To understand the description of the final system no deep knowledge about HDL lan-guages is needed. It is however necessary to be familiar with the basic structure of a Verilog design. The following concepts is enough to follow the reasoning:

• Module

• Top level module

• Port

• Wire

Module A module is a block of logic that can be used once or several times in a design.

Top level module The top level module has the same code structure as a regular module. But the top level module is where all the regular modules are instantiated and organized to create the final design. Thus there can only be one top level module in a design.

Port The port is always defined in the beginning of each module and contains the interface of the modules, in other words, the module’s inputs and outputs.

(18)

1.4.1. Example design

An example of a module implementing a simple AND gate is shown below. Everything between the two keywords ”module” and ”endmodule” defines the content of the module while ”and gate” is the name of the module which will be used to reference it later. The code between the parenthesis is the port, in this case two inputs and one output, and between the port and the ”endmodule” keyword is where the implementation is written.

(19)

To put several modules together one can instantiate the desired modules and wire them together as shown in the example below. There two instances of the ”and gate” module shown above is used together with an OR gate. Worth noting is that there also can be logic in the top level module and in some cases one actually uses only a top level module for a design.

/∗ ∗ The p o r t o f t h e t o p l e v e l module ∗/ module t o p l e v e l ( i n p u t a , i n p u t b , i n p u t c , i n p u t d , output r e s u l t ) ; ‘ i n c l u d e ” a n d g a t e . v ” /∗

∗ To c o n n e c t t h e r e s u l t w i t h t h e outcome o f t h e two AND g a t e s ∗ one can u s e a ” w i r e ”

∗/

w i r e a AND b ; w i r e c AND d ;

/∗

(20)

2. Related work

The idea of using a soft core processor in a FPGA is not new. One example is the free LEON processor written in VHDL and based on the SPARC architecture. The devel-opment of LEON started in 1997[6] and the first version was released under the Library General Public License1

(LGPL) in 2000[7] by the European Space Agency. After that the second version of LEON called LEON2 have had several successful implementations in space[8]. For the third version the development was moved to the Swedish company Aeroflex Gaisler and it is now on its forth version. Another example of a open source processor is the openRISC 1000, or OR1K, that was released in 2001[9] and marked the beginning of the opencores community. Along with the community the OR1K grew and has now a complete toolchain, several compatible operating systems and there are at least two available SoC:s that are developed around it[10]. A commercial example is the NIOS processor developed by Altera that was released in 2001[11].

2.1. Optimizations

Since the subject has become very popular there have been several studies with the goal to make soft processors more efficient in area utilization and also have better perfor-mance. In Sheldon et al.[12] the sharing of resources such as floating point units and multipliers between soft cores are analysed. They managed to decrease the area utiliza-tion of a dualcore platform with 16% while only introducing an cycle count overhead of 1%. Another interesting article was written by Lysecky and Vahid [13] where a so called warp processor based on a Microblaze soft core is implemented. The warp processor analyses the software at runtime and uses a dynamic partitioning scheme to implement important software functions as circuits in the FPGA at runtime. When comparing the warp processor to a fully equipped Microblaze processor they find that the performance increased 5.8 times while the power consumption decreased with 57%. Another approach is to optimize the processor for a specific software before synthesis. In an article written by Sheldon et al.[14] a method for this based on a Microblase processor is shown. They gain a 200% speedup at most and a 20% speedup when using tight size constraints. Yet another approach to application specific optimization is taken by Yiannacouras, Steffan, and Rose[15]. They use a verilog generating software called SPREE to generate appli-cation specific processors. By first optimizing away unused features and then remove unused parts of the instruction set they achieve a performance per area increase of 25% compared to a NIOS II processor.

1

(21)

3. Specification

A big task of the system will be to read data from different sensors and control a variety of chips. This is done through different communication protocols where the most common used today are SPI and I2

C. For communicating with a PC in a simple way UARTs have been used for a very long time and will also be included. In discussion with the supervisor about Syntronic’s need we agreed to also include a GPIO controller and a flash memory. A complete list of the systems peripherals are shown below.

• I2C controller • SPI controller

• UART controller

• GPIO

• Flash memory for storage

• Main memory

• Boot mechanism

• At least one user interrupt

(22)

3.1. Core alternatives

There are several processor cores available for developers to chose from. As mentioned earlier, ARM has released their own FPGA targeted design called Cortex M1. The major benefit from using this processor is that ARM themselves has verified its function and along with the core you get their warranty. The downside is of course the cost. A free evaluation version exists though but with a fixed configuration and no visibility inside the code. Alternatives to the Cortex M1 can be found on the OpenCores website www.opencores.org. For this thesis, two projects have been considered, Amber and Storm SoC. Table 3.1 shows a comparison of the cores.[16][17][18]

Core Amber a23/a251

Cortex M1 Storm SoC

Pipeline stages 3/5 3 8

Cache size(kb) 8-32 D=0-1, I=0-1 D=1, I=1

Interrupts 16 1-32 32

Frequency (MHz) 40-80 70-200 80

DMIPS/MHz 0.75/1.05 0.8 NA

Occupied area (LUTs) 90002

26003

90004

License LGPL Commercial GPL

Cost Free 1$/unit, min 1000$ Free

Table 3.1.: Comparison between ARM cores

3.1.1. Considerations

The most important properties to consider is licensing, cost, performance and area uti-lization. Cortex M1 is the most expensive option while also providing the highest per-formance. However, if there is high demands on system performance one should instead consider the MicroBlaze and NIOS II mentioned earlier since they provide more features and higher performance at a lower cost[19]. The Xilinx Zynq or Altera SoC are other high performance options as mentioned earlier. That leaves the two open source projects. The major benefit these have over the Cortex M1 except the cost is that they already are complete systems. With the Cortex one has to add a bus architecture, find suitable peripherals for that bus and create an arbitration scheme between these and that takes time. When looking at the included peripherals, the STORM SoC has everything listed in the beginning of this chapter. It would then seem to be the best choice for this project. However, since the core is to be used in commercial applications, the biggest difference between the two open source cores are the license. The General Public License (GPL) license states that any products containing GPL licensed software needs to be shipped

1

The amber project includes two different cores called 23 and 25

2

Core 23 and 16KB cache

3

Minimal config, no Cache

4

(23)

(24)

4. HDL Design

In this chapter the Amber project is presented more in detail along with the changes that were made to it in order to obtain the system specified in chapter 3. An overview of the complete design can be seen below in Figure 4.1. There it is shown how the core connect to the peripherals over the Wishbone bus[20]. The Wishbone bus is a competitor to ARMs open bus standard AMBA and how it works is shown in more detail in Section 4.5. The peripherals that came with the Amber project (UART, Interrupt controller, Timer controller and test module) were not included in the Amber user guide[21]. The information presented about these were obtained by us through analyzing the code and simulations of it. All the configuration parameters mentioned in this chapter are detailed in table 5.1.

(25)

4.1. Target

Different FPGA:s were discussed as a platform for the project. The supervisor suggested some sort of Xilinx FPGA since that was going to be used in another project and the hardware could then be shared. Unfortunately the business arrangements were not completed in time so there was no hardware available for actual testing. For simulation and synthesis the FPGA targeted in the Amber system, a Spartan 6 LX45T, was used. The synthesis results were only verified by reading the synthesis reports. These do not replace hardware testing but at least for device utilization and basic timing analysis we considered it enough.

4.2. Amber project

The Amber project was designed by Conor Santifort, a member at opencores. It was tested by him on a Xilinx SP605 development board[21]. The complete specification of the system is shown in the list below.

• ARMv2a compatible core

• Configurable cache size

• 8kB boot memory

• Two UART controllers

• Ethernet MAC

• Interrupts

• Timers

• Spartan-6 DDR3 memory controller

The peripherals connect to the core over a wishbone bus interface. The boot memory contains a boot loader that uses one of the UART ports to receive programs to be run on the system. The project also contains an extensive suite of hardware test programs written in assembler along with a bootable linux image that can be run in a simulator.

4.3. Clock and reset manager

(26)

remaining PLL uses a differential clock input of 200MHz to generate a 800MHz clock. This clock is then divided by the AMBER CLOCK DIVIDER parameter to get the system clock.

4.4. The Amber core

In the Amber project there are two different cores available called a23 and a25. They are totally software compatible but have some differences as shown in table 3.1. Since area is preferred over performance the a23 core will be used. In Figure 4.2 below an overview of the core’s Verilog structure is shown. The picture is taken from Amber core specification[22] where it is called Figure 5.

Figure 4.2.: Overview of the a23 verilog structure.

(27)

can be found in the Amber core specification[22].

4.4.1. ARMv2a Instruction Set Architecture

The core was built to be compatible with the ARMv2a Instruction Set Architecture (ISA) which is built up by a couple of 32-bit wide instructions. The ones supported by the Amber core can be divided in categories depending on their purpose as shown in Table 4.1. For descriptions of the individual instruction’s syntax and operation see Table 4 in the Amber core specification[22].

Category Instructions Description

Data processing ADC, ADD, AND, BIC, CMN, CMP, EOR, MOV, MVN, ORR, RSB, RSC, SBC, SUB, TEQ, TST

Performs operations on data already in registers

Multiply MLA, MUL Used to perform multiply-ing operations

Single data swap SWP, SWPB Swaps data in a register with data in memory Single data

trans-fer

LDR, LDRB, STR, STRB Used to move data be-tween memory and regis-ters

Block data trans-fer

LDM, STM Moves a series of words be-tween memory and regis-ters

Branch B, BL Branches the execution to

other places in the pro-gram

Coprocessor data transfer

MCR, MRC Used to move data to and from a coprocessor register Software

inter-rupt

SWI Used to throw a software

interrupt exception

Table 4.1.: ARMv2a instructions supported by the Amber core.

Registers and modes

The ARMv2a ISA is a load/store architecture which means that all operations on data occurs in the processors internal registers. In the Amber core, and in ARM cores in general, there are 16 internal registers of which 13 can be utilized for data operations. Which registers that are accessible depends on which mode the processor is in. For the Amber core four different modes are available:

(28)

IRQ Privileged mode that the processor enters when a interrupt occurs

FIRQ Privileged mode that the processor enters when a fast interrupt occurs

Supervisor Privileged mode that the processor enters when a software interrupt occurs

The current mode is indicated in the two least significant bits of register 15. This register is referred to as the program counter or PC as it also points to where in the program the processor is, or more correctly, the next instruction it will execute. The other reserved registers are register 14 and 13. Register 14 is called the ”Link register” and contain the address the processor will jump to when the current function call is completed. Register 13 is called the ”Stack pointer”, or SP, and is used as a pointer to the end of the stack. A graphical overview of the registers in the respective modes are shown in Table 14 and 15 in the Amber core specification[22].

Comparison with other ISAs

(29)

4.4.2. Pipeline

Using a pipeline means that the execution of an instruction is divided into smaller steps much like the famous assembly line invented by Henry Ford. In the a23 core the execution is divided in three steps, or stages, called called fetch, decode and execute. In Figure 4.3 an example of the execution on a pipelined processor is compared with a processor without a pipeline. The example is only for basic understanding and does not take into account the hazards discussed later in this section or other delays that occur when dividing the execution in several stages.

Figure 4.3.: Example of pipelined execution.

Fetch

(30)

Decode

This is the most complicated pipeline stage in the core. Here, the fetched instruction is decoded according to the tables in Chapter 4 of the Amber core specification[22]. The decoded instruction is converted into control signals for the execute stage. Some examples of these control signals are presented in Table 4.2.

Name Size (b) Description

instruction execute 1 If cleared the instruction passes through the execute stage without being executed. See Section 4.4.3 why this is necessary.

rn sel 4 Selects which of the 15 cpu regis-ters that is used as rn register in the current instruction.

rm sel 4 Selects which of the 15 cpu registers that is used as rm register in the current instruction.

rds sel 4 Selects which of the 15 cpu registers that is used as rd and rs register in the current instruction1

.

status bits mode 2 Shows what mode the processor is in and thus which registers are to be accessed.

Table 4.2.: Some of the control signals for the execute stage.

Execute

In this stage the control signals from the decode stage are registered and combined with data from the fetch stage. The data passes through the ALU and the result is written back to the cache. Additionally, the next address for the fetch stage is generated.

4.4.3. Pipeline hazards

A pipeline generally improves the performance of a processor but it also introduces some problems, called hazards. First there is the possibility when two subsequent instructions access the same register and the first is a write instruction. This is often referred to as ”data hazard”. Another problem occurs when a instruction is executed only if a certain condition is met. This condition is determined by the execution of an earlier instruction but by then the other is already in the fetch or decode stage, scheduled for execution.

1

(31)

This is often referred to as ”control hazard”, or in the case of a conditional branch instruction, ”branch hazard”.

Data hazard

In the a23 core this is dealt with by keeping the second instruction in the decode stage for two extra cycles and prevent the execute stage from executing it the two extra times. Two examples of this is shown in section 2.2 of the Amber core specification[22]. The method of disabling the execute stage can be compared with the method where the compiler inserts NOP instructions in the code to avoid this type of hazards. However, the decode stage in the a23 core stores the ”stalled” instruction in a register so that it can be decoded directly after the execution is resumed. This saves one clock cycle so where the NOP method would waste three clock cycles the a23 core only wastes two.

Control hazard

This problem is not documented in the specification but simulations show that it is solved in a similar manner. The condition flags of the instruction are compared with the status bits of the Program Counter (PC) in the decode stage. If a faulty condition is detected the execute stage is disabled when that instruction passes through as shown in the following example.

Consider the following assembler code snippet:

mov r0 , #0 x0 @ L o a d i n g v a l u e 0 x0 i n t o r e g i s t e r 0

mov r1 , #0 x1 @ L o a d i n g v a l u e 0 x1 i n t o r e g i s t e r 1

s u b s r2 , r1 , r 0 @ Compare t h e r e g i s t e r s and u p d a t e c o n d i t i o n f l a g s beq 1 f @ T h i s b r a n c h w i l l n o t e x e c u t e , r 1 != r 2

Figure 4.4 shows what happens in the pipeline, which is also described below, tick by tick.

Figure 4.4.: Example of control hazard handling.

1. ”mov r1” instruction enters the fetch stage

(32)

3. ”mov r1” is executed, ”mov r2” is decoded and the ”subs” instruction enters fetch

4. ”mov r2” is executed, subs is decoded and the ”beq” instruction enters fetch

5. ”subs” is executed and beq is decoded. The decode stage detects a conditional execution and starts to read the status flags of the program counter. The execute stage updates these flags after the execution is done.

(33)

4.5. Wishbone bus

The wishbone bus is an open standard designed for interconnection between Intellectual Property (IP)-cores. It is widely used in the open source community and is the official interconnect fabric for the cores at opencores.org. In a comparison to the AMBA bus used by ARM, wishbone get praise for its simplicity and ease of use[23]. There are several different options for implementing the Wishbone bus, for example, the topology could be implemented in four different ways:

Point-to-point Connects one master to one slave

Pipelined The IP-cores are connected sequentially and thus act as both master and slave, forwarding the data

Shared bus Connects one or several masters to one or several slaves with a common bus medium. An arbiter is needed to direct all data traffic

Crossbar switch Similar to the shared bus with the addition that several masters can communicate at the same time, as long as they do not try to access the same slave.

There are also two different bus cycle definitions called classic and registered feedback. Registered feedback actually includes the classic cycle but also includes improvements to send data in bursts. This improvement comes at the cost of a more complicated interface and the need for three additional signals. As for the Amber 23 system it uses a 32 bit wide classic wishbone bus with the standard protocol and a shared bus topology. The only exception is that the reset signal RST is not used. The bus supports the classic read and write cycle along with the simplest burst type called ”Synchronous cycle terminated burst”. As seen in the standard there would be a performance gain in implementing another burst type, for example the ”Advanced synchronous cycle terminated burst” which is also mentioned in the Chapter 11 Future Work.

4.5.1. Wishbone signals

(34)

ADR O Target address for the current bus cycle.

SEL O Four bit signal that indicates which bytes in the 32 bit data that is valid for the current cycle.

WE O Indicates if the current cycle is a write or read. 1 indicates a write. DAT I 32 bit wide data input line.

DAT O 32 bit wide data output line.

CYC O Asserted at the slave targeted by the transfer. If not asserted, other signals are not valid.

STB O Strobe line. Asserted to the slave targeted at the current bus cycle. ACK I Set by the slave to indicate that the strobe and cyc signal is detected.

In the case of a read cycle the data must be available at the next positive edge of the wishbone clock after ack is asserted.

ERR I Indicate that the slave cannot perform the requested action. No er-ror handling is implemented in the Amber core but the signal is still present.

Table 4.3.: Wishbone signals, direction is seen from a master perspective

4.5.2. Protocol

The handshake between master and slave is clearly shown in the Wishbone standard [20] Illustration 3-3 which is shown below in Figure 4.5. The master asserts CYC O and STB O at the positive edge of CLK I. When the slave is ready to respond it asserts ACK I at a following positive clock edge. The master terminates the cycle by resetting CYC O and STB O.

Figure 4.5.: Wishbone handshake

Below in Figure 4.6 and Figure 4.7 a single read/write cycle is shown respectively. The pictures are taken from the Wishbone standard [20] where they are named Illustration 3-5 and Illustration 3-7 respectively.

(35)

Figure 4.6.: Wishbone single read cycle

The write cycle is similar but with the difference that the WE O signal is asserted to indicate a write and the data is presented on DAT O at the first clock edge. The slave still asserts ACK O when it is ready which in the write cycle in most cases is the next clock edge.

Figure 4.7.: Wishbone single write cycle

(36)

Figure 4.8.: Wishbone synchronous burst cycle

4.5.3. De-multiplexer (Demux)

The Wishbone demux connects the Wishbone master (the Amber core) to the slaves (peripherals). It treats the Wishbone signals ADR O, DAT O and SEL O as general and branch them out to all slaves, while the other signals are directed to the currently addressed slave. It determines which slave is currently addressed by using its base address. That is converted to a number as shown in Table 4.4. A schematic of the demux is shown in Figure 4.9. The Verilog file is named wishbone arbiter.v although it is actually a demux. This is derived from the original Amber project where this component also arbitrated between two Wishbone masters. To not loose the reference to the original code the filename has been kept but for correctness it is referenced here as a demux.

Number Base address (hex) Slave

0 2000 I2 C 1 NA2 _{Boot memory} 2 NA3 _{Main memory} 3 1600 UART0 4 1700 UART1 5 F000 Test module 6 1300 Timer module 7 1400 Interrupt controller 8 1800 SPI Controller 9 1900 GPIO

Table 4.4.: Slave numbering in the Wishbone demultiplexer.

2

Depends on the BOOT MSB parameter, see Section 4.6.2

3

(37)

(38)

4.6. Memory

The ARM architecture uses memory byte addressing, meaning that the smallest unit addressable in any memory is one byte. ARMv2a also supports word access meaning that a chunk of four bytes are addressed at the same time. An example of this is the ”str” and ”strb” assembler instructions that stores a word or a byte in a register respectively. There were a couple of different memories supplied with the Amber project which are listed below.

• Generic RAM with variable size, byte-wide write enable – Used as boot and cache data memory

– Synthesizes as flip-flops

• Generic RAM with variable size, line-wide write enable – Used as cache tag memory

– Synthesizes as ram blocks in Spartan 6

• Spartan 6 specific block ram implementations of different fixed sizes – Used as boot and cache (data and tag) memory

– Synthesizes as ram blocks

– Sizes: 256x21, 256x32, 256x128, 512x128, 1024x128, 2048x32 and 4096x32 – Useful only on 6 series FPGA:s

• Wishbone to Spartan 6 memory controller bridge with DDR3 model

– Used as main memory along with a DDR3 model generated by Coregen – Useful only on Spartan 6 designs

• A non synthesizable memory model of variable size, 32 and 128 MB – Used as main memory in simulations only

(39)

4.6.1. Cache

The system uses a unified cache meaning that the data and instructions share cache space. Its size is configurable through the parameter A23 CACHE WAYS and can be either 2, 4, 8 or 16 kB. In the FPGA the cache is built from two different RAMs, one for the tags and one for the actual data. The cache is controlled by a coprocessor which in turn are controlled by four registers shown in Table 4.5. These registers are accessed with the assembler instructions mcr and mrc.

Name Access Description CR0 R ID register

CR2 R/W Cache control register CR3 R/W Cachable area register CR4 R/W Updateable area register CR5 R/W Disruptive area

CR6 R Fault status register CR7 R Fault address register

Table 4.5.: Coprocessor registers. All registers are 32 bits wide.

ID register (CR0)

This register returns an ID tag of the processor. It has the layout shown in Table 4.6.

Bit 31:24 23:16 15:8 7:0

Name Company ID Manufacturer ID Part type Revision

Value (hex) 41 56 03 00

Table 4.6.: Layout of coprocessor register CR0

Cache control register (CR2)

This register is used to enable and disable the cache memory. By setting bit 0 the cache is enabled, otherwise it is disabled. The other bits are unused.

Cachable area register (CR3)

The area from the boot and main memory that can be cached are defined in this register. Every bit represents a 2MB region where bit 0 represent the lowest 2 MB.

Updateable area register (CR4)

(40)

Disruptive area (CR5)

Writing to areas marked by this register flushes the cache. Bit 0 represents the lowest 2 MB.

Fault status register (CR6)

If a cache miss occurs, a Fault status can be read from this register.

Fault address register (CR7)

If a cache miss occurs the faulty address is stored in this register.

4.6.2. Boot memory

The boot memory uses a per byte write enable and is variable in size. The size is changed with the parameter BOOT MSB and the resulting size can be calculated using equation 4.1.

Size(b) = 2BOOT M SB+1 (4.1) The boot address space starts at address 0 and the highest address is found by subtract-ing 1 from the result of equation 4.1.

Originally the Amber system infused the boot memory content in the test bench. At synthesis, the specified block ram component was loaded through the makefile. Since all Xilinx specific code was removed this is no longer possible. Instead the Verilog function ”readmemh” is used as shown below. It infuses content into the boot memory at both simulations and synthesis using a file generated by the amber-elfsplitter-memcontents tool described in 6.

i n i t i a l b e g i n

$readmemh ( ” b o o t m e m c o n t e n t s . d a t a ” , mem, 0 , 2 ∗ ∗ (ADDRESS WIDTH−2)−1) ; end

The command takes as argument a file containing the data, the array that is to be loaded and the index boundaries of the array. The file ”boot mem contents.data” is extracted from an elf4 _{file using the tool amber-elfsplitter-memcontents described in}

section 6.3.2 and contains only data values.

4.6.3. Main

The memory is implemented as a 32 bit wide array that is variable in depth by changing the parameter MAIN MSB. The size (in bytes) of the memory can be calculated with equation 4.2.

4

(41)

Size(b) = 2M AI N M SB+1 (4.2) The memory address space starts where the boot memory ends which means that the lowest address can be calculated with equation 4.1. The highest address naturally is calculated by summing the base and size and subtracting one as shown in equation 4.3

Address = 2BOOT M SB+1+ 2M AI N M SB+1− 1 (4.3)

In the original memory controllers there was a signal called i mem ctrl that was used to wrap the memory address at bit 24 if it was set. The purpose was to simulate a 32MB memory even if it was bigger like the 128MB RAM on the SP605 dev board. In the current implementation it has been left out since the size is variable through the MAIN MSB parameter.

4.6.4. Flash memory

(42)

4.7. I

2

C

I2

C is a very common serial communication protocol. In this section the protocol is first described and then followed by an introduction to the core that was used in this system.

4.7.1. I2_{C protocol}

I2

C uses two signals for communication: SCL Serial Clock

SDA Serial Data

SCL is a clock line that determines the speed of the transfer. This is controlled by the master but the slave can force it low to pause the transfer temporarily (this is called clock stretching). SDA is the data line and the control is shared between the master and the slave. In order to share a common line there has to be tri-state buffers in both ends along with an output enable (oe) signal as shown in Figure 4.10.

Figure 4.10.: Tri-state buffers on SDA and SCL.

A typical I2

C transfer is shown in figure 4.11 below. The picture is taken from the I2

C controller specification[25]

Figure 4.11.: I2

C transfer.

(43)

1. The master generates a start condition (SDA line is pulled low before SCL). All slaves starts to listen.

2. The master sends the address to the desired slave.

3. The master tristates the SDA and the addressed slave confirms that it detected the ”call” by pulling SDA low. This is called ACK.

4. The master sends the data.

5. The slave confirms the data by setting the SDA high. This is called a NACK.

6. The master generates a stop condition (SCL is pulled high before SDA).

4.7.2. I2_{C controller}

The I2

C controller used in this project was written by Richard Herveille and published on opencores in 2001[26]. The version used in this system was uploaded to opencores on the 6:th of June in 2010. Some of the key features taken from the are:

• Multi master operation

• Clock stretching and wait generation

• Supports 7 and 10 bit addressing mode

• Arbitration lost interrupt

• 8 bit wishbone interface

Since the wishbone interface is only 8 bits it only supports byte access. To avoid any unpredictable behavior caused by undefined values the Least Significant Byte (LSB) of the output signal DAT O is wired to the controller while the other bytes are set to zero in system.v. All the other signals are also wired to the controller with the LSB. The core is configured and controlled by a set of registers shown in table 4.7.

Name Address Access Description

PRERlo 0x20000000 R/W Low byte of clock prescaler PRERhi 0x20000004 R/W High byte of clock prescaler CTR 0x20000008 R/W Control register TXR 0x2000000C W Transmit register RXR 0x2000000C R Receive register CR 0x20000010 W Command register SR 0x20000010 R Status register Table 4.7.: I2

C registers. All registers are 8 bits wide

For information how to set up the registers see the I2

(44)

4.8. SPI

SPI is a full duplex capable serial protocol used in a wide variety of applications ranging from small sensors to transfers of large amounts of data. In this section the SPI protocol is described followed by an introduction to the core used in the system.

4.8.1. SPI protocol

SPI uses four signals for communication:

SS Slave Select. It is used to select the slave the master currently wants to address. This eliminates the need of sending an address like I2_{C at the cost of some extra}

hardware.

SCK Serial Clock. This is controlled completely by the master and sets the speed of the transfer

MISO Master In Slave Out. Data line from slave to master.

MOSI Master Out Slave In. Data line from master to slave.

There are four different modes of SPI communication called 0,1,2 and 3 as shown in table 4.8. The parameters that determine the modes are:

CPOL Level of SCK in idle state. CPOL = 0 means SCK = low.

CPHA Phase of SCK. If the data is sampled on the rising or falling edge of SCK. CPHA = 0 means sample on rising edge.

Mode CPOL CPHA

0 0 0

1 0 1

2 1 0

3 1 1

Table 4.8.: SPI modes

(45)

Figure 4.12.: SPI transfer timing diagram.

4.8.2. SPI controller

The SPI core used in this system was written by Simon Srot and published on opencores in 2002[28]. The version used in this project was uploaded on the 10:th of March 2009. Some of the key features of the core are:

• Full duplex

• Variable length of transfer word up to 128 bits

• MSB of LSB first data transfer

• Supports mode 0 and 1

• Eight slave select lines

• 32 bit Wishbone slave interface

(46)

RX0 0x18000000 R Recieve register bits [31:0] RX1 0x18000004 R Recieve register bits [63:32] RX2 0x18000008 R Recieve register bits [95:64] RX3 0x1800000c R Recieve register bits [127:96] TX0 0x18000000 R/W Transmit register bits [31:0] TX1 0x18000004 R/W Transmit register bits [63:32] TX2 0x18000008 R/W Transmit register bits [95:64] TX3 0x1800000c R/W Transmit register bits [127:96] CTRL 0x18000010 R/W Control and status register DIVIDER 0x18000014 R/W SCK divider value.

SS 0x18000018 R/W Slave select register

Table 4.9.: SPI core registers. All registers are 32 bits wide.

4.9. UART

UART is one of the most common serial protocols used to interface between different sys-tems. Its simplicity makes it ideal to send commands and instructions from a computer or terminal to a smaller system such as this. It is also used to convert parallel data trans-missions to serial or to interface with RS-232 and RS-485 drivers. The Amber project had two UART controllers already implemented and both of them were kept. One of them is used by the included boot-loader to interface with a computer and initialize program downloads.

4.9.1. UART protocol

UART is a point to point transmission and can be used in either simplex, half duplex or full duplex mode. The transmission speed is called baud rate and is configured separately at both ends. There is therefore no need for a clock signal and in simplex mode there is then only need for a single data line. In half duplex mode the data line is shared but an additional signal controls the direction of the data as shown in Figure 4.13. The direction signals is controlled by one of the UART controllers and is usually called Ready To Send (RTS) or Clear to send (CTS).

Figure 4.13.: UART in half duplex mode.

(47)

is all that is needed. However, since UART controllers usually uses small First In First Out (FIFO) buffers they often implement two more signals, RTS and CTS. They are used to tell the other end if the receive buffer is full so that they can pause the transmission. The RTS signal from one controller is wired to the CTS input on the other as shown in Figure 4.14.

Figure 4.14.: UART in full duplex mode with RTS and CTS.

A transfer with the UART protocol is a bit flexible. It always starts with a start bit but the data that follows can range from 5 to 9 bits. If the data is 8 bits or smaller it is then followed by an optional parity bit and the transmission ends with one or two stop bits. The start bit is low, the data is always sent LSB first and the stop bits are high. A schematic of this is shown in Figure 4.15.

Figure 4.15.: A UART transfer.

4.9.2. UART controller

The controller was included in the Amber project and the main features of the controller are:

• Fixed setting of: – 8 data bits – No parity bit – 1 stop bit

• Hardware configurable baudrate, synchronous with system clock

• 1 byte buffer or 16 byte FIFO, enabled in software

• Transmit and receive interrupts

(48)

the system clock means that the baud rate is not exact. However, it is well within the 10% offset allowed in the standard according to the author of the controller[29]. In Figure 4.16 the structure of the UART controller is shown. The data register is drawn in dotted lines since it is not actually a register, just an common address to the transmit and receive FIFOs. In the system two UARTs are instantiated, called UART0 and UART1. They are differentiated by the first 16 bits of the address where UART0 has 0x1600 and UART1 0x1700. These are shown as XXXX in Table 4.10 where the UART configuration registers are shown.

Figure 4.16.: Structural schematic of the UART controller.

Interrupts

The transmit and receive interrupt share the same output. Thus when an output occurs one need to read the interrupt status register to determine which kind of interrupt that occurred.

Receive interrupt If the FIFO is enabled the receive interrupt will trigger when there is 8 bytes or more in the FIFO. Thus it can be cleared by reading bytes through the data register until less than 8 bytes remain. If the FIFO is disabled the interrupt will trigger when there is a byte ready in the transmission buffer and reset when the byte is read.

(49)

the data register until there is more than 8 bytes in the FIFO. If the FIFO is disabled the interrupt will trigger when the transmission buffer is empty and reset when a byte is pushed to it. The transmit interrupt can also be cleared by a write to the interrupt clear register.

Configuration registers

PID0 0xXXXX0fe0 R Constant value of 0x00000010 PID1 0xXXXX0fe4 R Constant value of 0x00000010 PID02 0xXXXX0fe8 R Constant value of 0x00000004 PID03 0xXXXX0fec R Constant value of 0x00000000 CID0 0xXXXX0ff0 R Constant value of 0x0000000d CID01 0xXXXX0ff4 R Constant value of 0x000000f0 CID2 0xXXXX0ff8 R Constant value of 0x00000005 CID3 0xXXXX0ffc R Constant value of 0x000000b1 DR 0xXXXX0000 R/W Data register

RSR 0xXXXX0004 R/W Receive Status Register LCRH 0xXXXX0008 R/W Line Control Register High LCRM 0xXXXX000c R/W Line Control Register Middle LCRL 0xXXXX0010 R/W Line Control Register Low CR 0xXXXX0014 R/W Control Register

FR 0xXXXX0018 R Flag Register

IIR 0xXXXX001c R Interrupt status register ICR 0xXXXX001c W Interrupt Clear Register

Table 4.10.: UART core registers. All registers are 8 bits wide.

DR, Data register

A write to this register either pushes a byte into the FIFO if it is enabled, otherwise puts it directly in the 1 byte buffer for transmission. When reading this register either the oldest byte from the FIFO or the byte in the transmission buffer is retrieved. The controller will initiate a transmission as soon as there is data in the FIFO or transmission buffer and the receiving UART signals that it is ready by pulling the CTS input low. Thus, a write to this register will implicitly start a transmission.

RSR, Receive status register

(50)

Line Control Registers

The Line Control Register consists of three bytes, called H (High), M(Middle) and L(Low). Of these only bit 4 in the high byte is used which will enable the FIFOs when set.

Control Register

In this register bit 4 is used to enable the receive interrupt and bit 5 to enable the transmit interrupt (high = enabled). The other bits are unused.

FR (Flag Register)

The 8 bits of the flag register are used as shown in Table 4.11.

Bit 7 6 5 4 3 2 1 0

Name TxE RxF TxF RxE Busy Not used Not used CTS

Table 4.11.: Flag register bits. Bits 2 and 1 are always high.

TxE Transmit FIFO empty. When no data is present in the FIFO or the buffer is empty this bit is high.

RxF Receive FIFO full. When the FIFO is full or there is data in the buffer this bit is high.

TxF Transmit FIFO full. When the FIFO is full or there is data in the buffer this bit is high.

RxE Receive FIFO empty. When no data is present in the FIFO or the buffer is empty this bit is high.

Busy UART busy flag. When there is data in the buffer or FIFO, this bit is high.

CTS Clear To Send. When the device the UART communicating with is ready to receive data this bit is high.

IIR Interrupt status register

This register is used to read the status of the interrupts. Bit 2 is the transmit interrupt status and bit 1 the receive interrupt status (high means interrupt active).

ICR Interrupt clear register

(51)

4.10. Ethernet MAC

The Ethernet Media Access Control (MAC) provided in the Amber project was removed. The decision was based upon the fact that it represented around 25% of the system’s FPGA utilization while not implementing a function Syntronic saw any future need of in this system. If this function is needed sometime in the future, the work needed to reinsert the controller is not overwhelming, see Section 5.2 for further information. Before doing that though, one should also consider the use of an external controller since that would further simplify the implementation and save a considerable amount of FPGA resources.

4.11. GPIO

GPIO:s are exactly what the name states. A set of pins that can be configured individ-ually in software to act as either inputs or outputs. They are very useful for reading buttons or driving led lights but can also be used to implement communication protocols such as the I2

C and SPI presented earlier.

4.11.1. GPIO controller

The GPIO controller used in this project was written by Richard Harveille and uploaded on opencores in 2002[30]. The version used in this system was uploaded on the 10:th of March in 2009. The original version had support for 8 GPIO pins and an 8 bit Wishbone interface. If one wanted more one could instantiate several components to achieve that. Since that solution would be a bit cumbersome in this case we modified the controller to support up to 32 GPIO pins and a 32 bit Wishbone interface in one instance. The number of usable pins are configured with the GPIO PINS parameter. Since a pin can be used as both input and output it has to utilize a tri-state buffer as shown in Figure 4.17. The registers CTRL, WRITE and READ are explained below.

Figure 4.17.: GPIO pin tri-state buffer connection.

Configuration registers

(52)

GPIO PINS parameter and every bit in the registers represent a pin. Due to an issue while accessing the line register from a C program it has been split into two addresses in register addresses.h called WRITE and READ. The software registers are presented in Table 4.12. More details about the access issue is described in Section 9.5.

CTRL 0x19000000 R/W Control the direction of the pins

WRITE 0x19000004 R/W Write output pin values, points to the LINE register READ 0x19000014 R/W Read input pin values, points to the LINE register

Table 4.12.: GPIO core registers.

CTRL This registers control if the respective pin is used as output or input. By setting a bit in this register to 1 its respective pin is used as an output.

WRITE This register is used to set a value to the output pins. There is nothing in the hardware that will prevent a read access but using this to read input pins might cause faulty values so use the READ register instead. Writing to an input pin will have no effect.

READ This register is used to read the values from the input pins. It will also report the state of the output pins, an effect of them sharing hardware register.

4.12. Timers

The timer core was included in the Amber project. There are three timers available that are configurable through a set of registers shown in Table 4.13.The timers are identical and have the following features:

• Individual interrupts

• Either periodic or one-shot

• Three different prescalers

(53)

Name Address Access Description TIMER0 LOAD 0x13000000 R/W Value set to timer 0 TIMER0 VALUE 0x13000004 R Current value of timer 0 TIMER0 CTRL 0x13000008 R/W Timer 0 control register TIMER0 CLR 0x1300000c W Timer 0 interrupt clear TIMER1 LOAD 0x13000010 R/W Value set to timer 1 TIMER1 VALUE 0x13000014 R Current value of timer 1 TIMER1 CTRL 0x13000018 R/W Timer 1 control register TIMER1 CLR 0x1300001c W Timer 1 interrupt clear TIMER2 LOAD 0x13000020 R/W Value set to timer 2 TIMER2 VALUE 0x13000024 R Current value of timer 2 TIMER2 CTRL 0x13000028 R/W Timer 2 control register TIMER2 CLR 0x1300002c W Timer 2 interrupt clear

Table 4.13.: Timer core registers.

LOAD

The LOAD register is used to load a value that the timer will count down from. The register is two bytes wide an thus the biggest value that can be loaded is 0xFFFF. The value is stored in the timer until it is overwritten.

VALUE

This register holds the current value of the timer and can be read. A write to this register has no effect.

CTRL

The control register is 8 bits and they are used as shown in Table 4.14.

Bit 7 6 5:4 3:2 1:0

Name Enable Mode Not used Prescaler Not used

Table 4.14.: Control register bits. Unused bits are always low.

Enable When set the timer is enabled.

(54)

Prescaler The prescaler bits determine how much the timer counts on a system tick. How much the prescaler affects the counting is shown in Table 4.15.

CTRL[3:2] PS value

00 1

01 16

10 256

11 Not used

Table 4.15.: Timer prescaler value

CLR

Writing any value to this register clears the timer interrupt.

4.12.2. Setting up a timer

The timer is set up by first writing a value to the LOAD register and then set up the control register with the enable bit set and the desired prescaler and mode. How long time in seconds the timer will count is shown in equation 4.4 where LOAD is the value in the LOAD register and PS value is the value of the prescaler as shown in Table 4.15.

T ime(s) = LOAD ∗ P S value

F req(Hz) (4.4)

When the time expires the timer will fire an interrupt and restart.

4.13. Interrupt controller

(55)

Figure 4.18.: Interrupt vectors and masks.

Figure 4.19.: Fast interrupt vectors and masks.

4.13.1. Registers

The registers in Table 4.17 handles the different vectors and masks shown in Figure 4.18 and 4.19 above. For the vectors that include the software interrupt they are outlined as presented in Table 4.16. For the others the software interrupt bit is unused.

Bit 31:10 9 8 7 6 5 4:3 2 1 0

Name NA SPI I2

C Timer2 Timer1 Timer0 NA UART1 UART0 SW

(56)

Name Address Access Description IRQ0 STATUS 0x14000000 R IRQ0 status

IRQ0 RAWSTAT 0x14000004 R IRQ0 hw IRQ status IRQ0 ENABLESET 0x14000008 R/W IRQ0 mask enable IRQ0 ENABLECLR 0x1400000c W IRQ0 mask disable INT SOFTSET 0 0x14000010 R/W Set software interrupt 0 INT SOFTCLEAR 0 0x14000014 R/W Clear software interrupt 0 FIRQ0 STATUS 0x14000020 R FIRQ0 status

FIRQ0 RAWSTAT 0x14000024 R FIRQ0 hw IRQ status FIRQ0 ENABLESET 0x14000028 R/W FIRQ0 mask enable FIRQ0 ENABLECLR 0x1400002c W FIRQ0 mask disable IRQ1 STATUS 0x14000040 R IRQ1 status

IRQ1 RAWSTAT 0x14000044 R IRQ1 hw IRQ status IRQ1 ENABLESET 0x14000048 R/W IRQ1 mask enable IRQ1 ENABLECLR 0x1400004c W IRQ1 mask disable INT SOFTSET 1 0x14000050 R/W Set software interrupt 1 INT SOFTCLEAR 1 0x14000054 R/W Clear software interrupt 1 FIRQ1 STATUS 0x14000060 R FIRQ1 status

FIRQ1 RAWSTAT 0x14000064 R FIRQ1 hw IRQ status FIRQ1 ENABLESET 0x14000068 R/W FIRQ1 mask enable FIRQ1 ENABLECLR 0x1400006c W FIRQ1 mask disable INT SOFTSET 2 0x14000090 None Defined but unused INT SOFTCLEAR 2 0x14000094 None Defined but unused INT SOFTSET 3 0x140000d0 None Defined but unused INT SOFTCLEAR 3 0x140000d4 None Defined but unused

Table 4.17.: Timer core registers.

In Table 4.17 above there are six types of registers, STATUS, RAWSTAT, ENABLE-SET, ENABLECLR, SOFTSET and SOFTCLEAR. They are divided for the six different interrupt types IRQ0, IRQ1, FIRQ0, FIRQ1, SOFT0 and SOFT1. The registers have exactly the same function for their respective interrupts and masks so there will only follow a general description them.

STATUS Reading from this register will return the value of the masked interrupt vector.

RAWSTAT Reading from this register will return the value of the (unmasked) hardware interrupt vector.

(57)

ENABLECLR Writing 1 to bits in this register will disable the corresponding interrupt.

SOFTSET Writing 1 to bit zero of this register will set the corresponding software interrupt. Reading from it will return the software interrupt status.

(58)

4.14. Test module

This module was included in the Amber project and is used to interface with the verilog test bench, to test the interrupt functionality, controlling the test bench UART and provide a set of random numbers to the system. It is controlled by a set of registers shown in Table 4.18.

Name Address Access Description STATUS 0xf0000000 R/W Test status register

FIRQ TIMER 0xf0000004 R/W FIRQ interrupt test register IRQ TIMER 0xf0000008 R/W IRQ interrupt test register UART CONTROL 0xf0000010 R/W Controls the test bench UART UART STATUS 0xf0000014 R Test bench UART status register UART TXD 0xf0000018 R/W Test bench UART data feed SIM CTRL 0xf000001c R Simulation register

MEM CTRL 0xf0000020 R/W Not used

CYCLES 0xf0000024 R Counts system ticks

LED 0xf0000028 R/W Control LEDS on SP605 board PHY RST 0xf000002c R/W Not used

RANDOM NUM 0xf0000100 R/W Provides a random number RANDOM NUM00 0xf0000100 R/W Provides a random number RANDOM NUM01 0xf0000104 R/W Provides a random number RANDOM NUM02 0xf0000108 R/W Provides a random number RANDOM NUM03 0xf000010c R/W Provides a random number RANDOM NUM04 0xf0000110 R/W Provides a random number RANDOM NUM05 0xf0000114 R/W Provides a random number RANDOM NUM06 0xf0000118 R/W Provides a random number RANDOM NUM07 0xf000011c R/W Provides a random number RANDOM NUM08 0xf0000120 R/W Provides a random number RANDOM NUM09 0xf0000124 R/W Provides a random number RANDOM NUM10 0xf0000128 R/W Provides a random number RANDOM NUM11 0xf000012c R/W Provides a random number RANDOM NUM12 0xf0000130 R/W Provides a random number RANDOM NUM13 0xf0000134 R/W Provides a random number RANDOM NUM14 0xf0000138 R/W Provides a random number RANDOM NUM15 0xf000013c R/W Provides a random number

Table 4.18.: Test module registers.

(59)

to or greater than ”32’h8000” the the message will be: ”Failed ’testname’ - with error 0x’data’” otherwise it will be: ”Failed ’testname’ - with error on line ’data’”. Reading will return the value written to it or zeroes. This is useful in hardware and only if the testpass() or testfail() functions are not used since they put the processor in an infinite loop.

FIRQ TIMER and IRQ TIMER Used to test the interrupt functionality of the pro-cessor core. When writing to this register only the LSB is used and the data has the following effect:

8’h00 Clears the interrupt

8’h01 Sets the interrupt

8’h02 to 8’hff Initiate a countdown that decreases every system clock tick. When it reaches 8’h01 it fires the interrupt and stops.

Note that these interrupts will not be shown in any of the interrupt controller vectors. However, reading from this register will return interrupt timer value and can therefore be used to check if an interrupt is set from here.

UART CONTROL This register controls the UART interface in the Verilog test bench. For this only the two lowest bits are used. They have the following effect:

Bit 0 When set it enables transmission in the test bench UART.

Bit 1 When set the test bench UART is in loopback mode.

UART STATUS Returns the status of the test bench UART. Only bit one and zero is used and has the following meaning:

Bit 0 High if the UART transmit FIFO is empty.

Bit 1 High if the UART transmit FIFO is full.

UART TXD This register is used to push a byte into the test bench UARTs transmit FIFO if it not in loopback mode. If the FIFO is full the byte will be discarded and a warning message generated (in simulation).

SIM CTRL This register is used by software to determine if it runs in simulation or in hardware. If the register is zero then it is on the FPGA otherwise it is a simulation. This register is controlled by the run.sh script and a define in the code.

(60)

CYCLES This 32 bit register stores the number of system ticks since startup.

LED This register is used to control the LEDs on the SP605 development board. For this bit 3:0 is used.

PHY RST This register is not used any more. Its purpose was to reset the Ethernet controller on the SP605 development board.

RANDOM NUM registers These are a set of one byte wide registers containing a random number. A new random number is retrieved reading the LSB from any of these registers. Writing to any of these registers will give the generator a new seed.

4.15. Verilog test bench

The test bench is the top level entity when running simulations and was included in the Amber project. It instantiates the whole system along with slave modules for the following functions:

• UART

• SPI

• I2C • GPIO

(61)

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Amber Core U s e r FIRQ IRQ > SVC r 0 0 x20000010 r 1 0 x 0 0 0 0 0 0 c 1 r 2 0 x00000002 r 3 0 x00000080 r 4 0 x d e a d b e e f r 5 0 x000000a5 r 6 0 x0000005a r 7 0 x d e a d b e e f r 8 0 x d e a d b e e f 0 x d e a d b e e f r 9 0 x d e a d b e e f 0 x d e a d b e e f r 1 0 0 x00000011 0 x d e a d b e e f r 1 1 0 x f 0 0 0 0 0 0 0 0 x d e a d b e e f r 1 2 0 x d e a d b e e f 0 x d e a d b e e f r 1 3 0 x d e a d b e e f 0 x d e a d b e e f 0 x d e a d b e e f 0 x d e a d b e e f r 1 4 ( l r ) 0 x d e a d b e e f 0 x d e a d b e e f 0 x d e a d b e e f 0 x d e a d b e e f r 1 5 ( pc ) 0 x00000268

S t a t u s B i t s : N=0 , Z=0 , C=1 , V=0 , IRQ Mask 1 , FIRQ Mask 1 , Mode = S u p e r v i s o r −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

++++++++++++++++++++ P a s s e d i 2 c 12364 t i c k s ++++++++++++++++++++

S t o p p e d a t t i m e : 309577500 ps : F i l e ” /home/ emanuel / w o r k s p a c e / amber SoC / t r u n k /hw/ v l o g / t b / t b . v ” L i n e 462

4.15.1. UART

This UART controller has two modes, loopback and transmission. In loopback mode a received byte is put in the transmission buffer and sent back. In transmission mode it utilises a 16 byte transmission buffer that can be filled with data using the TXD register. The registers are controlled from the test module described in Section 4.14.

4.15.2. SPI

The SPI slave model is a simple loopback model that was included in the SPI project[28] where it was a part of that systems test bench. The only modifications made was to read the CTRL register in order to automatically set the same mode as the Amber SPI controller.

4.15.3. I2_C

This I2

C slave was included in the I2

C project where it was part of its test bench. It is interfaced as a real I2

(62)

both read and write access. Since both the slave and the acutal controller tristates SDA and SCL they are pulled up in the test bench top level with the Verilog ”pullup” command.

4.15.4. GPIO

(63)

5. Configuration

In this chapter the parameters for configuring the system are presented. The steps that were taken to remove and add modules to the system are also shown.

5.1. Parameters

There are a number of parameters for configuring the system presented in Table 5.1.

Parameter Location Description

A23 CACHE WAYS a23 config defines Defines the size of the cache. The size can be either 2, 4, 8 or 16 kB. A23 RAM REGISTER BANK a23 config defines If set the register bank is

imple-mented in a RAM block otherwise in flip-flops.

AMBER CLOCK DIVIDER system config defines The PLL output is divided by this value to get the system clock. AMBER UART BAUD system config defines Specifies the baud rate for both

UARTs

BOOT MSB memory configuration.v Specifies the size of the boot mem-ory.

GPIO PINS system config defines.v The number of available GPIO pins. Any number from 1 to 32 is valid.

MAIN MSB memory configuration.v Specifies the size of the main mem-ory.

SPI DIVIDER LEN spi defines.v Sets the bit length for the spi clock divider.

SPI MAX CHAR spi defines.v Sets the maximum transmission data block size.

SPI SS NB spi defines.v Sets the number of slave select sig-nals.

WB SLAVES system.v Sets the number of slaves on the Wishbone bus.