• No results found

Design and Implementation of a DMA Controller for Digital Signal Processor

N/A
N/A
Protected

Academic year: 2021

Share "Design and Implementation of a DMA Controller for Digital Signal Processor"

Copied!
71
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Design and Implementation of a DMA Controller

for Digital Signal Processor

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping

av

Guoyou Jiang

LiTH-ISY-EX--10/4244--SE

Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)
(3)

Design and Implementation of a DMA Controller

for Digital Signal Processor

Examensarbete utfört i Datorteknik

vid Tekniska högskolan i Linköping

av

Guoyou Jiang

LiTH-ISY-EX--10/4244--SE

Handledare: Dake Liu

isy, Linköpings universitet

Examinator: Dake Liu

isy, Linköpings universitet

(4)
(5)

Avdelning, Institution

Division, Department

Division of Computer Engineering Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2010-08-12 Språk Language ¤ Svenska/Swedish ¤ Engelska/English ¤ £ Rapporttyp Report category ¤ Licentiatavhandling ¤ Examensarbete ¤ C-uppsats ¤ D-uppsats ¤ Övrig rapport ¤ £

URL för elektronisk version

http://www.da.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-58868 ISBNISRN LiTH-ISY-EX--10/4244--SE

Serietitel och serienummer

Title of series, numbering

ISSN

Titel

Title Design and Implementation of a DMA Controller for Digital Signal Processor

Författare

Author

Guoyou Jiang

Sammanfattning

Abstract

The thesis work is conducted in the division of computer engineering at the department of electrical engineering in Linköping University. During the thesis work, a configurable Direct Memory Access (DMA) controller was designed and implemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595.

The DMA controller has two address generators and can provide two clock sources. It can thus handle data read and write simultaneously. There are 16 channels built in the DMA controller, the data width can be 16-bit, 32-bit and 64-bit. The DMA controller supports 2D data access by configuring its intelligent linking table. The DMA is designed for advanced DSP applications and it is not dedicated for cache which has a fixed priority.

Nyckelord

Keywords direct memory access, DMA, digital signal processing, DSP, linking table, proces-sor, peripherals, scalability, testbench, verification

(6)
(7)

Abstract

The thesis work is conducted in the division of computer engineering at the department of electrical engineering in Linköping University. During the thesis work, a configurable Direct Memory Access (DMA) controller was designed and implemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595.

The DMA controller has two address generators and can provide two clock sources. It can thus handle data read and write simultaneously. There are 16 channels built in the DMA controller, the data width can be 16-bit, 32-bit and 64-bit. The DMA controller supports 2D data access by configuring its intelligent linking table. The DMA is designed for advanced DSP applications and it is not dedicated for cache which has a fixed priority.

(8)
(9)

Acknowledgments

This is the result of master thesis work starting from spring of 2009 to the spring of 2010 in Linköping University.

First of all, I would like to thank my supervisor and examiner Professor Dake Liu, who gave me the great opportunity to do this final year project. The thesis would not be possible to complete without his experience and support.

Second, I would like to give my gratitude to those Ph.D students in the division of Computer Engineering. Their experience in the digital signal processor design helped me a lot. Jian Wang, who helped me with some key issues in the design of behavior model. Di Wu, who introduced me with this topic. Olof Kraigher, who helped me to solve some programming problems of the C++ model. I also want to thank He Zhang, who helped me discussing some example applications of the design.

I also want to appreciate Thomas Österholm, who helped me to integrate my design to the complete DSP system. Andreas Ehliar and Johan Eilert who gave me a lot of advice while implement my design into ASIC.

Last but not least, I want to express my appreciation to my parents in my hometown Shanghai, their love and supports are unlimited and throughout my entire academic career far away from home.

(10)
(11)

Contents

1 Introduction 13 1.1 Scope . . . 14 1.2 Method . . . 14 1.3 Thesis Overview . . . 15 1.4 Notations . . . 15 1.5 Abbreviations . . . 16 2 Background 17 2.1 DMA Basics . . . 17 2.2 DMA Operations . . . 18

2.2.1 Normal DMA Operation . . . 19

2.2.2 Chain Operation . . . 19

2.2.3 Linking Table Operation . . . 20

3 Application Requirements 23 3.1 Application Analysis . . . 23

3.2 Requirement Specification . . . 26

4 Interfaces 29 4.1 Host Interface . . . 29

4.1.1 Main Status Register . . . 29

4.1.2 Main Control Register . . . 30

4.1.3 Special Memory Control Register . . . 31

4.2 Memory Interface . . . 31

4.3 Behavior model of I/O . . . 32

4.4 Task Packet Specification . . . 32

5 DMA Hardware 37 5.1 Host Interface . . . 38

5.1.1 Block Diagram . . . 38

5.1.2 Interface . . . 38

5.2 Source Address Generator . . . 39

5.2.1 Block Diagram . . . 39

5.2.2 Interface . . . 40

5.3 Destination Address Generator . . . 41 9

(12)

10 Contents 5.3.1 Block Diagram . . . 41 5.3.2 Interface . . . 41 5.4 Source Decoder . . . 42 5.4.1 Block Diagram . . . 42 5.4.2 Interface . . . 42 5.5 Destination Decoder . . . 42 5.5.1 Block Diagram . . . 42 5.5.2 Interface . . . 43 5.6 Transaction FSM . . . 43 5.6.1 Interface . . . 45 6 Integration 47 6.1 Hardware Integration . . . 47 6.2 Software Integration . . . 47 6.3 DMA Programming . . . 48

6.3.1 Initialize the DMA Controller . . . 49

6.3.2 Poll the DMA Controller . . . 50

6.3.3 Handle the DMA Interrupt . . . 51

7 Verification 53 7.1 Functional Verification . . . 53 7.2 Hardware Implementation . . . 54 8 Conclusion 55 8.1 Achieved Results . . . 55 8.1.1 DMA Benchmark . . . 55 8.1.2 Comparison . . . 56 8.1.3 Conclusion . . . 57 8.2 Future Work . . . 57 Bibliography 59 A DMA Simulator C++ Header 61 B DMA Simulator C++ Code 63

List of Figures

1.1 DIT butterfly of Radix-2 FFT . . . 14

2.1 System overview . . . 18

2.2 Basic DMA operation to save processor run time. . . 19

2.3 DMA Chain operation example. . . 20

2.4 An example of DMA linking table operation. . . 20

3.1 Matrix Transposition . . . 23

(13)

Contents 11

3.3 Transfer decomposition of Example 3.2 . . . 25

3.4 Neighbor Searching in Motion Estimation . . . 25

3.5 Transfer decomposition of Example 3.3 . . . 27

4.1 DMA configuration . . . 32

5.1 DMA Hardware architecture . . . 37

5.2 DMA Controller Block Diagram . . . 38

5.3 Block diagram of Host Interface Module . . . 39

5.4 Block diagram of Source address generator . . . 40

5.5 Block diagram of Destination address generator . . . 41

5.6 Block diagram of Source decoder . . . 42

5.7 Block diagram of Destination decoder . . . 43

5.8 Finite State Machine of the control logic . . . 44

7.1 DMA Functional Verification Flow . . . 53

8.1 Timing diagram of basic DMA operation. . . 55

8.2 Timing diagram of linking table operation. . . 56

List of Tables

3.1 Preparing DMA for Motion Estimation . . . 26

3.2 Requirement Specification . . . 28

4.1 Host Interface . . . 29

4.2 DMA Registers specification . . . 30

4.3 Main status register specification . . . 30

4.4 Main control register specification . . . 30

4.5 Special memory control register specification . . . 31

4.6 Memory Interface . . . 31

4.7 Task packet specification . . . 33

4.8 Control Vector 1 . . . 34

4.9 Control Vector 2 . . . 35

4.10 Control Vector 3 . . . 36

4.11 Control Vector 4 & 5 . . . 36

4.12 Control Vector 6 & 7 & 8 . . . 36

5.1 Interface of Host Interface Module . . . 39

5.2 Interface of Source address generator . . . 40

5.3 Interface of Destination address generator . . . 41

5.4 Interface of Source decoder . . . 43

5.5 Interface of Destination decoder . . . 44

5.6 Interface of Transaction FSM . . . 45

(14)

12 Contents 8.2 Results Comparison with and without DMA . . . 57

(15)

Chapter 1

Introduction

Today, as the technology evolving, there are lots of DSP applications emerge on the horizon. The demands for rich content multimedia such as HDTV or 3D display are huge. Behind all these demands, there are always some technologies pushing the need for better experience of electronic products. One of them is called digital signal processing. The DSP techniques have provided improvements in traditional signal processing applications like audio, visual, radar, and communications [9, p.1].

The component which does the digital signal processing can be called digi-tal signal processor. A special designed peripheral of the processor can help the processor itself with accessing memories. That peripheral can be called DMA controller.

With the help of DMA or DMA controller, the processor can do more tasks related to computing itself while the data transfer is in progress. Since most of the memory accesses are hidden from the DSP algorithms, it is important to reveal the hidden memory accesses from the algorithms [6]. A DMA controller will be a great help in the perspective of both power consumption and performance benchmark. For example, a DIT butterfly algorithm, which is the basis of FFT algorithm, can be divided into the following steps and it is shown in Figure 1.1:

1. Load two complex operands;

2. Load one complex coefficient and perform one complex Multiply; 3. Perform two complex Addition;

4. Store two complex results.

This is a simple example of memory accesses hidden in the basic DSP algorithms, more detailed discussion will be presented in Chapter 3.

(16)

14 Introduction

Figure 1.1. DIT butterfly of Radix-2 FFT

1.1

Scope

The scope of this thesis work is to design and implement a DMA peripheral for Senior, a DSP processor developed at the division of Computer Engineering in Linköping University.

The interface between the DMA controller and DSP core was already done in another project [7, p.53]. The design work started from the definition of DMA specification.

For many DSP applications, it is always desired to use a technique called linking table to accelerate the processing two-dimensional array [6, p.584]. The linking table is thus supported in the current DMA design.

In order to make sure the design is correct, a test bench is also developed to verify the functionality of designed modules. Since the DMA should work with Senior DSP, the test bench was written on the basis of the Senior test bench.

1.2

Method

For designing the DMA module, the specification should be defined on the re-quirement of applications. Since the DMA is designed to meet the need of Senior DSP, a behavioral model of DMA module should also be added to the exist Senior instruction set simulator. It is important to develop the behavioral model because it can be used not only to get the performance benchmark of the hardware, but also be used to compare with the actual hardware for verification.

Once the behavioral model is done, the RTL implementation is to translate the behavioral model into RTL language such as Verilog. After the completion of RTL implementation, the behavioral model is used as a golden reference to verify the RTL module. If they produce the same result, then it is believed that the RTL implementation is correct.

(17)

1.3 Thesis Overview 15

1.3

Thesis Overview

In Chapter 1, a brief introduction is presented to let the reader know what this thesis is about. Some basic knowledge background and operations of DMA will then be discussed in Chapter 2.

In Chapter 3, some applications will be analyzed first and then the requirement specification will be discussed based on the analysis of application requirements.

The designed DMA controller should work together with our host Senior DSP, in Chapter 4, the interfaces and registers of the DMA controller will be described along with the DMA task. Thus, the user of Senior will have an idea on how the DMA works with Senior DSP.

After discussing the requirement specification and the host interface, Chapter 5 will describe the detailed hardware architecture of the designed DMA controller, the micro architecture of each block will also be detailed in this chapter.

Once the DMA controller hardware is completed, we need to integrate it into the Senior system, Chapter 6 discuss the integration of DMA controller both in hardware perspective and in software perspective. The DMA controller behavioral model will also be discussed.

Chapter 7 will discuss the verification of the implemented hardware.

Chapter 8 is the summary which contains the results I have got, together with the conclusions and the future work.

1.4

Notations

In order to make the thesis more understandable, there are some notations the readers should be kept in mind while go through the text.

A $ and 0x before the number means that the number is in hexadecimal. A number without any prefix is a decimal number. For example, "0x64" means decimal value 100, while "64" means decimal value 64.

When discussing specific bits of a word, the Verilog syntax is used as far as pos-sible. Three zeros after each other followed by three ones is written as 6’b000111, where 6 denotes the total number of bits, the b tells it is a binary value. sta-tus[10:5] means the bits 10 to 5 of register status, and just bit 3 is written as status[3].

(18)

16 Introduction

1.5

Abbreviations

3D 3 Dimensional

AGU Address Generation Unit

ASIC Application Specific Integrated Circuit

ASIP Application Specific Instruction set Processor

DCT Discrete Cosine Transform

DDR Double Data Rate

DIT Decimation In Time

DM Data Memory

DMA Direct Memory Access

DRAM Dynamic Random Access Memory

DSP Digital Signal Processor

FFT Fast Fourier Transform

FIFO First In First Out

FPGA Field Programable Gate Array

FSM Finite State Machine

GIO General I/O

HDTV High Definition Television

I/O Input/Output

ISR Interrupt Service Routine

IP Intellectual Property

JPEG Joint Photographic Experts Group

LSB Least Significant Bit

MB Macro Block

MP3 MPEG 1 Layer 3

MSB Most Significant Bit

MUX Multiplexer

PC Program Counter

PM Program Memory

RTL Register Transfer Level

(19)

Chapter 2

Background

With the help of pipeline, the processor core can execute one operation in one cycle, including calculation, data load and data store, in reality it is only possible to achieve optimal performance in the application if the processor core has to do the data transfer itself [4, p.75]. This is where the DMA controller can be used to relieve the core from data movements.

2.1

DMA Basics

DMA stands for Direct Memory Access, and it is a technique to transfer data blocks between memories directly without using the processor for data access [6, p.535] [5]. Since the DSP is designed to do highly computational work, in most cases, a separated peripheral should help the processor core to access processor memories instead of the processor itself doing that. While the peripheral is doing memory transactions, the processor can do other operations not related to those memory transfers.

DMA module or DMA controller, by definition, is a peripheral module of a processor core for direct memory access. The basic work flow of a DMA transaction can be described as follows. The core or other data units prepare and send a DMA request to the DMA controller when they want to transfer a lot of data. The DMA controller prepares and transfers data while the core can do other operations. The core might poll the status of DMA controller to see if the transfer is completed, or an interrupt will be sent to core or other data units by the DMA controller when the transaction is finished. Then the processor core can decide if it is going to continue to process on the data.

A DMA subsystem can consist of a processor core, DMA module and several memory modules connected to both processor core and DMA module.

The DMA module can provide DMA transfers between two memory interfaces. Transfers can also be performed between memories and high-speed I/Os. Figure 2.1 shows a typical DSP sub-system with the DMA module inside.

In this DSP sub-system, the DSP core acts like the system master, and the DMA module is the slave of the DSP core. On the other hand, the DMA module

(20)

18 Background

Figure 2.1. System overview

is the master of its connected memory modules and high-speed I/Os, etc. Both the DSP core and the DMA module can access the memory modules, but cares must be taken since the memories cannot be accessed at the same time.

From the DMA controller’s point of view, the master DSP core configure the data format of the transaction and request DMA to do the data transfer. The configuration is called a DMA channel, which consists of the task priority, source port and destination port of the transfer, start addresses of both ports, the data packet size, etc.

2.2

DMA Operations

Usually, the DMA controller should be able to support more than one operation, since there are quite a lot of different access patterns according to different DSP algorithms. This section will illustrate several transfer options and their opera-tions.

(21)

2.2 DMA Operations 19

2.2.1

Normal DMA Operation

This is a simple DMA operation performing a block copy. In this operation, DMA performs a block copy from one location to another, either on the same interface or on different interfaces. The external software running on the processor core is responsible for limiting the access time. Figure 2.2 shows the basic DMA operation performed by the DMA controller.

Figure 2.2. Basic DMA operation to save processor run time.

As we can see from Figure 2.2, the processor core is responsible for the DMA transaction, once there is a need for the data of the processor, the processor will prepare a DMA request which specifies some basic parameters of the transfer. Then the processor will send the request through the general I/O to the DMA controller. The DMA controller will transfer the corresponding data from memory location 1 to memory location 2 based on the request sent by the processor. When the transfer is finished, the processor will check the status register of DMA con-troller or an interrupt will be sent to the processor. When the processor get the information that the transfer is done. It can use the data provided by the DMA. Thus, while the DMA is doing the data transfer, the processor can do other things rather than transferring the data itself, the run time can be saved.

2.2.2

Chain Operation

In this operation, a contiguous set of elements can be transferred when a syn-chronous event occurs [1] [8]. The DMA controller is used to transfer a chain of data elements which have equal distance between each element. Once the DMA controller gets the task, it will setup the proper parameters and transfer each element in that chain. Figure 2.3 is an illustration of this operation.

As we can see in Figure 2.3, each data element is separated by fixed stride. After transferring the first data element, the DMA can transfer the next element just like the data elements are chained together. By doing this operation, extra time for channel configuration can be saved.

(22)

20 Background

Figure 2.3. DMA Chain operation example.

2.2.3

Linking Table Operation

In this operation, multiple data blocks will be merged as one large data block of a DMA transaction. Since some of the DSP algorithms require data blocks at different locations in the main memory, with the help of linking table, multiple data blocks can be loaded sequentially by one DMA transaction. For example, in a video CODEC application, it is often desired to compare data from different reference frames [6]. A linking table concatenates several data blocks into one DMA transaction. Figure 2.4 gives an example of linking table.

Figure 2.4. An example of DMA linking table operation.

The first data block starts at the physical address 0x2000, the length of this block is 256 data words. While the first data block is loaded, the loading of second data block, which has the block number 2, is followed at once. As we can see from the Figure 2.4, the start address is 0x4000 and the length is 128. And after the loading of data block 2, the loading of data block 3 is activated immediately. The start address of data block 3 is 0x8000 and block length is 512. When the link=0 is reached at the end of data block 3, the DMA transaction is finished. Using linking table, three non-continuous data blocks transferring are merged into one single DMA transaction.

(23)

2.2 DMA Operations 21 Actually, linking table operation is a more flexible form of chain operation. Since the distance between each data element is not fixed, we need another pa-rameter to determine the length of each data element. Table 4.7 gives us a detailed configuration of linking table.

(24)
(25)

Chapter 3

Application Requirements

In this chapter, several application examples will be described and analyzed, then the requirement specification will be proposed based on the analysis of these ex-amples.

3.1

Application Analysis

First of all, let us take several application examples into consideration.

Example 3.1: Matrix Transposition

Suppose we want to transpose a matrix.

0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA 0xB 0xC 0xD 0xE 0xF Address Data 0 0x0 1 0x1 2 0x2 3 0x3 4 0x4 5 0x5 . . . . 14 0xE 15 0xF 0x0 0x4 0x8 0xC 0x1 0x5 0x9 0xD 0x2 0x6 0xA 0xE 0x3 0x7 0xB 0xF

Figure 3.1. Matrix Transposition

The matrix may be saved in the memory consequently shown as Figure 3.1. In order to transpose the matrix, we can simply move the data from the original address to the desired position. It could be thus abstracted by the chain operation as we discussed in Section 2.2.

(26)

24 Application Requirements

Figure 3.2. Transfer decomposition of Example 3.1

The data transfer can be represented in Figure 3.2, we can split the whole transfer into four chained transfer. In the example, the source address is discrete with a stride of 4 data words while the destination address is continuous. This is only a simple example due to the small size of the matrix. In more complicated application, the matrix could be very large, but the basic principle still holds.

Example 3.2: Create a large Matrix

Suppose we want to create a large matrix with 4096 elements, each element of the matrix is the same value 0 or 1. This case is quite common in the matrix manipulation in both communication algorithms and video processing algorithms. It is possible to create such matrix by writing continuous zeros or ones to a serious address. But to do this will waste quite a lot of precious core cycles, which makes impossible for the core to do more useful tasks.

In this case, we can simply use the DMA controller to create the zero matrix. First we use the core to write one element in DM0 the matrix, then we use the DMA controller to transfer the same content to the DM1, suppose we should create the matrix in DM1.

As we can see from Figure 3.3, the transfer is quite simple. The source address is fixed, while the destination address is continuous. The data to be transferred is the same as the size of the matrix.

(27)

3.1 Application Analysis 25

Figure 3.3. Transfer decomposition of Example 3.2

Let us see a more complicated and realistic example according to the algorithms of motion estimation [6, p.585].

Example 3.3: Motion Estimation

In the motion estimation algorithm, each macro block (usually 16 × 16 = 256 pixels) in the current frame will be compared by searching the neighboring area of the reference frame.

01 02 03 04 05 06 07 08 09 17 18 19 20 25 26 28 33 34 35 36 01 02 03 04 05 06 07 08 09 17 25 27 33

Figure 3.4. Neighbor Searching in Motion Estimation

Suppose we divide the picture into 8 × 8 = 64 macro blocks, each macro block contains 256 pixels. We want to estimate the motion vector of macro block 27 in the current frame. Based on the algorithm, we need to search the neighboring macro blocks in the reference frame. The macro blocks of number 18, 19, 20, 26, 28, 34, 35, 36 in the reference frame are going to be compared. Usually, the data memory of the processor core is not large enough to hold the whole picture, we need to transfer the desired data from main memory to the data memory of the processor core. Then the processor can perform the algorithms on the data.

(28)

26 Application Requirements Let’s say the segment address of the current frame in the main memory is 32768 and the address of the reference frame is 32768+(8×8)×(16×16) = 49152. Thus, we can specify the data block to be transferred in Table 3.1.

Specification Value Comment

DMA task ID 1 The identification of transaction

Task priority 1 The priority of the transaction

Number of links 5

Source port Main memory

Destination port DM0

Destination start address

0

Link 1 start address 32768 + 26 × 256 = 39424 Block 27 in current frame

Link 1 length 256

Link 2 start address 49152 + 17 × 256 = 53504 Block 18 in reference frame

Link 2 length 768 3 blocks in row

Link 3 start address 49152 + 25 × 256 = 55552 Block 26 in reference frame

Link 3 length 256

Link 4 start address 49152 + 27 × 256 = 56064 Block 28 in reference frame

Link 4 length 256

Link 5 start address 49152 + 33 × 256 = 57600 Block 34 in reference frame

Link 5 length 768 3 blocks in row

Table 3.1. Preparing DMA for Motion Estimation

Based on the data block specification in Table 3.1, we can draw the transfer decomposition in Figure 3.5 as follows.

3.2

Requirement Specification

As we have described in Chapter 2, the DSP core is responsible to configure the DMA controller. So we need to specify the parameters of the memory transfer.

As configured by the DSP core, the DMA controller will connect a source port and a destination port. Here, a port is either a data source supplying data or a data sink consuming data. In most cases, a port is a memory location or a data buffer. A DMA data transaction is to move data from source port to destination port as configured by the DMA task from the master DSP core.

In order to design the DMA controller, we need to specify the following pa-rameters of the memory transaction.

• Number of ports supported by the DMA controller

This specifies the number of channels can be connected by the DMA con-troller.

• Address Generator Unit (AGU)

(29)

3.2 Requirement Specification 27

Figure 3.5. Transfer decomposition of Example 3.3

two AGUs are needed, one to provide source address and the other is to provide the destination address of a data transfer.

• Data Width

Since the DMA controller should support different memory modules, the width of data path should be configurable. We need to specify the data widths supported by the DMA module.

• Memory Organization

Since there are two different ways to store words in a byte-addressed memory. The least significant byte stored at lower address is called “little endian”, while the least significant byte stored at higher address is called “big endian” [3]. There is no specific reason why to choose one way or another, but still we need to specify the format we support during the data transfer.

• Linking Table support

As we described in Chapter 2, the linking table can save the extra cost for configuring several separate data blocks by concatenating several data blocks into one transaction. On the other hand, it also costs extra hardware to keep track of several different data blocks [2]. Thus we need to specify the length of the linking table.

Table 3.2 shows the requirement specification of the DMA controller to be designed for Senior system.

(30)

28 Application Requirements

No. Description

1 16 Source ports:

8 on-chip memory, 1 off-chip memory, 1 high-speed I/O, other reserved. 2 16 Destination ports:

8 on-chip memory, 1 off-chip memory, 1 high-speed I/O, other reserved. 3 Address Generator Unit (AGU)

1 for Source port, 1 for Destination port, each has 32b address space. 4 Clock Generator: supply clock signal for memory (I/O),Source:Destination

1:1, 1:1/2, 1:1/4, 1:1/8, 1:1/16, 1:1/32, 1:1/64; 1:2, 1:4, 1:8, 1:16, 1:32, 1:64.

5 Data width

Source port: 8 bits, 16 bits, 32 bits. (64 bits not implemented in Senior.) Destination port: 8 bits, 16 bits, 32 bits. (64 bits not implemented in Senior.) 6 Memory organization:

The DMA controller should support both big endian and little endian data. 7 Linking Table supported, the maximum length of linking table is 64.

(31)

Chapter 4

Interfaces

The DMA module is controlled by the Senior core. Thus, when configuring, the Senior core uses its I/O instruction in and out to read and write the registers of the DMA module.

4.1

Host Interface

The host interface of DMA module conforms to the standard Senior I/O and should be connected through general I/O of Senior processor. The interface between DSP core and DMA module can be seen in Table 4.1. The data buses from and to the DMA module are 32 bits wide. Only the 16 LSB are used for current DMA configuration.

Name width DIR Description

clk_i 1 In System clock.

rst_i 1 In System reset, active low.

addr_i 16 In Address input (from DSP core).

data_i 32 In Data input (from DSP core).

rd_strobe_i 1 In Read strobe signal.

wr_strobe_i 1 In Write strobe signal.

data_o 32 Out Data output (to DSP core).

Table 4.1. Host Interface

Table 4.2 gives an overview of the DMA Register specification. The reference [7, p.53] has shown more detailed information about how to connect a peripheral to the Senior I/O.

4.1.1

Main Status Register

The status register is used to show the status of DMA transactions. Firmware developer can use this register to handle the DMA transactions.

(32)

30 Interfaces

Name Addr Width written by Description

Status 00 16 DMA Show the status of DMA.

Fur-ther details can be found in Table 4.3.

Control 01 16 Senior Used for configuring and

control-ling the DMA, details can be found in Table 4.4.

Output Data 10 16 Not used in current

implementa-tion.

Input Data 11 16 Senior DSP core writes task packet to

this port to configure the DMA channel.

Table 4.2. DMA Registers specification

Bits Specification

[0] Idling or busy: Idle=0, Busy = 1. [1] When 1, a channel can be configured,

When 0, no channel is available. [2] When 1, running task is finished. [3] When 1, an exception is occurred. [4] When 1, task queue is full. [15 : 5] Reserved

Table 4.3. Main status register specification

4.1.2

Main Control Register

The control register, as the name suggests, is used to control a DMA transaction.

Bits Specification

[0]=1 Reset DMA, flush the current task.

[1]=1 Shutdown DMA.

[2]=1 Data rate: always using DMA clock. [9 : 3] Reserved

[10]=1 Activate a task (Channel) which is specified in task ID. [14 : 11] DMA task ID

[15] When [15] = 1, ask for a channel configuration

(33)

4.2 Memory Interface 31

4.1.3

Special Memory Control Register

This register doesn’t belong to the general I/O of Senior core. It is a special purpose register, which is written by the DMA controller and read by Senior core. By writing the corresponding bit in the register, the DMA controller will notify the Senior core which memory is being accessed now.

Bits Specification

[0]=1 The DMA controller is accessing DM 0. [1]=1 The DMA controller is accessing DM 1. [2]=1 The DMA controller is accessing PM. [15 : 3] Reserved

Table 4.5. Special memory control register specification

4.2

Memory Interface

The memory interface is used for the slaves of the DMA module. Since the DMA module supports 16 in ports and 16 out ports, we need 32 ports in all. Table 4.6 shows the detail of the memory interface needed for the DMA module.

Name width DIR Description

src0_data_i 32 I Data input for Source Port 0.

src0_addr_o 16 O Address output for Source Port 0.

src0_csn_o 1 O Memory chip select enable for Source

Port 0, active low.

src0_oe_o 1 O Memory output enable for Source Port

0, active low.

src1 Interfaces for Source Port 1.

...

src15 Interfaces for Source Port 15.

dst0_data_o 32 O Data output for Destination Port 0.

dst0_addr_o 16 O Address output for Destination Port 0.

dst0_csn_o 1 O Memory chip select enable for

Destina-tion Port 0, active low.

dst0_we_o 1 O Memory write enable for Destination

Port 0, active low.

dst1 Interfaces for Destination Port 1.

...

dst15 Interfaces for Destination Port 15.

(34)

32 Interfaces

4.3

Behavior model of I/O

Since we use only one data I/O for both configuring the DMA module and writing DMA task, we need a protocol to distinguish the DMA configuration and task receiving. Figure 4.1 illustrates the configuration flow of the DMA module.

Figure 4.1. DMA configuration

Here, the PREAMBLE means the first control vector we sent to control register of the DMA module. Chapter 6 shows several examples of how to program the DMA controller.

4.4

Task Packet Specification

The task packet is used to setup the DMA transfer channel, both for normal DMA operation and linking table multiple transaction. Since the DSP core has a general I/O of 16-bit data width, the task packet is also 16-bit wide per data word.

We could specify a transaction by configuring a channel. The configuration includes configuring the source, the destination and the transaction. Generally, a basic channel configuration includes the following steps:

• Task priority

• Data size: the length of the data block. • Data from: the name of the source port. • Data to: the name of the destination port. • The physical start address of the source port. • The physical start address of the destination port.

• The endian behavior of the source port: Big or Little endian.

Besides the software configuration for the DMA transaction, the hardware specifications of transactions are also important to know by the DMA designers and DMA users:

(35)

4.4 Task Packet Specification 33 • The maximum source clock rate.

• The maximum destination clock rate.

• Data width of the source port: 8 bits, 16 bits, 32 bits or 64 bits. • Data width of the destination port: 8 bits, 16 bits, 32 bits or 64 bits. • Data protocol of the source port: error check or not.

Table 4.7 shows a task packet consists of 2 links, and from Table 4.8 to Table 4.12, we can see the explanation of each control vector. The length of task packet depends on the total number of the linking table.

Number of Links Task Priority Task ID

8b 4b 4b

SRC DST SRC DST SRC DST SRC DST

width width proc proc endian endian rate rate

2b 2b 1b 1b 1b 1b 4b 4b

Reserved Source Port Destination Port

6b 5b 5b

Destination Address low part 16b

Destination Address high part 16b

Source Address 1 low part 16b

Source Address 1 high part 16b

Length of Link 1 16b

Source Address 2 low part 16b

Source Address 2 high part 16b

Length of Link 2 16b

...

(36)

34 Interfaces

Name Bits Description

Number of Links [15:8] Specify the total number of links, up to 64

Task Priority [7:4] Specify the priority of the task.(Not yet implemented)

Task ID [3:0] Specify Task ID.

(37)

4.4 Task Packet Specification 35

Name Bits Description

SRC width [15:14] Specify the data width of source port: 2’b00: 8 bits

2’b01: 16 bits 2’b10: 32 bits 2’b11: 64 bits

DST width [13:12] Specify the data width of destination port: 2’b00: 8 bits

2’b01: 16 bits 2’b10: 32 bits 2’b11: 64 bits

SRC proc [11] Specify if the source port use parity check: 1’b0: Don’t use

1’b1: Use

DST proc [10] Specify if the destination port use parity check: 1’b0: Don’t use

1’b1: Use

SRC endian [9] Specify endian of source port: 1’b0: Little endian 1’b1: Big endian

DST endian [8] Specify endian of destination port: 1’b0: Little endian

1’b1: Big endian SRC rate [7:4] Clock rate of source port:

4’b0000: clk; 4’b0001: clk/2; 4’b0010: clk/4; 4’b0011: clk/8; 4’b0100: clk/16; 4’b0101: clk/32; 4’b0110: clk/64;

DST rate [3:0] Clock rate of destination port: 4’b0000: clk; 4’b0001: clk/2; 4’b0010: clk/4; 4’b0011: clk/8; 4’b0100: clk/16; 4’b0101: clk/32; 4’b0110: clk/64;

(38)

36 Interfaces

Name Bits Description

Reserved [15:10] Reserved for future use.

Source Port [9:5] Specify the source port number. Destination Port [4:0] Specify the destination port number.

Table 4.10. Control Vector 3

Name Bits Description

Destination Address low part [15:0] low 16 bit part of destination address. Destination Address high part [15:0] high 16 bit part of destination address.

Table 4.11. Control Vector 4 & 5

Name Bits Description

Source Address 1 low part [15:0] Specify low 16 bit part of source address 1. Source Address 1 high part [15:0] Specify high 16 bit part of source address 1. Length of Link 1 [15:0] Specify the length of Link 1.

(39)

Chapter 5

DMA Hardware

Generally, the DMA controller hardware can be divided into data path and control path [6, p.572]. Figure 5.1 shows the basic architecture of the DMA module.

Figure 5.1. DMA Hardware architecture

The DMA data path gets data from the source port using source address erator, and stores data to the destination port using the destination address gen-erator. In order to handle the data with different data rates and formats, source decoding and destination decoding module are also needed.

The DMA control path consists of the channel configuration FSM (Finite State Machine) and transaction FSM. The DSP core can request for the configuration of a channel. When the DMA is idle, the channel configuration FSM will issue the channel to the transaction FSM module. The transaction FSM is responsible for the control of data path. When the block is transmitted, the channel configuration FSM will generate an interrupt to the DSP core.

The following sections will give more detail information about the sub blocks of the DMA controller. Figure 5.2 shows the block diagram of the DMA controller with its main inputs and outputs.

(40)

38 DMA Hardware

Figure 5.2. DMA Controller Block Diagram

5.1

Host Interface

This is the interface between Senior DSP core and DMA controller. It is used to keep the control vectors sent by DSP core into registers inside the DMA controller and update the status register which can be accessed by the Senior DSP core.

5.1.1

Block Diagram

Figure 5.3 shows the block diagram of the Host Interface.

The input MUX is used to select input I/O data based on the input I/O address. The task FIFO is used to keep the Task packet, which will be used by transaction FSM. The output MUX is to output the desired data based on I/O address.

5.1.2

Interface

(41)

5.2 Source Address Generator 39

Figure 5.3. Block diagram of Host Interface Module

Name width DIR Description

clk_i 1 I Clock input.

rst_i 1 I Synchronous reset, active low.

io_data_i 16 I Data input from Host interface.

io_addr_i 16 I Address input from Host interface.

io_rd_strobe_i 1 I Read strobe from Host interface.

io_wr_strobe_i 1 I Write strobe from Host interface.

io_data_o 16 O Data output to Host interface. (Reserved)

config_reg_addr_i 8 I Read address for Task queue.

config_reg_addr_en_i 1 I Read enable signal for Task queue.

config_reg_data_o 16 O Task queue data output.

contrl_reg_o 16 O DMA control register, output to transaction FSM.

status_reg_i 16 I DMA status register, input from transaction FSM.

Table 5.1. Interface of Host Interface Module

5.2

Source Address Generator

This module is used to generate the address for the source port, it is controlled by the transaction FSM.

5.2.1

Block Diagram

Figure 5.4 shows the block diagram of the source address generator.

Once the transaction FSM decodes the task packet parameter into several control signals, it will send these signals to the source address generator. As

(42)

40 DMA Hardware

Figure 5.4. Block diagram of Source address generator

shown in Figure 5.4, an Adder is used inside source address generator to produce the output source port address. Two counters are also implemented to count how many words and how many links have been transferred, and thus the end link or end transfer signal will be asserted once the transfer is finished.

5.2.2

Interface

Table 5.2 gives the interface detail of source address generator.

Name width DIR Description

clk_i 1 I Clock input.

step_i 2 I Address increment step.

enable_i 1 I Enable address increment.

set_addr_i 1 I Set start address.

end_link_o 1 O Indicate the end of one link.

end_transfer_o 1 O Indicate the end of transfer.

src_addr_i 32 I Start address of the transfer.

src_length_i 16 I Transfer length.

src_link_number_i 8 I Total number of links.

src_addr_o 32 O Source address output.

(43)

5.3 Destination Address Generator 41

5.3

Destination Address Generator

This module is used to generate the address for the destination port, the control signal to this module is provided by the transaction FSM.

5.3.1

Block Diagram

Figure 5.5 shows the block diagram of the destination address generator.

Figure 5.5. Block diagram of Destination address generator

This module has the same structure as source address generator, the only difference is that it doesn’t need the counter for counting transferred words or links.

5.3.2

Interface

Table 5.3 gives the detailed interface description of destination address generator.

Name width DIR Description

clk_i 1 I Clock input.

step_i 2 I Address increment step.

enable_i 1 I Enable address increment.

setaddr_i 1 I Set start address.

addr_i 32 I Start address of the transfer.

addr_o 32 O Address output.

(44)

42 DMA Hardware

5.4

Source Decoder

This module decodes the incoming data based on the task packet provided by the transaction FSM. It will adapt the data into the internal data format which can be transferred through the channel.

5.4.1

Block Diagram

Figure 5.6 shows the block diagram of the source decoder.

Figure 5.6. Block diagram of Source decoder

The source decoder consists of several MUXs to decode the incoming data based on control signals provided by transaction FSM. First, the input data are segmented by 8 bytes, then the MUXs will select the right combination of data bytes to get the internal data format.

5.4.2

Interface

Table 5.4 gives the interface detail of Source decoder.

5.5

Destination Decoder

This module will package the internal data format into the data format specified by the task packet.

5.5.1

Block Diagram

(45)

5.6 Transaction FSM 43

Name width DIR Description

clk 1 I Clock input.

rst 1 I Synchronous reset, active low.

src_width 2 I Source data width.

src_parity 1 I Source parity check.

src_endian 1 I Source endian.

channel_din 64 I Data input from source port.

channel_dout 64 O Data output to channel FIFO.

Table 5.4. Interface of Source decoder

Figure 5.7. Block diagram of Destination decoder

The destination decoder has the similar structure as source decoder. The output MUX will combine the internal data into the desired data format based on control signals provided by transaction FSM.

5.5.2

Interface

Table 5.5 gives the detail interface description of Destination decoder.

5.6

Transaction FSM

This FSM is necessary to control all the transaction based on the task packet pro-vided by the DSP core. It receives the incoming task packet and saves the packet into the DMA internal registers. According to the task packet, the transaction FSM will decode the task packet based on the specification in Table 3.2 and then

(46)

44 DMA Hardware

Name width DIR Description

clk 1 I Clock input.

rst 1 I Synchronous reset, active low.

dest_width 2 I Destination data width.

dest_parity 1 I Destination parity check.

dest_endian 1 I Destination endian.

channel_din 64 I Data input from channel FIFO.

channel_dout 64 O Data output to destination port.

Table 5.5. Interface of Destination decoder

issue different control signals to different sub blocks of DMA controller to complete the DMA transaction. Figure 5.8 shows the Finite State Machine of the control logic.

Figure 5.8. Finite State Machine of the control logic

There are eight states of the transaction FSM in the current design. IDLE is the default state when the DMA controller is reset. Once the Senior core requests to configure the DMA controller, CONFIG1 state will be entered, and the transaction FSM will decode the incoming common control vectors until it finishes the first 5 common control vectors. States CONFIG2_1, CONFIG2_2 and CONFIG2_3 continues to configure the source address and link length of the linking table. Once the channel is configured, state TRANS is entered, the DMA controller starts the data transfer. When the FSM receives the “end of link” signal, state WAIT is entered to wait for configure the next transfer in the linking table. Then the FSM will repeat states CONFIG2_1, CONFIG2_2 and CONFIG2_3 to configure the channel. Once the “end of transfer” signal is detected, state FINISH will be

(47)

5.6 Transaction FSM 45 entered and the interrupt signal will be sent to the Senior core and status register will be updated. Then the DMA controller will wait for the Senior core to respond either on the status register or on the interrupt signal.

5.6.1

Interface

Table 5.6 gives the detailed interface description of Transaction FSM.

Name width DIR Description

clk_i 1 I Clock input.

rst_i 1 I Synchronous reset, active low.

src_port_o 5 O Source port number.

dst_port_o 5 O Destination port number.

config_reg_data_i 16 I Task packet data input.

contrl_reg_i 16 I Control register data input.

config_reg_addr_o 8 O Task packet read address.

config_addr_en_o 1 O Task packet read enable.

status_reg_o 16 O Status register data output.

src_addr_o 32 O Start address of source port.

src_addr_en_o 1 O Enable source port start address.

src_addr_incr 2 O Increment step of source port.

enable_src_gen_o 1 O Source address generator enable signal.

link_length_o 16 O Length of current transfer link.

link_num_o 8 O Total link number.

end_link_i 1 I End of current link.

end_transfer_i 1 I End of current transfer.

dst_addr_o 32 O Start address of destination port.

dst_addr_en_o 1 O Enable destination port start address.

dst_addr_incr 2 O Increment step of destination port.

enable_dst_gen_o 1 O Destination address generator enable signal.

src_rate_o 4 O Source port data rate.

src_parity_o 1 O Source port parity check.

src_endian_o 1 O Source port endian.

dst_rate_o 4 O Destination port data rate.

dst_parity_o 1 O Destination port parity check.

dst_endian_o 1 O Destination port endian.

src_csn_o 1 O Source port chip select enable, active low.

src_oe_o 1 O Source port output enable, active low.

dst_csn_o 1 O Destination port chip select enable, active low.

dst_we_o 1 O Destination port write enable, active low.

(48)
(49)

Chapter 6

Integration

Since the DMA controller should work together with the Senior DSP core, we need to integrate the DMA controller into the processor core. In this Chapter, the basic flow will be introduced. It includes the hardware integration and software integration.

6.1

Hardware Integration

The DMA controller works as a peripheral of the Senior DSP core. As introduced in Chapter 4 and Reference [7], the peripheral can be connected to any available GIO. In the following piece of code, the DMA controller is connected to I/O number 5. The Senior DSP system has other peripherals connected such as timer and interrupt controller.

The memory interface of the DMA controller should also be connected to the current Senior memory sub-system. Since the processor need to know which mem-ory is being accessed by DMA controller to make sure the processor core will not access the same memory module, the Special Memory Control Register of DMA controller should be connected to Senior core, also.

6.2

Software Integration

In order make the verification of the DMA controller easier, a behavioral model of DMA controller is also developed. Thus, it is necessary to integrate the behavioral model into the simulator.

The behavioral model is written in C++. At first, the behavioral model is not exactly cycle accurate. After the simulation of hardware implementation, the behavioral model is further tuned to meet the timing specification of the actual hardware.

The behavioral model should be compiled together with the Senior simulator. The DMA controller should be instantiated in header file of the simulator in Ex-ample 6.1.

(50)

48 Integration

Example 6.1: Create DMA Behavioral Model in simulator

class Senior { public: ... // ---// DMA // ---DMAController dma_controller; ... }

In the Senior simulator, the DMA controller should be connected to the pro-gram memory and data memory in the constructor of Senior. It should be con-nected to a specific I/O address as well, the codes are shown in Example 6.2.

Example 6.2: Connect DMA Controller

Senior::Senior() { ... dma_controller.cycle = &cycle; dma_controller.peripherals = &peripherals; dma_controller.pm[0] = &pm[0]; dma_controller.pm[1] = &pm[1]; for (int i=0; i<4; i++) {

dma_controller.dm[i] = &dm[i]; }

... }

int SrSim::srmain(int argc, char** argv) { ...

// Add DMA peripheral at I/O address 5

fprintf(stdout, "Loading DMA peripheral at address 5.\n"); addPeripheral(&(dma_controller),5);

... }

6.3

DMA Programming

In this section, some sample codes by which the Senior DSP core can program the DMA controller will be listed.

(51)

6.3 DMA Programming 49

6.3.1

Initialize the DMA Controller

In Example 6.3, the DMA controller is configured with a task packet contains 3 links by Senior core through its I/O instructions.

Example 6.3: Configure the DMA Controller

;; Define the address of DMA registers ;; DMA is connected to I/O 5

#define DMA_STATUS 0x05 #define DMA_CONTRL 0x45 #define DMA_OUT_DATA 0x85 #define DMA_IN_DATA 0xC5 .code ;;; DMA task 1

set r9,$FFFF ; start symbol, task package preamble ;;; number of link = 3, priority = 0, task ID = 2

set r10,$0301

;;; width = 16bit, endian = 0, src / dst rate = 1

set r11,$5000

;;; src port = 3, dst port = 4

set r12,$0064

out DMA_IN_DATA,r9

out DMA_IN_DATA,r10 ; write config vector to config fifo

out DMA_IN_DATA,r11

out DMA_IN_DATA,r12

set r10,$0010 ; dst addr low part

set r11,$0000 ; dst addr high part

out DMA_IN_DATA,r10

out DMA_IN_DATA,r11

;; link 1

set r10,$0000 ; src addr low part

set r11,$0000 ; src addr high part

set r12,32 ; link length = 32

out DMA_IN_DATA,r10

out DMA_IN_DATA,r11

out DMA_IN_DATA,r12

;; link 2

set r10,$0030 ; src addr low part

set r11,$0000 ; src addr high part

set r12,16 ; link length = 16

;; link 3

(52)

50 Integration

set r14,$0000 ; src addr high part

set r15,$40 ; link length = 64

out DMA_IN_DATA,r10 out DMA_IN_DATA,r11 out DMA_IN_DATA,r12 out DMA_IN_DATA,r13 out DMA_IN_DATA,r14 out DMA_IN_DATA,r15

;;; wait for channel configuration task1_channel_config in r1,DMA_STATUS nop and r1,$0002 sub r1,$0002 jump.ne task1_channel_config ;; start DMA task 1

;; write control register, start config channel ;; and start DMA transfer

set r1,0x8000 ; config a channel

set r2,0x0400 ; start DMA

out DMA_CONTRL,r1

out DMA_CONTRL,r2

6.3.2

Poll the DMA Controller

In Example 6.4, the Senior core will poll the status register of the DMA controller to check if the transfer is completed. If the transaction is done, the processor will go out of the loop and continue to do the other things.

Example 6.4: Poll the status of DMA Controller

;;; wait for DMA task 1 finish task1_done in r1,DMA_STATUS nop and r1,$0006 sub r1,$0006 jump.ne task1_done ;;; Start to do other things

(53)

6.3 DMA Programming 51

6.3.3

Handle the DMA Interrupt

From Example 6.4, we can find that there is a big disadvantage of polling DMA controller. The processor cannot do anything but waiting for the DMA controller to complete the transfer. Thus, it is necessary to deal with the interrupt so that the processor core can do other things while the DMA controller is doing the transfer. In Example 6.5, the entry for the interrupt service routine (ISR) should be set correctly according to the actual hardware connection. The flow of the interrupt can be described as:

[Interrupt Received] → [Push Flags] → [Push PC] → [PC = DM1[SPR(intaddr)]] → [Interrupt service routine] → [Instruction = RETI] → [Pop PC and Start Jump] → [Pop Flags]

Example 6.5: Handle the DMA Interrupt

.code

set sp, 0x7000 ; set the stack point

set intaddr, 0x0000 ; set interrupt BASE address (DM1)

set r0, INTERRUPT_ROUTINE

nop

st1 (0x0008), r0 ; store interrupt address 4 at BASE+8

jump MAIN_PROGRAM

INTERRUPT_ROUTINE

;;; Here is the interrupt service routine reti

MAIN_PROGRAM

(54)
(55)

Chapter 7

Verification

After the hardware is completed, it is always important to verify the correctness of the designed hardware. In the area of semiconductor industry, it is extremely critical to make sure the design is bug-free before tape out, since the non-recurring engineering (NRE) cost of a tape out in 0.13µm technology is more than 1 million USD in the year 2004 [10]. Modern technology has even higher NRE cost.

7.1

Functional Verification

The functional verification of the DMA controller is based on the test bench of Senior processor. The basic principle of verification is to compare the output from the behavioral model of DMA controller with the output from the RTL code simulation. If the results match, it is believed that the designed hardware is correct, otherwise debug procedures should be taken.

Figure 7.1 shows the functional verification flow.

Figure 7.1. DMA Functional Verification Flow

Several test cases have been developed to increase the code coverage of the 53

(56)

54 Verification design. Currently, normal DMA operation, linking table operation and large block transferring with interrupt has been tested. The code coverage is 91.7%.

7.2

Hardware Implementation

For a hardware design, it is always exciting to implement the design into real hardware, either on FPGA or on ASIC. It is an honor that Professor Liu offered me an oppertunity to make my design into real hardware.

The FPGA implementation was targeted on Xilinx Virtex 4 FPGA while the ASIC implementation was targeted on Infineon 65nm CMOS technology.

The implementation was straight forward, the logic synthesizer translates the RTL code into netlist based on the specific technology, either CMOS standard cell or FPGA cell. The backend tool will produce the layout based on the floorplan and synthesized netlist. Some optimization will be performed while the design hierarchy might be broken. Since the implementation was about the whole Senior system, I will only discuss the results of the DMA module in Chapter 8.

(57)

Chapter 8

Conclusion

8.1

Achieved Results

8.1.1

DMA Benchmark

From the RTL simulation, we can see the timing diagram of the DMA controller when it is performing the transaction. The timing diagram is drawn in Figure 8.1 and Figure 8.2, respectively.

Note that the extra 4 cycles in Figure 8.2 between 2 links are used to configure the corresponding transfer parameter for the second link.

Figure 8.1. Timing diagram of basic DMA operation.

The DMA controller has also been synthesized in 65nm digital CMOS technol-ogy and implemented in Xilinx Virtex 4 FPGA. Table 8.1 shows the result.

From Table 8.1, we can find that the estimated gate count for CMOS 65nm technology is relatively high, that’s because a 256 word depth with 16-bit word

(58)

56 Conclusion

Figure 8.2. Timing diagram of linking table operation.

ST Infineon Xilinx

Target Technology 65nm CMOS 65nm CMOS FPGA Virtex 4 without mem with mem

Working Frequency 200 MHz 200 MHz 88 MHz

Estimated Gate Count 26595 18000

-Number of Flip Flops - - 504

Number of 4 input LUTs - - 694

Estimated Power 4.18 mW 2.48 mW Not Available

Table 8.1. Synthesis Result of DMA controller

width dual-port RAM is used as the control FIFO in the DMA controller. And the memory was not optimized in this implementation and was synthesized directly. If memory cell is used in the synthesis, the actual gate count is 18000. The synthesis result for the FPGA implementation is quite comparable to the ASIC implementation without memory.

8.1.2

Comparison

Theoretically, with the help of the DMA controller, the efficiency of memory trans-fer should be improved since the DMA controller can read and write the memory pipelined as shown in Figure 8.1. It is of course possible for the processor core to read and write memory pipelined, but it will cost extra register file and program-ming tricks. It is somewhat only partially pipelined because the limit of registers available when the desired transfer is too large such as tens of kilo bytes.

In order to compare the efficiency of the memory transfer, Table 8.2 compares the Clock Cycle the Senior spent when transfer a certain amount of data blocks.

The test case 1 includes three different memory transfer tasks from and to different parts of the memory sub-system. Task 1 contains three links with 32, 16 and 64 data words respectively. The transfer is from memory port 3 to port 4. Task 2 is almost the same as task 1, except the destination is memory port 5.

(59)

8.2 Future Work 57

Results without DMA and without DMA but with DMA no optimization with optimization

Clock Cycle 1055 543 466

Code Size(Bytes) 212 548 488

Table 8.2. Results Comparison with and without DMA

Task 3 is to transfer 32 data word from memory port 4 to memory port 3. The reader should keep in mind that the benchmark is only a way to esti-mate the actual performance. The performance benchmark should always been collected on the real-life applications such as a FFT or DCT kernels or even more complicated applications such as a complete JPEG decoder and MP3 decoder.

8.1.3

Conclusion

The DMA controller can improve the memory transfer efficiency and make it possi-ble for the processor to do other things while the data transfer is being performed. There is no free lunch, extra hardware cost and extra code size should be paid for this improvement.

For some timing critical applications, it is almost impossible for the processor core to do both data calculation and data transfer. Thus, the DMA technique is preferred.

8.2

Future Work

As discussed in section 8.1.2, the actual improvement of DMA controller should be measured on more complicated application such as baseband kernel algorithm or multimedia kernel algorithms. Which means the DMA controller together with the Senior processor core should be implemented on either FPGA or ASIC to make a chip, and the whole application should be developed on the platform.

In order to support off-chip memory modules, external memory interface should also be developed. That would possibly include the commonly used DDR DRAM interface and NAND Flash memory interface.

The behavioral model of the DMA controller is currently statically compiled into the Senior simulator. In order to protect Intellectual Property (IP) and tech-nical detail of Senior core, it is better to compile it dynamically.

(60)
(61)

Bibliography

[1] TMS320C6000 DSP Enhanced Direct Memory Access (EDMA) Controller Reference Guide, March 2005. Literature Number:SPRU234B.

[2] Dave Comisky, Sanjive Aganvala, and Charles Fuoco. A Scalable High-Performance DMA Architecture for DSP Applications. In International Con-ference on Computer Design, pages 414–419, 2000.

[3] Steve Furber. ARM System-on-Chip Architecture. Addison-Wesley Profes-sional, 2nd edition, August 2000.

[4] David J.Katz and Rick Gentile. Embedded Media Processing. Elsevier, September 2005.

[5] Phil Lapsley, Jeff Bier, Amit Shoham, and Edward A. Lee. DSP Proces-sor Fundamentals: Architectures and Features. Wiley-IEEE Press, February 1997.

[6] Dake Liu. Embedded DSP Processor Design, Volume 2: Application Specific Instruction set Processors (Systems on Silicon). Morgan Kaufmann, June 2008.

[7] Markus Svensson and Thomas Österholm. Optimization and Verification of an Integrated DSP. Master’s thesis, Linköping University, December 2008. [8] Tongtong Wang. Design of High-performance DMA Controller for Multi-core

Platform. Master’s thesis, Linköping University, May 2006.

[9] Lars Wanhammar. DSP Integrated Circuits. Academic Press, 1st edition, May 1999.

[10] Kun-Cheng Wu and Yu-Wen Tsai. Structured ASIC, evolution or revolution? In Proceedings of the 2004 international symposium on Physical design, pages 103–106. ACM, 2004.

(62)
(63)

Appendix A

DMA Simulator C++

Header

#ifndef DMA_CONTROLLER_HPP #define DMA_CONTROLLER_HPP #include "support.hpp" #include "peripheral.hpp" #include "memory.hpp" #include "data_memory.hpp" #include <map> #include <queue> #include <stdlib.h> #include <stdint.h>

#define DMA_LINK_NUM 64 // DMA linking table number #define DMA_TASKQ_SIZE 3 #define DMA_PM1 0 #define DMA_PM2 1 #define DMA_DM0_1 2 #define DMA_DM0_2 3 #define DMA_DM1_1 4 #define DMA_DM1_2 5 struct Links_t{ uint16_t srcAddrL; uint16_t srcAddrH; uint16_t length; }; struct DMATask_t{ uint8_t linkNumber; uint8_t taskPriority; uint8_t taskID; uint8_t srcWidth; uint8_t dstWidth; bool srcProtocol; bool dstProtocol; bool srcEndian; bool dstEndian; uint8_t srcRate; uint8_t dstRate; uint8_t srcPort; uint8_t dstPort; uint16_t dstAddrL; uint16_t dstAddrH; Links_t links[DMA_LINK_NUM]; }; struct DMAStatus_t{ bool busy; 61

(64)

62 DMA Simulator C++ Header bool chReady; bool finish; bool exception; bool queueFull; }; struct DMAControl_t{ bool reset; bool shutdown; bool dmaClock; bool start; uint8_t taskID; bool reqChConf; };

class DMAController : public Peripheral { public:

cycle_T* cycle;

std::map<unsigned int, Peripheral*>* peripherals; //connect to peripheral IO DMAController(void);

~DMAController(void);

long ioCommunicate(unsigned int, unsigned long, unsigned long, unsigned int, unsigned long);

int GetInterrupt(); int Step(); // Program memory Memory *pm[2]; // Data memory DataMemory *dm[4]; unsigned long clockTag;

void start(unsigned long cycle);

void configChannel(unsigned long cycle); uint16_t getStatusReg(unsigned long cycle); uint16_t getControlReg();

void setControlReg(uint16_t data);

void putTaskQueue(uint16_t data, unsigned long cycle); void shutDown(); void reset(); private: // DMA config DMAStatus_t _status; DMAControl_t _control; DMATask_t _task; uint16_t _statusReg; uint16_t _controlReg; uint16_t _taskQueue[DMA_TASKQ_SIZE][198]; // DMA task queue

uint16_t _queuePtr; uint16_t _nextQueuePtr; uint16_t _taskPtr; uint32_t _taskRegAddr; std::queue<uint16_t> _taskQ; // Task queue function

void _setTaskReg(uint32_t queID, uint32_t addr, uint16_t data); uint16_t _getTaskReg(uint32_t queID, uint32_t addr);

// DMA data transfer function void _trans(); uint32_t _transCycle(); void _syncReg(); void _syncTask(); }; #endif

(65)

Appendix B

DMA Simulator C++ Code

#include "dma_controller.hpp" #include <stdlib.h>

static inline int gv(unsigned int insn, int bitpos, int bits) { return ((insn >> bitpos) & ((1<<bits)-1));

}

//---// DMA peripheral I/O

//---long DMAController::ioCommunicate(unsigned int addr_in, unsigned long data_in, unsigned long data_in2, unsigned int read_write, unsigned long cycle) { if (read_write == 1) {

// Reading

switch(gv(addr_in,6,2)) { case 0: // Status register

return getStatusReg(cycle); case 1: // Control register

return getControlReg();

case 2: // Out port data to DSP core

fprintf(stderr, "Warning: No data written to DSP core.\n"); return -1;

case 3: // In port from DSP core

fprintf(stderr, "Warning: Can’t read In port data.\n"); return -1;

default:// Unkown operation

fprintf(stderr, "Warning: Unknown operation.\n"); return -1; } } else if (read_write == 2) { // Writing switch(gv(addr_in,6,2)) { case 0: // Status register

fprintf(stderr, "Warning: Trying to write read-only status register.\n"); return -1;

case 1: // Control register

setControlReg((uint16_t)data_in);

printf("DMA: Cycle(%lu), write DMA_CONTROL,

value = 0x%04x.\n",cycle,(uint16_t)data_in); if (gv(data_in,0,1)) {

References

Related documents

Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor..

Syftet med denna studie är att undersöka hur föräldrar till barn på språkskola skattar sitt barns beteende samt hur beteende skiljer sig mellan dessa barn jämfört med normdata

But because of how the FPGA code is synthesized or how the Senior processor has been synthesized, the clock signal to the Senior processor has to be inverted to be able to

Jag ser i mitt material att alla föräldrar är nöjda med den information de får på dessa möten och att de tycker att den svenska skolan är bra på det sättet eftersom de vet hur

They are also the same ones used to represent the meta data in each blog post, to give the user a sense of recognition and coherence, both important for good usability as said

The implementation considers sig- nal processing on the sensor inputs, estimations of TDC and compression ratio, choice of heat-release model and sub-models (for example specific

En reflektion av författarna till aktuell studie är att eftersom föräldrarna till barn med en AST – diagnos upplever en högre stress än föräldrar till barn utan AST så leder

Genom detta arbete har vi kommit till insikt om att den fysiska miljön i skolan har stor betydelse för många delar, inte minst för elevers lärande samt inspiration i skolarbetet.