• No results found

Acceleration and Integration of Sound Decoding in FPGA

N/A
N/A
Protected

Academic year: 2021

Share "Acceleration and Integration of Sound Decoding in FPGA"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Acceleration and Integration of Sound Decoding in

FPGA

Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings universitet

av

Jesper Eriksson, Johan Holmér

LiTH-ISY-EX--11/4471--SE

Linköping 2011

Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

(2)
(3)

Acceleration and Integration of Sound Decoding in

FPGA

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan i Linköping

av

Jesper Eriksson, Johan Holmér

LiTH-ISY-EX--11/4471--SE

Handledare: Erik Lindahl

Actiwave AB

Examinator: Kent Palmkvist

isy, Linköpings universitet

(4)
(5)

Avdelning, Institution

Division, Department

Division of Electronics Systems Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2011-05-20 Språk Language  Svenska/Swedish  Engelska/English   Rapporttyp Report category  Licentiatavhandling  Examensarbete  C-uppsats  D-uppsats  Övrig rapport  

URL för elektronisk version

http://www.es.isy.liu.se http://www.ep.liu.se ISBNISRN LiTH-ISY-EX--11/4471--SE

Serietitel och serienummer

Title of series, numbering

ISSN

Titel

Title

Accelerering och integrering av ljudavkodning i FPGA Acceleration and Integration of Sound Decoding in FPGA

Författare

Author

Jesper Eriksson, Johan Holmér

Sammanfattning

Abstract

The task has been to develop a network media renderer on an embedded linux system running on a Spartan 6 FPGA. One of the challenges have been to make the best use of the limited FPGA area. MP3 have been the prioritised format. To achieve fast MP3 decoding a MicroBlaze soft processor have been configured for speed with concern to the small area availabe. Also the software MP3 decoding process have been accelerated with hardware. MP3 files with full quality (320 kbit/s) can be decoded with real time requirements. A sound interface hardware have been designed to handle the decoded sound samples and convert them to the S/PDIF standard interface. Also UPnP commands have been implemented with the MP3 player software to complete the renderer’s network functionality.

Nyckelord

Keywords hardware acceleration, digital signal processing, embedded systems, sound encod-ing

(6)
(7)

Abstract

The task has been to develop a network media renderer on an embedded linux system running on a Spartan 6 FPGA. One of the challenges have been to make the best use of the limited FPGA area. MP3 have been the prioritised format. To achieve fast MP3 decoding a MicroBlaze soft processor have been configured for speed with concern to the small area availabe. Also the software MP3 decoding process have been accelerated with hardware. MP3 files with full quality (320 kbit/s) can be decoded with real time requirements. A sound interface hardware have been designed to handle the decoded sound samples and convert them to the S/PDIF standard interface. Also UPnP commands have been implemented with the MP3 player software to complete the renderer’s network functionality.

Sammanfattning

Uppgiften har bestått av att utveckla en mediaspelare med nätverksstöd på ett inbyggt linuxsystem som körs på en Spartan 6 FPGA. En av utmaningarna har varit att använda den begränsade FPGA-arean effektivt. MP3 har varit det huvud-sakliga formatet i arbetet. För att uppnå snabb MP3-avkodning så har en mjuk processor, MicroBlaze, konfigurerats för hög prestanda med hänsyn till den begrän-sade arean. För att ytterligare snabba på avkodningen har hårdvaruaccelaratorer för MP3-avkodningen designats. MP3 med högsta kvalitet (320 kbit/s) kan avko-das i realtid. Hårdvara för ljudhantering har designats som behandlar avkodade ljudsample och konverterar dem till S/PDIF-standarden. UPnP-kommandon har också implementerats för mediaspelarens nätverksstöd.

(8)
(9)

Acknowledgments

We would like to thank Actiwave for all support and input to our work. They have been very helpful at all areas of this wide project. We also had a very good start of the work with a thorough introduction by previous thesis workers, Axel Wiksten Färnström and Karl-Rikard Ländell. Especially we would like to thank our main supervisor Erik Lindahl, although everyone at Actiwave have been involved in our thesis and deserve credit too. Thanks also to Kent Palmkvist for all his support and always being available for supervising.

This has been a very interesting thesis work which has taught us a lot about sound encoding and complete system-on-chip design, from hardware design to computer architecture and software development.

Jesper Eriksson, Johan Holmér

Linköping 2011-05-20

(10)
(11)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Goal . . . 1

1.3 Tasks . . . 2

2 FPGA Development Kit 3 2.1 Overview . . . 3

2.2 Spartan 6 . . . 4

2.3 Development Tools . . . 4

2.3.1 Xilinx Platform Studio (XPS) . . . 4

2.3.2 ISE . . . 5 2.4 Board Bring Up . . . 5 3 Hardware Platform 7 3.1 MicroBlaze Processor . . . 7 3.1.1 Cache Memory . . . 7 3.1.2 Barrel Shifter . . . 7

3.1.3 Floating Point Unit (FPU) . . . 8

3.1.4 Memory Management Unit (MMU) . . . 9

3.1.5 Integer Multiplier . . . 9

3.1.6 Integer Divider . . . 10

3.2 Buses . . . 10

3.2.1 Processor Local Bus (PLB) . . . 10

3.2.2 Fast Simplex Link (FSL) . . . 10

3.3 System Peripherals . . . 11

3.3.1 Ethernet . . . 11

3.3.2 Universal Asynchronous Receiver/Transmitter (UART) . . 11

3.3.3 Sound interface . . . 11

3.3.4 Accelerator Unit I . . . 11

3.3.5 Accelerator Unit II . . . 11

3.3.6 Register space . . . 11

(12)

x Contents 4 Software Platform 13 4.1 PetaLinux . . . 13 4.1.1 PetaLinux Configuration . . . 13 4.1.2 Applications . . . 16 4.2 Peripheral Software . . . 16 4.2.1 FSL Software . . . 16 4.2.2 PLB Software . . . 16

5 MPEG-1/MPEG-2 Audio Layer III Decoding 19 5.1 Software Decoder . . . 20

5.2 Decoder Profiling . . . 20

6 Sound Interface 23 6.1 Pulse-code Modulation (PCM) . . . 23

6.2 Sample buffer . . . 23

6.3 Sony Philips Digital Interface (S/PDIF) . . . 24

6.3.1 Biphase Mark Code . . . 25

6.4 Sample Frequency Select . . . 26

7 Accelerator Unit I 27 7.1 Overview . . . 27 7.2 MicroBlaze Communication . . . 27 7.3 FSL Interface Control . . . 28 7.4 The Accelerator . . . 28 7.5 FPGA Usage . . . 29 7.6 Performance . . . 29 7.7 Problems . . . 30 7.7.1 Case 1 . . . 30 7.7.2 Case 2 . . . 30 7.7.3 Case 3 . . . 31

7.7.4 Summary and conclusion . . . 32

8 Accelerator Unit II 33 8.1 Overview . . . 33

8.2 MicroBlazeTM Communication . . . . 33

8.2.1 Bus Control and Registers . . . 34

8.3 Interface . . . 34

8.4 Hardware Implementation . . . 35

8.5 Performance . . . 37

9 Register Space 39 9.1 Purpose . . . 39

(13)

Contents xi

10 Universal Plug and Play 41

10.1 Digital Living Network Alliance . . . 41

10.2 Interprocess Communication . . . 41

10.3 Application Usage . . . 43

11 Results 45 11.1 Goal Fulfilment . . . 45

11.2 Conclusion . . . 46

12 Further Development and Improvements 47 12.1 Format Decoding . . . 47 12.2 Accelerator I . . . 47 12.3 Accelerator II . . . 48 12.4 Software Efficiency . . . 48 12.5 UPnP Compatibility . . . 49 12.6 Wireless LAN . . . 49 Bibliography 51 A Glossary 53

(14)
(15)

Chapter 1

Introduction

1.1

Background

DLNA is an industry standard for sharing media files over both wired and wireless LAN. Speaker systems designed by Actiwave today support DLNA for playback of sound data (MP3, WMA, FLAC etc). The current system uses an externaldata to module for receiving and decoding DLNA data. This is an expensive module which requires a lot of area. This is why the possibility to replace the module with an FPGA design is investigated.

In a previous master thesis work performed at Actiwave, a linux system has been set up on an FPGA development kit. The system is capable of receiving DLNA data and decoding MP3-encoded sound data. Everything is built around a soft processor running embedded linux. This system however, is not fast enough to decode all sound data in real time. It also lacks an interface for sound data to connect to other FPGA functional blocks.

Integrating the current functionality with hardware acceleration and sound in-terface is the main objective of the thesis, allowing the system to decode sound data in real time and also send the data to other blocks before it eventually reaches the speaker.

1.2

Goal

The main goal of the thesis is to have a working DLNA Media Renderer connected to wired LAN with real time decoding of sound data, implemented on a Xilinx Spartan 6 FPGA development kit.

(16)

2 Introduction

1.3

Tasks

• Suggest parts of the decoding algorithms that are well suited for hardware acceleration

• Implement the hardware acceleration on the FPGA

• Modify the software decoding to make use of the hardware acceleration • Integrate the system with other parts of the FPGA, i.e. feed decoded PCM

samples to other FPGA functional blocks.

If there is time left available after main tasks are fulfilled there also is an extra assignment:

• Integrate the system with a Wireless LAN module and open source software drivers

(17)

Chapter 2

FPGA Development Kit

2.1

Overview

The hardware used in this thesis is a Spartan 6 FPGA SP601 evaluation kit. The SP601 is Xilinx’s base platform for developing FPGA designs on Spartan 6. Some of the SP601 base board features are:

• Spartan 6 FPGA

• Quad SPI Flash

• DDR2 Memory

• USB JTAG Download Port

• Serial UART USB Port

• 4 LEDs

• 4 Push Buttons

• 4 DIP Switches

• GPIO Contact

• RJ45 Ethernet Contact

All of these features are not used. The JTAG connection is used to program the FPGA, the UART Port for sending software to the board and the GPIO pins are used for sending sound signals to devices outside the board. The LEDs are used for debugging purposes. The system uses the DDR2 memory as main memory and this is where the operating system image is uploaded via UART. [17]

(18)

4 FPGA Development Kit

2.2

Spartan 6

The Spartan 6 series is built on a 45 nm low-power copper process technology. Spartan 6 is available in 13 different variants with a big range of capacity. The FPGA used on the SP601 board is called XC6SLX16 and is one of the smaller variants. See Table 2.1 for a small feature summary.

Device Logic Cells Slices Flip-Flops DSP48A1 Slices BRAMs XC6SLX16 14,579 2,278 18,224 32 32

Table 2.1. FPGA resources for Spartan 6 model XC6SLX16

Logic Cells Basic logic unit that the FPGA structure is based upon.

Slices Slices consists of a few logical cells. Each Spartan-6 FPGA slice contains four LUTs (Lookup tables) and eight flip-flops.

DSP48A1 Slices These are special slices used for implementing parts of designs that uses heavy arithmetic operations. Each DSP48A1 slice contains an 18 × 18 multiplier, an adder, and an accumulator.

BRAM Block RAMs are memory units on the FPGA. Each Block RAM is fundamentally 18 Kb in size and the total amount of memory of all BRAMs is 32 × 18 = 576 Kb. It is possible to use each BRAM as two independent 9 Kb blocks. [16]

2.3

Development Tools

2.3.1

Xilinx Platform Studio (XPS)

XPS is a software tool used to configure and build the hardware specification of the embedded system. With XPS you can configure the processor core, buses, memory-controller, peripherals, etc. Chapter 3 contains a description of the hard-ware configuration used in this thesis. With XPS you can import designs from other development tools and add them to the system. XPS is also the program used to download the complete system to the FPGA. In this thesis XPS 11.4 is used.

Figure 2.1 shows the software interface after the Base System Builder wizard has been completed. The Base System Builder is an easy way to set up the standard MicroBlaze environment with some basic peripherals of choice.

(19)

2.4 Board Bring Up 5

Figure 2.1. Xilinx Platform Studio

2.3.2

ISE

ISE is a software tool provided by Xilinx. It is used for synthesis and analysis of HDL designs. All designs in this thesis are implemented and analysed in ISE and then exported to XPS for integration. ISE includes a simulation program called ISIM that is used to run simulations of the designs. These simulations can be made with a behavioural model or a post route model. The post route model uses timing models based on the synthesis of the design. In this thesis ISE 11.4 with ISIM 11.4 is used.

2.4

Board Bring Up

To set up the operating system on the development platform the operating system have to be integrated with the hardware devices. This requires many configurations and settings and can be difficult to achieve without a proper guide. PetaLogix, the company that develops the operating system, have provided a good tutorial for starting up a basic system on Xilinx FPGA development kits. Since PetaLogix products supports Xilinx platforms there is an easy way to make a simple system work without any prior knowledge of embedded systems. You just follow the guide step by step according to your current system. [5]

(20)
(21)

Chapter 3

Hardware Platform

3.1

MicroBlaze Processor

The processor used in the system is called MicroBlaze. The MicroBlaze processor is a 32-bit RISC Harvard architecture soft processor, provided by Xilinx. The processor is highly flexible for specific needs and FPGA architectures. Simply by enabling a checkbox you can add a whole new hardware module, which enables more assembler instructions for the compiler to use.

Figure 3.1 shows the Xilinx Platform Studio MicroBlaze configuration window. As you can see it is very easy to customize the processor to your preferences. All decisions are a matter of trading area for performance. [18]

3.1.1

Cache Memory

Since the processor is of Harvard architecture type it has separate instruction and data memory. That means the cache is divided into instruction and data cache. Both instruction and data memory are divided into two parts. One which is cacheable and one which is not. Size of the cache memory is configurable by the user when generating the processor hardware. In the system used in this thesis 8 kB is used for instruction cache and 8 kB is used for data cache. The cache is generated in block RAMs and increases performance significantly.[15]

It is recommended to use as much cache as possible, when the amount of block RAM needed by the rest of the system is determined, for best performance.

3.1.2

Barrel Shifter

The barrel shifter is a computing unit that performs shifting operations. It can shift a 32-bit data word by a specified number of bits. The barrel shifter instruction takes 2 clock cycles regardless of the number of bit positions to shift. It is optional to include the barrel shifter in the MicroBlaze processor but this unit does not use

(22)

8 Hardware Platform

Figure 3.1. MicroBlaze processor configuration

up much of the FPGA resources and it significantly improves the performance. Therefore it is recommended that the barrel shifter is used. [15]

3.1.3

Floating Point Unit (FPU)

The MicroBlaze floating point unit is a single precision unit based on the IEEE 754 standard. This includes definitions for infinity, zero and not-a-number (NaN). Operations supported are addition, subtraction, multiplication, division, compar-ison, conversion and square root. Rounding is always done to the nearest possible precision and status bits for underflow, overflow, divide-by-zero and invalid oper-ation are available. [15]

Floating point numbers are calculated using Equation (3.1). The single precision standard uses 32 bits for a floating point number, mapped according to Figure 3.2. The base (B) is implicit in the 32-bit word and set to 2 in this standard. The fraction (M) consists in reality of 24 bits, but the first bit is implied ’1’. The other 23 bits represent the bits to the right of the binary point (’1.M’ is total 24 bits).

(23)

3.1 MicroBlaze Processor 9

The exponent (E) is represented in biased form, meaning that you add 127 to the actual exponent to get the 8 bit binary representation for floating point numbers. The remaining bit decides the sign. [9]

f loating point number = M · BE (3.1)

Figure 3.2. Floating point representation

The floating point unit adds significant area to the processor, but can dramati-cally increase the performance of floating point intense applications. For complete details in simplifications and specifications for the processor FPU, see the MicroB-laze reference guide. [15]

Even though as little area as possible should be used by the system it is rec-ommended that this unit is used for best performance. The system uses the basic version of the Xilinx FPU intellectual property core (IP core).

3.1.4

Memory Management Unit (MMU)

In the architecture used the MicroBlaze processor is built with a memory manage-ment unit, using virtual mode. This means that all addresses are translated by the virtual memory management hardware to a physical address. By using the MMU programs and data can be relocated anywhere in the physical address space. For example, inactive programs can be moved out of the physical address space when space is required by active programs. This gives the impression that there is more memory available than actually implemented.

The virtual memory management also provides control over memory protection, enabling memory to be protected from unauthorized access. Protection and relo-cation enables the operating system to support multitasking, which is essential in the system. The MMU must be enabled in virtual mode, to be able to run Peta-Linux, and this requires the processor to be optimized for performance, leading to greater area used by the system. [15]

3.1.5

Integer Multiplier

Hardware support for integer multiplications. With this unit enabled more as-sembler instructions are available for multiplication operations which speeds up multiplications. It can be set to 64-bit or 32-bit operation. The 32-bit option is

(24)

10 Hardware Platform

used since the system rarely uses large numbers. When enabled this unit uses DSP48 slices. Since there is no shortage of these slices and it boosts performance it is recommended that this unit is enabled.

3.1.6

Integer Divider

Hardware support for integer divisions. With this unit enabled more assembler instructions are available for division operations which speeds up divisions. When enabled the MicroBlaze uses a little bit more area but it is recommended that this unit is enabled because of the performance increase.

3.2

Buses

Below are a description of the the two different bus systems used in this system..

3.2.1

Processor Local Bus (PLB)

The PLB is the standard MicroBlaze bus in the system and is used by all periph-erals except the two accelerators and the sound interface. It is a standard address bus with support for 32-bit or 64-bit data width. The system uses 32-bit data width because MicroBlaze is a 32-bit processor. The PLB consists of bus control and gating logic, a central bus arbiter and bus OR/MUX structures. The arbiter controls all bus operations and which unit that will have access to the bus. This is done by comparing the priority of the request signals of the different PLB masters competing for the bus. The PLB bus system supports up to 16 masters and 16 slaves. [14] [12]

3.2.2

Fast Simplex Link (FSL)

The FSL is an alternative bus to the PLB and is used by the two accelerator peripherals and the sound interface. The FSL is a 32-bit wide direct link between registers inside the MicroBlaze processor and other hardware on the FPGA. An FSL connection consists of a FIFO, slave interface and a master interface. The MicroBlaze supports up to 16 FSLs and the FIFO depth can be specified between 1 and 8k. The FSL also have an optional extra bit called the control bit that can be used for multiple purposes. If enabled you can use this as an extra data bit or as a status flag etc. The FSL interface have status signals indicating if there is data in the FIFO or the FIFO is full. If you read from the FIFO when it is empty the value read is undefined. The FSL does not use addresses like the PLB and the MicroBlaze registers are accessed with micro instructions. The FSL is often used for high-speed communications because it is fast and simple. Since it does not use addresses and there is no waiting for the bus to be available, as can be the case with the PLB, the FSL is often faster. The FSL also supports asynchronous FIFO mode which means that you can clock the slave side and the master side at different rates. You can choose to implement the FSL link in LUTs or BRAMs. [13] [15]

(25)

3.3 System Peripherals 11

3.3

System Peripherals

Below is a list of the peripheral hardware used by the system.

3.3.1

Ethernet

The ethernet module is using the network access for Universal Plug and Play (UPnP) protocol (See Chapter 10). It is intended to be replaced by a Wireless network connection in the final system, making it primarily used for development purposes.

3.3.2

Universal Asynchronous Receiver/Transmitter (UART)

The UART is used for operating system image transfer. In other words only used for developing purposes and not for the media renderer system functionality.

3.3.3

Sound interface

To be able to monitor the sound with a sound device (sound card) the sound need to be fed out of the system. Since the software produces PCM samples (see Section 6.1) the S/PDIF format is used (see Section 6.3). This module has a sample buffer, written to by the software. The S/PDIF module reads a pair of samples from the buffer every sample clock. It is of great importance that the software keeps the buffer from being empty which would cause a glitch in the output sound.

3.3.4

Accelerator Unit I

Accelerator hardware for software function imdct_l. For more detailed informa-tion see Chapter 7.

3.3.5

Accelerator Unit II

Accelerator hardware for software function sub_dct. For more detailed information see Chapter 8.

3.3.6

Register space

The register space is attached to the MicroBlaze Processor Local Bus and the physical address is mapped to a pointer in the linux environment. It contains various status registers used by the system, for example the sample rate divider used by the S/PDIF unit. The divider value is determined with respect to the currently played sound sample rate and can be set to correspond to 48000 Hz, 44100 Hz or 32000 Hz. For more detailed information see Chapter 9.

(26)
(27)

Chapter 4

Software Platform

4.1

PetaLinux

PetaLinux is an operating system designed for embedded systems and it is tar-getted towards MicroBlaze and PowerPC processor architectures. The main ad-vantage of using the system is the support for many types of Xilinx IP cores. [7]

4.1.1

PetaLinux Configuration

PetaLinux, and the linux kernel can be customized to the needs of the specific applications and hardware architectures. This has many advantages, one of them being the possibility to select the minimum amount of modules, hardware support and drivers needed. This leads to a minimal operating system and saved storage space. The platform configuration is done in four areas:

• Vendor/Product Settings • Kernel Settings

• Vendor/User Settings • System Settings

Product settings means choosing a hardware platform setting configuration. After building the hardware a script is run that copies the configuration data of the hardware to a platform setting, which is then selected in product settings. This is how, among other things, the PetaLinux build process knows what instruction set it has available for building the operating system. It also learns the physical addresses of the peripherals present. This is important to repeat every time a significant change in the hardware platform is made. The MicroBlaze processor is highly configurable, leading to a very varying set of available instructions. Every peripheral added to the PLB also needs to be known by the software. This con-figuration window is shown in Figure 4.1.

(28)

14 Software Platform

Figure 4.1. Main configuration

Kernel settings is where kernel modules and drivers for inclusion in the system is chosen. In the decoding system MMU support and drivers for Ethernet net-working have to be included to make the base system work, but otherwise there is a lot more to choose from. In the future development, see Section 12, the system will need a wireless module connected to the system and possibly other external pe-ripheral components. For this purposes support for SPI, USB, parallel port, IEEE 1394 (FireWire), memory cards (MMC/SD) and much more can be included. File system support is also available under kernel settings. This configuration window is shown in Figure 4.2.

User settings is where the shell commands and applications to include in the sys-tem is added. Applications include file syssys-tem applications, network applications and miscellaneous applications. Miscellaneous applications is where the MP3 de-coding application is found. There also is a wide range of other applications, not just only for sound decoding. The main interest in the future is the other format applications included in the operating system. System settings include boot im-age build settings and network address settings, which is essential for the media renderer since all sound data is transferred over network protocols. The system is set to obtain IP address automatically. This configuration window is shown in Figure 4.3.

In addition to all included applications and kernel modules custom made C/C++ applications can also be installed. There is a dedicated directory where all

(29)

ap-4.1 PetaLinux 15

Figure 4.2. Kernel configuration

(30)

16 Software Platform

plications’ source code is put. A script creates the folder structure and makefile template which is then completed with necessary information. After the make command is executed correctly, run make romf s to install the application in the current build. It is all really simple, and provided scripts are very useful when configuring a tailored system. [6]

4.1.2

Applications

Previous thesis work has provided a UPnP application that is installed as a user application outside of the PetaLinux configuration. [6]

4.2

Peripheral Software

Below is a description of the software required to communicate with the accelera-tors and with the sound interface.

4.2.1

FSL Software

To use the accelerators and the sample buffer with the complete Linux system it is necessary to be able to put data into the slave FSL and get data from the master FSL in C code. This is done with in-line assembler instructions put and get: asm volatile("put %0,rfslX" : : "r"(putvar));

asm volatile("get %0,rfslX" : "=r"(getvar) : );

The variables given as arguments are either written to the FSL (putvar) or data from the FSL are written to the variable (getvar). X indicates which MicroBlaze FSL port to perform the instruction on. The assembly instructions used are block-ing instructions, which means that the processor will stall if a write can not be performed due to a full FSL FIFO or if a read can not be performed due to an empty FSL FIFO. Thus it is needed to make sure that a complete accelerator cycle always takes the required amount of data without stalling and that the acceler-ator always writes the expected amount of data back. Otherwise the processor operation will be stalled permanently and the system will stop working. [15]

4.2.2

PLB Software

The S/PDIF module in the sound interface supports different sample rates. The dividing factor used to achieve correct sample rate is sent via PLB and stored in a registry space that the S/PDIF module can access. To send data with the PLB we have to map a pointer to the address that the registry space uses. To do this a function called map_peripherals is written. This function uses a C-function that is called mmap. Mmap takes some different arguments concerning read/write access etc but most importantly it takes the physical base address and memory size. The physical base address and the memory size can be found in XPS. Mmap uses this to map the pointer’s virtual address to the real physical address of the register. With this done it is now enabled to write data using the pointer in C and

(31)

4.2 Peripheral Software 17

the MMU will handle the addresses so that the real physical address is accessed correctly.

(32)
(33)

Chapter 5

MPEG-1/MPEG-2 Audio

Layer III Decoding

The MPEG-1/MPEG-2 Audio Layer III (MP3) is a popular digital audio lossy compression format which has become a de facto standard. It is able to greatly reduce the information needed to reproduce, to most listeners, a faithful copy of the original recording. Considerable quality is reached at 1/11 of the raw sound data size. The format became an ISO/IEC standard in 1991. Some of the techniques used in compression are:

• Huffman coding

• Quantization

• Modified discrete cosine transform (MDCT)

For decompression/decoding the inverse procedure is done, so the most compli-cated and computing intense part we will encounter in our decoding would be the inverse MDCT (IMDCT). Huffman coding and quantization may not be so good to accelerate in hardware. These techniques uses many table values and to store these in FPGA hardware would take too much area. It is better for the system to have this in the DDR2 memory, rather than we create these tables in LUTs och BRAMs. These units can be used for better purposes.

There is a set of fixed bitrates available in the MP3 standard, where 320 kbit/s is the largest data rate. In other words, if the system is able to decode sound data at 320 kbit/s it will maintain real time playback for all MP3 files. It is also important to consider the network overhead when streaming data over the UPnP protocol. In the intial investigation the goal is to speed up the decoding process as much as possible to increase the time slack available. [4]

(34)

20 MPEG-1/MPEG-2 Audio Layer III Decoding

5.1

Software Decoder

The software decoder used is provided as an application in the PetaLinux release. The work was started by investigating other open source decoding applications to find the best one with respect to compatibility and platform optimization. At first, about 10 different open source decoders were looked at, tested and the source code read. The decoders tested were Amp-0.7.6, Amp-1.1, FreeAmp, mp3play, Cool Edit MP3 Decoding Filter, mpg123, LAME, Maplay 1.2+ and Mpeg3Play. Many open source code libraries could be found at mp3-tech.org. [10]

In the end the application called mp3play included in the operating system re-lease were chosen, mainly because:

• The application is bundled with the operating system, meaning it is com-patible and well tested in the actual environment

• It is written with respect to embedded systems, making the most use of available hardware

• The library consists of well separated decoding functions, making accelera-tion of single funcaccelera-tions simpler

• The output is written as PCM samples, which is exactly the type of data the system is to forward to other units

• The application can play sound data pointed by an URL, an essential feature in a network based system

Basically, this application has all features necessary to be able to complete the whole media renderer chain, from network to speakers.

5.2

Decoder Profiling

When it comes to find out what parts of the software that are especially tough for the processor to handle it is good to do some kind of profiling of the software. Pro-filing is done to measure time spent in software functions and subroutines. Then it is possible to see which parts consume the largest amount of time. However, Xilinx profiling tools can not be used under an operating system so it is necessary to find an alternative way to analyse software performance. Since the system has real time requirements it was decided to time the functions using the value of the system clock with precision down to 1 µs. The internal function time is accumu-lated over the playback of an entire MP3 file. It is then compared to the total time without the internal timers present. This is because all timer value handling takes up a lot of time in the decoding application and will give a significantly larger total value, which is misleading. The internal values are assumed to be correct because all accumulation and variable declarations are outside of the collecting of timestamps analyzed.

(35)

5.2 Decoder Profiling 21

It is also important to know that the results are very different depending on what platform the profiling is performed on. Actually, the same type of profil-ing were done on the development board with the MicroBlaze system and then on the workstation computers, with very different results. The two most demanding functions (same for both tests) switched order between the platforms, showing that the platforms architecture had a large impact on what type of calculations is more effective. Also it is needed to make sure that the function measured on does not contain calls to other functions. If so, the measurements will not show the ’leaf’ function times, which are the interesting ones. A time consuming function may in fact just be a very simple one, but it calls other more demanding subroutines.

Function Time coverage

IMDCT long 39 % Sub DCT 25 % Huffman 13 % Dequantization 6 % Antialias 4 % IMDCT short 1 % Reorder 0.1 % Total coverage 88.1 %

Table 5.1. Profiling results at 320 kbit/s

Function Time coverage

IMDCT long 37 % Sub DCT 27 % Huffman 10 % Dequantization 6 % Antialias 4 % IMDCT short 0 % Reorder 0 % Total coverage 84 %

Table 5.2. Profiling results at 192 kbit/s

The results from the profiling are shown in Tables 5.1-5.2. The system used is the system with the best performance achievable on the targeted FPGA. This is motivated by the fact that it is preferred to have the highest overall performance achievable in the final system, even with the accelerators present. Note that a full performance analysis on all functions is not done, just on the library functions be-lieved to take a lot of time. It would have been inefficient to put counter variables in all decode library functions, so some ’smart’ choices were made by

(36)

investigat-22 MPEG-1/MPEG-2 Audio Layer III Decoding

ing the source code. If a big function is missed it would not though make a big difference. Two of the analysed functions are still the main part of the total time and accelerating these properly has a huge impact on the total decoding time and the system has area requirements preventing from implementing too much custom hardware. Accelerating the top two functions will be good enough.

Once the most time consuming functions and the proportions between them are known, it comes to analyse if it is more efficient to compute the data in custom hardware. Just because a software routine consumes a lot of time does not mean that it is more suitable for implementing with hardware acceleration components. It was however discovered, that both of the two top functions had a lot of mul-tiplications and additions with cosine and sine functions. Both were related to parts of the IMDCT so it was decided this would have potential to speed up the decoding by use of logic that could be custom designed, since it contained many arithmetic operations on a lot of different data.

(37)

Chapter 6

Sound Interface

6.1

Pulse-code Modulation (PCM)

PCM is a method used to represent sampled analog signals digitally. Every sample is given a value that is proportional to it’s analog value. It is standard in many applications such as CDs and digital telephone systems. In the MP3 decoder used in the this system every sample value is represented by a 16-bit 2-complement value. Figure 6.1 shows an analog signal converted to 4 bit resolution PCM data.

Figure 6.1. PCM with 4-bit 2-complement

6.2

Sample buffer

The system’s sample buffer consists of an FSL bus connected to a port on the MicroBlaze processor. Samples are written to the buffer one channel at a time. A pair of stereo samples is then written in two assembler instructions. If the sound is mono the mono sample is written twice in order to get it to both sound channels. In other words, it is assumed that the receiving module always interpret the data

(38)

24 Sound Interface

as stereo samples. The FSL control bit is set to indicate the right channel sample, control bit low indicates left channel sample.

For testing purposes, an S/PDIF (see Section 6.3) module is used which feeds the signal to a GPIO pin. This pin is connected to an external computer sound card, so that the sound data can be validated by listening to the sound. It is important that the PCM sound samples are correct after the custom hardware has been connected, so the sound is compared with the sound from an unmodified system. Logic have been added that puts two samples in registers, one for left and one for right sample. These registers are loaded by the S/PDIF module every time the sample enable signal goes high, which it does for one clock cycle at the rate of the specified sample frequency.

Since this is a real time system the buffer can not be empty at any time dur-ing playback. An empty buffer would cause the playback of the sound to stop. If the decoding system is slightly too slow the output sound will experience glitches caused by the DC-level outputted by the S/PDIF module if no new samples are registered. The decoding process is done in blocks of 1152 samples which is then written to any type of sound device as PCM samples. This means that a total of 1152 · 2 = 2304 data words are to be written to the buffer in every decoding block. The maximal time for the decoding is determined by Equation (6.1).

tmax=

2304

fs

(6.1)

If the buffer does not contain this amount of samples the system will fail to main-tain real time decoding. It is important to write all decoded samples before con-tinuing to process the next decoding block, so that the buffer can feed data to the sound interface in the same time the application decodes the next block. If the buffer is full the system stalls, but as long as the system is able to decode a block in less time than 2304 samples are read by the S/PDIF module the playback speed requirement can be met.

An FSL FIFO depth of 3072 samples (512 · 6) is used and it is implemented in BRAMs which can be done by just ticking a checkbox in the FSL configuration in Platform Studio. The buffer will otherwise be far too area consuming to be synthesized in LUTs.

6.3

Sony Philips Digital Interface (S/PDIF)

S/PDIF is a development of the AES3 standard. AES3 is a digital audio standard and it is often called AES/EBU. It was developed by the Audio Engineering Soci-ety (AES) and the European Broadcasting Union (EBU). AES3 is often used for professional use and S/PDIF is a commercial version of the AES3 standard. Sony and Philips were the primary designers and the standard is very common when sending digital audio in consumer products. Table 6.1 shows some specifications for AES3 and S/PDIF.

(39)

6.3 Sony Philips Digital Interface (S/PDIF) 25

AES3 S/PDIF Interface Balanced Unbalanced Connector XLR3 RCA Impedance 110 Ω 75 Ω Output Level 2-7 Vp−p 0.5 Vp−p Max Current 64 mA 8 mA Min Input 0.2 V 0.2 V Cable STP COAX Max Distance 100 m 10 m

Table 6.1. AES3 and S/PDIF specifications.

In table 6.1 it is clear that the S/PDIF is a commercial variant of AE3. AE3 uses XLR3 connectors which are often used in professional sound equipment and S/PDIF uses standard RCA connectors for home electronics. The voltage levels for S/PDIF are lower and the max distance is 10 m which is enough for most consumers but not for larger professional use. AES3 also uses balanced interface which is good for rejection of external noise but not very important in a small system. [8]

6.3.1

Biphase Mark Code

In the S/PDIF standard the sound samples are sent serially using biphase mark code. S/PDIF supports up to 24 bits per channel. The system used in this thesis uses two channels. The biphase mark code works as follows: First a preamble is sent which consists of three logical ones i a row. Then for sending a logical 1 the signal sent switches value every clock phase. When sending a logical 0 the signal switches every second clock, see Figure 6.2. One advantage with this channel

Figure 6.2. Channel coding with biphase mark code

(40)

26 Sound Interface

The sample frequency can be derived from the signal by the unit that receives it. S/PDIF supports many different sample frequencies but through the biphase mark code all information about the frequency is in the signal’s switching activity. [11]

6.4

Sample Frequency Select

The sample frequency select is a peripheral connected to the PLB of the systems’ MicroBlaze processor. The S/PDIF module has an input for clock divider. This input is wired to a register in the sample frequency select peripheral to set the sample rate. For more information see Section 9.2.

(41)

Chapter 7

Accelerator Unit I

7.1

Overview

The first accelerator unit replaces the calculations done by the subroutine imdct_l in the software. This function takes 296 32-bit input values. This subroutine is doing a part of the calculations included in the IMDCT. It basically calculates the sum of many products and shifts the result. This is done 18 times with two output values calculated each round making it a total of 36 output values.

7.2

MicroBlaze Communication

The accelerator communicates with the MicroBlaze processor with two FSL links, one with MicroBlaze as master and one with the accelerator as master. An overview of this can be seen in Figure 7.1. The output FSL link is 36 words deep so that it can contain all the output values. The input FIFO depth is 256 due to reasons explained in section 7.7, especially 7.7.3.

Figure 7.1. Graph illustrating the communication flow

(42)

28 Accelerator Unit I

7.3

FSL Interface Control

The accelerator uses 296 input values, which is quite a lot. It would take a long time to first fill the input FSL FIFO with 296 values and then begin the computation. To speed up the process the accelerator is working continuously as the values are being sent from the MicroBlaze. The accelerator is working with the calculations in parallel with the MicroBlaze putting in values in the input FSL FIFO. This is done by using a clock enable signal for the accelerator so that it is only working when there is data in the input FIFO. When the last value is sent from the MicroBlaze the computations are almost done. Just a few clock cycles later the last output is calculated and all the output values are available in the output FSL FIFO. For the input FSL the control bit is used as a start flag. Whenever a FSL control bit is detected the computations starts. For the output FSL link the control bit is not used. The FSL interfaces are controlled by a simple state machine with two states; idle and active. In the active state data is read from the input FSL, data is written to the output FSL and the clock enable signal is set when there is data to work with. It switches to the active state when a control bit is detected and it returns to the idle state when all output values are written to the output, see Figure 7.2.

Figure 7.2. Graph of the FSM controlling the FSL communications

7.4

The Accelerator

The accelerator is basically a MAC unit with a shifting function. The calculations are done serially with one input value being clocked in one at a time. Each input value is multiplied with a constant and the product is added to, or subtracted from, the result from the previous round. So every cycle one product is added to the accumulator, see equation (7.1).

new_result = last_result ± (input × constant) (7.1)

It is also possible to multiply the input with the result from the last cycle and the result each cycle can be right-shifted. These functions are needed to perform the

(43)

7.5 FPGA Usage 29

second stage of the calculations. Each of the 18 rounds of calculations are different. The constants are different and whether products are added to, or subtracted from, the result differ each round. The calculations done are described in equation (7.2), (7.3) and (7.4).

temp = ((X0× C0 ± X1× C1 ± ... ± X11× C11) ± X12)  14 (7.2)

O0= (temp × X13)  14 ± X14 (7.3)

O1= (temp × X15)  14 ± X16 (7.4)

W here X are input values , C are constants and O are output values

As can be seen in equation (7.2) and (7.3) the value temp is used to calculate two output values. When calculating an output the value of temp is lost since the accumulator only contains the last value each cycle. To calculate temp two times would take a long time. To speed up the calculations the value of temp is stored in a small memory so that it is available for the calculation of the second output value. The memory, processing elements and control logic are managed by a control unit. This unit is a state machine with 296 states, each with a set of control signals to the processing elements and control logic. This unit can me made smaller and less resource demanding, see Section 12.2.

7.5

FPGA Usage

The accelerator is designed to not use much of the FPGAs resources since the MicroBlaze processor itself use a lot of the FPGA. The consists consists of one adder, one multiplier, one shifter, a small memory and a control unit. The most resource demanding part of the calculations is the multiplications. The MicroB-laze processor only uses 4 of the 32 available DSP48 slices. By synthesizing the multiplier needed for this accelerator in DSP slices many LUTs are saved. The input FSL FIFO is synthesised in BRAMs to save LUTs and the output FIFO is synthesised in LUTs since it only has a depth of 36.

7.6

Performance

The decoding time when using the accelerator is the same as when decoding us-ing the software. Surprisus-ingly there is no performance increase even though the theoretical performance increase is huge.

The software decodes an MP3 file in 80 % of allowed time with a file having full quality (320 kbit/s). A song with a length of 165 seconds is decoded in 165 × 0.8 = 132 seconds. Of all time spent decoding about 38 % is spent in the

imdct_l function. This means that 132 × 0.38 ≈ 50 seconds are spent in this

function. In an MP3 file with a length of 165 seconds the imdct_l function is called 623810 times. This means that every run of imdct_l takes 62381050 ≈ 80 µs.

(44)

30 Accelerator Unit I

The hardware only takes 336 clock cycles to perform this function, 300 cycles for sending and calculating in parallel and 36 for sending back the data after the cal-culations are done. 336 hundred clock cycles, when the system runs at 83 MHz, takes 336 × 1

83000000≈ 4 µs.

This is many times faster than the software. It is however, only possible to reach 336 clock cycles if the accelerator is given input data every clock cycle since it stalls when there is no input data. According to the MicroBlaze reference docu-ment [15] the instruction that puts data in the FSL, and the instruction that gets data from the FSL, takes one clock cycle. However, in-line assembly is used and it is not know exactly how the compiler handles that. For some reason the data transfer speed is many times slower than optimal and therefore slowing down the accelerator. One input data is sent approximately once every 20th clock cycle. Maybe this is due to interrupts, from ethernet and timer, occurring during the subroutine. It is the software that is the bottleneck in this design. For ideas about improving the performance see Section 12.4.

7.7

Problems

The accelerator is not working exactly as intended. Of the 36 output values that are calculated each round all are correct the first round. All following rounds have errors in the last 12 output values. Only the first round after every system start is correct. The problem can be many things but the most probable are:

1. The accelerator have a design flaw in itself that causes this problem. 2. The clock enable signal is not working properly causing undefined or old

values being read from the input FIFO.

3. The FSL link is faulty or does not behave as expected.

7.7.1

Case 1

The accelerators functionality have been tested in simulations. In the simulations everything works perfectly for many sequential runs. Both behavioural simulations and post route simulations have been used. It is hard to imagine anything from a previous run interfering with a following one since the accelerator only have one memory cell. This cell however is used in the calculation of every output value. If this memory cell was not reset properly every round all values the following round would have been faulty. This is not the case since the 24 first of the 36 values are correct.

7.7.2

Case 2

This part of the design, the communication with the FSL input FIFO and the clock enable signal has been tested in simulations as well. The clock enable signal is set accordingly to the FSL status signals. It is however harder to test this part

(45)

7.7 Problems 31

since the whole link can not be simulated, from the MicroBlaze master interface to the accelerator slave interface. A test bench has been used which simulates the behaviour of the FSL control signals. This test bench has been compared to the FSL data sheet [13] and all signals seems to be behaving correctly. In simulations the clock enable signal works fine and no undefined values are being read from the input FIFO.

Another test to see if the clock enable signal is the problem is to remove the use of the signal. To do this the input FIFO should first be filled with all input values and then the accelerator should be started. This would test the functional-ity of the accelerator when the clock enable signal never goes low since there never is a need to wait for data. The FSL control bit can not be used as a start flag in this case because as soon as it is detected by the slave interface the accelerator starts running and the purpose of the test is to wait. A software register was used as start flag and it was written to via the PLB after the FIFO was filled. This test never gave any useful results however. All values were wrong and it never worked properly. It is hard to know why but it can have something to do with the new added register used as start flag. Maybe the accelerator got out of phase with the input data. Many tests where made where the data was shifted one step forth or back to get in phase but it never worked properly.

7.7.3

Case 3

It is hard to imagine what could be wrong with the FSL link. The FSL interface is a part of the MicroBlaze processor and is a verified IP from Xilinx. The problem can not be caused by data left in the FIFO between rounds. Since it is a FIFO any remaining data from a round would be clocked out next round corrupting the values of the next round. The first 24 values however, are correct so this is not the case. It is hard to test the FSL link.

The FIFOs can be synthesized in BRAMs or LUTs. The functionality is the same but it is synthesized in different places of the FPGA. When these different options are tested on the input FIFO however, the results differ. If LUTs are used all values are wrong and if BRAMs are used the problem is as stated before, some values are wrong when running many rounds of the calculations. For the output FIFO there is no difference when changing synthesize options. Maybe this error is caused by a defect BRAM unit. This could be investigated by testing the system on another board.

The FIFO depth also causes some problems. The depth of the input FIFO should not matter. Immediately when there is data in the FIFO it is clocked out by the slave interface. The slave interface can clock out one value each clock cycle and the micro instruction that puts values in the FIFO from the MicroBlaze takes 1 clock cycle [15]. This means that the FIFO is never full. A depth of 1 should be enough. A few different lengths have been tested and the results differ. With a depth of 8 all values are wrong and with a depth of 256 the problem is as stated

(46)

32 Accelerator Unit I

before, errors when running many subsequent rounds.

7.7.4

Summary and conclusion

It is hard to determine where the problem lies. The complete system goes through the synthesise process without any errors or serious warnings and all timing re-quirements are met.

It is unlikely that the problem is in the accelerator itself. It consists of basic calculating elements and control logic and have been tested in simulations thor-oughly. This unit is also the easiest one to make a good test bench for since it have few input signals, and these are easy to simulate realistically.

Some things point to the FSL link being the problem but other units in the design uses the FSL. The sound interface uses a FSL link with a FIFO depth of 3072 without problems and accelerator II uses FSL links without any problems as well. But still, the FSL is behaving very strangely, maybe due to hardware errors.

The difference between accelerator I and other units using FSL links are that it uses a clock enable signal. The clock enable functionality may be the cause of the trouble since it is hard to do simulations and verify off-chip that the clock enable signal is working correct together with the FSL control signals. Also, the test without the clock enable never worked properly so it is not as thoroughly tested as the accelerator itself. The FSL however, could have been the reason why the these tests were unsuccessful.

(47)

Chapter 8

Accelerator Unit II

8.1

Overview

The second accelerator unit replaces the calculations performed in the subroutine

sub_dct. The input data consists of 16 words (32-bit). Resulting output data is

also 16 words, which are written back to the same memory location as the input data. Input data is processed in four computational stages in series. Each stage consists of 8 additions, 8 subtractions and 8 multiplications computed in parallel.

8.2

MicroBlaze

TM

Communication

The accelerator unit communicates with the MicroBlaze processor using two in-stances of the Fast Simplex Link (FSL) as shown in Figure 8.1. One instance operated with the processor as master and the accelerator as slave and the other instance goes in the reversed direction. Each of the FSL buses have a 16 words deep FIFO in order to be able to contain all input/output data without being dependent on the other unit reading and thus, emptying the FIFO, making room for more data. This will eliminate the risk of the FSL being a bottleneck in the design. Since this is a fairly small amount of data this is synthesized using LUTs instead of BRAMs to make better use of the memory resources in the FPGA. To further reduce the area the FSL control bit is disabled in the instances since this

Figure 8.1. Xilinx Platform Studio - System block schedule

(48)

34 Accelerator Unit II

Figure 8.2. Schematic for accelerator communication

accelerator does not use it.

8.2.1

Bus Control and Registers

The bus control and intermediate registers act as a bridge between the FSL and accelerator, shown in Figure 8.2. The reason why the registers are called ’interme-diate’ is that all intermediate results between the computational stages are stored in these registers. Both the accelerator and FSL have read and write access to the registers, but the FSL has higher priority for writing. The registers are also used to store the serial input data from the FSL in parallel.

This is done because the accelerator demands a data rate of 1 word/clock, so all data must have arrived before we start the accelerator. The bus control ex-pects 16 words every run and they are not required to arrive at any specific rate. After all required data is received from the FSL the start signal is automatically sent to the accelerator block. When the accelerator is finished processing it raises a status signal to the bus control. Then all data is written back to the processor via the FSL.

8.3

Interface

The interface to the accelerator consists of a serial interface for input and output data. The reading and writing to intermediate registers is controlled by the select signals rd_sel and wr_sel signals. The select signals are represented in one-hot encoding. This means that only one bit at a time is ’1’ and the other bits are ’0’. To be able to control all registers the select signals then have to be 16 bits wide. Status flag f inished and control signal start are active high signals. A complete description of the signal interface is shown in Table 8.1.

(49)

8.4 Hardware Implementation 35

Signal name Direction Type MSB LSB

clk in std_logic reset in std_logic

data_in in std_logic_vector 31 0 data_out out std_logic_vector 31 0 rd_sel out std_logic_vector 15 0 wr_sel out std_logic_vector 15 0

start in std_logic finished out std_logic

Table 8.1. Interface for Accelerator Unit II

8.4

Hardware Implementation

To save area the accelerator is based on a simple computational element which is used to serially compute all output values. The computational element takes two 32-bit words as input arguments, Ik, and produces two 32-bit word outputs,

Ok. The expressions for calculating the output values are shown in equations (8.1)

through (8.3). In order to complete all calculations needed, this unit is used for 32 separate pairs of input data divided into the 4 computational stages mentioned in 8.1. Indexes i and j are unique for each calculation.

O0= I0+ I1 (8.1)

O1= cosij· (I0− I1) (8.2)

cosij=

1

2 · cos(ji · π) (8.3) The flow of the computation is controlled by a microprogrammed control unit, which feeds the multiplication coefficient and multiplexer control signals to the computational chain. The control unit also handles all intermediate register read-ing and writread-ing. Program flow is shown in Figure 8.3.

The most demanding operation in this computation is the multiplication. When this is synthesized in the FPGA it is preferred to utilize the DSP slices for maxi-mum performance. The MicroBlaze processor only uses 4 out of the total 32 DSP blocks found on the Spartan 6 FPGA, which leaves good utilities for the accel-erators without consuming area that could be needed elsewhere. Since the DSP blocks are 18x18 multipliers and the accelerator signals are 32 bits wide, this will lead to a cascade of DSP blocks in the synthesized design.

To be able to reduce the critical path of the operation, Xilinx Core Generator is used to generate the multiplicator. This allows the design to have pipeline registers before, after and in-between DSP instances, providing a significant im-provement in speed at the cost of increased latency in clock cycles. This accelerator

(50)

36 Accelerator Unit II

Figure 8.3. Graph illustrating hardware data flow

uses a total latency of 3 clock cycles and consists of 4 cascaded DSP instances. This means it will be a maximum of 2 instances in series in the critical path.

The processor FPU can not be used for floating point computations, since the accelerator is custom logic outside of the processor. This means that multiplica-tions have to be performed with a fixed precision in the hardware. It is chosen to compute a multiplication that is fixed point with 16 fractional bits. This is the same precision as used in the software decoding, meaning the data does not lose any precision compared to the original decoding. The input data is 32 bits wide and the multiplication coefficient is 20 bits wide to be able to represent all coeffi-cients correctly. The coefficoeffi-cients are computed in Equation (8.3) and then shifted 16 times left for the fixed point conversion. The result after the multiplication is 32 + 20 = 52 bits and the result is shifted down 16 times (arithmetic right shift) to get the correct result since the coefficient is shifted left, see Equation (8.4).

(x · (cosij· 216)) · 2−16= x · cosij (8.4)

In reality the shift after the multiplication is not performed. Instead the custom multiplier just picks the wanted 32 bits from the 52 bit result, which circumvents the work of doing a shift and then also converting output data to correct data width. The coefficients are stored as shifted numbers, which eliminates the need of extra hardware for this computation since it is not needed to know the original coefficient values.

Pipelining has a very good impact on overall performance since the computational pipeline is always full during operation, and the critical path is reduced signifi-cantly leading to a higher maximum system clock frequency. When the pipeline is always full the system still reaches approximately 1 operation/cycle, but with a much higher frequency. If the pipeline would have to be emptied during operation, or if it is not possible to feed the input data at the rate of 1 data/cycle, the per-formance would degrade and it would be necessary to investigate which amount of pipe-lining that would be optimal.

(51)

8.5 Performance 37 Function time Function time Total decoding time without acceleration with acceleration

9.038317s 6.464558s 38.72463s

Table 8.2. Performance measure at test sound decoding

8.5

Performance

On the intended platform the accelerator speeds up the total decoding time with about 6.6%. The measured performance is shown in Table 8.2. If the total time is cut by 6.6% and the decoding function represents 25% of the total decoding time, this implies that the function itself is accelerated by approximately 27%. This is not as good as expected when doing some simple approximate calculations. It is suitable do some calculations on the total accelerator time including bus overhead, assuming the FSL put instruction is performed in one clock cycle according to and the fact that the system is not optimized for area. [15]

The accelerator computing flow is no more than 90 clock cycles. If the time for input data write and output data read is added (16 put instructions respectively) it is in total 16 + 90 + 16 = 122 cycles. For a system clock frequency at 83 MHz this corresponds to a time of 83·101226 s = 1.5 µs. When the real time spent in the

subroutine is measured it is above 20 µs. This means that there is some bottle-neck still present in the system. One reason could be that there are instructions added when the system is built. For example, the put instruction needs the value registered in one of the processors general purpose registers.

When the data is supplied it is a variable in the C code. This leads to a cou-ple of extra cycles on getting the variable into one of the processor registers. The routine time is also measured in real time, mainly because it is a real time system, but this means that time spent handling interrupts will be counted as time in the routine measured on. It could be that timer and Ethernet interrupts are very time consuming, so a cut in subroutine time will not affect the total time that much.

(52)
(53)

Chapter 9

Register Space

In addition to all accelerators and communication peripherals a register space is attached to the Processor Local Bus (PLB) of the MicroBlaze. It can contain an arbitrary number of 32-bit registers (as many as the area constraint allows). The register space is used in software by mapping the system physical address to a C pointer. Then you can read or write values to the current address or increment the pointer starting from the base address in order to reach all the registers present.

9.1

Purpose

The purpose of this unit is to have a space accessible from Linux that you can read system states from, or write control signals to. Since the peripheral is using the PLB, this unit is rather slow and should not be used for high speed communi-cation purposes. In that case it is better to switch to the FSL interface. The main advantage with this method is the ability to have multiple system state variables stored in memory and have them all individually wired to fixed locations in other units. This is done by adding ports to the peripherals in the VHDL code and then connecting wires between the ports in Xilinx Platform Studio. The advantage over FSL is that you do not remove the values by reading them. As long as the values are not overwritten they remain registered in memory.

You could say that all peripherals should have their status and control registers in their own address space and that it would be unnecessary to make a new module just for this. But the FSL interface is only used for custom accelerators so there is no support for this kind of registers. All data fed into the link is lost after reading once and it would be too time consuming to write all control signals every time the accelerators are run.

(54)

40 Register Space

9.2

Sample Rate Divider

The four least significant bits of the sample rate divider register is wired to the divider input of the S/PDIF module. Depending on the system clock frequency and the divider value a playback rate for the output sound is determined. The divider constant can be derived by Equation (9.1), where d is the divider, fclk is

the system clock frequency and fsis the sample frequency.

d =

fclk

fs − 128

128 (9.1)

This will lead to some inaccuracy in playback frequency, since you are limited to set the divider to a 4 bit integer number. A better way to derive a sample enable signal would be to count clock cycles and decide a proper reset count number where you set the sample enable high for one cycle. The maximum deviation will then be just one system clock cycle, either you count one cycle too little or one too much using the system clock precision:

1 fs,out = 1 fs,in ± 1 fclk (9.2)

Since the system clock cycle is far greater than the sound sample rate this will have very little effect on the output rate, and human ears will definitely not be able to hear the difference. The most important thing though is to have the samples as PCM data and to have a way of determining the original sample rate.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Generally, a transition from primary raw materials to recycled materials, along with a change to renewable energy, are the most important actions to reduce greenhouse gas emissions

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Coad (2007) presenterar resultat som indikerar att små företag inom tillverkningsindustrin i Frankrike generellt kännetecknas av att tillväxten är negativt korrelerad över

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Det har inte varit möjligt att skapa en tydlig överblick över hur FoI-verksamheten på Energimyndigheten bidrar till målet, det vill säga hur målen påverkar resursprioriteringar