Power Optimization of Image Filtering with FPGA

(1)

ISRN UTH-INGUTB-EX-E-2018/05-SE

Examensarbete 15 hp 15 Juni 2018

Power Optimization of Image Filtering With FPGA

Sebastian Götbring

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Power Optimization of Image Filtering With FPGA

Sebastian Götbring

High speed real time video processing puts a lot of demand on hardware and Field Programmable Gate Arrays (FPGA) are becoming more popular for this. What makes them interesting in this field is their inherent concurrency which make them ideal for high speed applications. Higher demands for energy efficient solutions

require the designer to have knowledge on how different implementations on the FPGA effects the power consumption.

Therefore, a study on power consumption for image filtering with FPGA was conducted.

Two image filtering algorithms are implemented on a FPGA with the goal of reducing the power consumption for real time image filtering by optimising the implementations on the FPGA.

To reduce the power consumption three main areas where examined:

optimizing the algorithm, using the different hardware

capabilities that come with FPGAs and working with different clock speeds.

The different approaches were simulated in a power estimator to evaluate the effects on the power consumption before implementing them on a FPGA and measuring the results.

In this project it was determined that lowering the frequency and utilizing the resources to the full extent can have a positive impact on the power consumption. The results were too small for the accuracy of the amperemeter used to be able to make any conclusions. Larger systems with multiple FPGAs might show more noticeable power savings. More knowledge in Hardware Description Language (HDL) programming and resource managing could lead to even lower power consumption.

ISRN UTH-INGUTB-EX-E-2018/05-SE Examinator: Tomas Nyberg

Ämnesgranskare: Steffi Knorn Handledare: Gunnar Stjernberg

(3)

Sammanfattning

Videobehandling i höga hastigheter och realtid ställer stora krav på hårdvaran och Field Programmable Gate Arrays (FPGA) blir allt mera populär som lösning. FPGA-ers parallella natur gör dem ideala för denna typ av applikationer i höga hastigheter. Med större intresse för energisnåla lösningar kommer större krav på att förstå hur olika implementationer påverkar effektförbrukningen. Med detta som bakgrund utfördes en studie på

effektförbrukning för bildfiltrering med FPGA-er.

Två bildfilter implementerades på en FPGA med målet att minska effektåtgången för realtidsfiltrering av bilder genom att optimera implementationerna.

För att minska effektåtgången studerades tre huvudområden: optimering av algoritmer, använda de olika hårdvarutillgångarna som kommer med FPGA-er och arbeta med olika klockfrekvenser.

De olika förslagen simulerades i en effektuppskattare för att ge en idé om huruvida några besparingar skulle uppstå innan implementation på en FPGA och mätning av den faktiska effektförbrukningen.

Av testerna syntes en trend av lägre effektförbrukning som förväntat vid användande av effektivare hårdvara och lägre frekvenser. Mätonogranheten hos amperemetern var dock för stor för att kunna ge några säkra svar. Större och mer komplexa system över flera FPGA-er borde ge tydligare effektbesparingar.

(4)

Table of content

1 Introduction ... 1

1.1 Project description ... 1

1.2 Motivation ... 1

2 Theory ... 2

2.1 Digital image filtering ... 2

2.2 Edge detection ... 2

2.3 Median filter ... 3

2.4 Digital imaging and yuv422 ... 3

2.5 Field-programmable gate array (FPGA) ... 4

2.6 Very High Speed Integrated Circuit Hardware Description Language (VHDL) ... 5

2.7 Time multiplexing ... 5

2.8 Complementary Metal Oxide Semiconductors (CMOS) Power Consumption ... 5

2.9 Sorting networks ... 7

3 Analysis and Simulations ... 9

3.1 Analysis ... 9

3.2 Power analyser ... 10

4 Implementation ... 11

4.1 Sobel filter ... 11

4.2 Sobel filter with lower clock speed. ... 13

4.3 Median filter ... 14

5 Measurements ... 15

5.1 Protocol and test specifications ... 15

5.2 Tests specifications ... 16

5.3 Results ... 17

6 Discussion and conclusion ... 22

6.1 Test 1: SRL and BRAM ... 22

6.2 Test 2: SRL 74.25MHz and SRL 67MHz ... 22

6.3 Test 3: Median filter ... 22

7 Recommendations for future improvements ... 24

7.1 Frequency reduction ... 24

7.2 Sorting ... 24

8 Ref ... 25

9 Figures ... 28

(5)

1

1 Introduction

1.1 Project description

This project aimed at finding ways to reduce the power consumption for image filtering using FPGAs. The study was general to find possible solutions that could be implemented in future designs and further developed upon.

The goal for the project is first to implement two filters and test their power consumption. This was then followed by trying to reduce the power consumption to see how different

implementations in the FPGA effects the power consumption. No specific amount was set for the reduction goal.

To improve the power consumption, this project aimed at experimenting with different types of memories, special arithmetic Digital Signal Processing-blocks (DSP-blocks) and time multiplexing of the DSP-blocks, different clock speeds as well as optimising the filtering algorithm.

The platform was a Xilinx Zynq Ultrascale+ZU3EG FPGA set up for a video stream to be received via HDMI at 720p 60Hz. The platform was prepared in beforehand so that only the image filtering had to be programmed using Xilinx Vivado Design Suit. A computer fed the video stream to the FPGA and the resulting image could be seen on a separate screen. The power consumption was measured using an ampere meter.

The work included:

• Researching the hardware and its capabilities as well as image filtering and possible solutions to reduce the power.

• Simulations of different solutions in a power estimator for the hardware to gauge which solutions could have a positive impact on power consumption.

• Designing the filter in Very High Speed Integrated Circuit Hardware Description Language (VHDL).

• Implementing the filter on the hardware and measuring of the power consumption.

1.2 Motivation

FPGA is an alternative to handle and compute digital data. In recent years the cost, performance and power consumption has been improved a lot. In some fields FPGAs can outperform the other options such as Central Processing Units (CPU) and General-Purpose computing on Graphics Processing Units (GPGPU) by 10 - 100 times [1.1].

With increasing use of image processing and a need for high performance low energy solutions, FPGAs are getting more attracting. For instance, autonomous electric vehicles need to be able to process video streams at high speeds in real time to safely operate the vehicle with a limited power supply. Most previous studies on image filtering with FPGA focus on area minimizing and speed within the FPGA. Hence, a study on power consumption was highly relevant.

(6)

2

2 Theory

It is important to understand the different parts of image filtering and FPGA for this project.

This section therefore explains the most important aspects of the different topics covered in this thesis.

2.1 Digital image filtering

Digital images allow a computer to process the image using algorithms. Filtering is often used as a first step to allow for more complex algorithms to be used on the result from the filtering such as: classification, feature extraction, multi-scale analysis, pattern recognition, projections.

Different filters require different solutions but often involve convolutions or correlations [2.1]

2.2 Edge detection

An edge detection filter detects rapid changes in light intensity which often occur around edges.

The Sobel filter is a discrete differential operator that detects the change in light intensity by convolving the picture with two 3x3 kernels, one for the horizontal edges and one for the vertical edges, and then combining the results into one picture [2.2].

Figure 2.1: Nine pixels for use in the convolution.

Figure 2.2: The kernel for vertical edges.

Figure 2.3: The kernel for horizontal edges.

(7)

3 The kernels are placed centred over the desired pixel and the horizontal and vertical change in light intensity from the directly surrounding pixels are calculated using (1) & (2)

Vertical edges:

𝐺_𝑥= 𝑃3 + 2 ∗ 𝑃6 + 𝑃9 − (𝑃1 + 2 ∗ 𝑃4 + 𝑃7) (1) Horizontal edges:

𝐺_𝑦= 𝑃7 + 2 ∗ 𝑃8 + 𝑃9 − (𝑃1 + 2 ∗ 𝑃2 + 𝑃3) (2)

The result is a grayscale picture with the edges enhanced calculated using (3)

𝐺 = √𝐺_𝑥²+ 𝐺_𝑦² (3)

2.3 Median filter

The median filter is a nonlinear spatial low pass filter that reduces noise but keeps the edges in the picture. A symmetric kernel is placed on top of each pixel and the median of the surrounding pixels is used as the output pixel [2.3]

2.4 Digital imaging and yuv422

YUV422 is a colour encoding system that represent a picture with the light intensity

(Y=lumina) and the colour (UV=chroma) that takes the human colour perception into account to reduce bandwidth.

In YUV422 the lumina is presented in every pixel but the croma in every other, alternating between U and V. This reduces the bandwidth by 33% with little noticeable reduction in image quality [2.4].

The chroma ranges between -0.5 and 0.5 and the lumina between 0 and 1. By setting the chroma to zero and only showing the lumina, a grayscale image is obtained.

In digital representation the range is between 0 and 255 as it is 8 bits of data for lumina and 8 bits of data for chroma resulting in 16 bits of data per pixel [2.5].

The image is represented one pixel at a time containing the lumina and chroma. The pixels are ordered in lines from the top left to the bottom right.

Typical signals in digital imagery are:

• A pixel clock.

• Data.

• Data enable - signals that data is being sent.

• Horizontal blanking - Tells the system that the end of a line is reached and to reposition to the left.

• Vertical blanking - Tells the system that the end of the frame is reached and to reposition to the top left.

The image is often written with a portion of valid data which is the visible image and two sections of invalid data which is used for older monitors that need time to reset the position of the electron ray, signalled with the vertical and horizontal blanking [2.6].

(8)

4 With digital monitors this portion is unnecessary but kept in order to be backwards

compatible.

Figure 2.4: Visualization of valid data and invalid data (blanking).

2.5 Field-programmable gate array (FPGA)

FPGA is an integrated circuit consisting of an array of programmable logic blocks. This allows for reconfiguration using a Hardware Description Language (HDL) and is used where complex and flexible circuitry is needed [2.7].

It contains:

Look Up Table (LUT)

A Look Up Table is a small Read Only Memory block (ROM) with usually 2 to 6 inputs and one output that can get configured to the desired combinatorial circuit. This is done by storing the desired look up table in the ROM and by addressing the data. This is the basics of the FPGA [2.8].

Slice

A slice contains several LUTs, flip-flops and other components such as memory depending on the manufacturer.

The platform used in this project contained slices composed of four 6-input LUTs and eight storage elements [2.8].

Configurable Logical Block (CLB)

A CLB contains a pair of slices [2.8].

Digital Signal Processing blocks (DSP-block)

DSP-blocks are specialised for making multiplications fast and more power efficient since it is usually resource heavy to make multiplications with logic blocks [2.9].

Distributed ram

Small memory together with the LUT gives the user the ability to create small memory portions distributed over the FPGA [2.8].

Block RAM (BRAM)

Fixed RAM blocks in the FPGA than can be accessed for storing medium sized amounts of data.

(9)

5 These allows for the data to be handled fast and power efficient [2.10].

Shift Register Logic (SRL)

The LUT is used as a shift register that can delay data up to 32 clock cycles for a combined 128 clock cycles in a single Slice [2.8].

2.6 Very High Speed Integrated Circuit Hardware Description Language (VHDL)

VHDL is a text-based Hardware Description Language (HDL) that is used to describe digital circuitry using FPGA.

It resembles standard languages such as C but with two major differences. One being that it is time reliant, and two that it allows for concurrent programming since it is a dataflow

language.

This is because electronics in its nature is time reliant and parallel.

[2.11]

2.7 Time multiplexing

Time multiplexing allows multiple sets of data to be transmitted over one signal lane by multiplexing the data at a higher clock rates than the incoming data and transmitting it over a fast lane. Then the data is again multiplex back to different signal lanes in the original clock rate.

[2.12]

2.8 Complementary Metal Oxide Semiconductors (CMOS) Power Consumption

A FPGA consist of Complementary Metal Oxide Semiconductors [2.13]. The power consumption in a CMOS circuit is made up by two components:

• Static power consumption

• Dynamic power consumption

CMOS devices have a low leakage current when held at a constant value and therefore a small static power consumption.

Switching the states increases the power consumption and at high frequencies this can be the significant part of the total power consumption.

STATIC

The leakage current (Ilkg) is calculated using (4)

𝐼_𝑙𝑘𝑔 = 𝑖_𝑠(𝑒⁽^𝑞𝑉^⁄^𝑘𝑇⁾− 1) (4)

Where:

is = reverse saturation current V = diode voltage

k = Boltzmann’s constant (1.38 × 10^-23 J/K) q = electronic charge (1.602 × 10^-19 C) T = temperature

Total static power consumption Ps

𝑃_𝑠= ∑(𝑙𝑒𝑎𝑘𝑎𝑔𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∗ 𝑠𝑢𝑝𝑝𝑙𝑦 𝑣𝑜𝑙𝑡𝑎𝑔𝑒) (5)

(10)

6 DYNAMIC

The dynamic power consists of transient power (PT), and capacitive-load power (PL).

Transient Power Consumption:

Transient power is consumed only when the transistors are switching between logic states (6).

𝑃_𝑇= 𝐶_𝑝𝑑∗ 𝑉_𝑐𝑐²∗ 𝑓_𝑖∗ 𝑁_𝑆𝑊 (6)

Where:

PT = transient power consumption VCC = supply voltage

fi = input signal frequency NSW = number of bits switching

Cpd = dynamic power-dissipation capacitance Capacitive-Load Power Consumption:

Charging external load capacitance is dependent on the switching frequency and adds additional power consumption according to (7).

𝑃_𝐿= ∑(𝐶_𝐿𝑛∗ 𝑓_𝑂𝑛) ∗ 𝑉_𝑐𝑐² (7) Where:

Σ = sum of n different frequencies and loads at n different outputs

fOn = all different output frequencies at each output, numbered 1 through n (Hz) VCC = supply voltage (V)

CLn = all different load capacitances at each output, numbered 1 through n.

𝑃_𝐷 = 𝑃_𝑇+ 𝑃_𝐿 (8)

𝑃_𝑡𝑜𝑡 = 𝑃_{𝑠𝑡𝑎𝑡𝑖𝑐}+ 𝑃_{𝑑𝑦𝑛𝑎𝑚𝑖𝑐} (9) [2.14] [2.15]

(11)

7

2.9 Sorting networks

Sorting networks are used to sort data. The size of the network is the number of comparators used and the depth is the maximum number of steps a single element can travel, correlating to the speed of the sorting [2.16].

Brick sort/odd-even sort

Brick sort is a simple sorting algorithm for parallel processing. It takes an array and

compares two neighbouring elements and swaps them if necessary, alternating between odd and even elements for each loop. For an even number of elements n, the first loop starts with the odd element, comparing a1 < a2, a3 < a4 …an-1 < an. The next loop starts with the even element, comparing a2 < a3, a4 < a5 … an-2 < an-1. The third loop again starts with the odd element (Figure 2.5).

Figure 2.5: Visualization of the odd-even sorting network. Data flows from left to right. The vertical lines connecting the horizontal lines are comparators.

It has size:

𝑆𝑖𝑧𝑒 = 𝑛(𝑛 − 1)/2 (10)

and depth:

𝑑𝑒𝑝𝑡ℎ = 𝑛 (11)

[2.17]

(12)

8 Batcher odd–even mergesort

This sorting network works on the same principle as the odd-even sort but merges two sorted parts into a fully sorted sequence. The first part of the networks partially sorts the elements, the second part merges the parts and sorts them into one sorted sequence (Figure 2.6). It is significantly faster with size:

𝑠𝑖𝑧𝑒 = 𝑛 ∗ (log₂𝑛)² (12)

and depth:

𝐷𝑒𝑝𝑡ℎ = (log₂𝑛)² (13) [2.18] [2.19] [2.20]

Figure 2.6: Visualization of the Batcher odd-even merge sorting network. Data flows from left to right. The vertical lines connecting the horizontal lines are comparators.

(13)

9

3 Analysis and Simulations

Propper analysis and simulations of the problem indicate what solutions might be beneficial.

This section explains what options were examined before moving on to the implementation.

3.1 Analysis

To filter an image with a convolution matrix the following steps have to be performed:

• Buffer previous lines from the source for use in the convolution.

• Extract a small number of pixels for the calculation.

• Calculate the convolution for the desired pixel.

• Send the new pixel to the output.

Buffering lines requires memory. Two full lines contain 2560 pixels and are considered a large amount of data for a FPGA to buffer. The dynamic power consumption mentioned in section 2.8 indicates that block RAM should be the more energy efficient solution compared against distributed RAM and shift registers which require a lot of logic (6).

Equations often involve multiplication and square rooting besides summation.

Square rooting is resource heavy and introduces a lot of latency in FPGA [3.1].

Therefore, it should be avoided if possible and (3) can be approximated by:

|𝐺| = |𝐺_𝑥| + |𝐺_𝑦| (14)

Since only large changes in the values are interesting the effects on the resulting image are neglectable.

Multiplication is also a resource heavy operation but can be done with special DSP-blocks that are designed for fast and energy efficient multiplications.

Time-multiplexing of DSP-blocks saves resources but increases the clock frequency used.

Since the power consumption is linear with the frequency according to (6) & (7) the power consumption should be the same minus eventual leakage that comes from having more DSP-blocks.

Operating at high frequencies often results in more power loss due to linear relation between the power consumptions and frequency according to (6) & (7).

Lowering the operating frequency for the filtering process could therefore reduce the power consumption. The incoming pixel clock for a 720p 60Hz video stream is 74.25MHz and by utilizing the invalid data region that comes with horizontal blanking (Figure 2.4), a slower filter clock can be used. The valid data area is 1280 pixels wide and the horizontal blanking space 370 pixels wide resulting in a total line length of 1280 + 370 = 1650 [3.2].

This allows for the clock to be lowered by 22%

1 − (¹²⁸⁰₁₆₅₀) ≈ 0.224 ≈ 22% (15)

The power consumption in sorting network consist mainly of the dynamic consumption from switching elements. Therefor the complexity is more important than the depth which

corelates to the static power consumption. For the odd-even sorting network, with n = 81 (10) and (11) give

(14)

10 Comparators: 81 ∗ 80/2 = 3240 (16)

Latency: 81 (17)

And for Batcher odd-even mergesort, with n = 81 (12) and (13) give

Comparators: 81 ∗ (log₂81)² = 3256 (18)

Latency: (log₂81)²= 40 (19)

3.2 Power analyser

In the simulation for time-multiplexing of DSP-blocks, 40 DSP-blocks were simulated at 100 MHz and 10 DSP-blocks at 400MHz. According to the power analyser this did not change the power consumption at all.

Tests for implementing a buffer on the input and the output so that the actual filter could run on a slower clock were performed. The simulation showed a possibility of lowering the power consumption (Figure 3.1 & 3.2).

Figure 3.1: Estimation of power consumption for a filter running with a 74.25MHz clock.

Figure 3.2: Estimation of power consumption for a filter running with a 66 MHz clock.

(15)

11

4 Implementation

This section discusses how the filters were implemented on the FPGA and is divided into three sections. Each section revolves around one of the main areas investigated: using different memory types, lowering the frequency and sorting.

4.1 Sobel filter

The input signals to the filter were:

• Data_in

• Data_enable

• Blanking

• Clock

These signals are provided by the computer. The clock was used to run the logic, data_in contained the video stream, data_enable and blanking were used internally to control the filter.

The output was:

• Data_out

Data_out contained the filtered video stream and was sent to the monitor.

The incoming data_in contained 16 bits of information and was split into two blocks of 8 bits.

The 8 first bits contained the chroma, thus these bits were set to “01111111” to get a grayscale image.

The next 8 bits were the lumina and these were converted to an integer value ranging from 0 to 255 to be used in the algorithm.

To perform the convolution, two lines were stored in buffers.

The current line was taken directly from data_in and the two previous lines were stored in two shift registers configured as First In First Out (FIFO), called line buffers. Nine signal buffers were also used to buffer the pixels used for the calculations (Figure 4.1).

Figure 4.1: Buffering of the incoming lines and pixels for calculation. Line 0 is the current line. Line 1 and 2 are the two previous lines. P1 to P9 are buffers for use in the calculations.

(16)

12

Figure 4.2: Flowchart of the calculations in the Sobel filter.

The values were taken from eight of the nine buffers and used in the calculations of the convolution, excluding the centre buffer (Figure 4.2).

The algorithm calculated the vertical change in luminance Gx using (1) and the change in horizontal luminance Gy using (2),

resulting in Gx and Gy ranging from -1020 to 1020.

Gx and Gy were divided by four and the absolute values were summed using (14), forming the gradient G.

The gradient G was divided by two to have a value between 0 and 255 and clamped to an upper boundary of 235.

The gradient was converted to 8 bits binary and concatenated with the chroma and sent to the output.

SRL

Coding the line buffers as traditional shift registers was implemented as SRL32 logic by Vivados synthesising tool. The amount of logic used in the FPGA can be seen in Figure 4.3.

Figure 4.3: The amount of logic used for the Sobel filter using SRL for linebuffers can be seen highlighted in blue.

(17)

13 BRAM

Coding the line buffers as BRAM was implemented as RAMB18E2 by Vivados synthesising tool. The amount of logic used in the FPGA can be seen in Figure 4.4

Figure 4.4: The amount of logic used for the Sobel filter using BRAM for linebuffers can be seen highlighted in blue.

4.2 Sobel filter with lower clock speed.

The same filter as previous was used and two dual port FIFOs was added to the design, one for storing the incoming data and one for storing the outgoing data.

The FIFOs required one clock and an enable signal separate for the input side and the output side.

Figure 4.5: Top hierarchy overview of the block design. Data flows from the left to the right.

The input FIFO was connected directly to the incoming data on the input side with the incoming data_enable as write_enable.

The output side of the FIFO was connected to the filter with the separate slower clock and a read_enable signal coming from the filter (Figure 4.5).

The output FIFO was connected to the filter on the input side with the slower clock and a write_enable signal coming from the filter (the same as read_enable to the input FIFO).

The output side of the FIFO was connected to the HDMI out using the faster clock and the incoming data enable as read enable but delayed, compensating for the longer filtering process.

The two blanking signals were also delayed by the block called “SHIFT” (Figure 4.5).

(18)

14 The delay in data_enable and blanking was implemented using a shift register and was needed due to the slower filter clock (Figure 4.6).

Figure 4.6: Relation between the three different data enable used in the system. Top line is incoming data_enable, middle line is data_enable for the filter, bottom line is data_enable for the output. The colored dashed lines show the timing between the signals.

In between the FIFOs the filter was used the same way as before but with a modification to the internal signals and a data_enable signal to control the reading from the input FIFO (read_enable) and the writing to the output FIFO (write_enable).

A picture format of 1280*768 was selected. The filter ran on a clock 11% slower than the input clock to save power resulting in a filter clock at 67MHz.

The format had a horizontal blanking space of 160 pixels giving the theoretical slowest clock of:

(1280 + 160)/1280 = 1.125 (20) 74.25𝑀𝐻𝑧/1.125 ≈ 65.99𝑀𝐻𝑧 (21) 100% − (65.99 ∗ 100/75) ≈ 12% (22)

To allow for drift between the clocks, a 11% reduction was used.

1419/1280 ≈ 1,11 (23) 74,25𝑀𝐻𝑧/1,11 ≈ 67𝑀𝐻𝑧 (24)

4.3 Median filter

A 9x9 kernel was used in the median filter.

8 line-buffers were implemented with BRAMs and the ninth line was taken from the current video stream, like the Sobel filter.

Nine pixels from each line were stored in buffers and placed into an array of 81 bits.

The array was sorted using a parallel brick sort and the median was used as the output pixel.

The clock running the logic was controlled by an internal clock_enable signal. Allowing the logic to stop when the blanking occurred.

(19)

15

5 Measurements

Test specifications and how the measurements were made are presented in this section. The results from the measurements are presented combined with the calculated uncertainty.

5.1 Protocol and test specifications

The current used by the FPGA was measured using a Yoctopuce ampere meter ‘Yocto-Amp’

[5.1] connected to a PC via USB.

Manufacturer Product Number YAMPMK01

Accuracy (DC) 2 mA, 1%

Sensitivity 2 mA, 0.3%

Max current (Cont.) 10A

Max current (peak) 17A

To maximize the power consumption and improve the measurements during the tests, 30 identical filters were implemented on a single FPGA, except for the median filter where only one filter could fit. The Vivado software usually removes unused elements so the additional filters had to be connected to Integrated System Analysers (ILA) in the software, preventing the software from doing so. The ILA only consumed a small amount of current in relation to the system and were constant for all the tests. Hence, only the relative changes from the different filters were measured.

The uncertainty for the measurements was calculated using (25) [5.2].

𝑚𝑒𝑎𝑠𝑠𝑢𝑟𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑚𝐴)

100 + 2(𝑚𝐴) (25)

For the tests a standard test video shown in Figure 5.1 was used.

Figure 5.1: Test image used for the system during power measurements {Web address 1}.

The video ran at 720p 60 Hz.

The system ran for 20 minutes before starting the measurement to allow for the hardware to heat up.

The tests ran for 5 min each where 6 measurements were sampled one minute apart.

(20)

16

5.2 Tests specifications

Test 1: Sobel filter running at 74.25 MHz

• Using SRL for the line buffers.

• Using BRAM for the line buffers.

Goal: To see if there is any noticeable difference using different memory resources.

Test 2: Sobel filter running at 74.25 and 67 MHz

• Using SRL as line buffers and running the filter at 74.25 MHz.

• Using SRL as line buffers and running the filter at 67 MHz.

Goal: To see if there is any noticeable difference using different clock frequencies.

Test 3: Median filter running at 74.25 MHz

• Running the filter constantly.

• Turning off the filtering during blanking.

• Turning of the clock during blanking.

Goal: To see if there is any noticeable difference when running the filter constantly, stopping the filter during blanking and completely shutting of the clock during blanking. The filtering process is the sorting in this case.

(21)

17

5.3 Results

When running the test video on the FPGA the total current consumption was measured. The consumption can be seen in Tables 1 to 3.

Table 1: Test 1 - Sobel filter.

Current (mA)

Minute SRL BRAM

0 1072±13 1048±12

1 1072±13 1048±12

2 1072±13 1049±12

6 1072±13 1048±12

4 1073±13 1048±12

5 1072±13 1048±12

Running 30 Sobel filters with SRL on a FPGA resulted in a median current consumption of 1072 mA. The uncertainty when using 1072 mA as the median is:

2𝑚𝐴 +^1072𝑚𝐴₁₀₀ ≈ 13𝑚𝐴 (26)

Changing the SRL to BRAM resulted in a median current consumption of 1048 mA. The uncertainty when using 1048 mA as the median is:

2𝑚𝐴 +^1048𝑚𝐴

100 ≈ 12𝑚𝐴 (27)

The difference between the two tests is:

Best case: (1072 + 13(𝑚𝐴)) − (1048 − 12(𝑚𝐴)) = 49𝑚𝐴 (28) Worst case: (1072 − 13(𝑚𝐴)) − (1048 + 12(𝑚𝐴)) = −1𝑚𝐴 (29)

(22)

18 Table 2: Test 2 - Sobel filter.

Current (mA) Minute Running at

74.25 MHz

Running at 67 MHz

0 1062±13 1054±13

1 1062±13 1054±13

2 1062±13 1054±13

6 1062±13 1054±13

4 1062±13 1054±13

5 1062±13 1054±13

Running 30 Sobel filters with SRL logic on a FPGA at 74.25 MHz resulted in a median current consumption of 1062 mA. The uncertainty when using 1062 mA as the median is:

2𝑚𝐴 +^1062𝑚𝐴

100 ≈ 13𝑚𝐴 (30)

Running 30 Sobel filter with SRL logic on a FPGA at 67 MHz resulted in a median current consumption of 1054 mA. The uncertainty when using 1054 mA as the median is:

2𝑚𝐴 +^1054𝑚𝐴

100 ≈ 13𝑚𝐴 (31)

The difference between the two tests is:

Best case: (1062 + 13(𝑚𝐴)) − (1054 − 13(𝑚𝐴)) = 34𝑚𝐴 (32) Worst case: (1624 − 13(𝑚𝐴)) − (1054 + 13(𝑚𝐴)) = −18𝑚𝐴 (33) The resulting picture from the Sobel filter can be seen in Figure 5.2

Figure 5.2: Test screen after applying the Sobel filter.

(23)

19 Table 3: Test 3 - Median filter.

Current (mA) Minute Constant

filtering

Stopping the filtering during

blanking

Stopping the clock during

blanking

0 1077±13 1077±13 1075±13

1 1078±13 1077±13 1076±13

2 1078±13 1078±13 1075±13

6 1077±13 1077±13 1074±13

4 1077±13 1077±13 1075±13

5 1077±13 1078±13 1075±13

Running a single median filter on a FPGA resulted in a median current consumption of 1077 mA. The uncertainty when using 1077 mA as the median of the measurements is:

2𝑚𝐴 +^1077𝑚𝐴

100 ≈ 13𝑚𝐴 (34)

Stopping the sorting process of the median filter during blanking resulted in a median current consumption of 1077 mA. The uncertainty when using 1077 mA is:

2𝑚𝐴 +^1077𝑚𝐴

100 ≈ 13𝑚𝐴 (35)

Stopping the clock driving all the logic in the median filter during blanking resulted in a

median current consumption of 1075 mA. The uncertainty when using 1075 as the median is:

2𝑚𝐴 +^1075𝑚𝐴₁₀₀ ≈ 13𝑚𝐴 (36)

The difference between running with the filter constant and turning of the clock during blanking is:

Best case: (1078 + 13(𝑚𝐴)) − (1074 − 13(𝑚𝐴)) = 30𝑚𝐴 (37) Worst case: (1077 − 13(𝑚𝐴)) − (1075 + 13(𝑚𝐴)) = −24𝑚𝐴 (38)

Demonstrations of the median filter filtering noisy images can be seen in Figure 5.3 - 5.8.

Figure 5.3: Noisy image {Web address 2}.

(24)

20

Figure 5.4: After applying the median filter to Figure 5.3 the noise is reduced.

(25)

21

Figure 5.8: After applying the median filter to the Figure 5.7 the noise is reduced.

(26)

22

6 Discussion and conclusion

This chapter presents the conclusions from the different parts of the project and aims at discussing the outcomes.

6.1 Test 1: SRL and BRAM

The test indicated an improvement in the power consumption as expected. This was probably due to the amount of logic needed by the SRL (Figure 4.4) compared to a single block of BRAM (Figure 4.5). However, (28) shows that the inaccuracy of the ampere meter was too big to be able to get any certain results.

It would still be recommended to use BRAM for buffering data the size of an entire image line compared to SRL since BRAM could save power, resources and compilation time. More logic used requires more time to compile. The time to compile the code was not shown in this report but noticed to be a lot shorter using BRAM during the tests.

The mathematical operations for the Sobel filter were too few and simple to have a major impact on the power consumption. Therefore, no tests on this area were performed.

In the simulation for time-multiplexing of DSP-blocks, the power analyser did not result in any change in the power consumption.

This was probably due to the linear relation between the power consumption and the frequency. The leakage current was probably too small for the analyser to calculate.

Therefore, this solution was not built upon any further.

6.2 Test 2: SRL 74.25MHz and SRL 67MHz

The test indicates a small improvement in power reduction but (32) shows that the

uncertainty was too big for the measured difference to be able to give any definite answer.

Decreasing the frequency of the filter indicates small power reductions but requires more logic so the overall power consumption might increase. This approach might be more interesting for filters requiring a lot of logic.

Different picture formats have different sized blanking windows and lowering the frequency might be more attractive if using a format with a large blanking window such as 1280*720 which allows for a 20% slower clock frequency (15).

A resolution of 1280*720 was planned for the project but due to problem with the computer not detecting the correct resolution it had to be changed to 1280*768 which unfortunately had a shorter horizontal blanking period resulting in a smaller reduction of the filter clock.

6.3 Test 3: Median filter

Test 3.1 and 3.2 showed no difference when stopping the sorting process during blanking.

This might be since the input data_in was constant during blanking and therefore no

switching of logic occurred. This would explain why the dynamic power consumption did not change according to (6).

Test 3.3 was supposed to target even more logic, shutting down the entire filter and not only the sorting process. This yielded no improvements, again probably because most of the logic were static during blanking.

(27)

23 Stopping a clock usually does not require a lot of logic and is often used in more complex system. Further analysis with bigger networks of higher complexity would be of interest to try and see what impact it has on the power consumption. This was outside the scope of this project and could not be further investigated.

A lot of power savings can be done by reducing the amount of logic used. If larger amounts of data need to be sorted a different sorting network could achieve this, such as the Batchers odd-even mergesort. The number of comparators were similar between the two sorting networks in this case, 3240 (16) and 3256 (18). The latency was halved with Batchers odd- even mergesort. The speed of the network was not of interest in this project and only the complexity has a big impact on the power consumption. Therefore, this was not an interesting solution. Different ways of reducing the number of comparators proved to be difficult and no solution with a noticeable impact was found.

(28)

24

7 Recommendations for future improvements

This chapter highlights some of the topics in the research and gives some recommendations for future implementations.

7.1 Frequency reduction

Using true 720p 16:9 60 Hz (1280*720) gives a horizontal blanking space twice as wide as the format used in the report. This would allow to reduce the clock by 22% (15) instead of 11% and thus improve the power saving further.

A video format of 1920*1080 for 1080p60Hz has a pixel clock of 148.5 MHz. This allows for a 10-12% slower clock like the format used in this report. Equations (39), (40) and (41) give a guideline for potential power savings for true 1080p format. This format was not supported by the hardware used in this project, so it could not be tested.

2200/1920 ≈1.1458 (39)

148,5𝑀𝐻𝑧/1.1458 ≈ 129.6 𝑀𝐻𝑧 (40)

100% − (129.6 ∗ 100/148,5) ≈ 12.7% (41)

7.2 Sorting

When sorting data, the sorting network can have a big impact on the required logic. Keeping the complexity of the network in mind can save a lot of power as well as resources on the FPGA.

(29)

25

8 Ref

[1.1]

Asano, Shuichi & Maruyama, Tsutomu & Yamaguchi, Yoshiki (2009). In: Performance comparison of FPGA, GPU and CPU in image processing. FPL 09: 19th International Conference on Field Programmable Logic and Applications, page 126 - 131.

10.1109/FPL.2009.5272532.

[2.1]

Rafael C. Gonzalez, Ricard E. Woods (2008). In: Digital Image Processing (3^rd edition), page 1 - 2. Pearson Prentice Hall.

[2.2]

Rafael C. Gonzalez, Ricard E. Woods (2008). In: Digital Image Processing (3^rd edition), page 166. Pearson Prentice Hall.

[2.3]

Rafael C. Gonzalez, Ricard E. Woods (2008). In: Digital Image Processing (3^rd edition), page 156 - 157. Pearson Prentice Hall.

[2.4]

Poynton, Charles (2012). In: Digital Video and HD – Algorithms and Interface (2^nd edition), page 121 - 126. Elsevier.

[2.5]

[2.6]

[2.7]

Wayne Wolf (2002). In: Modern VLSI Design (3^rd edition), page 351 - 352. Prentice Hall.

[2.8]

Data Sheet, page 15 - 44.

https://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf Verified 2018-05-05.

[2.9]

Data Sheet.

https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf Verified 2018-05-05.

(30)

26 [2.10]

Data Sheet.

https://www.xilinx.com/support/documentation/user_guides/ug473_7Series_Memory_Resour ces.pdf

Verified 2018-05-05.

[2.11]

Wayne Wolf (2002). In: Modern VLSI Design (3^rd edition), page 400 - 401. Prentice Hall.

[2.12]

Lars-Hugo Hemert (2013). In: Digitala kretsar (3^rd edition), page 17. Holmbergs.

[2.13]

Wayne Wolf (2002). In: Modern VLSI Design (3^rd edition), page 352. Prentice Hall.

[2.14]

Wai-Kai Chen (2005). In: The Electrical Engineering Handbook, page 266 - 272. Elsevier.

[2.15]

Page 1 - 5.

http://www.ti.com/lit/an/scaa035b/scaa035b.pdf Verified 2018-05-05.

[2.16]

Al-Haj Baddar S.W., Batcher K.E. (2011). In: Designing Sorting Networks, page 57 - 59.

Springer.

[2.17]

Knuth, Donald Ervin (1938). In: The art of computer programming - Sorting and searching - Vol. 3, Sorting and searching (2^nd edition), page 241. Addison-Wesley.

[2.18]

Al-Haj Baddar S.W., Batcher K.E. (2011). In: Designing Sorting Networks, page 1 - 7 Springer.

[2.19]

Knuth, Donald Ervin (1938). In: The art of computer programming - Sorting and searching - Vol. 3, Sorting and searching (2^nd edition), page 224 - 226. Addison-Wesley.

[2.20]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2009). In:

Introduction to Algorithms (2^nd edition), page 11. MIT Press.

(31)

27 [3.1]

Web page.

https://www.xilinx.com/html_docs/xilinx2017_4/sdsoc_doc/unp1504034287150.html Verified 2018-05-05.

[3.2]

Poynton, Charles. (2012). In: Digital Video and HD – Algorithms and Interface (2^nd edition), page 467. Elsevier.

[5.1]

Data Sheet

https://www.yoctopuce.com/EN/products/yocto-amp/doc/YAMPMK01.usermanual.html Verified 2018-05-05.

[5.2]

Alan S. Morris, Reza Langari (2016). In: Measurements and Instrumentation, Theory and Application (2^nd edition), page 19 - 20. Elsevier.

Web address 1:

https://www.youtube.com/watch?v=ekthcIHDt3I Verified 2018-05-05

Web address 2:

http://pixelsciences.blogspot.se/2017/07/adaptive-median-filter-for-image-corrupted-by-salt- and-pepper-noise.html

Verified 2018-05-05

Web address 3:

http://www.fit.vutbr.cz/~vasicek/imagedb/?lev=60 Verified 2018-05-05

Web address 4:

https://stackoverflow.com/questions/18427031/median-filter-with-python-and-opencv

Verified 2018-05-05

(32)

28

9 Figures

Figure 2.1: Nine pixels for use in the convolution.

Figure 2.2: The kernel for vertical edges.

Figure 2.3: The kernel for horizontal edges.

Figure 2.4: Visualization of valid data and invalid data (blanking).

(33)

29

Figure 2.5: Visualization of the odd-even sorting network. Data flows from left to right. The vertical lines connecting the horizontal lines are comparators.

Figure 2.6: Visualization of the Batcher odd-even merge sorting network. Data flows from left to right. The vertical lines connecting the horizontal lines are comparators.

Figure 3.1: Estimation of power consumption for 74.25MHz clock.

(34)

30

Figure 3.2: Estimation of power consumption for 66 MHz clock.

Figure 4.1: Buffing of the incoming lines and pixels for calculation.

Figure 4.2: Flowchart of the calculations in the Sobel filter.

(35)

31

Figure 4.3: The amount of logic used for the Sobel filter using SRL for linebuffers can be seen highlighted in blue.

Figure 4.4: The amount of logic used for the Sobel filter using BRAM for linebuffers can be seen highlighted in blue.

Figure 4.5: Top hierarchy overview of the block design.

(36)

32

Figure 4.6: Relation between the three different data enable used in the system. Top line is incoming data_en, middle line is data_en for the filter, bottom line is data_en for the output. The colored dashed lines show the timing between the signals.

Figure 5.1: Test screen used for the measurements. https://www.youtube.com/watch?v=ekthcIHDt3I

Figure 5.2: Test screen after applying the Sobel filter.

(37)

33

Figure 5.3: Noisy image. http://pixelsciences.blogspot.se/2017/07/adaptive-median-filter-for-image-corrupted-by-salt-and- pepper-noise.html

Figure 5.5: Noisy image. http://www.fit.vutbr.cz/~vasicek/imagedb/?lev=60

(38)

34

Figure 5.7: Noisy image. https://stackoverflow.com/questions/18427031/median-filter-with-python-and-opencv

Figure 5.8: After applying the median filter to the Figure 5.7 the noise is reduced.