• No results found

Power Efficiency of Radar Signal Processing on Embedded Graphics Processing Units (GPUs)

N/A
N/A
Protected

Academic year: 2021

Share "Power Efficiency of Radar Signal Processing on Embedded Graphics Processing Units (GPUs)"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

TVE-F 18036

Examensarbete 15 hp Juli 2018

Power Efficiency of Radar Signal Processing on Embedded Graphics Procesing Units (GPUs)

Simon Blomberg

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Power Efficiency of Radar Signal Processing on Embedded Graphics Processing Units (GPUs)

Simon Blomberg

In recent years the use of graphics processing units for general purpose computation has been increasing. This provides a relatively cheap and easy way of optimizing computation intensive tasks. Although a lot of research has been done on this subject the power aspect of this is not very clear. This thesis treats the implementation and benchmarking of three radar signal processing algorithms for the CPU and GPU of the Jetson Tegra X2 module. The objective was to measure the power consumption and speed of the GPU versus CPU implementations.

All three algorithms were most efficiently executed on the GPU both in terms of power consumption and speed. The Space Time Adaptive Processing algorithm presented the biggest speedup and the Corner Turn the smallest. It was found that the both the computation and power efficiency of the GPU implementations was lower for sufficiently small input matrices.

Ämnesgranskare: Huan Wang Handledare: Jimmy Pettersson

(3)

Popul¨ arvetenskaplig sammanfattning

Denna rapport behandlar en studie best˚aende av m¨atning av prestandaskill- nader mellan grafikkortet och processorn f¨or n˚agra vanliga radaralgoritmer.

Efterssom grafikkortet har ett mer parallelliserat s¨att att k¨ora koden p˚a f¨orv¨antas det att denna ska vara snabbare f¨or stora datam¨angder. ¨Aven en- ergikonsumptionen f¨or algoritmerna har m¨atts och j¨amf¨orts mellan grafikkort och processor. F¨ors¨oken har gjorts p˚a datorplatformen Jetson tegra X2 som

¨

ar utvecklad just med energikonsumption kombinerat med h¨og prestanda i

˚atanke. F¨or att j¨amf¨ora de tv˚a ber¨akningsenheterna har tre radarsignalbe- handlingsalgoritmer programmerats f¨or b˚ade grafikkort och vanliga CPUn, Time Domain Finite Impulse Response (TDFIR), Space Time Adaptive Pro- cessing (STAP) och Corner Turn CT. Alla dessa algoritmer anv¨ands flitigt inom signalebehandlingsbranchen och speciellt STAP anv¨ands mycket f¨or kalibrering av radarsystem. Inom m˚anga signalbehandlingssystem ¨ar det av h¨og prioritet att algoritmerna kan k¨oras snabbt och f¨or n˚agra applikationer ocks˚a energisn˚alt.

F¨or alla implementerade algoritmer presterade grafikkortet m˚anga g˚anger snabbare. Allt fr˚an CT algoritmen som var 10-60 g˚anger snabbare till STAP som var upp till 12000 g˚anger snabbare. Detta reflekterade sig ocks˚a i en- ergikonsumptionen som ocks˚a var mycket mindre f¨or grafikkortsimplementa- tionerna ¨an vanliga processorn.

(4)

Contents

1 Background 5

1.1 The Jetson Tegra X2 module . . . 5

1.2 General Purpose Graphics Programming . . . 5

2 Objective 6 2.1 Hypothesis . . . 6

3 Profilation methods 7 4 Overview of the algorithms 7 4.1 Time domain finite impulse response - TDFIR . . . 7

4.2 Corner turn - CT . . . 8

4.3 Space Time Adaptive Processing - STAP . . . 8

5 Implementation 8 5.1 Time domain finite impulse response - TDFIR . . . 8

5.2 Corner turn - CT . . . 9

5.3 Space Time Adaptive Processing - STAP . . . 10

5.4 The HPEC benchmarking suite . . . 10

6 Results and Discussion 10 6.1 Time domain finite impulse response - TDFIR . . . 10

6.1.1 One dimensional output . . . 10

6.1.2 Two dimensional output . . . 11

6.1.3 Three dimensional output . . . 12

6.1.4 Summary . . . 16

6.1.5 Discussion . . . 16

6.2 Space Time Adaptive Processing - STAP . . . 18

6.2.1 Discussion . . . 19

6.3 Corner turn - CT . . . 20

6.3.1 Discussion . . . 22

7 Conclusions 24

(5)

1 Background

The demand for high performance numerical computing has been increasing for the past 20 years and processor manufacturers are constantly trying to optimize their architectures to achieve faster computing. This has lead to the development of very high performance Central Processing Units or CPUs with billions of transistors.[1] In recent years however this development has somewhat stalled mostly due to the actual physical laws of the transistors.[2]

To circumvent these problems, some companies have resorted to highly par- ellelized computing methods. Graphics card manufacturers have developed easy methods to use the very parellelized GPU to perform other numerical tasks than graphics processing. This method is called general purpose graph- ics programming (GPGPU). In 2007 NVIDIA released their framework for GPGPU programming called CUDA. Since then, much research has been done on the effectiveness of this method versus normal CPU methods, and it is now used extensively in many high performance applications. [3] [4]

One aspect of this which has become more relevant in recent years is the use of GPGPU methods in embedded applications with high demands on both power consumption and speed. In 2008 NVIDIA released a line of devices called Tegra which tried to address this aspect and was intended for use in smart phones and embedded applications. As of today one of the newest devices in this series is the Tegra X2 on which the benchmarks of this study is run.

1.1 The Jetson Tegra X2 module

In 2017 NVIDIA released the Jetson Tegra X2 module for use in embedded applications with high demands on power efficiency. The Tegra X2 utilizes the pascal graphics architecture combined with an ARM CPU. It was de- veloped for advanced data processing like Artificial intelligence and signal processing applications. The Tegra module shares its memory between the GPU and CPU removing the need for copying the data back and forth before and after executing the GPU programs. [5]

1.2 General Purpose Graphics Programming

General purpose graphics programming or GPGPU is the use of the Graphics Processing Unit or GPU to handle tasks traditionally handled by the CPU.

This includes many signal processing tasks such as convolving two vectors or transposing a big matrix. Many tasks that handles big data sets are suitable for GPGPU since this method is heavily parellelized and much of the data

(6)

can be processed simultaneously leading to a big speedup compared to iter- ative CPU execution. [6]

There are multiple programming languages and frameworks available for gen- eral purpose programming on the GPU. The one used in this report is the CUDA language developed by NVIDIA since the target platform is also de- veloped by NVIDIA and is optimized for this framework. CUDA is a lan- guage very similar to C/C++ and one can mix the host C/C++ code with the device CUDA code in the same file. This file is then compiled by the NVIDIA compiler nvcc to produce an executable containing both the GPU and CPU machine code. This executable is then run on the host and the GPU part, called the kernel, is then sent to the GPU to be executed there.

Since the Jetson Tegra device shares its memory between the GPU and CPU no data needs to be transferred to and from the devices. For other architec- tures where this is needed these memory copies can be more or less done in parallel with the kernel execution so it only introduces a small delay before the first kernel execution. [5]

2 Objective

The objective of this study is to measure the power and compute efficiency of a few algorithms used commonly used in radar signal processing. This would provide an insight into when the use of GPGPU programming is desirable on low power embedded systems. Some of the algorithms were implemented in different stages and measured before and after certain optimizations to measure if and how a given optimization affects latency and power consump- tion. This can be used to get an intuition for if a given optimization is worth doing. A lot of research has been done to study the performance considera- tions of GPGPU, but the embedded power aspect of this area is still quite unexplored. [7] [3]

2.1 Hypothesis

Since the GPU is able to execute many instructions simultaneously and these kind of large signal processing radar algorithms are very suited for that, the results are expected to confirm previous research in that the GPU should out- perform the CPU in terms of speed. Since an effective implementation leads to less energy being consumed per computation unit, the power consumption is also expected to be lower for the GPU implementations.

(7)

3 Profilation methods

Three main metrics of the GPU and CPU executions was be measured:

• Compute efficiency in terms of floating point operations per second (GFLOP/s) or bandwidth (GB/s) depending on the algorithm.

• Power efficiency in terms of power consumed per GFLOP/s or GB/s (GFLOP/Ws or GB/Ws).

• Speedup (CPU Execution time / GPU Execution time)

To measure the computation time of the GPU kernels the program nvprof provided by NVIDIA were used. A Python script was written to parse the output of this program for different input sizes and the result saved to a file.

The CPU execution time was measured with the C standard library timing functions and written to standard output.

The power measurements where done by measuring the /dev/sys nodes pro- vided by the drivers for the Tegra device both for the CPU and GPU imple- mentations. All data processing and visualization was done with a combina- tion of Python[8] and Matlab[9] scripts.

4 Overview of the algorithms

4.1 Time domain finite impulse response - TDFIR

fk ...

∗ ∗ ∗ ∗ ∗ ∗

+=

xn ...

yn ...

K

N

Figure 1: Diagram of the one dimensional TDFIR algorithm.

(8)

The TDFIR algorithm is used heavily in signal processing and consists of a number of scalar multiplications of a filter to a subsection of a bigger input vector. For a given input vector xn, with N elements and a filter fk, with K elements the output yn is defined to be

yn=

K

X

k=0

xn−kfk for n ∈ N (1)

In this report three different versions of this algorithm were implemented.

The first one operates on a input vector and a filter vector as the above definition. The second and third implementation are versions of this with the exception that the input vectors may be matrices and the algorithm iterates over the rows in the matrix and performs the above algorithm for each row.

4.2 Corner turn - CT

The corner turn or CT algorithm is a simple matrix transponate. This con- sists of inverting all of the indices of the input matrix. Mathematically this is represented by the following equation

Nij = Mji (2)

for an input matrix M and a output matrix N.

4.3 Space Time Adaptive Processing - STAP

In radar signal processing the Space Time Adaptive Processing algorithm is often used to filter data from noise in high interference applications. The input to the algorithm consists of a three dimensional matrix M of size r×d×b where the dimensions are called range, doppler and beam. It produces an output 3D-matrix N of size 3b×3b×(d−2) that can be used to solve an linear system of equations to produce weights for noise filtering. The input matrix is iterated over the d dimension and three beams each are used to form a 2D covariance matrix for a each doppler. These matrices is then concatenated into the 3D output.

5 Implementation

5.1 Time domain finite impulse response - TDFIR

Three versions of the finite impulse response filter has been implemented with different characteristics. The first implementation is a simple one that

(9)

takes as input a one dimensional vector A of size a and a filter vector F of size f . The filter is loaded into the GPUs constant memory before the kernel is loaded. The goal is then to load the needed data for one output block into shared memory to be quickly accessible. This consists of loading BS +(f −1) entries into shared memory for each block. The output is then calculated by accumulating reads from the shared buffer and constant memory in a GPU register.

The second implementation takes as input a vector A of size a and a P ∗ Q size filter matrix F consisting of several filters collected in rows of the ma- trix. Much like in the above implementation four rows of the filter matrix is loaded into constant memory to be quickly accessible and BS + (Q − 1) floats is loaded into shared memory for each block. The calculation itself is then performed with the four loaded filters at once to maximize compute efficiency for the algorithm. The kernel is then re-run with four new filters loaded into constant memory. The output of this implementation becomes a two dimensional matrix of size Q ∗ a with each row representing the corre- sponding filter applied to A.

The last and most optimized version of TDFIR takes an M ∗ N matrix A and a P ∗ Q filter F as input and applies each row of F to every row of A.

This implementation also loads four rows of F into constant memory before launching the kernel. The kernel then loads a two dimensional block of size BS ∗ (BS + (Q − 1)) from A into shared memory. Each thread in the block then reads one value from shared memory and an element from each of the four filters. These are then multiplied together and accumulated in registers.

This implementation only reads once from shared memory and the compu- tation is then done in completely in registers to minimize shared memory utilization. The output from this algorithm is a M ∗ N ∗ Q size matrix where the depth represents which filter was applied and the row which input vector it was applied to.

5.2 Corner turn - CT

The implementation of this algorithm is quite straight forward. It loads a block of 32x32 values into shared memory and writes them directly to global memory with the indices reversed.

(10)

5.3 Space Time Adaptive Processing - STAP

The STAP algorithm was implemented by iterating over the range and load- ing 96 values or three beams of data into registers for each block. Then another subsection of the matrix representing the subsection of data to be multiplied with is loaded into other registers. The result is then calculated in registers only with FMA operations. This register heavy implementation does not make use shared memory and minimizes the use of global memory.

This increases efficiency since these kind memory of operations are very time consuming compared to register loads and stores. Since this implementation uses 64 registers for each thread only 32 blocks can be executed simultane- ously. This may be a performance constraint that can be optimized away by using fewer registers. [10]

5.4 The HPEC benchmarking suite

The HPEC benchmarking suite is set of algorithms and implementations for use on the CPU that is intended to measure the performance of embedded processing units. The suite consists of several algorithms commonly used in signal processing and in this report the suite is used to compare GPU performance with the CPU equivalent. The STAP algorithm is not present in the HPEC suite but a CPU version was done that should represent a standard implementation.

6 Results and Discussion

This part of the report deals with the results from the benchmarks and some discussions around why the results looks like it does both in terms of power consumption and computational efficiency.

6.1 Time domain finite impulse response - TDFIR

6.1.1 One dimensional output

The first and simplest implementation of the TDFIR algorithm that operates on a one dimensional filter and input vector shows an improvement over the CPU in terms of GFLOP/s by a factor of 80 - 100. There is still much room for optimization of this algorithm since the GPU kernel only reaches about a tenth of the theoretical compute bound for the Tegra device. The increase in speed on the GPU compared to the CPU is more apparent when the input data and filter size is big. The power consumption of the GPU kernel is

(11)

somewhat larger than the CPU program but the total consumed energy of the CPU is still larger due to the longer execution time.

102 103 104 105 106 107

Input vector size 0

50 100 150 200 250 300 350

speedup

Speedup

Figure 2: Speedup of version one compared to CPU implementation for different input vector sizes.

6.1.2 Two dimensional output

The HPEC suite of algorithms for the CPU provides an implementation of TDFIR that operates on a single one dimensional input vector with many filters. To correctly compare the GPU kernels with HPEC a similar GPU version was implemented. When run on the relatively small data sets pro- vided by HPEC the kernel performed poorly and reached an efficiency of a few percent of the theoretical compute bound of the Tegra. However it still achieves a speedup compared to the CPU of about 270 for input sizes of 4096 as can be seen in Figure 3. For larger input sizes the speedup drops as the CPU implementation from HPEC performs better at these sizes.

(12)

102 103 104 105 Input vector size

0 50 100 150 200 250 300

Speedup compared to CPU

Speedup

X: 4096 Y: 269.4

Figure 3: The speedup of FIR with two dimensional output compared to the CPU.

6.1.3 Three dimensional output

To simulate more realistic conditions a version of TDFIR with three di- mensional output was created that operates on an input matrix and a filter matrix and produces one matrix per filter. This implementation produced a significantly higher efficiency both in terms of computation and power con- sumption. As seen in figure 4 this is most evident with large input vector sizes. However for inputs bigger than 64 - the block size used - the compute efficiency appears to be quite constant and even declines for sufficiently big input sizes. The speedup compared to the CPU program behaves the same way and increases rapidly for small input sizes while staying more or less constant for sizes bigger than the block size used. Compared to the CPU program the kernel executed significantly faster for big input sizes. This is illustrated by Figure 5 where the speedup reaches around 180.

(13)

101 102 103 104 Input vector size

0 20 40 60 80 100 120 140

Floating point operations per second [GFLOP/s]

GFLOPS

CPU GPU

Figure 4: Floating point operations per second of version 5 for different input vector sizes with filter size 128, 1024 input vectors and 64 filters.

Input vector size 1024 Input vector size 128 Input vector size 512 0

20 40 60 80 100 120 140 160 180 200

Speedup compared to CPU

Figure 5: Speedup for some input vector sizes with filter size 128.

The power consumed by the GPU is practically constant for input sizes smaller than the block size while it increases rapidly for larger inputs. Figure 7 illustrates the different components of the Tegra devices power consumption for different sizes of the input vector. Compared to the CPU the power consumed per FLOP is very low as can be seen in figure 6 which presents the power efficiency for different inputs.

(14)

101 102 103 104 Input vector size

0 2 4 6 8 10 12 14 16 18 20

GFLOPS/W

Power efficiency

GPU CPU

Figure 6: Floating point operations per W of version 5 for varying input sizes.

101 102 103 104

Input vector size 1000

1500 2000 2500 3000 3500 4000 4500 5000 5500

Power consumed [mW]

Power consumption

GPU CPU RAM

Figure 7: Power consumption of some components of the Tegra while exe- cuting the version 5 kernel.

The performance of the implementation is also heavily dependant on the size of the filters applied. Figure 8 illustrates that the maximum efficiency in terms of FLOPS per second is achieved at a filter size of 128. For bigger sizes the computational efficiency drops rapidly. Because of this the speedup also drops as the filter size gets bigger than 128. However with a filter size of

(15)

512 one still obtains a speedup of around 70 over the CPU. This is evident in figure 9. The power consumed by the GPU algorithm is significantly smaller than the CPU version for all filter sizes. As the filter size grows, this difference gets bigger and eventually for filters of size 64 and 128 the CPU consumes over 150 times more energy per GFLOPS than the GPU. This is is illustrated by figure 10.

101 102 103

Input filter size 40

50 60 70 80 90 100 110 120 130

Floating point operations per second [GFLOP/s]

GFLOPS

Figure 8: Computational efficiency of the final FIR version for some different filter sizes.

101 102 103

Input filter size 60

80 100 120 140 160 180 200

Speedup compared to CPU

Speedup

Figure 9: Speedup of the final FIR version for some different filter sizes.

(16)

Filter size 128 Filter size 64 0

2 4 6 8 10 12 14

GFLOPS/W

GPU CPU

Figure 10: Power efficiency of two filter sizes with an input vector size of 1024.

6.1.4 Summary

All three GPU versions of the TDFIR algorithms performed better than the corresponding CPU programs both in therms of speed and power efficiency.

In all implementations the speedup decreases for smaller input and filter sizes. The energy efficiency of all three kernels was observed to be larger for big input data sizes. Both the GPU and CPU programs performed more efficiently when acting on two dimensional input and producing three dimen- sional output. The speedup was biggest in the first version even though the compute efficiency of that kernel was notably smaller than that of version 5.

6.1.5 Discussion

The GPU clearly outperformed the CPU on this algorithm both in terms of speed and power efficiency. It can be observed that in all three algorithms the GPU kernel is more effective when applied to large input data both in terms of filter size and input vector size. This is probably due to the heavy parallellization of the GPU. When the input is large enough the GPU can utilize more of its resources simultaneously. The CPU on the other hand has a greater power efficiency on smaller inputs. This is most likely due to an increase in cache misses in the larger data sets resulting in more energy consumed and less GFLOPS.

In figure 4 one can see that the peak computation speed is achieved at a

(17)

vector size of 64. After this the computation falls off slightly and approaches 120. The explanation of this behavior is not obvious. It seems that the implementation is bounded by something which gets increasingly relevant at big input vector sizes. What this bound is has not been studied yet.

To explain why the speedup is larger for version 1 than version 5 even though the compute efficiency is remarkably higher for the later kernel, we need to look at the computation speed of the CPU version. In figure 11 we notice that the computation efficiency of the CPU is greater for version 5 which explains why the speedup is dropping.

CPU Compute Efficiency

Version 1 Version 5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

GFLOP/s

Figure 11: Compute efficiency of version 1 and 5 on the CPU.

(18)

6.2 Space Time Adaptive Processing - STAP

The GPU implementation of the STAP algorithm reached a maximum com- putation efficiency of 240 GFLOP/s which is remarkably high compared to the above algorithm. For range sizes below 260 the efficiency increased ex- ponentially as seen in figure 12. It can clearly be seen that the algorithm performs fastest with a range size of around 600. At bigger range sizes the efficiency drops to around 233 GFLOPS/s. Compared to the CPU program the GPU performed extremely well as can be seen by the speedup in figure 13.

101 102 103

Range size 180

190 200 210 220 230 240

GFLOP/s

Compute efficiency

Figure 12: Computation efficiency of the STAP GPU implementation with different input ranges.

In terms of power consumption the algorithm is also very efficient com- pared to the CPU implementation. For a range of 960 and a doppler size of 32 the power efficiency of the GPU program performs 84850 times better.

For smaller doppler sizes the advantage gets smaller but the kernel still out- performs the CPU program with a factor of at least 28240 with a doppler size of 4. As can be seen by figure 14 the power efficiency is at its highest around range sizes of 300. For larger and smaller ranges the power efficiency decreases, but it still remains over 18.6 GFLOPS/W for all tested inputs.

(19)

100 200 300 400 500 600 700 800 900 Range size

1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2

Speedup compared to CPU

104 Speedup

Figure 13: Speedup of the GPU stap kernel compared to the CPU equivalent for different range sizes.

100 200 300 400 500 600 700 800 900

Range size 22

22.5 23 23.5 24 24.5 25

GFLOPS/W

Power efficiency

Figure 14: Power efficiency of the GPU STAP implementation for different range sizes.

6.2.1 Discussion

The advantage of using GPGPU for the STAP algorithm seems obvious as the power consumption per GFLOP/s is significantly lower for the GPU im- plementation. Even for relatively small doppler sizes the CPU consumes

(20)

Table 1: Number of blocks executed for some different input parameters of the STAP implementation.

Range Size Doppler Size Nr of Blocks

50 16 42

100 32 90

200 64 192

400 128 378

800 256 762

much more power. The computation efficiency is also notably higher on the GPU especially for big data inputs. Since this algorithm is computationally heavy the execution is not bound by memory accesses. It is therefore very suitable for execution on the GPU. This particular implementation utilizes local thread registers to maximize throughput thereby further increasing the memory access bound.

As figure 12 and 14 shows the compute and power efficiency falls off at the highest range sizes. Since this kernel uses a lot of registers per thread fewer blocks can be executed by each SM at once. This is also the case for smaller input sizes but the performance effect may be more evident when many blocks needs to be executed. For small ranges the implementation needs to execute fewer blocks and the SMs may be able to execute them simultaneously hiding this latency. Table 1 shows how many blocks needs to be executed for different ranges. The current implementation allows 32 blocks to be executed at once.

6.3 Corner turn - CT

The corner turn algorithm is very bandwidth bound compared to the other algorithm since it does no computations apart from index calculations for memory operations. Therefore the main measure of performance is the band- width and not compute efficiency, that is the number of bytes loaded and stored to memory per second. In figure 15 one can clearly see that The GPU version has a lower bandwidth for smaller matrices. The CPU version be- haves the opposite way and performs worse for bigger input data. This is also reflected on the power efficiency presented in figure 16 since the power consumption is more or less constant for all inputs. This can be found by noting that the graphs are nearly identical. In terms if speed the GPU ver- sion outperformed the CPU quite clearly with a speedup of around 45 for

(21)

large enough input matrices as seen in figure 18.

100 101 102 103 104 105

Matrix height 0

2 4 6 8 10 12 14

Bandwidth [GB/s]

Bandwidth

GPU CPU

Figure 15: Bandwidth of corner turn implementation for different matrix heights.

100 101 102 103 104 105

Matrix height 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Power efficiency [GB/Ws]

Power efficiency

GPU CPU

Figure 16: Power efficiency of corner turn implementation for different matrix heights.

(22)

100 101 102 103 104 105 Matrix height

0 500 1000 1500 2000 2500

Power consumption [mW]

GPU Power consumption

Figure 17: Power consumed by the GPU for different matrix heights.

Matrix height 1024 Matrix height 2048 Matrix height 4096

0 10 20 30 40 50 60

Speedup compared to CPU

Figure 18: Speedup of the corner turn implementation for some input sizes.

6.3.1 Discussion

Compared to the previously discussed algorithm the CT GPU implementa- tion did not achieve as much speedup over the CPU. For the smallest input matrix the GPU only outperformed the CPU by a factor of of about 2.5 which is relatively small in the context of GPGPU. The difference may be explained in the fact that opposed to the other algorithm, CT does not per- form any floating point computations. This may be used as an argument

(23)

that bandwidth bound algorithms is not as suitable for GPGPU implemen- tation as compute bound ones. Although this may be the case, the GPGPU implementation still achieves a large speedup compared to the CPU for most common matrix sizes used in signal processing.

One interesting thing to note is the decline of the power efficiency for in- put sizes larger than 1024 as seen in figure 16. This is due to the increased power consumption of the GPU implementation for larger sizes as seen in figure 17.

(24)

7 Conclusions

Three algorithms has been benchmarked on the CPU and GPU and for all tests the GPU implementations outperformed the CPU both in terms of power efficiency and speed. The first and most significant conclusion must be that these kind of algorithms are very appropriate for GPU implementa- tions. Further one can establish that the algorithm suited best for the GPU was the STAP since this algorithm achieved the biggest speedup compared to the CPU. This is most likely due to the fact that the algorithm is very compute bound. The CT algorithm did not perform as well on the GPU since it’s execution time is more bound by memory speed. For all algorithms the biggest speedup was reached for bigger input data sizes as expected. The CPU implementations all performed better at smaller sizes most likely due to it’s efficient cache handling of coalesced memory. When the input size increases beyond what the CPU cache can handle the performance drops rapidly and the speedup increases.

Moreover one can establish that the power efficiency of the implementations is mostly dependant of the computation efficiency since the actual power consumption is fairly constant when varying the input, as opposed to the computation speed. This means that the most effective way to make an al- gorithm perform more efficiently in terms of power is to optimize it to run faster, and thereby decreasing the total amount of consumed power. A direct effect of this is that for the tested implementations, the GPU was more power efficient than the CPU since it consistently outperformed the CPU in terms of computation speed.

(25)

References

[1] W. D. Nordhaus, “Two centuries of productivity growth in computing,”

2007. doi: http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.330.1871&rep=rep1&type=pdf.

[2] S. Kumar, “Fundamental limits to moore’s law,” 2015.

[3] J. Petterson and I. Wainwright, “Radar signal processing with graphics processors (gpus),” Uppsala Universitet, 2010. doi: https : / / www . diva-portal.org/smash/get/diva2:292558/FULLTEXT01.pdf.

[4] R. C. B. Jr., “Using state-of-the-art gpgpu‘s for molecular simulation:

Optimizing massively parallelized n-body programs using nvidia tesla,”

Uppsala Universitet, 2009. doi: http://uu.diva-portal.org/smash/

get/diva2:278869/FULLTEXT01.pdf.

[5] NVIDIA. (2018). Jetson tegra hardware specifications, [Online]. Avail- able: https : / / www . nvidia . com / en - us / autonomous - machines / embedded-systems-dev-kits-modules/ (visited on 05/17/2018).

[6] D. M. Chitty, “A data parallel approach to genetic programming using programmable graphics hardware,” 2007.

[7] J. Jansson, “Integrated gpus: How useful are they in hpc?” Uppsala Universitet, 2013. doi: http://uu.diva- portal.org/smash/get/

diva2:675727/FULLTEXT01.pdf.

[8] Python. (2018). Python official website, [Online]. Available: https : //www.python.org/ (visited on 06/05/2018).

[9] Mathworks. (2018). Matlab official website, [Online]. Available: https:

//se.mathworks.com/products/matlab.html (visited on 06/05/2018).

[10] J. Pettersson and I. Wainwright, “Radar signal processing on graphics processors (nvidia/cuda),” 2010.

References

Related documents

– Physical memory controller interface – Handling of PCI-bus communication – Onboard memory capacity are limited. • Need for

In our investigations, we have made use of software threads in order simulate a small set of multicore processors. The ThreadSpotter tool has enabled us to investigate and analyse the

[r]

In this chapter, a parameter estimation algorithm called EASI- SM is compared to the non-negative least squares (NNLS) spectrum approach commonly used in the context of MRI.

Again, the neck pain risk we found from rotation in a group of forklift operators (OR 3.9, Table 2 ) seems reason- able given that forklift operators are more exposed to un-

HQT VJG VYQ ENCUU ENCUUKſECVKQP RTQDNGO WUKPI NKPGCT FKUETKOKPCPV HWPEVKQPU VJCV OKPKOK\G VJG RTQDCDKNKV[ QH ENCUUKſECVKQP GTTQT TCVG 6JG HQEWU QH VJKU EJCRVGT KU QP VJG

We will in this chapter investigate common techniques for clustering data with similar characteristics together, as well as further developing a clustering tech- nique suited to

In the case of STAP calculations in magnitude of the hard case described in the Mitre benchmark, which unfortunately has a matrix width that fits the CSX600 processor quite badly,