Evaluation of Computer Vision Algorithms Optimized for Embedded GPU:s.

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Evaluation of computer vision algorithms optimized for

embedded GPU:s

Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet

av

Mattias Nilsson

LiTH-ISY-EX--14/4816--SE

Linköping 2014

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Evaluation of computer vision algorithms optimized for

embedded GPU:s

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Mattias Nilsson

LiTH-ISY-EX--14/4816--SE

Handledare: Erik Ringaby

isy_{, Linköpings universitet}

Johan Pettersson

SICK IVP

Examinator: Klas Nordberg

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Computer Vision Laboratory Department of Electrical Engineering SE-581 83 Linköping Datum Date 2014-05-20 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-XXXXX

ISBN — ISRN

LiTH-ISY-EX--14/4816--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Utvärdering av bildbehandlingsalgoritmer optimerade för inbyggda GPU:er. Evaluation of computer vision algorithms optimized for embedded GPU:s

Författare Author

Mattias Nilsson

Sammanfattning Abstract

The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers such as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms. For embedded systems, GPU:s are not as popular yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been too hard to get, too energy consuming and not suitable for some algorithms. At SICK IVP, advanced computer vision algorithms run on FPGA:s. This master thesis optimizes two such algorithms for embedded GPU:s and evaluates the result. It also evaluates the status of the embedded GPU:s on the market today. The results indicates that embedded GPU:s perform well enough to run the evaluatedd algorithms as fast as needed. The implementations are also easy to understand compared to implementations for FPGA:s which are competing hardware.

Nyckelord

(6)

(7)

Abstract

The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers such as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magni-tude, for suitable algorithms. For embedded systems, GPU:s are not as popular yet. The embedded GPU:s available on the market have often not been able to jus-tify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been too hard to get, too energy consuming and not suitable for some algorithms. At SICK IVP, advanced computer vision algorithms run on FPGA:s. This master thesis optimizes two such algorithms for embedded GPU:s and evaluates the result. It also evaluates the status of the em-bedded GPU:s on the market today. The results indicates that emem-bedded GPU:s perform well enough to run the evaluatedd algorithms as fast as needed. The implementations are also easy to understand compared to implementations for FPGA:s which are competing hardware.

(8)

(9)

Acknowledgments

This project could not have been executed without the help from Johan Petters-son and Johan Hedborg. Thank you very much! I would also like to thank Erik Ringaby and Klas Nordberg from CVL for their help.

Linköping, June 2014 Mattias Nilsson

(10)

(11)

viii Contents 4.1 Initial phase . . . 21 4.2 Parallelization . . . 22 4.3 Theoretical evaluation . . . 22 4.4 Implementation . . . 22 4.5 Evaluation . . . 22 4.6 Alternative methods . . . 23 4.6.1 Theoretical method . . . 23 4.6.2 One algorithm . . . 23 4.6.3 Conclusions . . . 23 5 Rectification of images 25 5.1 Generating test data . . . 25

5.2 Theoretical parallelization . . . 28 5.3 Theoretical evaluation . . . 29 5.4 Implementation . . . 29 5.4.1 Initial implementation . . . 29 5.4.2 General problems . . . 30 5.4.3 Texture memory . . . 31 5.4.4 Constant memory . . . 31 5.5 Results . . . 31 5.5.1 Memory transfer . . . 32 5.5.2 Kernel execution . . . 32

5.5.3 Memory access performance . . . 33

5.6 Discussion . . . 34

5.6.1 Performance . . . 34

5.6.2 Memory transfer . . . 35

5.6.3 Complexity of the software . . . 35

5.6.4 Compatibility and Scalability . . . 35

5.7 Conclusions . . . 35

6 Pattern Recognition 37 6.1 Sequential Implementation . . . 37

6.2 Generating test data . . . 37

6.3 Assuring the correctness of results . . . 38

6.4 Theoretical parallelization . . . 39

6.4.1 Pyramid image representation . . . 40

6.4.2 Parallelizing using reduction . . . 40

6.5 Theoretical evaluation . . . 41

6.5.1 Searching intuitive in full scale . . . 41

6.5.2 Trace searching intuitive . . . 41

6.5.3 Search using reduction . . . 42

6.5.4 Memory transfer vs. kernel execution . . . 42

6.5.5 PMPS . . . 43

6.6 Implementation . . . 43

6.6.1 Implementation of reduction in general . . . 44

(13)

Contents ix

6.6.3 Implementation of non maxima suppression . . . 46

6.7 Results . . . 46

6.7.1 Kernel performance . . . 46

6.7.2 Performance of algorithm . . . 46

6.7.3 PMPS . . . 47

6.7.4 Memory access performance and bandwidth . . . 48

6.8 Discussion . . . 49 6.8.1 Intuitive implementation . . . 49 6.8.2 Reduction implementation . . . 50 6.8.3 CPU . . . 51 6.9 Conclusions . . . 51 7 Conclusions 53 7.1 Overall Conclusions . . . 53

7.1.1 Recommendation about hardware . . . 54

7.2 Future . . . 55

7.2.1 Architecture . . . 55

7.2.2 Implementation . . . 55

7.3 Evaluation of method . . . 56

7.4 Work in a broader context . . . 56

(14)

(15)

Notation

GPU-architecture

Notation Meaning

SM Streaming multiprocessor, a main processor in charge of a number of cores.

Warp Smallest amount of cores doing the same operations, often 32.

Compute-capability A number describing which generation of Nvidia GPU-architecture the GPU is built according to. Higher compute capability supports more features.

Kepler The Nvidia GPU-architecture used in the Master the-sis project, compute capabilities of 3.0 or 3.1.

Fermi The Nvidia GPU-architecture with compute-capability 2.0-2.9.

CUDA

Notation Meaning

Kernel A CUDA-function written for a GPU.

Thread Each kernel runs a number of parallel threads. Block A block consists of a number of threads indexed in up

to 3 dimensions.

Grid A grid consists of a number of blocks indexed in up to 3 dimensions.

(16)

(17)

1

Introduction

1.1 Background

The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms.

Embedded GPU:s are small GPU:s built into SoC:s (System on Chips). SoC:s are integrated circuits where several processor and function blocks are built into one chip. SoC:s are used in embedded systems such as mobile phones. The interest of using embedded GPU:s as general processing units have not been nearly as high as for regular GPU:s yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been hard to get since the models available on the market have been few. Their energy consumption have been to high and thet have not been suitable for some algorithms. However, the performance of embedded GPU:s improve all the time and it is very likely that their performance will be sufficient in the foreseeable future.

At SICK IVP advanced computer vision algorithms are accelerated on FPGA:s. Accelerating the algorithms on embedded GPU:s instead might be preferred for several reasons. Apart from possibly being faster, GPU:s are also in general easier to program than FPGA:s. This is because the programming model of a GPU is much more similar to a CPU than the programming model of an FPGA is.

(18)

2 1 Introduction

1.2 Purpose and goal

The goal of the master thesis is to analyse how some of the computer vision algo-rithms that SICK IVP today run on FPGA:s instead would suit running on GPU:s. Critical factors in the analysis are theoretical parallelization, memory access pat-tern, memory choice and how good the performance is in practise.

Another goal is to determine whether the embedded GPU:s available today are good enough to be considered in computer vision products. The results from the algorithms relate to this question in several important ways by answering the following questions.

• How well is the algorithm parallelized?

• What is the performance of the implemented algorithms compared to what was theoretically expected?

• How device specific are the implementations, i.e. how portable are they? • Is the performance sufficient?

• Is the code hard to understand, compared to a CPU implementation and compared to an FPGA implementation?

When the algorithms were implemented and evaluated, so that the previous ques-tions could be answered, a recommendation about hardware was done for SICK IVP based on the answers.

1.3 Delimitations

To define the project and to scale it down to a reasonable size some delimitations were made. The delimitations regard implementation, hardware, number of al-gorithms and how the result of the project should be interpreted.

To get a perfect idea of the performance of computer vision algorithms on embed-ded GPU:s a large number of algorithms could be analysed and implemented. In this project only two algorithms were analysed.

In this project only Nvidia GPU:s were used so that the CUDA programming language could be used. CUDA is a modern GPGPU programming language that is easy to set up and use compared to other GPGPU programming languages. For more information about hardware choice see section, 1.4.

In GPU-programming a concept called multiple streams exist. Multiple streams are explained in section 3.1.1 and are of interest for the two different algorithms implemented. However multiple streams are only discussed theoretically and not implemented.

The recommendation about embedded GPU:s in products, mentioned in section 1.2, is only based on the questions of the same section. Other factors that could be interesting for a hardware choice are not considered.

(19)

1.4 Hardware 3

1.4 Hardware

Nvidia Tegra is Nvidia:s product series of SoC:s. They are embedded devices with both CPU:s and GPU:s on the same chip. Three different hardware set-ups were used during the project. Most of the development were performed on a desk-top computer featuring a GTX 680 GPU. At the start of the master thesis project there were no devices or test boards on the market that ran embedded GPU:s with unified shader architecture. In a unified shader architecture all streaming multi-processors (SM:s)can be used for GPGPU operations but in a non-unified shader architecture some SM:s are reserved for specific graphic operations. Devices with-out unified shader architecture can therefore not be utilized to their full capacity by GPGPU operations. To simulate an embedded device with unified shader ar-chitecture, tests were run on a test board featuring an Nvidia Tegra 3 with a sep-arate Geforce GT 640 GPU. Nvidia calls this combination Kayla [Nvidia, 2013]. The separate GPU is there to simulate future devices with unified shader architec-ture. NVIDIA Tegra K1 has a GPU based on Kepler architecture which includes unified shaders. There are some differences between the Kayla platform and the K1 though. A big performance difference between the Kayla platform and the Tegra K1 is that the the Tegra only have one SM, the separate GPU of Kayla has two SM:s. Another difference is that the K1 has a GPU and a CPU with a shared memory pool. This kind of memory drastically reduces the transfer time between the CPU and GPU. A third important difference is that the memory bandwidth is higher on the GPU of Kayla, making memory accesses faster. At the end of the project all tests were run on a test board called Jetson TK1 featuring a K1 SoC. All GPU:s used in the project are based on Kepler architecture, which is the architecture of a specific generation of Nvidia GPU:s.

Some important specifications of the GTX 680, the GT 640 and the Tegra K1 are listed in table 1.1. Since accessing the global memory is a typical bottleneck of a GPU-kernel the memory bandwidth is very important. Core speed is important to be able to make computations as fast as possible. For all GPU:s built on Kepler architecture an SM contains 192 cores. Therefore the number of SM:s determines the total number of cores.

Feature GTX 680 GT 640 Tegra K1

Memory Bandwidth 192 GB/s 29 GB/s 17 GB/s

Core speed 1053MHz 1033 MHz 950 MHz

Number of SMs 8 2 1

(20)

(21)

2

Sequential Algorithms

Many computer vision algorithms are suited for running on GPU:s and the algo-rithms chosen for this master thesis project were:

• Rectification of images

• Pattern recognition using normalized cross correlation

The purpose of the first algorithm is to extract a geometrical plane from an image with respect to the distortion of the camera lens. It was chosen for the project since it is of low complexity. The second algorithm tries to find an object in an image using the intensity of the pixels. It was chosen for the project since it is a common computer vision algorithm of high complexity and because it is, in contrast to the first algorithm, not intuitively well suited for a GPU.

2.1 Rectification of images

The rectification algorithm applies to when a camera is stationed to capture a plain surface. Even though an image where the camera is placed over the surface pointing straight down on it is wanted, see figure 2.1, it is often not desirable to install the camera that way e.g. because the camera may cast a shadow on the surface. The camera is often placed around 45 degrees compared to the surface, see figure 2.1. The purpose of the algorithm is to extract parts of the image us-ing a given homography and lens distortion parameters. Given the homography between the image plane and the surface it is possible to transform the image to be placed in the image plane. Let X be a coordinate in the original image, H the homography between the surface and the image plane and Y the coordinate in the transformed image. X and Y are written in homogeneous form.

(22)

6 2 Sequential Algorithms

Figure 2.1:Camera placed orthogonal to the surface to the left and approxi-mately with approxiapproxi-mately 45 degrees angle to the right.

Y ∼ H X (2.1)

This transformation is not sufficient since the camera is using a lens that has lens distortion. The most significant lens distortions is the radial distortion [Janez Pers, 2002]. Radial distortion can be corrected according to equation 2.2.

xcorrected= x(1 + k1r2+ k2r4+ k3r6...)

ycorrected = y(1 + k1r2+ k2r4+ k3r6...)

(2.2)

where kn is the nth radial distortion parameter, r is the radial distance from the

optical centre of the picture, x and y are the original coordinates and xcorrect

and ycorrect are the corrected coordinates. If the result is not good enough it is

possible to also add tangential distortion to the model. For most applications it is sufficient to correct for radial distortion.

The work flow of the algorithm is to go through all pixels in the output image and make the transformation and lens correction backwards to find the correct place in the input image. Interpolation between the neighbouring pixels in the input image is then performed to get subpixel accuracy.

2.2 Pattern recognition

Pattern recognition, also known as template matching [Lewis, 1995], is an algo-rithm that aims to find occurrences of a known pattern in an image using the pixel values. It searches by moving the pattern through the search image and calculating a match-value pixel by pixel, see figure 2.2 The match-value can be calculated in different ways. In this project normalized cross correlation (NCC) is used.

(23)

2.2 Pattern recognition 7

Figure 2.2:Performing NCC on each pixel of the image.

2.2.1 Normalized cross correlation

Let P be a template image with width w and height h. Let Sxbe the overlapping

part of search image S when placing P around a certain pixel a. S and P are the mean-values of the search and template image around a. The normalized cross-correlation, C, for that image is defined in equation 2.3, [Lewis, 1995].

Ca= 1 −

Pw,h

i=1,j=1(Sa(i, j) − Sa)(P (i, j) − P )

q Pw,h

i=1,j=1(Sa(i, j) − Sa)2Pw,hi=1,j=1(P (i, j) − P )2

(2.3)

Since NCC takes the mean-value and standard-deviation of the image in account it is not as sensitive to illumination differences as it would be if regular correla-tion was used.

When NCC is calculated for all coordinates in the image, the result is a new image consisting of NCC-values.

2.2.2 Scaling and rotation

The algorithm can be constructed to be rotation and scale invariant by using trans-formations. In the project a rotation invariant version was implemented.

In the rotation invariant version a number of angles are chosen. For each angle that is chosen a separate image of NCC-values is calculated to find occurrences of the pattern rotated by that angle. When the NCC is calculated for a specific pixel at a specific angle, every coordinate in the pattern is rotated using a rotation matrix based on that angle, see equation 2.4. Each coordinate is then added to the coordinate of the pixel for which the NCC is calculated for. The result of the addition is the coordinate in the search image that should be compared with the original coordinate of the pattern. The rotation often results in floating point

(24)

indexes. An interpolation between the closest pixels in the search image around the indexes is therefore required. Bilinear interpolation is often chosen for this type of interpolation. To make the image rotate around its optical center instead of its upper left corner the coordinates of the pattern are in range [−w/2, w/2] and [−h/2, h/2].

The rotation invariant version searches for matches by rotating the pattern in a chosen number of angles. It then calculates one image of NCC-values per angle. The index in the pattern is rotated by the transformation in equation 2.4, where

xSa, ySa are coordinates in the local search image and xp, ypare coordinates in the

pattern. xSa ySa ! = ₋cos(θ) sin(θ) sin(θ) cos(θ) ! · xP yP ! (2.4)

2.2.3 Complexity

The rotation variant and scale variant algorithm is of the complexity O(wPhPwShS)

[James Maclean, 2008] where wp and hp is width and height of the pattern and

wP and hP is the width and height of the search image. Making it scale invariant

and rotation invariant increases the complexity to O(rswPhPwShS) where r is the

number of rotations and s is the number of scales. These calculations assume that the mean and standard-deviation of the images is already known. Calculating the mean has the complexity O(wPhP) and can be disregarded. However calculating

the mean of all local search images is of complexity O(wPhPwShS) and can not be

disregarded.

2.2.4 Sequential implementation

When running the algorithm on a CPU or GPU it would be better to be able to calculate the sums in equation 2.3 and the mean of Sain the same loop instead

of in two separate loops. The mathematical complexity does not differ but the overhead of running a loop on a computer makes it faster to perform several operations in the same loop than performing one operation in several loops. By rewriting the sumPn

i=1(ai −a)2, where ai are pixel values and a is the mean of

picture a, it is possible to separate the pixel values and the mean.

n X i=1 (ai−a)2= n X i=1 (a2_i −_2a_i_{a + a}2_{) =} n X i=1 a2_i + n X i=1 a2−₂ n X i=1 aia = n X i=1 a2_i + na2−_2a n X i=1 ai = n X i=1 a2_i + na2−_2na2₌ n X i=1 a2_i −_na2

In the same way it is possible to rewrite the sumPn

i=1(ai −a)(bi−b), where b is

(25)

2.2 Pattern recognition 9 n X i=1 (ai−a)(bi−b) = n X i=1 (aibi−abi−aib + ab) = n X i=1 (aibi) − n X i=1 abi− n X i=1 aib + n X i=1 abi = n X i=1

(aibi) − nab − nab + nab =

n

X

i=1

(aibi) − nab

By using the rewritten sums, the mean of the overlapping image Sacan be

calcu-lated simultaneously as the other sums, thereby reducing the number of loops to 1 instead of 2. The average of the pattern and the square sum of the pattern is calculated offline since they will be the same for every NCC-pixel. Three sums are calculated online for each NCC-pixel.The sums that are calculated are:

• P Si - to be able to calculate S

• P S2

i - for the denominator in NCC

• P SiPi- To be able to calculateP(Si−S)(Pi−P )

When the sums are calculated they are converted to the sums in equation 2.3. The NCC-value is calculated and the value of the pixel in the result image is set. These operations are repeated for all pixels so that the result image is filled with NCC-values.

2.2.5 Pyramid image representation

In this project full scale pattern recognition is defined as calculating the NCC for all pixels in the search image. Because of the high complexity of the full scale pattern recognition algorithm an image pyramid representation is often used to reduce the complexity, see [James Maclean, 2008]. The original images are down-sampled to create an image pyramid of a desired number of levels. The full scale pattern recognition is only performed on the coarsest level. When matches are found on the coarsest level the matching coordinates are changed to match an image with a finer scale. In the larger image the NCC is only calculated for the resulting pixel from the previous image and its neighbouring pixels. If any of the neighbouring pixels have a better correlation the index will be changed to the better match. The search in the finer images is in this report called trace search. The total number of operations performed when using an image pyramid is significantly lower then when performing a full scale pattern search on the original image. The number of operations O is illustrated in equation 2.5.

O = rswPhPwShS 16k + k X i=1 mwphp 16k−i (2.5)

(26)

In equation 2.5 k is the number of times the pattern and search image are down-sampled and m is the number of matches that are saved from the original search image. The coefficient in the denominator is 16 because when the width and height of the search image and pattern is down scaled by 2, the total scale factor will be 24= 16. The single term in equation 2.5 is the number of operations in the full scale search on the coarsest image. The sum is the total number of operations for scaling up coordinates to fit finer images and calculating the matches of the neighbourhoods. To use initial matches to go from coarser images to finer images is hereby called trace search.

2.2.6 Non maxima suppression

Non maxima suppression is used to suppress all image-values where a neighbour-ing value is higher than the current value. Applyneighbour-ing non maxima suppression to the NCC-images will make regions of high correlation values result in only one high value. Since very few pixels are examined at finer levels it is very important that only one pixel per correct match is saved. Non maximum suppression, see algorithm 1 is performed on all NCC-images on the coarsest level to get unique results to use in the trace search. A rough description of the complete algorithm can be seen in algorithm 2.

Data: image

forall pixels (x,y) in image do

ifmax(neighbouring pixels) > pixel then pixel = 0;

end end

Algorithm 1:Non maximum suppression of image. If any of the neighbouring pixels have a higher value than the current one its value is set to zero.

When using a large number of different angles in the search there is a risk that several angles of the same pixel will produce high NCC-values. A better non maxima suppression would suppress not only in the x and y-dimension, but also for the closest different angles. This kind of suppression was not implemented in the project, mainly because in practise the angles searched for are often so few that only the best angle will produce a NCC-value high enough to be considered a match. A larger area to examine when suppressing could also be used, but suppressing only according to the neighbouring pixels was found sufficient for

(27)

2.2 Pattern recognition 11

the project.

Data: patternPyramid, searchImagePyramid, minSimilarity Result: bestMatches

for all angles do

image = performNCC(coarsestImages, angle); image = nonMaximumSupress(image);

bestMatches = findBestMatches(image, bestMatches); end

forall larger images do

upScaleIndexes(bestMatches);

traceSearch(currentImageSize, bestMatches); removeBadMatches(bestMatches, similarity); end

Algorithm 2:Pseudo code for finding a pattern in an image. The images in the image pyramids differ by a factor 2 of size in width and height.

W.James MacLean and John K. Tsotsos proposed a similar algorithm.[James Maclean, 2008].

(28)

(29)

3

Parallel programming in theory and

practise

3.1 GPU-programming

When programming a GPU there are a number of features that need to be con-sidered. GPU:s use SIMD architecture (Single instruction multiple data). SIMD means that there are several cores running, often many and they all perform the same operations. The only difference between them is that they take different data as input. The bottleneck of the performance when programming GPU:s are often the bandwidth of different memories, see section 3.1.1.

Applications written in the CUDA programming language manages the SIMD-architecture in an efficient way. A grid of 1, 2 or 3 dimensions is used to index the running threads of a function. The grid is divided into blocks that are calculated spread on several streaming multiprocessors (SM:s). An important note is that in the CUDA programming language are functions called kernels and they shall not to be mistaken for cores or processors.

3.1.1 Memory latency

A typical problem when programming a GPU is that the transfer between differ-ent memories is a bottleneck. There are differdiffer-ent kind of memory transfers and they are often slow if they are not chosen and implemented with care. The most important memories in GPU:s are:

• Global memory • Constant memory • Texture memory

(30)

14 3 Parallel programming in theory and practise

• Shared memory

When computations are running on a GPU they first need to fetch the input data from the memory of the CPU. This is the slowest type of memory transfer in the work flow of a GPU and the speed for the transfer is often one magnitude slower than accessing regular GPU-memory called global memory. When the compu-tations are done the output data is transferred back to the CPU. That transfer is as slow as the first transfer. This problem is hard to avoid when the time of the kernel is short compared to the amount of data that needs to be transferred to the GPU. Multiple streams, [Jason Sanders, 2010b], can sometimes reduce the problem. A stream is the flow of transferring data to the GPU, compute a kernel and transfer the resulting data back to the CPU. The purpose of multiple streams is that as soon as data for a kernel is transferred into the GPU, the transfer of data for the next kernel is started so that when the first kernel is finished, the data for the second kernel has already been transferred. The second kernel can then start its computations at once, see figure 3.1. The benefits of using multiple streams are greatest when the runtime of the kernel is equally long as the trans-fer time. If the kernel is shorter than the transtrans-fer time the transtrans-fer time can not be hidden and if the kernel time is much longer than the kernel time the gained performance will be negligible. Another thing that often increase the transfer speed is to use pinned memory instead of pagable memory. Pinned memory is in difference of pagable memory locked to a certain adress on the CPU.

Figure 3.1:Advantage of using multiple streams

For Tegra K1, memory latency caused by transferring data from the CPU-memory to the GPU-memory is not as important as for desktop GPU:s. This is because the GPU and CPU share a unified memory pool that stores data that is accessed by both the GPU and the CPU, i.e. no transfer between them is needed [Harris, 2014b]. The shared memory pool supports currently only regular memory, other types of memory such as texture and constant memory, described in section 3.1.1 and 3.1.1, are not supported.

(31)

3.1 GPU-programming 15

Coalescing memory accesses

In regular memory the data is stored in horizontal lines. Since all threads that run on an SM are performing the same tasks, they will access the global memory at approximately the same time. If neighbouring threads access neighbouring data in the memory several threads can get their desired data on the same read from the global memory. In this way, the number of accesses to the global memory can be reduced dramatically, see figure 3.2.

Figure 3.2:Perfect coalescing in the upper image and a bad memory access pattern below.

Shared vs. global memory

In addition to the global regular memory each SM have a local memory called shared memory which is shared for all threads in a block. A typical GPU data transfer bottleneck is accessing global memory of the GPU from a thread. The shared memory is smaller than the regular memory and accessing it is faster. If a kernel uses many accesses to the global memory, the values can be stored in the shared memory to reduce accesses to the global memory [Jason Sanders, 2010c]. If a value is read from memory several times it is always beneficial to read it once from the global memory and the rest of the times from the shared memory. When the value has been read from the global memory it should be stored in the shared memory so it can be reached from there for the future readings.

On newer GPU:s, based on Fermi or Kepler architecture, it is not as crucial to use shared memory as for earlier architectures. This is since Fermi introduced built

(32)

in caches for each SM. The cache is using the shared memory to store the values. Consideration of shared memory is still important for maximum performance [Ragnemalm, 2013].

Constant memory

If a value is read from many different threads in a kernel it is preferred to store it in the constant memory for increasing the performance. The constant memory has a fast cache accessible from the whole GPU [Jason Sanders, 2010a]. Global memory is only cached on every SM or block since it uses the shared memory. Values that are read from all blocks will therefore require less memory bandwidth when placed in the constant memory. It is constant because it can only be read from the GPU, it is set during a memory transfer from the CPU.

Texture memory

When the access pattern is not horizontal and is hard to predict, it is often good to use the texture memory. Texture memory stores data in a different way than the other types of memories used in CUDA. It stores data in squared areas instead of horizontal rows as the global memory. It also has a cache that fetches a number of these areas and not lines. Since the cache has 2-dimensional data stored, memory accesses is fast not only for horizontally proximate values but also for vertically proximate values. This is called 2D-locality. There is also built in interpolation so that accesses with 2D floating point index only require one memory access [Wilt, 2012]. When normally stored memory is used all 4 neighbouring values to the index are needed to fetch from the memory to make the interpolation. The difference of using texture memory compared to regular memory is that the interpolation is performed before the transfer in the texture memory and after the transfer in the regular memory.

Figure 3.3: Upper image shows regular memory storing order and lower image shows texture memory storing order.

3.1.2 Implementation

This section covers information on what to consider besides memories when writ-ing software for GPU:s.

Block size

Choosing a correct block size increase the performance of CUDA-kernels. There are several factors to consider when choosing block size. The first thing to

(33)

con-3.1 GPU-programming 17

sider is how many SM:s the GPU running the kernel has. The workload should be divided in at least as many blocks as there are SM:s so that all SM:s will be busy. Another important thing when choosing block size is that the block size is a multi-ple of the warp size. The warp size is the smallest amount of threads performing the same operation. The GPU is always running a multiple of warps doing the same thing [Jason Sanders, 2010a]. If a block does have fewer threads it will be rounded up to the next multiple of the block size and those resources will be wasted. So if the warp size is 32 and 33 three threads are chosen for a kernel 64 threads will be used and 31 of them will idle. The wasted resources can be calculated according to:

rwasted=

w − b%w

wdb/we (3.1)

where rwastedare the wasted resources [0, 1], w is the warp size, b is the block size

and % calculates the remainder of a division. Note that no resources are wasted if the block size is a multiple of the warp size. The equation is invalid if the block size is a multiple of the warp size.

Partition work between CPU and GPU

As mentioned in section 1.1, GPU:s outrun CPU:s by one order of a magnitude for many algorithms. There are also many algorithms where parts or the whole algorithm run faster on a CPU than on a GPU, especially parts that are not paral-lelizable at all. Therefore it is important to evaluate which part of an algorithm that might run faster on a CPU. Since the Tegra K1 has a shared memory pool between the CPU and the GPU, the overhead of switching from CPU to GPU is reduced, resulting in more situations where it is favourable to switch between CPU and GPU.

Shuffling

A new feature of the Kepler architecture is that it is possible to share data between different threads in a warp without using shared memory. When a variable is read using shuffle all threads will read the value of the variable in the neighbouring thread, instead of the local thread, one or several steps away. Shuffle one step will read the variable of thread 0 in thread 1 etc. This way of reading data is even faster than using shared memory since only one read operation is required. Shared memory needs write, synchronize and read. Another benefit of using shuffling compared to using shared memory is that the size of the shared memory is small. The joint size of all the registers from the threads is bigger than the shared memory [Goss, 2013].

Grid stride loops

In GPU-computing the number of threads is often adapted to the number of ele-ments in the array processed. This is not convenient for all algorithms e.g. if all elements in an array of 33 elements are multiplied by 2 the number of threads

(34)

should intuitively be 33. But since the warp size of the GPU is often 32, 33 threads will make 31 cores of the GPU idle, see section 3.1.2. It is common that a specific number of threads result in a simpler implementation and a higher per-formance. Grid stride loops can then be used [Harris, 2013] to avoid adaption of the number of threads to the array size. The purpose of a grid stride loop is to be able to read a larger number of elements into a fixed lower number of threads in a coalesced way. In each thread the reading of values is performed in a loop. The first thread is assigned to read the first element in the memory and the next thread is assigned to read the second element in the memory etc. When there are no threads left there will still be elements left to read from the memory. The first thread is then assigned to the first non-assigned element and the second to the second assigned element etc. This assignment lasts until all elements are assigned and read. This is done technically according to algorithm 3.

Data: Array, N

sum =0; for i =threadId; i <N; i+=threadWidth do sum+=Array[i];

end

Algorithm 3:Grid stride loop performed in a thread threadW idth number of threads and N number of elements in the array.

3.2 Parallel programming metrics

When comparing the performance of sequential algorithms, time complexity is often used. For parallel algorithms there are other metrics that also show how well parallelized the algorithm is. In this section the metrics that are used for analysing the algorithms are presented. Note that the unit of both time and oper-ations are clock cycles, making some calculoper-ations a bit confusing.

3.2.1 Parallel time

Parallel time Tp is the time it takes for a parallel implementation to run on p

processors. Tpis measured in clock cycles.

3.2.2 Parallel speed-up

The parallel speed-up Spmeasures how much faster the parallel implementation

is compared to the sequential implementation.

Sp= T

Tp

(3.2)

T is the time of running a sequential implementation. The parallel speed-up has

(35)

3.2 Parallel programming metrics 19

3.2.3 Parallel Efficiency

Parallel efficiency Epmeasures how well an implementation scales independent

of the value of p.

Ep=

Sp

p (3.3)

where the optimal scaling of an algorithm is 1. Ep is Sp normalized over the

number of processors.

3.2.4 Parallel Cost

Parallel cost measures if resources are wasted when running a parallel algorithm.

Cp= pTp (3.4)

Consider the total number of clock cycles passed on a system using p processors, during a parallel time Tp, pTp. If the passed clock cycles are more than the

to-tal number of clock cycles passed when running the algorithm sequentially on one processor, the parallel implementation is wasting resources. An algorithm is thereforeCost optimal if Cp = T .

3.2.5 Parallel work

The work W is the total number of operations that are performed on all sors. If more operations are performed than operations performed on one proces-sor in the sequential algorithm, the parallel algorithm is doing more work than the sequential algorithm. An algorithm isWork optimal if the number of

opera-tions performed for the parallel algorithm is equal to the operaopera-tions performed by the sequential algorithm, W = T .

3.2.6 Memory transfer vs. Kernel execution

A crucial part of running computations on a GPU is transferring data between the CPU and the GPU. Sending the input data to the GPU before the computations and the result back to the CPU after the computations is time consuming. An analysis of an algorithm must consider the transfer time of data. Multiplying the size of the data with the transfer speed on the device results in the transfer time. An interesting metric is the kernel execution time compared to the memory transfer time.

3.2.7 Performance compared to bandwidth

The performance of a kernel can be evaluated by comparing its average memory bandwidth to the memory bandwidth of the GPU. The memory bandwidth can be estimated by running a kernel that only copies the value from one array to another. The quotient between the average memory access speed and the memory bandwidth is in this report calledmemory access performance. By dividing the size

(36)

speed for the kernel can be calculated.

vm=

whsn tk

(3.5)

vm is the memory access speed, s is the size of one pixel in the image, n is the

number of times the value is read or written to the memory and tkis the measured

time of the kernel. A kernel with an optimal access pattern has a speed very close to the copy kernel.

3.3 Related Work

Egil Fykse wrote a thesis [Fykse, 2013] comparing the performance of computer vision algorithms running on GPU:s in embedded systems and on FPGA:s. His benchmark algorithms were similar to the used algorithms in this thesis, although his focus laid on implementing FPGA versions of the algorithms. The hardware used for his GPU-implementations is not an embedded GPU, but the predecessor to the Kayla platform, CARMA. Egils conclusion is that his results are slightly faster for the GPU but that the FPGA is more energy efficient. The Tegra K1 should be much more power efficient than CARMA since CARMA features a desk-top GPU, but this project does not examine the power usage of any devices.

(37)

4

Method

The project was performed according to a method that analysed the algorithms in steps which are described in this chapter. The parallelization, the theoretical evaluation and the implementation were executed iteratively to be able to test new ideas. The steps were:

• Initial phase • Parallelization • Theoretical evaluation • Implementation • Evaluation

4.1 Initial phase

In the initial phase the sequential version of the algorithm was analysed theoret-ically, by calculating its complexity, and implemented. Artificial test data was generated using Matlab. The test data was in general as simple as possible. The purpose of the project was not to test the accuracy of the algorithms but to opti-mize the already known algorithms. Imprecise test data could result in problems where it would be hard to know whether undesired results was because of the accuracy of the algorithm or because of bugs in the implementation.

(38)

22 4 Method

4.2 Parallelization

The parallelization was about making parallel versions of the algorithm and deter-mining which of the versions that should be implemented and further analysed. The list below describe on what premises the implementations were optimized.

• Different algorithm variants • Memory choice

• Memory access pattern

• Partitioning between CPU and GPU • Shuffling

• Grid stride loops

For a description of what the items in the list mean see section 3.1

4.3 Theoretical evaluation

A theoretical evaluation is a good way of determining how parallelizable an algo-rithm is. The parallel performance metrics presented in section 3.2 are used for the theoretical evaluation. Since not all algorithms perform well on GPU:s the result of the theoretical evaluation may differ from the result of later steps of the method.

4.4 Implementation

The implementation was about implementing the different versions of the algo-rithm proposed in the parallelization phase as efficient as possible. Measuring of performance was an important part of the implementation phase. Profiling tools make it possible to show the performance of the different parts of the running kernels. For this project Nvidia Visual profiler was used.

4.5 Evaluation

In the evaluation the results from the theoretical evaluation and implementation was compared to make a conclusion supported in several ways. There were im-portant questions that needed to be answered to be able to make a conclusion about embedded GPU:s after the 2 algorithms were analysed.

• Was the performance as expected? • Is the performance sufficient?

(39)

4.6 Alternative methods 23

• How portable is the code? • Is further optimization possible? • What are the bottlenecks?

When all the algorithms were analysed, the possible conclusions about perfor-mance of embedded GPU:s in general and about the algorithms were made.

4.6 Alternative methods

The above described method is a combination of practical and theoretical work. Algorithms are both analysed theoretically, implemented and evaluated using the results. Other approaches could be either more theoretical or laying more focus on one single algorithm.

4.6.1 Theoretical method

A theoretical method would analyse algorithms only theoretically. By using this method the analysis of an algorithm would take less time so that the project would cover more algorithms. The benefit of more algorithms is that it would give a larger picture of how well computer vision algorithms are suited for em-bedded GPU:s. However implementations often reveal problems that are easily missed when doing a theoretical evaluation. An implementation is a certification that something actually works and an evaluation of how well it works.

4.6.2 One algorithm

Another type of method could spend more time implementing and optimizing one single algorithm. Even better results could be achieved for the chosen algo-rithm when spending more time on it. However a lot of the optimizations regard-ing one algorithm is specific for that algorithm and does not say much about the performance of embedded GPU:s in general. This method would not give a good picture of computer vision on embedded GPU:s in general.

4.6.3 Conclusions

Given the projected outcomes of the alternative methods described above the originally proposed method was used.

(40)

(41)

5

Rectification of images

5.1 Generating test data

Synthetic test data was generated in Matlab. The first step was to create two im-ages where one had a rectangle located in the image plane, see figure 5.1. The other picture was a geometrical object simulating a rectangle from another view, see figure 5.2. By using the edges from the rectangles and equation 2.1 a homog-raphy between the rectangles could be calculated. The homoghomog-raphy parameters were stored to use as input to the program running the algorithm.

Figure 5.1:Rectangle in image plane.

(42)

26 5 Rectification of images

Figure 5.2:Rectangle from another view.

Lens distortion was also simulated. An image distorted by specific lens distortion parameters can not be calculated analytically since equation 2.2 has no closed form solution for obtaining x from xc. An iterative numerical solution according

to Newtons method, equation 5.1, was implemented in Matlab to simulate im-age distortion. The pseudo code for generation of lens distortion is displayed in algorithm 4. Only radial distortion was simulated.

xn+1= xn−

f0

(xn)

f (xn) (5.1)

Data: image, maxdiff, param Result: Distorted image forall pixels (x,y) in image do

Convert x and y to normalized coordinates;

xi, yi = x, y;

while |correct(xi, param) − x | + | correct(xi, param) − y |> maxdif f do

xi = xi+_(correct(xcorrect(x_ii_,param)−x,param)−x_ii₎₀;

yi = yi+ _(correct(ycorrect(y_ii_,param)−y,param)−y_ii₎₀;

end

Convert xiand yi to pixel range.;

outimage(x, y) = interpolation(image(xi, yi));

end

Algorithm 4:Pseudo code for distorting an image. correct is the lens correction formula. The interpolation is bilinear, maxdif f is the maximum tolerated error and param are the distortion parameters.

(43)

5.1 Generating test data 27

with a rectangle in the image plane and the applying lens distortion to that image, see figure 5.3. To get a reference result for the GPU rectifications, the sequential rectification algorithm was applied to the test input image, see figure 5.4. Note that some parts of the original image is missing. This is not an error but due to the fact that some parts of the original image does not fit into the input image in figure 5.3.

The parameters of the homography and the lens distortion parameters are affect-ing the performance of the algorithm. If the homography makes the algorithm fetch values from a smaller rectangle there will be fewer cache misses resulting in a higher performance. The reason that there will be fewer cache misses is that a smaller rectangle has fewer pixels and a bigger percentage of the pixels can be stored in the cache. However when the algorithm is used in reality the rectangle will always be as large as possible and still fitting the sensor. Therefore the input data is also constructed this way. If the lens distortion parameters are smaller the access pattern will be more linear which also will result in fewer cache misses. Reasonable sizes are therefore chosen for the lens distortion parameters. The lens distortion assumes an image with normalized coordinates between [−1, 1]. It is therefore important to transform the pixel values into normalized coordinates to get a correct result in terms of lens correction. The lens distortion parameters, described in equation 2.2, used in this project are illustrated in equation 5.2, note that k3was not used.

k1= 0.04, k2= 0.008 (5.2)

(44)

Figure 5.4:Reference result for tests.

5.2 Theoretical parallelization

The rectification algorithm is very suitable for parallelization since lens correc-tion and homography transformacorrec-tion can be performed independently for each pixel. The pseudo code for the parallelized algorithm for n pixels on n processors is illustrated in algorithm 5.

Data: image, H

Result: rectified image

forall pixels (x,y) in parallel do

r =px2_{+ y}2_;

xc= x(1 + k1r2+ k2r4);

yc= y(1 + k1r2+ k2r4);

(xh, yh, 1)T ∼H · (xc, yc, 1)T;

outI mage(x, y) = interpolation(image(xh, yh));

end

Algorithm 5:Pseudo code for parallel rectification. Assumes normalized coordi-nates.

The interpolation used in the master thesis project is bilinear and it is done since (xh, yh) is typically not integers.

(45)

5.3 Theoretical evaluation 29

5.3 Theoretical evaluation

In this section the algorithm is theoretically evaluated according to section 3.2. As section 5.2 states, the algorithm is very suitable for parallelization. The paral-lel time Tpfor p processors and n pixels is of order n/p. The parallel speed-up Sp

increases proportionally when p is increasing. This gives a parallel efficiency Ep

of 1. The parallel cost Cpis of the order p ·np = n.

Since the sequential time is of order n, the algorithm is cost optimal. The parallel work for p processors is also of order n. The algorithm is also work optimal, see section 3.2.5.

The slow part of a kernel is the global memory accesses. In this kernel, there will be maximum 5 global memory accesses per thread. 4 accesses for fetching neighbouring pixel values to interpolate between and one access to write the re-sult to the global memory. But since the GPU uses the shared memory as cache, see section 3.1.1, there will most likely be less global memory accesses depending on the access pattern. The memory bandwidth between the CPU and GPU can be measured but it is in general at least 10 times slower than the global mem-ory bandwidth, the factor is 24 for the GTX680 GPU. Even if all global memmem-ory accesses will be cache misses, the kernel will still be a lot faster than the mem-ory transfer. Equation 5.3 aims to illustrate that the kernel will be faster definit

MDtH as memory transfer latency device to host and MH tDas vice versa.

MDtH+ MH tD>> 5 · GlobMemAccess · P ixels (5.3)

For the GTX 680 GPU the memory transfer should be around 48₅ times slower than the kernel. Since the kernel is so much faster than the memory transfer mul-tiple streams would not increase the performance in any substantial way. Multi-ple streams are described in section 3.1.1.

5.4 Implementation

The implementation was done in steps to be able to determine how much each step affected the performance.

5.4.1 Initial implementation

The first implementation of the rectification was simple and intuitive. Global memory was used for all memory accesses. As mentioned in section 5.2 the rectifi-cation algorithm is easy to parallelize. For an Nvidia GPU from the generation of Fermi or newer, the naive implementation is quite good since the shared memory is used as a cache. But for an older GPU without use of cache the implementation would be slow.

(46)

Figure 5.5:Access pattern on input image in rectification.

5.4.2 General problems

There are two main problems regarding kernel speed, when implementing a GPU-kernel for the rectification algorithm. The first problem is that the homography part of the algorithm may make the access pattern in the image non-horizontal, see figure 5.5. The reason that the access pattern can get non-horizontal is that it is hard to install a real camera perfectly straight compared to the observed plane. If the camera is leaning slightly to the right or left, the access pattern will be horizontal. Since the image is a stored as one array putting each row after the previous row, a non-horizontal access pattern will make the memory accesses for two neighbouring threads on different rows in the memory i.e. the access pattern will not be coalesced.

The second problem is that the lens correction makes the access pattern non-linear. Instead of being aligned the access pattern will be concave. The larger the distortion parameters are the more concave will the access pattern be, see equa-tion 2.2. According to equaequa-tion 2.2 the access pattern will be very dense in the middle of the image and more sparse further out from the middle. In the sparse areas the memory accesses will be far away from each other. It is not intuitive how to use the shared memory in an efficient way for that access pattern. In this project the problem was solved by disregarding the manual shared memory and instead use it as cache.

(47)

5.5 Results 31

5.4.3 Texture memory

When the access pattern is irregular the performance is it often increased by us-ing texture memory instead of global memory. Interpolations are also performed very fast using the texture memory, see section 3.1.1. The access pattern of the rectification algorithm fits well into that description and the performance were clearly increased by loading the input image to the texture memory instead of the global memory.

5.4.4 Constant memory

The input parameters are the same for every pixel in the image and they are read once for every thread. The performance of the implementation is drastically increased when reading them from constant memory compared to reading them from global memory, see section 3.1.1.

5.5 Results

The resulting images from running the rectification on a GPU was very similar to running it in Matlab. A slight difference occurred near all edges on the chess board since Matlab used 64 bits precision of their floating values while 32 bits precision were used in CUDA resulting in a slightly worse interpolation, see the difference of the resulting images from Matlab and Cuda in figure 5.6. The white slightly bent line in the image occurs because of the index differences in Matlab compared to most programming languages, the indexing starts at 1 and not 0.

Figure 5.6:Absolute difference between Matlab and GPU results, values in range [0, 1].

(48)

The results of the different steps of the optimization are all presented below to be able to evaluate them. All results are averages over 5 runs.

5.5.1 Memory transfer

As explained in section 3.1.1 transferring data from the CPU to GPU is often a bottleneck when running smaller kernels. The differences of using pagable or pinned memory, see section 3.1.1, are displayed in table 5.1. The time of the memory transfer does not affect the kernel time.

Table 5.1:Transferring image of 1024x1024 pixels 32 bit floating points be-tween CPU and GPU using pagable vs pinned memory (µs).

Task GTX 680 GT 640

Pagable CPU -> GPU 703 13113 Pagable GPU -> CPU 649 17810 Pinned CPU -> GPU 694 12908 Pinned GPU -> CPU 649 8799

On Tegra K1 no regular memory needs to be transferred between the CPU and GPU because of the unified memory pool. Data lying in the texture memory needs to be transferred though. The transfer of a 1024x1024 image of 32 bit floating points to the texture memory takes about 1.1 ms.

5.5.2 Kernel execution

The results of running an rectification implementation on a 1024x1024 pixels im-age using a naive approach (only global memory), a constant memory approach and a texture memory and constant memory approach on the GTX 680 and GT 640 is displayed in table 5.2.

Table 5.2: Performance of different optimization steps (µs). The implemen-tation using texture memory also uses constant memory.

Task GTX 680 GT 640

Naive 392 1334

Constant memory 280 960

Texture memory 88 747

On Tegra K1 it is not as obvious what is the best way of optimizing the kernel. The texture memory can not be used in the unified memory pool. This means that if the texture memory is used, more transfer between GPU and CPU is needed. If the increased performance in the kernel is smaller than the lost time of memory transfer, it is not beneficial to use the texture memory. The results of running

(49)

5.5 Results 33

the algorithm on K1 using texture memory and unified memory are illustrated in table 5.3

Table 5.3:Performance of using textured memory and global unified mem-ory on K1 (ms).

Task Tegra K1

Texture memory 1.8 Global unified 4.1

Since the memory transfer time of the texture memory is 1.1 ms and the reduced time in kernel by using the texture memory is 2.3 ms it is preferred to use the tex-ture memory. Since the memory transfer time is shorter than the kernel time the latency can theoretically also be hidden by using multiple streams. The resulting images would then be received with a constant delay of 1.1 ms but new results would be received every 1.8 ms.

Table 5.2 shows that choosing texture memory instead of global memory is pre-ferred for a desktop GPU running the rectification algorithm. The data set used for that test is an optimal data set for the global memory. In table 5.4 the plane that is supposed to be extracted from the input image is rotated 90 degrees to the input image making the access pattern in the input image very bad for the global memory, as discussed in section 5.4.2. The tests are run with 1024x1024 image size. The results show that for this kind of data set the texture memory is even more superior than for the previous data.

Table 5.4:Varying results for hard data set using textured and global mem-ory (µs).

Task GTX 680 GT 640

Kernel using texture memory 89 775 Kernel using global memory 659 3123

5.5.3 Memory access performance

The performance of a kernel can be evaluated by comparing its average memory access speed to the memory access speed of a copy kernel, see section 3.2.7. The memory access performance were quite good for the rectification algorithm, but it differed between the GTX 680 and the other two GPUs. The size of the input data was 1024x1024 and the size of each element were 4 Bytes (32-bit floating points). In the kernel code there is one read and one write making n = 2. The memory access speed on the Kayla platform is then:

1024x1024x4x2

(50)

The memory access speed of a copy kernel on Kayla was 27GB/s making the memory access performance 0.4.

The memory access speed of the Tegra K1 was: 1024x1024x4x2

1.8 · 10−₃ '47GB/s. (5.5)

The memory bandwidth of the desktop was 11,7 GB/s making the memory access performance 0.4.

The memory access speed of the GTX 680 was: 1024x1024x4x2

88 · 10−₆ '95GB/s. (5.6)

The memory bandwidth of the desktop was 147 GB/s making the memory access performance 0.64.

5.6 Discussion

As section 5.3 states, the algorithm is very suitable for parallelization. The prac-tical results also states that its performance running on an actual GPU is good. As mentioned in section 3.1.1 the bottleneck of running algorithms on GPU:s is often the number of accesses to the global memory. When using texture memory the rectification algorithm only performs 2 accesses per thread, except the input parameters which are all read once for every thread.

As the results states the most important differences in performance for this al-gorithm depend on how the different memories are used. Using the constant memory for the parameters of lens distortion and for the homography is crucial to get a good result. It is also important to use texture memory, depending on the camera installation, see section 5.4.2 and table 5.4. The texture memory will make the installation of the camera much easier though, since a non-horizontal access pattern will not decrease the performance.

5.6.1 Performance

The memory access performance of the algorithm is not very close to optimal. The lens correction part of the algorithm makes it almost impossible to avoid cache misses. It is possible that usage of the shared memory in a manual way could make the memory access performance even higher but manual shared memory has not been used in this project. Something that is interesting for the project is to use the performance to calculate how much of the GPU computing time is used by the rectification. The performance goal of the rectification is that other algorithms should be able to run simultaneously on the GPU, keeping their per-formance. Given that 25 frames per second (fps) is needed for the other algorithm and the kernel of rectification is 1.8 ms on K1, the time used by rectification on

(51)

5.7 Conclusions 35

the GPU is approximately 1, 8ms · 25 = 0.05 or 5%. The performance goal is there-fore considered fulfilled.

5.6.2 Memory transfer

For a classic computer architecture with separate memories for the GPU and the CPU, the kernel is very fast. Although the transfer time of data from the CPU to the GPU is slow in comparison. This memory latency can not be hidden by using multiple streams i.e. no matter how fast the kernel is the number of kernels that can be run per second is restricted by the memory transfer time. The conclusion from this is that for a classic computer architecture, it is better to include the rectification part in another algorithm than to use it separately since the mem-ory transfer from CPU to GPU can be avoided and the memmem-ory latency will be possible to hide by using multiple streams.

The unified memory on Tegra K1, see section 3.1.1 results in several benefits. The obvious reason is that slow CPU-GPU memory transfers are removed and there will be less memory latency for the kernels. Another benefit is that the host code will be easier to read and understand because less code will be about memory transfers.

5.6.3 Complexity of the software

The software written for the rectification algorithm is short and easy to read. It is harder to read than software doing the same thing on a CPU though. The main difference in readability is that management of threads using the combination of grids, blocks and threads is more complex than management of threads in CPU-code where one-dimensional indexes are used.

5.6.4 Compatibility and Scalability

One important task of this project was to determine the portability of the soft-ware written for a specific GPU. For the rectification algorithm there is no obvi-ous way to change the code depending on which GPU is used. Both the desktop GPU (Geforce GTX 680) and the Kayla platform (Geforce GT 640) run the same software. They also perform best for approximately the same block size. The code must be changed a bit to use unified memory on K1 though. But the code that use unified memory is shorter and easier to understand because of the absence of memory transfers.

5.7 Conclusions

The rectification algorithm works well on a GPU, especially for an embedded GPU. The reason why it works better for an embedded GPU is because of the avoidance of memory transfer latency when using the shared memory pool. The performance is very high, giving the GPU a chance to perform other tasks along with the rectification.

(52)

The algorithm is completely parallelizable which makes it computationally light for a GPU. The fact that the memory access pattern is not coalesced slows down the result though.

The software is easy to understand and is compatible for different GPU:s featur-ing the Kepler architecture. However, to get maximum performance, a correct image size should be selected, the number of pixels should be a multiple of the warp size, see section 3.1.2.

An obvious benefit of using embedded GPU:s compared to other hardware is that it makes the installation easier for customers since perfect alignment of the camera is not necessary to keep the performance up, see section 5.4.2 and figure 5.4. When the direction of the camera lens gets more horizontal, the result image from the rectification gets blurrier though. Similar solutions as the texture mem-ory could be implemented on other hardware but with a high developer effort.

(53)

6

Pattern Recognition

6.1 Sequential Implementation

Before any parallel pattern recognition algorithm was implemented a sequen-tial implementation was made. The purpose of the sequensequen-tial implementation was to get a deeper understanding of the algorithm. Since it is very hard to debug parallel code and especially GPU-code, it is very convenient to rely on a verified CPU implementation when making a GPU-implementation. The per-formance of the sequential CPU implementation should not be compared to the GPU-implementations. Such a comparison would not be fair since no greater ef-fort has been made to optimize the performance of the sequential algorithm. The sequential implementation was done according to algorithm 2.

6.2 Generating test data

The test data for the pattern recognition algorithm was mainly synthetic. To assure that the images were not too noisy to get a good result, synthetic search images were made by pasting rotated pattern images, see image 6.1, into a larger image, see image 6.2. However the fact that the synthetic data was noise free was not exploited to make a faster implementation that would fail for a noisy data set.