REAL-TIME HYPERSPECTRAL IMAGE ANALYSIS ON GPU HARDWARE

(1)

Bachelor Thesis, 15 hp

REAL-TIME

HYPERSPECTRAL IMAGE ANALYSIS ON GPU

HARDWARE

Performance impact of different GPU architecture implementations

Simon Bonér

(2)

(3)

Abstract

This paper examines the optimization possibilities of using different GPU memory for a hyperspectral imaging algorithm. It focuses on the global memory, the shared memory, and the constant memory of the GPU. Three versions of the hyperspectral imaging algorithm are implemented utilizing the GPU’s global memory, shared memory, and constant memory, respectively. The algorithm consists of 4 steps: 3 pre-processing steps and 1 prediction step. The pre-processing step comprise of calculating the absorption, centering the image, normalizing using the standard normal variate. Lastly, there is a prediction step using a matrix-vector

multiplication. Then the implementations are tested on their performance in processing an image. We also investigate how coalescing images in the different implementations can speed up the processing and what kind of extra latency it adds to the processing of an image.

(4)

Introduction

This bachelor’s thesis focuses on investigating the impact different GPU hardware

implementations have on a real-time hyperspectral imaging analysis algorithm. Hyperspectral imaging is a type of imaging that collects information from the electromagnetic spectrum [1].

The spectrum of the pixels is then processed to extract useful information that can, for example, be used to identify materials. This is reminiscent of the way human eyesight work. We as humans process visible light in the electromagnetic spectrum in 3 bands, red, green, and blue.

But whereas human eyes only can see color of the visible light, hyperspectral imaging allows for analyzing more of the electromagnetic spectrum, extending outside visible light. It also can divide the visible light spectrum into much finer bands.

Hyperspectral imaging cannot be achieved using regular cameras though. Regular cameras produce images with a height and a width. Each pixel in that image then has 3 different colors, most often red, green, and blue, which can be combined into any color for the pixel. This gives an image from a regular camera 3 dimensions, the width, the height, and the colors (e.g. red, green, and blue). The hyperspectral images from the hyperspectral imaging cameras, which we investigate, on the other hand are 2-dimensional. Instead of having a height to the images they are only 1 pixel high. That is, they only consist of a row of pixels, each with a set of spectrums it is capturing. The hyperspectral imaging camera could for example capture 256 different bands instead of only the 3 on the regular cameras.

Therefore, the cameras are not especially useful for taking images of static objects. Instead, they can be used in conjunction with production lines or moving vehicles to scan the objects as they are moving past the camera. The lines of images can then be stitched together to create a regular, 3-dimensional image. Using all this, hyperspectral imaging can be used to e.g. make predictions about the materials of objects that are traveling along a production line. This can then be used to sort the objects depending on their material composition.

Though, doing these predictions comes at a cost. Since the hyperspectral imaging cameras only take 1-pixel high images, they must operate at a higher frame rate to match the speed of the moving objects. This combined with the fact that each pixel generally consists of much more than 3 bands results in a high data throughput rate. If this data is to be analyzed and used int real-time decision making, then it needs to be computed efficiently.

The algorithm that is used in this paper to make predictions regarding the hyperspectral images is split into 4 steps, where 3 are pre-processing steps. In the first step we calculate the

absorption spectrum and adjust the image. Then the image is centered. Thirdly, we normalize the image using the standard normal variate. Finally, the prediction regarding the image is done. Further information regarding this algorithm can be found in chapter 4. Method.

This algorithm can be implemented in more or less efficient ways regarding the computation.

But it can also be implemented on more or less specialized hardware. There are ways to implement it onto hardware that allows for highly parallel execution of the algorithm. This would allow for short execution time as well as being processed on lighter weight hardware.

Therefore, the study of this report is to investigate the best way to implement this algorithm on the highly parallel and abundant hardware, GPUs. The focus of the implementation is on finding the best memory architecture for these computations as GPUs have several different memory architectures it supports.

(6)

Background

The GPU is a relatively new hardware, especially as a general-purpose compute unit. In the early 1990s, graphical operating systems had become a thing. This in turn led to game developers creating new and graphically demanding games, something the CPUs of that era were not built to handle. The new games required a lot of floating-point operations for the different rotations and projections that were required to render a game. This could be accelerated with a special compute unit called a floating-point unit (FPU) that had to be placed on the chip. At first, they started appearing on regular CPU chip but quickly it was apparent that the always increasing demand on higher graphic games could not be met by just the FPUs on the CPU. This gave rise to a new plug-in card called Graphics Processing Unit (GPU), specialized in doing floating-point operations. With the GPU being specialized in floating-point operations and the CPU getting to offload the rendering to the GPU we now suddenly had a massive increase in performance.

In 2001 the first steps, in the parallel computing standpoint, were made for GPUs. Nvidia released a GPU with programmable vertex computing and pixel shading. This was the first time programmers had control over the computations made on their GPUs. Even though these new tools allowed for programming parts of the graphics pipeline the GPUs were still only meant for doing graphics processing on. This meant that the researchers that wanted to use their GPU hardware to accelerate their algorithm had to make their non-rendering tasks seem like

standard rendering. This was a convoluted process but showed promising initial results for GPU computing.

The limitations of tricking the GPU into computing non-rendering tasks limited the use cases and adoption of GPUs as general-purpose compute units. But in 2006 Nvidia released the first GPU built with their new CUDA architecture. This architecture included new components and instruction specifically built for GPU computing. This new GPU allowed for both rendering and general-purpose computing. Though, the API for accessing this specialized general-purpose computing architecture was not available. Therefore, Nvidia also launched the CUDA C

compiler, which allowed for pure general-purpose computing without having to trick the GPU to render the result. This is now called general-purpose computing on graphics processing units (GPGPU) and many applications have since been found that can be accelerated with these types of computations.

As is apparent, GPUs are not fundamentally very different from regular CPUs. Instead they are a very specialized compute unit meant to offload some types computations from the CPU. Since they were built to render images, by rotating and projecting, they are very good at doing specific calculations in parallel, massively parallel. The types of algorithms where you are doing the same operation on different data elements at the same time is called Single Instruction Multiple Data (SIMD) algorithms. Rendering algorithms like rotating, scaling, projections, and such are SIMD algorithms and therefore scale very good with GPUs. But many other algorithms, not related to rendering, are also SIMD and does also scale very well with GPU hardware. GPUs are not limited only to SIMD algorithms but those are great examples of algorithms that scale well on GPUs.

There are several different application areas for GPU algorithms today. One of the earliest and most prevalent application areas is image processing. These algorithms are often SIMD algorithms that then can easily be implemented on the GPU. Examples of these are image flipping, applying a black and white filter, applying a gaussian filter (a blurring method) [2].

Another common use case is summarizing large amounts of data. This is regularly done using histograms, which once again can be implemented as a SIMD algorithm [3]. GPUs have also helped with heavy simulations that require a lot of calculations, like the n-body problem [4].

Furthermore, in recent times it has been increasing the performance of neural nets and deep learning [5]. GPUs sees many more application areas than these. Most algorithms working with

(7)

a lot of data that can be parallelized in some way can then also be accelerated by GPU hardware.

Nvidia even has an entire library for accelerating just Linear algebra operations as these are often applicable for the large dataset, parallelizable algorithm, requirements [6].

(8)

GPU Architecture

The GPU architecture is very different from the one used in CPUs. In designing the GPU

processor, NVidia sacrificed both the generality and the fast, sequential performance of the CPU in return for a massive parallel performance. This in turn forces the programmer to adjust their algorithms to the parallel nature of the GPU. To understand the motivations behind certain algorithmic choices made we first must explain the architecture behind the GPU that limits us.

There are a few keywords that are essential and reoccurring during the following sections. As these are important to understand we have added a summary of them below and then further explain them in the following sections.

• Kernel – The function(s) that are executed on the GPU

• Single Instruction Multiple Data (SIMD) – A type of algorithm design that does the same operation on every piece of data but with varying data, like scalar multiplication on a vector.

• Block – A group of threads, often computing a segregated part of data

• Streaming Multiprocessor (SM) – A set of cores that share specific hardware.

Blocks are always executed on a specific SM.

• Warp – A set of 32 threads executing the same block on the SM, they execute simultaneous.

GPU Multiprocessors

First, the kernel is the code that is executed on an individual core on the GPU. In contrast to parallel CPU algorithms, the kernel is almost always Single Instruction Multiple Data (SIMD).

This is due to that those algorithms often scale easily with massive amounts of cores. And if the kernel does not scale to a large set of cores it is almost always better to perform on the CPU, since it is much faster in single-threaded performance. This might even apply if the algorithm scales but does not do so very well. An exception to this might, e.g., be when transferring data back and forth between the GPU and CPU is slower than doing a slow computation on the GPU.

This is because transferring data between the CPU and GPU is a relatively slow process, and the overhead is at times larger than doing slow, serial computations on GPU.

Another difference in developing for the GPU is the instantiating of a kernel. While launching the kernel there are additional parameters that must be specified. These are block and thread count. A block is the set of threads that is executed on a Streaming Multiprocessor (SM).

Looking at Figure 1, this is an SM, this SM will be assigned a block, and then using the cores of the SM it will schedule its’ threads to execute on those cores. The parameter threads in the instantiation, are the number of threads that are executed on the cores in the SM for that block.

Therefore, a 2 block, 4 threads instantiation would use 2 different SM, each SM would launch 4 threads for a total of 8 threads executing for that kernel. This is one of the ways GPU

architecture differs from CPU architecture.

A Pascal architecture SM is shown in Figure 1. Though this is the GP100 instead of the model used in this report, the GTX 1080, this is close enough to describe the general architecture. The Pascal architecture processors are split into several SMs [7]. An SM contains 2 warps. Inside the warp, you have 32 regular cores as well as some Double Precision Units (DP Unit), the Load and Store (LD/ST), and the Special Function Units (SFU). These are what executes a set of the threads from the block. Without going into too much detail it is important to note that these might be limiting the performance of an application. As in the case of the GP100, the ratio of regular cores to DP Units is 2:1, but in the GTX 1080 the ratio is 32:1, which drastically impacts

(9)

the performance in double precision calculations. A similar argument might apply to SFUs which does logarithmic calculations and such.

An SM also includes an Instruction Cache and Shared memory, this hardware is shared between the entire SM. So, when launching a block of threads, this is all the hardware that the block of threads has available. And since there exist several SMs on a GPU the kernel needs to be launched with several blocks to fully utilize the parallelism of the GPU. It should also be noted that an SM can have several blocks scheduled, this can increase the performance of that SM since e.g. it is able to compute another block while waiting for the first block to fetch data.

Though the limitations of the hardware in an SM also limit the number of blocks that can be scheduled on that SM, e.g. if a block utilizes more than half of shared memory no more blocks can be scheduled on that SM until that block is finished. As a last note on the GPU hardware, we can often see a performance increase by using more threads per block than there are cores in the SM this is due to the GPU’s excellent thread switching.

Figure 1: The Streaming Multiprocessor of a Pascal GP100

GPU Memory

The Nvidia Pascal GPU architecture consists of several different types of memory, some memory types even have several cache-layers. This results in a complex memory structure and even though much of it is hardware-controlled there is a lot of optimizations that can be had by choosing the correct memory structure. The memory we are examining is Global Memory, Shared Memory, and Constant Memory. There also exists a Texture Memory but this is optimized for an access pattern of physically close elements, in a 2D sense. Something these algorithms do not use.

2.2.1 Global Memory

(10)

memory types. The GTX 1080 GPU has a global memory of 8 GB. The bandwidth for this GPU’s memory is 320 GB/s. Though this connection might seem sufficient, there is still a possibility to saturate the connection between the cores and the global memory. The reason for this is the vast number of cores in GPUs, for example, the GTX 1080 has 2560 cores.

To counteract these limitations the GPU architectures also utilize L2 cache. Contrary to CPU cache the GPU L2 cache is coherent, meaning it is the same memory addresses for every core. It should also be noted that GPUs do not have an L3 cache in the Pascal architectures, therefore L2 is what is called Last Layer Cache (LLC). Though this memory is not large, 2048 KB for the GTX 1080, it is specifically designed to distribute data to a lot of cores efficiently. This combined with that cores often work with physically close data allows the L2 cache to increase performance. It should also be noted that the individual cores in each SM have a shared L1 cache, meaning the L1 cache is not coherent throughout the GPU but is throughout the SM. Though, there is no way as a programmer to decide what data goes in or out of these caches since it is what is called a hardware-controlled cache. This means that we lose performance whenever there is a cache miss, that is when the required memory is not in cache and needs to be fetched from global memory. Therefore, we as programmers, have access to another type of memory that allows for the same efficiencies as the caches for the global memory, but instead of the content being decided by an algorithm we can specify exactly what information we want to be stored there.

This memory is called shared memory.

2.2.2 Shared Memory

Shared memory is what is called a software-controlled cache. This means that the programmer almost exclusively can decide what goes into this memory. This memory is very similar to the L1 cache of the cores since it is not coherent between all cores but only between the cores in the same SM. Neither is it very large. In the Pascal architecture, it is only 96 KB which drastically limits how much data can be put into shared memory. It should also be noted that for maximum performance, the GPU would like to schedule several blocks onto an SM. But if a kernel needs more than half of the shared memory allocated to execute a block the GPU is limited to one block per SM resulting in inactive time e.g. when one core needs to fetch data from memory.

Therefore, the most allocated shared memory per block is more appropriately 48 KB. The shared memory can drastically speedup kernels that have small data sets that are repeatedly used by a block or when doing coordinated calculations inside a block. Though it should also be noted that since the data is not coherent on all SMs the data has to be copied to global memory if it is needed for further calculations.

2.2.3 Constant Memory

The last memory architecture that is examined in this report is constant memory. This memory, as the name implies, consists only of constant, immutable values. It is also very limited in size, on Pascal the size is only 64 KB. In contrast to shared memory, constant memory is not local to an SM, instead is it shared between the entire GPU. Though, the architecture of this memory is very different from other GPU memory when it comes to fetching data from it. Instead of having a 1-to-1 communication between core and memory, like every other memory, the constant memory does broadcast the memory a warp, either 1-to-16 or 1-to-32. This means that if several cores want to access the same memory address this can drastically decrease the memory

bandwidth utilization since it can be done in a broadcast instead. Though it should also be noted that trying to fetch data from constant memory, but at different memory-locations will hurt performance. This is due to the constant memory only being able to broadcast data to one warp at a time. Therefore, if the broadcasting capabilities are not utilized, the memory-access will be serial instead of parallel, as it is with global memory. One extra feature regarding constant memory is that, since it is constant the GPU can more aggressively cache it which in turn increases performance if the data is read more than once.

(11)

Method

The algorithm is implemented in 4 parts, 3 pre-processing and then 1 prediction part. For this the input is a raw image, in the case of this report, 640 pixels wide, 256 wavelengths deep. A total of 163,840 points of data, each represented as a float (4 bytes), resulting in images of size 655,360 bytes or 656 kilobytes. This is then processed through the 3 pre-processing steps before it is used to make predictions on. The prediction is a 256-wavelength, 8-feature matrix that is multiplied with the image matrix. The outputted result is a 640-pixel, 8-feature matrix that can be used in post-processing to make decisions about the material it consists of.

It is also important to note that even though the algorithm below handle images of several pixels, these pixels are treated as isolated, single pixels. This means that there is no

synchronization or communication needed between pixel calculations, hence every pixel can be launch as an individual block. This results in that the algorithm is SIMD for all pixels. Therefore, the following calculations are only focusing on the wavelengths of pixels as the calculations are exactly the same for every pixel. For clarification we should also note that every pixel contains several wavelengths, this is represented as wl in the formulas later used. When calling e.g.

“Img(wl) := 1” we then mean, for a pixel in the image we want, for every wavelength wl, to set it to 1.

There were two directions on how to design the algorithms, either use libraries or implement it on our own. Though there are many reasons for using libraries, especially in a production setting, it limits one to that exact implementation, which does not always apply perfectly to the problem. Another reason is that the algorithms in libraries are often using the optimal hardware architecture for that implementation. This limits us in comparing the impacts of different hardware architectures as it introduces an extra parameter, the implementation.

The parts of the algorithm are implemented as individual kernels as well as tested individually.

Therefore, some memory tests do not apply to every kernel since they are very different in structure and cannot be applied everywhere. These limitations are further discussed in the individual sections for the kernels.

It also has to be noted that some of these kernels require so few reads from the memory that it might invalidate the use of specialized memories like shared or constant memory. Therefore, to increase the performance and to better understand the impact of different memory types we investigate a varying number of coalesced images. This means that data that is static between images can be reused in the same kernel since it is also used at the next coalesced image. If no coalescing is used, we might lose valuable information. E.g., the performance benefits of shared memory might not be realized as the overhead of copying data to shared memory might be larger than the benefits it provides.

Absorption

The first kernel calculates the absorption of the image. The algorithm first calculates the reflection of every wavelength, recall that this is done for every individual pixel from the raw image (Raw). This is done using black and white reference images (DarkRef and WhiteRef), which are static images gathered at initialization. The formula for the reflection looks as following:

𝑅𝑒𝑓𝑙(𝑤𝑙): = (𝑅𝑎𝑤(𝑤𝑙) − 𝐷𝑎𝑟𝑘𝑅𝑒𝑓(𝑤𝑙)) (𝑊ℎ𝑖𝑡𝑒𝑅𝑒𝑓(𝑤𝑙) − 𝐷𝑎𝑟𝑘𝑅𝑒𝑓(𝑤𝑙)).

To formulate an algorithm that computes the reflection for each wavelength in a pixel we can

(12)

𝐷𝑖𝑣(𝑤𝑙): = 1

(𝑊ℎ𝑖𝑡𝑒𝑅𝑒𝑓(𝑤𝑙) − 𝐷𝑎𝑟𝑘𝑅𝑒𝑓(𝑤𝑙)). This means that the resulting reflection calculation looks as follows:

𝑅𝑒𝑓𝑙(𝑤𝑙): = (𝑅𝑎𝑤(𝑤𝑙) − 𝐷𝑎𝑟𝑘𝑅𝑒𝑓(𝑤𝑙)) ∗ 𝐷𝑖𝑣(𝑤𝑙).

Then, using the reflection, the algorithm calculates the absorption of every wavelength in the pixels. This calculation looks as follows:

𝐼𝑚𝑔(𝑤𝑙): = −𝑙𝑜𝑔₁₀( 1 𝑅𝑒𝑓𝑙(𝑤𝑙)).

Using this information, we end up with three vectors that need to be transferred to the GPU, the black reference (DarkRef), the divisor (Div), and the raw pixel (Raw). Since two of these are static, we can load them in at initialization instead of at runtime. Therefore, the only data that needs to be transferred is the raw image.

The two static vectors could either be placed in global, shared, or constant memory. Storing it in shared memory for the duration of the processing means two extra copies from global memory, copying the DarkRef and Div. Though, we only need to copy the pixel that is used for this block of computation. This might yield a better result, especially at high coalescing, since the shared memory is faster than regular global memory if we do not account for caches.

Storing it in constant memory on the other hand is not as attractive as an option. Judging from the access pattern of the algorithm this would probably hurt performance since no threads access the same memory locations, instead they want different areas of the vector. But the bigger concern here is the limitations on constant memory. Since constant memory is only 64 KB and since the DarkRef and Div images are of the same size as the input image, 656 KB, there is no way to efficiently store them in constant memory.

Centering

Another pre-processing kernel, centering, is centering the wavelengths of each pixel. This is done using a centering vector (Center) that is identical for each pixel and image. The calculation looks as follows:

𝐼𝑚𝑔(𝑤𝑙): = 𝐼𝑚𝑔(𝑤𝑙) − 𝐶𝑒𝑛𝑡𝑒𝑟(𝑤𝑙).

Since the centering vector is static over images we can load it into memory during initialization.

The image is already in memory from the absorption step, so there is no need to load any data into memory.

Though, this calculation allows for possible memory optimizations. Just like absorption, there is the overhead of copying the data to shared memory if we want to use that, but at high enough coalescing we might still see performance benefits from it. Also, this is one copy less as there is only one vector that is being copied.

Constant memory on the other hand allows for storing the entire centering vector. Though there are some questions as to how efficiently the algorithm can utilize the constant memories

broadcast capabilities since the individual SMs request different memory addresses, even though they are close.

SNV

The last pre-processing step SNV, Standard Normal Variate, is calculating the variance of the wavelengths and then normalizing them using the variance. This, in turn, results in the following calculation:

(13)

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒: = ∑𝑤𝑎𝑣𝑒𝑙𝑒𝑛𝑔𝑡ℎ𝑠(𝑚𝑒𝑎𝑛 − 𝐼𝑚𝑔(𝑤𝑙))²

𝑤𝑙=0

(𝑤𝑎𝑣𝑒𝑙𝑒𝑛𝑔𝑡ℎ𝑠 − 1) .

This calculation is based on that you know the mean which is calculated in the following way:

𝑚𝑒𝑎𝑛: =∑𝑤𝑎𝑣𝑒𝑙𝑒𝑛𝑔𝑡ℎ𝑠𝐼𝑚𝑔(𝑤𝑙)

𝑤𝑙=0

𝑤𝑎𝑣𝑒𝑙𝑒𝑛𝑔𝑡ℎ𝑠 .

Neither of these calculations is inherently parallel since they depend on the previous result.

Though, since addition is commutative the summation part can be split into several separate calculations and then summed up using a reduction.

Then lastly, using the variance of each pixel’s wavelength that element is normalized to the variance of that pixel. This calculation on the other hand is fully parallelizable since it only operates on a single element. This is done as follows:

𝐼𝑚𝑔(𝑤𝑙): =𝐼𝑚𝑔(𝑤𝑙) − 𝑚𝑒𝑎𝑛

√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 .

This kernel is quite different than the others since it only operates on the image data and has to do coordinated calculations between threads. Therefore, there are a few memory tweaks that can be done. First, all memory used in this calculation is changed and therefore ruling out constant memory as an option. But using shared memory in contrast to global memory, especially in the calculation and reduction for mean and variance could provide some performance benefits.

Since every pixel is on a block and every pixel can be calculated independently the SM does not need to coordinate between SM and the entire mean and variance can be calculated done in shared memory. Something that we cannot guarantee is going to happen with the L1 cache, as it is hardware-controlled.

Prediction

The last step, which produces the result, is the prediction. This is done using a prediction matrix (PredMat) based on the wavelengths and features we are trying to predict. Then it uses a

matrix-vector multiplication between an image pixel vector and the prediction matrix. The resulting vector of features can then be used to make decisions about the pixel. The matrix multiplication for a specific feature can be written as follows:

𝑅𝑒𝑠𝑢𝑙𝑡(𝑓𝑒𝑎𝑡): = ∑ (𝑃𝑟𝑒𝑑𝑀𝑎𝑡(𝑤𝑙, 𝑓𝑒𝑎𝑡) ∗ 𝐼𝑚𝑔(𝑤𝑙))

𝑤𝑎𝑣𝑒𝑙𝑒𝑛𝑔𝑡ℎ𝑠

𝑤𝑙=0

.

Using this formula we can create several different algorithms for matrix multiplication. But based on that the image always acted as a vector as well as that the feature count is low some options could be rejected. The most obvious method, letting each thread do the entire vector multiplication for a feature but since the feature count is low this would not utilize the

performance benefits of the massively parallel architecture of the GPU. Therefore, instead, the calculation was split on the wavelengths resulting in more coordination but better utilization of the GPU cores. This was achieved using an atomic addition operation. Every thread, therefore, was assigned a wavelength, for that wavelength the thread then did the calculations for every feature and atomically added it to the resulting feature vector.

Then there are the memory optimizations. First, the prediction matrix is static and can,

therefore, be loaded into memory at an initialization step. Though there are possibilities to load it into constant memory, this would limit the feature set to a very low number. Since constant memory is 64 KB and the data type is floats we end up with 2000 elements in the matrix,

(14)

intended use case for constant memory we can also see that it does not align with these types of operations. No data points are repeatedly used, therefore not utilizing the aggressive caching.

Nor is the same data used on different threads, therefore not utilizing the broadcasting of data to warps.

The shared memory on the other hand could increase performance. Since a block is executed on a single SM we could coordinate the GPU to compute the matrix operation on data stored in shared memory and then copying it back to global memory when it is finish computing. This could result in better utilization of cached data but instead increases the coordination needed between threads as well as introduces an extra copy operation.

(15)

Result

There are several ways to measure performance on GPUs, Nvidia provides tools such as Nvidia Nsight Systems [8]. These are heavyweight tools for analyzing the execution of a kernel and finding its’ bottlenecks. Whereas these tools are great for developing full-scale applications they are more complex than necessary to measure and compare different implementations

depending on the architecture. Conveniently Nvidia, in their API, also provides CUDAEvents.

These events can be used to measure everything from single API calls to entire sets of kernels with very precise timing, down to half a millisecond. This is exactly the feature set we are searching for since this allows us to compare different aspects of the program granularly as well as benchmarking kernels precisely.

Using CUDAEvents we are mainly going to be looking at benchmark for entire kernels as that allows us to fully see the performance impact different implementations has. As some

implementations might use faster memory but then requiring an extra copy to global memory.

Though it should be stated that the initializations are not being benchmarked as this step is not impacting the real-time performance of the application.

Figure 2-5 show comparisons between the different GPU memory architectures for the individual kernels. The measurements are on the time that kernel took to process one frame.

These comparisons are then also be divided into segments, defined by the number of coalesced frames. Lastly, the total latency of the algorithm is shown, see Figure 6, this is the time it took for one image from starting to be processed to finishing being processed. Note that in the constant memory case the only kernel using constant memory is the centering kernel while the other 3 kernels use global memory.

Figure 2: Absorption kernel. Average execution time per frame, in 100 ns, over increasing numbers of coalesced frames.

1 2 5 10 20 50 100

Shared 9.86 7.38 4.63 3.38 2.94 2.78 2.45

Global 7.51 6.10 4.20 2.76 2.68 2.30 2.25

0.00 2.00 4.00 6.00 8.00 10.00 12.00

Time per Frame (100 ns)

Coalesced Frames

Absorption - Time per Frame

(16)

Figure 3: Centering kernel. Average execution time per frame, in 100 ns, over increasing numbers of coalesced frames.

Figure 4: SNV kernel. Average execution time per frame, in 100 ns, over increasing numbers of coalesced frames.

1 2 5 10 20 50 100

Shared 7.32 4.76 2.83 2.22 1.61 1.44 1.40

Global 5.31 3.75 2.42 1.82 1.41 1.26 1.27

Const 5.38 3.64 2.44 1.91 1.56 1.33 1.30

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

Coalesced Frames

Centering - Time per Frame

1 2 5 10 20 50 100

Shared 29.93 27.29 26.05 27.44 25.73 25.11 24.79

Global 36.84 33.36 31.58 30.38 30.23 29.92 29.73

0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00

Coalesced Frames

SNV - Time per Frame

(17)

Figure 5: Prediction kernel. Average execution time per frame, in 100 ns, over increasing numbers of coalesced frames.

Figure 6: Total latency. Average latency for frame from starting to finished being processed, in 100 ns, over increasing numbers of coalesced frames.

1 2 5 10 20 50 100

Shared 168.21 164.68 158.78 155.48 159.32 158.87 158.08

Global 26.44 25.95 22.89 22.62 21.08 19.97 19.42

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00

Coalesced Frames

Prediction - Time per Frame

1 2 5 10 20 50 100

Shared 215.32 408.19 961.44 1885.24 3792.36 9412.14 18672.78 Global 76.10 138.32 305.41 575.82 1108.17 2672.83 5268.63 Const 76.17 138.10 305.52 576.74 1111.16 2676.07 5271.11

0.00 2000.00 4000.00 6000.00 8000.00 10000.00 12000.00 14000.00 16000.00 18000.00 20000.00

Coalesced Frames

Total latency

(18)

Discussion

Interpretation of Results 5.1.1. Different Memories

The results of the absorption kernel, see Figure 2, were not far from expected. By looking at Figure 2 we can see that the Global Memory kernel performs better than the Shared Memory kernel. This is most likely due to the extra copy that is needed from global to shared memory in the Shared Memory kernel. In theory, the overhead from copying the data to shared memory should result in better performance on the Shared Memory kernel when we increase the number of coalesced frames per kernel. But that is only the case if we exclude the different cache layers of the global memory. Most likely the Global Memory kernel automatically caches the values used repeatedly, therefore negating the possible performance advantages that copying to shared memory could have given.

The centering kernel, see Figure 3, has similar results to the absorption kernel. The performance difference in the Shared Memory kernel versus the Global Memory kernel is most likely due to the extra copy to shared memory. Also, the shared memories’ fast access, is most likely being mitigated by the caching of global memory. As for the Constant Memory kernel, just like the Global Memory kernel, it is probably fast due to the aggressive caching. The reason it is slower in most cases is probably due to all cores not requesting the same data. Since they are requesting different elements in a vector, so even if the constant memory serves cache lines, they are most likely not the entire size of the centering-vector. Therefore, the constant memory has to queue several requests, since it is only able to serve one request at a time. Instead, global memory can serve all requests of different elements at the same time as long as the memory interface is not saturated.

The SNV kernel, see Figure 4, is the only kernel that performs overall better using specialized memory. This is probably due to that it is the only kernel that has a memory access pattern that operates on the shared memory. The difference between this kernel and the earlier, is that the specialized memory in the earlier is only used to read from. The shared memory in this kernel is used to store the intermediate results of different calculations, like calculating mean and variance. Therefore, most likely, it cannot be as aggressively cached by the Global Memory kernel, at least not in the L1 cache, which is not coherent. As a result, the Shared Memory kernel is faster since it is faster than L2 cache and global memory.

Lastly, the Prediction kernel, see Figure 5. The difference between the Shared Memory kernel and the Global Memory kernel is large. The most likely reason for this is that the Shared Memory kernel requires an extra synchronization and copy. The Shared Memory kernel has to synchronize all threads once the prediction is finished calculating and then copy that prediction back into global memory before going to the next frame. The Global Memory kernel on the other hand can skip this extra synchronization and copy since it is already operating on global

memory.

Many of the benefits available with specialized memory are in these cases negated by the hardware-controlled caching in global memory. There is also the fact that the algorithm is likely not memory intensive enough to saturate the memory interface. This also negates some of the possible performance benefits of specialized memory.

(19)

5.1.2. Number of Coalesced Frames

Looking at the performance difference over the increasing number of coalesced frames we can see the performance improvements the caching gives, at least for absorption and centering.

These two have a clear decrease in execution time per frame when increasing the numbers of coalesced frames. Though both even out towards a high number of coalesced frames. Since the overhead of fetching data from global memory and moving it into cached memory is a one-time cost. Therefore, the execution time we are approaching is most likely closer to the actual

execution time of the computation, not bound by memory fetches. This also should decrease the discrepancies between shared memory and global memory, as we can clearly see towards the end. Lastly should also be noted that some of the performance benefits here could also be attributed to the decreased numbers of launched kernels, in turn decreasing the overhead from launching kernels.

As for increasing the number of coalesced frames in the SNV and Prediction kernel, we do not seem to be gaining any significant performance. This is most likely due that both kernels are not re-using data as efficiently as previous kernels. The SNV only keeps the mean and variance calculations in shared memory and both these values are individual to each frame, resulting in no re-use of them. The reason we can see a slight performance increase in the Global Memory kernel is most likely due to moving the arrays used for performing the mean and variance calculations into the L2 cache. As for the Prediction kernel. This kernel does re-use data, but not the exact same elements. Instead, it uses a set of elements, every feature’s wavelength prediction value. This probably means that the cached value from the first feature’s wavelength prediction value are evicted from the cache once it returns to the first again in the next frame. Also, the same as holds for previous kernels, the decreased number of launched kernels should decrease the overhead from launching kernels.

As can also be seen in the results, Figures 6, the cost of latency by increasing the number of coalesced frames is high. Especially while looking at the SNV and Prediction kernel, since these did not scale in execution time per frame very well. These two are also especially important to keep in mind when comparing the performance gain from increasing coalesced frames to the latency increase as these two have much longer execution times. This results in small amounts of performance gains from coalescing frames for a high cost in latency for the overall algorithm.

Conclusion

As stated before, there is a large latency increase for only small amounts of performance gains in the execution time per frame. Therefore, unless performance is of the highest importance and latency is of no concern, the recommended choice would be to at most coalesce 2 frames. But coalescing frames also introduces extra complexity in the program, so keeping it to no coalescing might be a good idea.

There is also the question regarding which memory type to use. In these cases, we show that there are only small improvements possible with shared memory in the SNV. Otherwise, the results point to global memory being the best solution. This argument alone is enough to recommend global memory but there is also the fact that the specialized memories are much more storage bound. global memory is much larger than specialized memory and has been keeping expanding with every generation of graphics cards. Therefore, if in the future you were to run the algorithms on higher specification cameras, that require larger images, you might run out of memory in your specialized memory. While the global memory is much more likely to not run out due to its larger size.

Further research into the different input image specifications and different implementations is

(20)

architectures as this report only investigate the Pascal architecture, model GTX 1080. Other interesting aspects to investigate could also be the implications of different data transfer methods like direct memory access or streams as these could impact the latency of the overall program. [2]

(21)

Acknowledgement

I wish to express a sincere thanks to my supervisor, Eddie Wadbro, for giving constructive critique regarding both the content and the spelling. As well as for his guidance through each stage of the process.

(22)

References

[1] "Wikipedia," [Online]. Available:

https://en.wikipedia.org/w/index.php?title=Hyperspectral_imaging&oldid=954036402.

[Accessed 5 Juni 2020].

[2] T. Soyata, GPU Parallel Program Development using CUDA, CRC Press, 2018.

[3] E. K. Jason Sanders, CUDA by Example: an introduction to general-purpose GPU programming, Pearson Education, Inc., 2010.

[4] N. Wilt, The CUDA Handbook: A Comprehensive Guide to GPU Programming, Pearson Education, Inc., 2013.

[5] Corporation, NVIDIA, "Nvidia," 1 November 2018. [Online]. Available:

https://docs.nvidia.com/deeplearning/sdk/introduction/index.html. [Accessed 5 Juni 2020].

[6] NVIDIA Corporation, "Nvidia," 28 November 2019. [Online]. Available:

https://docs.nvidia.com/cuda/cublas/index.html. [Accessed 5 Juni 2020].

[7] NVIDIA Corporation, "Pascal Architecture Whitepaper v1.2," 2017.

[8] NVIDIA Corporation, "Nvidia," [Online]. Available: https://developer.nvidia.com/nsight- systems. [Accessed 5 Juni 2020].

(23)

REAL-TIME HYPERSPECTRAL IMAGE ANALYSIS ON GPU HARDWARE

REAL-TIME

HYPERSPECTRAL IMAGE ANALYSIS ON GPU

HARDWARE

Performance impact of different GPU architecture implementations

Contents

Introduction

GPU Architecture

Method

Result

Absorption - Time per Frame

Centering - Time per Frame

SNV - Time per Frame

Prediction - Time per Frame

Total latency

Discussion

Acknowledgement

References