GPU Accelerated Light Field Compression

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

g

n

i

p

ö

k

r

o

N

4

7

1

0

6 n

e

d

e

w

S

,

g

n

i

p

ö

k

r

o

N

4

7

1

0

6 -E

S

LiU-ITN-TEK-A--18/032--SE

GPU Accelerated Light Field

Compression

Gabriel Baravdish

2018-06-18

(2)

LiU-ITN-TEK-A--18/032--SE

GPU Accelerated Light Field

Compression

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid

Linköpings universitet

Gabriel Baravdish

Handledare Patric Ljung

Examinator Jonas Unger

Norrköping 2018-06-18

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Science and Technology

Master thesis, 30 ECTS | Media Technology and Engineering

2018 | LIU-ITN/LITH-EX-A--18/000--SE

GPU Accelerated

Light Field Compression

GPU- Accelererad Kompression av Ljusfält

Gabriel Baravdish

Supervisor : Ehsan Miandji Examiner : Jonas Unger

(5)

(6)

GPU Accelerated Light Field Compression

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid Linköpings universitet

av

Gabriel Baravdish LiTH-ITN-EX--YY/NNNN--SE

Handledare: Ehsan Miandji

itn, Linköpings universitet Examinator: Jonas Unger

itn, Linköpings universitet Norrköping, 1 juni 2018

(7)

(8)

Abstract

This thesis presents a GPU accelerated method to compress light field or light field videos. The implementation is based on an earlier work of a full light field compression framework. The large amount of data storage by capturing light fields is a challenge to compress and we seek to accelerate the encoding part. We compress by projecting each data point onto a set of dictionaries and seek the most sparse representation with the least error. An optimized greedy algorithm to suit computations on the GPU is presented. We benefit of the algorithm outline by encoding the data segmentally in parallel for faster computation speed while maintaining the quality. The results shows a significantly faster encoding time compared to the results in the same research field. We conclude that there are further improvements to increase the speed, and thus it is not too far from an interactive compression speed.

(9)

(10)

Acknowledgments

I would like to thank my supervisor Ehsan Miandji for being inspiring and for many helpful discussions that often led to new ideas.

I also want to thank my examiner Docent Jonas Unger for giving me the opportu-nity to do this thesis and for his encouragement and support during the work. I would also like to thank Saghi Hajisharif for many fruitful discussions.

Big thanks to the Computer Graphics and Image Processing group at MIT for inviting me to a great working environment and making me feel welcome. Last but not least, a lot of love and gratitude goes to my family for both constant support and encouragement.

(11)

(12)

Notation

Notation List

Notation Meaning

N _{Set of natural numbers} R _{Set of integers}

C _{Set of complex numbers}

α, a Scalar

v Vector

U Matrix

X Tensor

{X(i)}N_i=1 A collection of N tensors U(n) n-th matrix in the sequence

{U(1,k), . . . , U(n,k)}K_k=1 A collection of K number of sequences, each se-quence contains n matrices

(i1, i2, i3) Element at position (i1, i2, i3)

(: , i) All rows at i-th column

(15)

(16)

1

Introduction

Light field imaging is an ongoing trend in many research areas during the last decade. New techniques to capture, sample, and display has been developed and might be even more common in the near future. Light fields can be captured in different ways. The most common capture methods has for a long time been by a camera mounted on a moving gantry or by an array of cameras, [Wilburn et al., 2005]. The light field captures radiance of a set of rays from a scene with various directions. This enables post-processing methods of a captured scene with fasci-nating features. The never-ending increasing of computational hardware power has made it possible for researchers to develop new techniques in the field of light field imaging. These techniques are applied in applications such as refocus-ing [Ng, 2005], large 3D reconstruction [Levoy, 2001], 3D displays system [Jones et al., 2016] and depth estimation [Williem and Park, 2016], to name a few. Some of the techniques that has already been introduced are at a later stage being opti-mized and refined. And some of them has potential to hit the consumer market. Of these, there might be one or more that has potential to interact in your daily life. You can think of a future vision where you are able to capture a light field with your phone and being able to refocus the photo of a car you just took. Or perhaps you want to change viewpoint and thereby being able to look further or behind objects that are standing in the way in a photo. The possibilities are end-less.

The challenges for most of these applications are mainly the high quality re-quirements and the required large amount of data storage. Therefore, light field compression is vital for both the use and the expansion of light field imaging.

In parallel to the development of light field imaging, another research field has emerged. It is a field with great progress and with theoretical results in sig-nal theory. It is called Compressed Sensing (CS) [Donoho, 2006]. The underlying theory states that if a signal has a sparse representation under a basis, it can be

(17)

2 1 Introduction fully recovered under certain conditions [Candès and Tao, 2006]. The theory of compressed sensing is the basis for the work of [Miandji et al.], which is a frame-work for compression of light fields and light field videos.

The purpose of this thesis is to accelerate the encoding scheme presented in [Miandji et al.] and optimize the routine such that the computations suits the Graphics Processing Unit (GPU). We aim to implement an encoding routine that is both expandable for other applications and scalable for multidimensional data. Furthermore, we seek a flexible implementation for different types of data with different dimensionality.

(18)

2

Background theory

This chapter will cover the necessary background theory used in this work. The reader will get familiar with tensor theory, light field representation, the com-pression framework and background theory about the hardware.

2.1 Tensor

A tensor is commonly interpreted as an multidimensional matrix. But it is de-fined as any multidimensional array of order n or n-mode. Which means that the number of dimensions of the multidimensional array is n. To summarize: A first-order tensor is a vector v which is defined in RI1_{. A second-order tensor is a}

matrix U defined in RI1xI2_{. If the order exceeds 3, then it is called a high-order}

tensor. More about tensors and their properties can be seen in [Kolda and Bader, 2009].

A tensor of order N is defined as:

X ∈ RI1×I2×...×IN _(2.1)

An illustration of a third-order tensor can be seen in Figure 2.1.

Fibers and slices can be extracted from tensors. Fibers, where every index but

one is fixed, can be seen in 2.2. And slices, where all but two indexes are fixed, can be seen in Figure 2.3.

The Frobenius norm of the tensor in Equation 2.1 is the square root of the sum of the squares of all its elements.

kX k_F = v u u t I1 X i1=1 I2 X i2=1 . . . IN X iN=1 x2_i 1i2...iN (2.2)

(19)

4 2 Background theory

I2

I3

I1

Figure 2.1: This is a tensor of third order. Here X ∈ RI1×I2×I3_.

Figure 2.2: Extracting the fibers of a third-order tensor in mode-1 (upper left) as X_:,i₂,i3, mode-2 (upper right) as Xi1,:,i3and mode-3 (bottom) as Xi1,i2,:.

This can be seen as a cut along the first, second and third dimension.

Figure 2.3: Extracting the slices of a third-order tensor in horizontal slices (upper left) as Xi1,:,:, in lateral slices (upper right) as X:,i2,: and in frontal

slices (bottom) as X_:,:,i₃.

which is similar to the Frobenius norm for a matrix, denoted as kXkF.

(20)

2.1 Tensor 5 the products of their entries. It is defined as

hX , Yi = I1 X i1=1 I2 X i2=1 . . . IN X iN=1 xi1i2...iNyi1i2...iN (2.3)

It follows from Equation 2.2 that hX , X i = kX k2_F.

2.1.1 Tensor-matrix product

There are many interesting operations and properties when working with ten-sors, some of them were useful for this work and will be explained further in this chapter. The first one is the tensor-matrix product, also called the n-mode

product. This operation is denoted as ×n. The multiplication operation matrix by

matrix is well known; it is the inner product of the rows in the first matrix with the columns in the second matrix. With tensors it is not clear which dimension the inner product operates along, because a tensor can have arbitrary number of dimensions. The index n in the ×n symbol indicates a specific mode, it tells us

along which dimension the inner product occur. Let us continue on Equation 2.1 and multiply it by a matrix U ∈ RJ×Inwe get

Y= X ×_nU (2.4)

, where Y is a tensor with dimensions: I1× I2× · · · × In−1× J × In+1· · · IN. And the

entries are given by Y_i₁_,i₂_,...,i

n−1,j,in−1 = (X ×nU)i1,i2,...,in−1,j,in+1 =

In

X

in=1

xi1,i2,...,iNuj,in (2.5)

2.1.2 Unfolding and folding a tensor

Unfolding is known as the process to transform a tensor into a matrix. This can,

again, be unfolded into a vector. In general this process is able to transform a tensor from a order n to a tensor of order k, where k < n. Note that unfolding a matrix, which is a tensor of second order, into a vector, which in turn is a tensor of first order; is the same concept as unfolding a higher order tensor of fifth order into a tensor of second order, a matrix. The unfolded tensor of X ∈ RI1×I2×...×IN

along the n-mode is denoted by the matrix X(n) ∈ RIn×In+1In+2...INI1I2...In−1. This

transformation arranges the n-mode fibers to be the columns of the resulting matrix, or to the single column if unfolded into a vector. Folding is the process to transform a lower order tensor into a higher order tensor. In other words, it is the reverse of the unfold operation. In this work the process to fold a first order tensor into a higher order tensor is of special interest, because the tensor is initially stored as a first order tensor in the memory. The inverse of the unfolding operation along the n-mode is

(21)

f oldn(unf oldn(X )) = f oldn(X(n)) = X (2.6)

With Equation 2.6 we can now rewrite the n-mode tensor product to

X×_nU = f oldn(Uunf oldn(X )) = f oldn(UX(n)) (2.7)

1 2 3 6 4 5

1 2 3 4 5 6

Figure 2.4: An illustration of 1-mode unfolding.

Below is an example of how the unfolding process is done. Let the tensor in Figure 2.1 be X_i,j,1=         1 2 3 4 5 6 7 8 9         X_i,j,2=         10 11 12 13 14 15 16 17 18        

By unfolding along the three different modes, we will get the three following matrices: X(1)=         1 2 3 10 11 12 4 5 6 13 14 15 7 8 9 16 17 18         X(2)=         1 4 7 10 13 16 2 5 8 11 14 17 3 6 9 12 15 18         X(3)= 1_{10 13 16 11 14 17 12 15 18}4 7 2 5 8 3 6 9 !

Lastly, the folding is illustrated in Figure 2.5. This could be seen as a real scenario, where a vector stored in the memory being folded to a tensor and un-folded back to a vector again. In order to traverse through an unun-folded tensor, we can use the notation which maps a tensor element at position (i1, i2, . . . , iN) to a

matrix element (in, j). j = 1 + N X k=1 k,n (ik−1)Jk and Jk= k−1 Y m=1 m,n Im (2.8)

(22)

2.1 Tensor 7

We can use the same concept to map a tensor element to a vector element. This can easily be derived by using Equation 2.9 and omit the inequality of the index at the n-th dimension. The notation for this is

j = 1 + N X k=1 (ik−1)Jk−1 and Jk = k−1 Y m=1 Im with J0= 1 (2.9)

This is a process to map subscripts to a linear index j. In practice this is often referred as the subscripts-to-index.

1 3 2 4 5 6 8 ₁ 3 2 4 5 7 6 8 1 2 3 4 5 6 7 8

Figure 2.5: An illustration of the folding process. Here is a vector, which could be stored in the memory, folded to a tensor. The operation is revertible.

2.1.3 HOSVD

One of the most common rank-revealing matrix decompositions is the singular

value decomposition (SVD). However, this decomposition can be insufficient for

higher orders. A generalization of this decomposition is called HOSVD,

higher-order singular value decomposition. This is presented in [Lathauwer et al., 2000],

which has its origin in the research field Multilinear-Principal Component Anal-ysis (M-PCA). The idea is to decompose a tensor into a core tensor multiplied by a matrix along each mode. Any n-order tensor X ∈ RI1×I2×...×IN can be written as

the product

X = S ×1U(1)×2U(2). . . ×N U(N ) (2.10)

where S is the core tensor with same dimensions as X and U(1)_...U(N ) _are

orthogonal matrices. S can be computed as

S= X ×1(U(1))T ×2(U(2))T. . . ×N(U(N ))T (2.11)

A visualization of the singular value decomposition can be seen in Figure 2.6 and Figure 2.7 below.

(23)

8 2 Background theory A I2 I1

=

_U I1 I1 _S I2 I1 V T I2 I2

Figure 2.6: An illustration of SVD of a matrix.

X U(1) I2 I1 I3 = S J₂ J₃ J₁ U(3) I₃ I1 J₁ J₃ U(2) I₂ J₂

Figure 2.7: An illustration of HOSVD of a tensor.

2.2 Light field

2.2.1 The plenoptic function

A simplified case to visualize light filled in a region of space is to capture the light rays through a pinhole camera, project the rays onto an image plane and then render the result as an 2D image. The image is discretized and defined as a 2D matrix where each element, or pixel at coordinate (x, y) on the image plane, represents the intensity level of the projected light rays. This representation has to be extended in order to visualize real human vision. The work of [Landy and Movshon, 1991] explains the 7D plenoptic function, P(Vx, Vy, Vz, θ, φ, λ, t), that

describes a model to capture light as data from real world environment by using a camera. The parameters can be explained as follows: To begin with, a pin-hole camera captures a black and white photograph. The representation of the taken image is the light rays captured at a specific time, viewed from a single viewpoint, with a specific wavelength in the visible spectrum. The result is a collection of intensities P seen through the light from a region of space. These intensities are distributed over the captured picture and can be parameterized. Either by, the most known, Cartesian coordinates P(x, y) or by the spherical coor-dinates P(θ, φ). The next step is to add some colors. A color photograph repre-sents intensities with variated wavelengths λ. The color image is represented by

P(θ, φ, λ). A simple extension to a colored image is a collection of colored images

that, in a sequence with time parameter t, can be seen as a video. The video is represented by P(θ, φ, λ, t). To extend this even further one can see this video and observe the incoming light, not only from the seat where the observer sit,

(24)

2.3 The light field compression framework 9 but from every viewing position (Vx, Vy, Vz). The final representation is therefore: P(θ, φ, λ, t, Vx, Vy, Vz). Or more commonly written: P(Vx, Vy, Vz, θ, φ, λ, t).

A human observer can use five of the seven parameters of the plenoptic func-tion. The x and y coordinates are captured on the eye’s retina. The λ is captured by the three types of cones that resides inside of the eye. The time t is processed by temporal filters. Vxis taken by the observer’s two eyes. To get a sample of the Vy, the observer would need a third eye placed vertically above or below the eyes.

Finally, Vz is captured by having another eye placed forward or backward. The

observer could simulate this over time by lifting the head or moving forward, but not in a stationary position.

2.2.2 Light field representation

The light field is a limited version of the plenoptic function. The limitations to define the light field function are following: The first restriction limits the time variable to a fixed time. This will result in a capture of a static object or static scene. The second restriction prevents the radiance for each ray in the empty space to change, which makes it constant. Lastly, only radiance leaving the convex hull of an object are included.

By using these restrictions, the plenoptic function is limited to a 4D-function. The 4D set of rays can be parameterized in different ways. As [Gortler et al., 1996] introduce, one way is to use two-plane parameterization, shown in Figure 2.8. However, this parameterization can not represent all rays. An example of the limitations are rays that are parallel to the planes, these can not be sampled. The two-plane light field could be seen as collection of images of the s-t plane, where each are taken from an observer position on the u-v plane. The light field can therefore be interpreted as a 2D collection of 2D images where each image is taken from a different observer position.

2.3 The light field compression framework

2.3.1 Light field data set

The data used in this work is represented by the light field function

l(xi, yj, uα, vβ, λγ), where the spatial locations are (xi, yj), the angular locations

(uα, vβ) and a spectral parameter λγ. The light field is partitioned into smaller

data points. We set each data point to be of size m1× m2× m3× m4× m5and has

in total 5 dimensions. To extend this further, a light field video is represented by l(xi, yj, uα, vβ, λγ, t), where t is the time parameter. This data point has 6

dimensions.

2.3.2 Compression

[Miandji et al.] has presented a complete compression framework for light fields and light field videos. The framework is the basis for this thesis. The compres-sion consists of two parts: training phase and testing phase. Since the dictionary

(25)

10 2 Background theory Image plane pixel t s u v (u,v) (t,s) Camera center

Figure 2.8: The relationship between an image and the parameterization of a light field.

ensemble is created by a learning process, there are two different sets of data. The first one is smaller and called training set. It is used to construct the dictionary ensemble. A dictionary is a set of atoms and each atom is a basis function that represents the data points. The other data set is called testing set, which is an unknown data set and the one that will be compressed. The training set will be represented as {L(i)_}Nl

i=1, where L(i)∈ Rm1×m2···×mn, i is the current data point and Nl is the number of total data points in the learning set. The testing set {T(i)}Ni=1t

uses the same notation, where Nt is the total number of data points. The main

advantage of this model is that the training phase is only needed to be executed once. The exception is when the data set is changed to have other dimensions, in that case it has to be performed again.

In the training phase, there are K number of orthonormal dictionaries of or-der n, where K ≪ Nl. The dictionaries are denoted as

n

U(1,k), . . . , U(n,k)oK k=1. One

of the dictionaries in this set represents a data point L(i)_{. It is derived from}

L(i)= S(i)×1U(1,k)· · · ×nU(n,k) (2.12)

Here, S(i) _{∈ R}m1×m2×···×mn is a sparse coefficient tensor. The goal is to have

each dictionary ensemble to represent a collection of data points as a sparse linear combination. The dimensionality of the dictionaries matches the small data point size, leading to efficient computations.

(26)

2.3 The light field compression framework 11

2.3.3 The multidimensional dictionary ensemble - MDE

We aim to construct the dictionaries such that they have a sparse representation. [Miandji et al.] formulate this problem as

min Uj,k_,S(i,k)_,M i,k Nl X i=1 K X k=1 Mi,k L (i)_{− S}(i,k)_× 1U(1,k)· · · ×nU(n,k) 2 F subject to (U(j,k))T_U(j,k) _{= I,} _{∀k = 1, . . . , K,} _{∀j = 1, . . . , n,} S (i,k) ₀≤ τl K X k=1 Mik= 1, ∀i = 1, . . . , Nl (2.13)

Solving Equation 2.13 results in an ensemble of orthonormal dictionaries. Where S(i,k)contains the coefficients of data point i after it has been projected onto the

kth dictionary. A dictionary consists of a number of matrices and each of these

matrices operates along a particular mode. There are in total K number of dic-tionaries in the ensemble. The user-defined constant τl is a sparsity parameter.

M ∈ RNl×K is the membership matrix, it is in binary form and each data point

L(i)_{is associated to its corresponding dictionary. M can be seen as a matrix that} performs clustering. It groups data points with their respective dictionary they are represented in.

To solve Equation 2.13, the matrices {U(1,k), . . . , U(n,k)}K_k=1are initialized with random orthonormal matrices. This is done by computing the higher order sin-gular value decomposition, HOSVD, of the tensor L(i) _{with uniform random}

el-ements. The elements of M are initialized with 1/K. After this initialization, an it-erative scheme is computed to obtain the dictionary ensemble {U(1,k)_{, . . . , U}(n,k)_}K

k=1.

The complete iterative scheme can be seen in [Miandji et al.] and is denoted shortly as

{U(1,k), . . . , U(n,k)}K_k=1= T rain({L(i)}Nl

i=1, K, τl) (2.14)

We also observe that the computed {S(i)}Nl

i=1in Equation 2.13 are acting on the

training set, and not the unknown data set that will later be compressed. These have to be computed separately, see Section 2.3.4.

2.3.4 Encoding

The next task in the compression framework is to perform the encoding routine. This is done by computing the coefficients that has the most sparse representa-tion. One has to remember that these coefficients are represented by one of the dictionaries. Thus, in order to decode the data, we need both the coefficients and the corresponding dictionary where these coefficients have been projected onto.

(27)

12 2 Background theory The goal is to find which dictionary represents a data point as sparse as possible, along with the least error. This routine is expressed as

[{S(i)}Nt

i=1, m, z] = T est({T(i)} Nt

i=1, {U(1,k), . . . , U(n,k)}Kk=1, ǫ, τt) (2.15)

Where {S(i)}Nt

i=1is a collection of sparse coefficients and {T(i)} Nt

i=1is the testing set

that will be compressed. m ∈ RNt is the membership index vector. It tells us

which dictionary has the best representation for a data point T(i). The member-ship index for one data point is denoted as mi ∈ [1, . . . , K]. The vector z ∈ RNt

stores the number of nonzero coefficients of each S(i)_{, ǫ is a user-defined}

thresh-old value for the error and τtis a user-defined threshold for the sparsity. Note that τtsets the upper limit for the maximum number of coefficients that will be stored

in {S(i)_}N_t

i=1, this simply means that the maximum number of nonzero elements of

a tensor S(i) will not exceed τt. The main contributor for the sparsity,

mean-ing keepmean-ing the number of coefficients in {S(i)}Nt

i=1as low as possible, is the error

parameter ǫ. In this way, the user can control the quality and in the same time ad-just for a more sparse representation. As mentioned earlier, {U(1,k), . . . , U(n,k)}K_k=1 is the ensemble of dictionaries.

[Miandji et al.] use a greedy algorithm to perform the encoding routine of Equation 2.15. This algorithm can be seen in Algorithm 1.

To summarize the encoding process: The algorithm takes the data points and an ensemble of dictionaries as input parameters along with threshold parameters, and compress the data set by initially projecting each data point on every dictio-nary. Then iteratively, it finds the most and best sparse representation for a dic-tionary. The output parameters are the nonzero coefficients {S(i)_}Nt

i=1, the sparsity

vector z which indicates the number of nonzero coefficients and the membership vector m with indexes pointing to which dictionary has best sparse representa-tion.

2.3.5 Decoding

A data point can be reconstructed by using Equation 2.12, the computed coeffi-cients in S(i)_{and the membership indexes in m. It is denoted as}

T(i)= S(i)×1U(1,mi)· · · ×nU(n,mi) (2.16)

Every element at location x1, x2, . . . , xn of a data point T(i) ∈ Rm1×m2×···×mn is

efficiently computed by a single sum. It is expressed by T_x(i)₁_,x₂_,...,x_n = zi X z=1 S_l(i)z 1,l2z,...,lzn×1U (1,mi) x1,l1z U (2,mi) x2,l2z · · · U (n,mi) x_n,lz n (2.17)

The location of the z-th nonzero element in S(i) is found by the index tuple (lz

(28)

2.4 CUDA 13 Algorithm 1 Compute coefficients and the membership index. The test-routine on CPU implements

h

S(i), ai = T est(T(i),nU(1,k), . . . , U(n,k)oK k=1, τt, ǫ)

1: e ∈ RK ← ∞and z ∈ RK _←₁

2: for k = 1 . . . K do ⊲ For every dictionary

3: X(i)← T(i)×1(U(1,k))T· · · ×n(U(n,k))T

4: while zk ≤ τtand ek > ǫ do

5: Y ←Nullify(Qn

j=1mj) − zksmallest element of X(i)

6: ek← T(i)− Y ×1U(1,k)· · · ×nU(n,k) 2 F 7: zk← zk+ 1 8: end while

9: X(k)← Y ⊲ Store the nonzero coefficients

10: end for 11: a ← index of min(z) 12: if za= τtthen 13: a ← index of min(e) 14: end if 15: S(i)← X(a)

2.4 CUDA

The CPU has been the main platform for scientific computations for a long time. With the increased demand of raw computational power and higher performance in video games, the development of the graphics processing unit (GPU) has rapidly reached new levels. To a degree that it has become the preferred platform for sci-entific computations in some areas. The idea is to take advantage of the highly parallel architecture that is offered by the GPU and such architecture was devel-oped by Nvidia and launched in 2006. It is called CUDA and stands for Compute Unified Device Architecture. CUDA is a general purpose parallel computing ar-chitecture which allows implementations of general applications and not just im-plementations of graphics applications. With this new architecture, the General Purpose computing on the GPU (GPGPU) has recently become more popular for solving scientific computations that are too heavy for the CPU.

The technical differences between the GPU and CPU limits the code conver-sion. An implemented application on the CPU may not suit the GPU and there-fore may run slower on the GPU. The hardware differences forces an awareness of naive implementations. On the other hand, the awareness can lead to implemen-tations that benefits a thousandfold of the parallel architecture. The underlying concept of the GPU is to run many parallel cores slowly rather than one core very fast.

(29)

2.4.1 Architecture

To fully utilize CUDA it is necessary to have knowledge about its architecture. Ev-ery GPU with CUDA enabled have similar architecture components. The biggest components are the host, which represents the CPU, and device which represents the GPU. Figure 2.9 shows how these components interact with each other.

Host

Shared memory Shared memory

Registers Registers Thread (0,0) Thread (1,0) Block (0,0) Block (1,0) Grid Global memory Constant memory Texture memory Registers Registers Thread (0,0) Thread (1,0)

Figure 2.9: The memory architecture in CUDA .

2.4.2 Main architecture components

One of the most important component of the CUDA architecture is the streaming

multiprocessor (SM). There are several streaming multiprocessors on the GPU and

in each of them resides a number cores. These cores are called Stream Processors (SP), and are sometimes called Cuda Cores. In CUDA, all parallel instructions are using the Single Instruction Multiple Data architecture (SIMD), which means that every stream processor in each of the streaming multiprocessor computes the same instructions but on different data.

The other important components of the architecture are covered in the list below. • Thread - A thread is a light weight process that executes its own instruc-tions in the kernel. There are different types of memory that can be accessed by the thread. Every launched thread in a kernel function is given an ID or an index by CUDA.

• Block - A block is a group of threads. When a block is created it maps threads to stream multiprocessors. The stream multiprocessors are then re-sponsible for activating blocks at launch and inactivate them when needed at the end. Each block has access to a specific type of memory called shared memory. A block can be created as a three-dimensional structure of threads.

(30)

2.4 CUDA 15 • Grid - The grid is the container for a group of blocks. As there are groups

of threads in a block, there are groups of blocks inside of a grid.

• Warp - A warp is a group of threads, but unlike a block it is the minimum number of threads that can be launched at once. For example, if one wants to launch only one thread, a whole warp will be launched by the streaming multiprocessor and it will set the rest of the threads in the warp as inactive. To have a optimal launch structure, one seek to launch warp-wise number of threads. Typically, for the most CUDA versions the number of threads per warp is 32. Therefore, the optimal launch pattern would be to launch the number of threads in a magnitude of 32.

• Kernel function - The kernel function is the parent function that is running on the GPU. Before it is launched, it will specify the grid and block structure that will be used during the execution of the kernel function. The kernel is always launched from the host and executed on the device. Recently, it is possible to run multiple kernel functions in parallel.

• Device function - A device function behaves as a normal function on the CPU. It is called from the device and executed on the device. This means that a kernel function can call a device function to perform a subroutine task.

• Device stream - A stream is a queue of device work that is executed on the GPU. The host places work in the queue and continues immediately, re-gardless if the device work is finished or not. The device schedules work from streams once resources are free. Many CUDA operations are placed within a stream by default, such as kernel function, memory copies and syn-chronizations. Operations within the same stream are ordered and cannot overlap. While operations in different streams are unordered and can over-lap. If there is one or several GPUs, the stream can run CUDA operations independently in parallel. This means that, for example, kernel functions can be executed on different streams in parallel, leading to even higher level of parallelism.

Memory components

The main memory components are explained below.

• Global memory - The global memory is the largest memory on the GPU and is typically of the type Dynamic Random Access Memory (DRAM). How-ever, the size comes with a drawback: It is the slowest memory on the GPU. The latency for memory access is high and the memory is not cached. This type of memory is the only one that can be accessed from the host and by all threads from the device, as seen in Figure 2.9. The memory transaction is done through the PCIE bus from and to the host. The access times by the threads can be significantly shortened if the memory calls are coalesced. The lifetime of global memory is the lifetime of the running application. When the application ends, the memory is freed.

(31)

16 2 Background theory • Local memory - The local memory is, physically, the global memory. But logically, it is only accessible by each individual thread. It lives and dies within the thread’s scope.

• Shared memory - Shared memory, sometimes called the cache memory, is a very fast on-chip memory. The main advantage of the shared memory is its very low latency. This type of memory is shared between all threads within a block and it is often used to temporary store a part of a heavier computation. The lifetime of the shared memory is the lifetime of a kernel launch. Thus, it is freed once the kernel is finished.

• Register memory - Register memory is the fastest memory on the GPU and consumes, if no obstacles, close to zero clock cycles per instruction. It is private to the thread’s local scope and has typically a tiny storage compared to other memory types. The register memory is stored as register banks where a specific number of bits fit in each bank. The programmer has no power over whenever the thread will use the register memory. If too much memory is used, the memory will be spilled to global memory, this is called

register spilling. When and how is decided on hardware level by the CUDA

Scheduler, which distribute the memory allocation among the threads. • Constant memory - Constant memory is very similar to the global memory,

but the constant memory can only be written from the host. It is cached and accessible for all threads.

• Texture memory - Texture memory is cached and allows hardware interpo-lation. It is accessible for all threads.

More information about the architecture can be seen in Nvidia CUDA C Pro-gramming guide [Nvidia, 2018b].

(32)

3

Implementation

The implementation in this thesis is based on the algorithm in Algorithm 1. Un-fortunately, Algorithm 1 does not suit computations on the GPU very well and there is no simple conversion. There are several bottlenecks for the GPU and the most dominant bottleneck is that all of the lines in Algorithm 1 are computed sequentially. In order to increase the level of parallelism we have to optimize cer-tain steps in the algorithm. One seek to have globally parallel steps, meaning that the lines can be computed independently. For example, line 6 can be computed before line 5. This is a very rare situation in numerical computations. The typ-ical case is when the first value is needed in order to compute the second value before next iteration starts. We can hold back our expectations and instead seek for locally parallel computations. This means that a specific line is parallelizable. For example, in this work we will show that one tensor-matrix product operation in line 3 in Algorithm 1 is indeed parallelizable.

3.1 Encoding - Algorithm outline

There are many different kinds of constellations and setups one can use when op-timizing the iterative routine in Algorithm 1 to be more GPU-friendly. Not only is it difficult to come up with an algorithm with as few bottlenecks as possible, it is also difficult to take into account the large variety of input data that could affect the process in different ways. Therefore, we limit ourselves to one of many cases and this is done to get as high performance as possible. The two limitations are described below.

• The size of each dictionary and the size of every slice of each data point are relatively small. The size of the dictionaries is based on the compres-sion method described in [Miandji et al.], where only small dictionaries are

(33)

18 3 Implementation needed. This implementation has not taken into account other compression methods in similar research field with a large dictionary size.

• The number of data points is very large such that a one-pass run on the GPU is not possible. This means that one operation in the algorithm will need all of the available global memory.

The GPU-optimized algorithm outline can be seen in Algorithm 2 and the main parts will be covered in this chapter.

Algorithm 2 Compute coefficients and the membership index. The algorithm on the GPU implements

h S(p), ai = T estT(p),nU(1,k), . . . , U(n,k)oK k=1, τt, ǫ 1: e ∈ RK← kT(p)k2_F and z ∈ RK_←₁ 2: for k = 1 . . . K do 3: X(k)← T(p)×1(U(1,k))T· · · ×n(U(n,k))T

4: I(k)← index(X(k)) ⊲ Linear indexing

5: I(k)_sorted← sort X(k) , I(k) ⊲ Sort descending 6: Y ← nullif yX(k) 7: while zk≤ τtand ek > ǫ do 8: l ← I(k)_z

k,sorted ⊲ Index to the largest element

9: γ ← X_l(k) ⊲ The zk-th largest element

10: ek ← ek− γ2

11: Y_l ← X_l(k)

12: zk ← zk+ 1

13: end while

14: X(k)← Y ⊲ Store the nonzero coefficients

15: end for 16: a ← index of min(z) 17: if za = τtthen 18: a ← index of min(e) 19: end if 20: S(p)← X(a)

3.1.1 Data points computed in chunks in parallel

Algorithm 2 is computed segmentally in parallel. This means it can run many data points T(p) _{in parallel in a chunk-wise fashion. The only limitation is the}

amount of global memory on the GPU, which is a hardware limitation and inde-pendent of the algorithm. One can interpret every line in Algorithm 2 as:

Com-pute this line segmentally for p number of data points in parallel. The data points

are computed independently and each thread computes and access data indepen-dently. This is to minimize synchronizations between threads and to avoid bank

(34)

3.1 Encoding - Algorithm outline 19 conflicts and race-conditions.

The three main parts this thesis has focused on are: the tensor-matrix prod-uct, the nullification of the data point and the norm of the tensor-matrix product see line 3, 5 and 6 in Algorithm 1. The computation on line 6 in Algorithm 1 is extra exhausting for the threads on the GPU. We seek to minimize heavy com-putations and move to lighter comcom-putations. Many smaller comcom-putations are preferred than one heavy computation, because every heavy computation in an iterative process, i.e. a loop, will have a negative impact on the load balance of each thread.

3.1.2 Memory - transaction and structure

One bottleneck to be aware of when optimizing an algorithm for the GPU is the memory transaction. If we optimize Algorithm 1 such that there is no heavy com-putation for a single thread we force a light comcom-putation model. With such model, where each thread has only simple instructions to execute, every memory call will become expensive relative to the computation. With this consideration every part of the algorithm was implemented on the GPU. This is because we want to minimize the memory transaction from host to device.

To solve this transactional problem, a large memory buffer on the global mem-ory is created to store everything that is needed to run Algorithm 2. We do this to be able to store every executed instruction on the device and we only copy data back to host once the application has finished. The memory buffer is as large as the size of the global memory on the GPU. We want to store and run as many data points as possible and by doing so the only limitation is the hardware memory. In other words, a GPU with two times larger memory size can compute two times more data points in parallel. The buffer is of size 4N + C, where N is the amount of memory to store a segment or a chunk of data points denoted as T(p)_{and C is}

the complementary storage, for example to store vectors z and e. Since C ≪ N we can see the buffer as 4N . Figure 3.1 shows how this resides on the GPU.

4N of data points Global memory

Buffer

C

Figure 3.1: An illustration of the device memory buffer, which is as large as the global memory of the GPU.

(35)

20 3 Implementation

3.2 The n-mode product on GPU

The heaviest computation in Algorithm 1 is the tensor-matrix product, where each data point is iteratively projected onto every dictionary. The work of [Austin et al., 2016] computes tensor decomposition by said operation in a massive par-allel framework with over 50 000 CPU cores, also called a super-computer. How-ever, the computations on the GPU are neither the same nor as scalable as the very large amount of CPU cores, where obstacles such as cache-miss, load bal-ance and race-conditions are much more punishable on the GPU. In this section we will present a way to compute the n-mode product on the GPU.

The most important understanding of the n-mode product is to recognize the product as an inner product between every fiber of a tensor and every row of a dictionary matrix. This awareness is the basis of the mode product. The n-mode product between a single fiber of a given tensor and a matrix can therefore be seen as a matrix-vector product. With this realization, the task is to translate the n-mode product into a matrix-vector operation. The matrix is already given as the dictionary U. Remaining is to extract the vector, which is a fiber of a given tensor X .

3.2.1 Implicit folding

To extract a fiber we need to traverse along the n-mode in X . In practice, the tensor is stored in column-major order in the memory and unfolded as a vector, see Figure 3.2. Where X ∈ RI1I2I3I4_{, which in this example is X ∈ R}2×3×3×2_{. Thus,}

our task is to fold the unfolded vector into a tensor and then extract the fiber we seek. The folding is done in-place and not stored explicitly as a tensor.

1 2 3 4 5 6 7 8 9 10 11 12 I1 I3 I2 I1I2 I₁I₂I₃ 1 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3031 32 33 33 34 351 I1I2I3I4 I₄

Figure 3.2: An example of how a tensor is stored in the memory. This 4D unfolded tensor is stored as a vector in column major order. See Figure 3.4 for how a thread work on this structure.

Every thread in a kernel function is given an ID by CUDA to distinguish the threads. We launch the same amount of threads as there are elements stored in the buffer along the n-mode, which is practical. Because this will give us an 1:mn thread-per-element ratio, where mn is the number of elements along the

(36)

3.3 Sorting 21 n-th dimension. This also means that every thread is given a fiber to compute the matrix-vector product. Continuing on the example in Figure 3.2, the 1-mode product kernel function launches I2I3I4 = 3 × 3 × 2 = 18 threads, each of them

operates along the first dimension independently. See Figure 2.2 for how the fibers assigned to each thread looks like.

Recall Equation 2.9, where a mapping process from several subscripts to a linear index is done. Since we are given a linear thread index defined by CUDA, and want to convert it to subscripts-form, we have to revert the function. The idea is to find the subscripts (i1, i2, . . . , in) for a linear index j such that a thread

can iterate through a fiber as X (k, i2, ..., in), where k = 1, 2, . . . , mn. Or along the

last dimension as X (i1, i2, ..., in−1, k). In the memory this translates to: We want

to jump a certain number of elements when traversing the n-mode fiber. This function is defined as the recursive series

bk= f (bk−1) = bk−1− ik−1 Ik−1 ik= bk mod Ik ∀k = 1, . . . , n with b0= j, i0= 0 and I0= 1 (3.1)

When traversing along the n-mode, each thread stores mnelements from the

global memory to either the shared memory or to the registers. Depending on the dimensionality of the data point, one memory type can be faster than another. Note that mnis small. See Figure 3.3 how this process is done on the GPU.

The logical process by a single thread can be seen in Figure 3.4.

3.3 Sorting

The next part we optimize is line 5 in Algorithm 1. We aim to store the elements with the highest absolute value of X(k)_{. This is done in Algorithm 1 by iteratively}

nullify the Qn

j=1mj− zksmallest absolute value of the elements of X(k)and stop

once ek < ǫ. By doing so we stop as soon as we get enough elements that

approx-imate the projection of T(i)_{. This process is very exhausting for the threads on}

the GPU. Instead, we sort the elements in each data point |X(k)_|_{segmentally in}

parallel, see line 5 in Algorithm 2. We do this to minimize the number of instruc-tions that have to be computed by the iteration on line 7 in Algorithm 1. When the sorting is done we store the linear indexes of the sorted data points. We can now, with a single memory call, access the τt-th largest absolute value of the

el-ements in X(k)_{. This benefit even more when the number of iterations in line 7}

in Algorithm 2 is large. The sorting method is a segmented pair-wise radix sort, which suit large data sets on the GPU. To access an element and extract a fiber in X(k)we use the function in Equation 3.1.

(37)

22 3 Implementation Global memory T(1,1,1,1) T(1,1,1,2) T(1,1,1,3) Shared memory U(n) Shared memory U(n) T(1,1,1,:) Global memory X(1,1,1,:) X(1,1,1,:) T(1,1,1,:)

Figure 3.3: The n-mode product on the GPU. Here T is a 4D-tensor and the first thread is traversing along the fourth dimension. Observe that this is just an illustration of the concept, the data points are not explicitly stored as tensors on the global memory.

(38)

3.4 Optimization of the sparse n-mode product 23

1 2 3 4 ... 15 16 17 18 19 20 ...

thread_n

Global memory _{Unfolded tensor}

(i₁,i₂,i₃,k) thread ID

Figure 3.4: An illustration of the logical process. A thread has to get the subscripts in order to correctly jump through the tensor which is stored as a single long vector in the memory.

3.4 Optimization of the sparse n-mode product

The last step is to optimize line 6 in Algorithm 1. This step is the heaviest com-putation for the GPU to do iteratively. We aim to either transform the heavy computation to a simpler one or move it outside of the iteration. As with the computation of Y, we were able to move the computation on line 5 in Algorithm 1 outside the iterative routine starting on line 7. In this section, we will cover how to transform the norm of the n-mode product of a sparse tensor into almost a single instruction. Specifically, we will show that the norm on line 6 in Algo-rithm 1 can be approximated to line 10 in AlgoAlgo-rithm 2.

Let the norm of a tensor be defined as Equation 2.2. Let

T = X ×1U(1,k)· · · ×nU(n,k) X = T ×1(U(1,k))T· · · ×n(U(n,k))T (3.2) and ˜ T = Y ×1U(1,k)· · · ×nU(n,k) V= X − Y < Y, V > = 0 (3.3)

, such that Y is a sparse version of X ,nU(1,k). . . U(n,k)oN

n=1form orthonormal basis

and V is the complementary dense version of X with respect to Y. We will use the orthogonal invariance property fromnU(1,k). . . U(n,k)oN

(39)

24 3 Implementation preserves the norm. It follows from

kX k2= < X , X > = < X , T ×1(U(1,k))T· · · ×n(U(n,k))T > = < X ×1U(1,k)· · · ×nU(n,k), T > = < T , T > = kT k2 (3.4)

By using Equation 3.2 and Equation 3.3 we can now define the optimization prob-lem stated on line 6 in Algorithm 1 as

kT − ˜Tk2= kT k2−2 < T , ˜T > + k ˜Tk2 (3.5) and for the inner product we have that

< T , ˜T > = < T , Y ×1U(n,k)· · · ×nU(1,k)> = < T ×1(U(1,k))T· · · ×n(U(n,k))T, Y > = < X , Y > = < V + Y, Y > = < V, Y > + < Y, Y > = 0 + < Y, Y > = kYk2 (3.6)

With the property of orthogonality in Equation 3.4 and Equation 3.6 we an rewrite Equation 3.5 to

kT − ˜Tk2 = kT k2−2kYk2+ kYk2

= kT k2− kYk2 (3.7)

The next task to solve is how to take advantage of the result in Equation 3.7 in our iterative routine. Since Y is sparse and we iteratively include one element at a time from X , we can break this down to an element-wise update. We have from Equation 2.2, together with a linear index j of P as the total number of elements, that kT k2_F = I1 X i1=1 I2 X i2=1 . . . IN X iN=1 t2_i₁_i₂_...i_N = P X j=1 t_j2 (3.8)

By putting Equation 3.7 and Equation 3.8 together we have kT k2_F− kYk2_F = P X j=1 t_j2− P X l=1 y_j2 (3.9)

(40)

3.4 Optimization of the sparse n-mode product 25

We exploit the sparse structure of Y and iteratively add one element at a time from X . Let 0 < M < τt, where M is the number of nonzero coefficients in the

sparse tensor Y at a specific iteration. Then, from Equation 3.7 and the formu-lation of Equation 3.9 together with the index l as the location to the nonzero coefficients, we finally have

kT k2_F− kYk2_F = N X j=1 t2_j − M X l=1 y_l2 (3.10)

To conclude the process: We iteratively access an element from a tensor X by Equation 3.1 and update the threshold value ek as in Equation 3.10. Line 6 in

Algorithm 2 is now optimized and is among the simplest computations, right after line 12.

(41)

(42)

4

Result

In this chapter we present the result for compression quality and speed. The used data sets were both synthetic images and randomly generated. The synthetic data set was used to measure the compression quality and the randomly generated data sets were used to measure the speed for various dimensionality. The results were achieved with Nvidia GeForce GTX Titan XP and Intel Xeon CPU W3670 at 3.2 GHz. Since the computations were performed on the GPU only one CPU core was used.

4.1 Synthetic data set

Figure 4.1 and Figure 4.2 shows the data set Rotating Bunnies [Wetzstein et al., 2012], and has the dimension m1 = 8, m2 = 8, m3 = 3, m4 = 9, m5 = 3, m6 =

3 for each data point. There are 6930 data points in 3 frames and 13860 data points in 6 frames, see Table 4.1 and Table 4.2. The computation speed for the Rotating Bunnies data set is 2.45 data points per ms over K = 32 dictionaries for

τt = 770 and ǫ = 10−5, or 2450 data points per second. On average, the result is

a little bit more than 1 light field frame per second. By normalizing the result to

K = 1 dictionary, i.e. one iteration of Algorithm 2, we get 34 (33.9) frames per

second and dictionary. We compare with the result of [Miandji et al.], which has a total encoding time of 1910 seconds over 89 frames and K = 64 dictionaries. Normalizing this, we get on average almost 3 (2.98) frames per second for one iteration or K = 1 dictionary. Thus, our speedup ratio is 11.37, or about 11 times faster.

(43)

28 4 Result

Table 4.1: The measured result for the Rotating Bunnies. Where K is the number of dictionaries, N is the number of data points and τtis the sparsity or the number of nonzero coefficients included. N/Ks is data points per sec-ond and dictionary. It measures how many data points per secsec-ond Algorithm 2 can compute during one iteration, where s = 1 second and K = 1 dictionary.

Data set K N time (s) PSNR MSE τt N/s

Bunnies 32 13 860 5.44 31.67 6.87e-04 256 81 529 Bunnies 32 13 860 5.50 33.94 4.05e-04 512 80 576 Bunnies 32 13 860 5.59 35.66 2.82e-04 770 79 264 Bunnies 32 13 860 5.61 36.95 8.45e-05 1024 79 104 Bunnies 64 13 860 10.92 31.89 6.30e-04 256 81 230 Bunnies 64 13 860 11.10 34.26 3.74e-04 512 79 913 Bunnies 64 13 860 11.14 35.91 2.57e-04 770 79 555 Bunnies 64 13 860 11.30 37.28 4.677e-05 1024 78 499

4.2 Randomly generated data set

The RNG, randomly generated, data sets in Table 4.2 were created with a uniform distribution. The purpose was to measure the time and speed for different or arbi-trary dimensionality of a data point. We aim to see how the dimensionality affect the computation time.

The dictionaries were created by first creating random uniform matrices and then perform a QR-decomposition, A = QR, on each matrix along each mode. Where Q is a random orthogonal matrix and R is an upper triangular matrix. The data points in each data set have 3 color channels, m5 = 3. The last

dimen-sion is set to be m6= 3 frames. RNG3 has the dimensions m1= 3, m2= 3, m3= 3,

m4= 3, m5= 3, m6= 3. RNG5 has m1= 5, m2= 5, m3= 5, m4= 5, m5= 3, m6= 3.

RNG8 has m1= 8, m2 = 8, m3= 8, m4= 8, m5= 3, m6= 3 and RNG10 has m1=

10, m2= 10, m3= 10, m4= 10, m5= 3, m6= 3. Lastly, the RNG Painter data set is

set to have the same dimensions as [Miandji et al.] use the Painter data set, with 4 frames. It has the dimensions m1= 10, m2= 10, m3= 4, m4= 4, m5= 3, m6=

4.

The encoding time by [Miandji et al.] for the Painter data set [Sabater et al., 2017] over K = 128 dictionaries and 101 frames is a total of 12 436 seconds. This is computed with four Intel Xeon E7-4870 at 2.4GHz, a total of 40 CPU cores and 80 threads. On average, the speed is 123 frames per second. For one iteration, i.e.

K = 1 dictionary, the speed is almost 1 (0.96) frame per second and dictionary.

The result in this thesis, seen in Table 4.3, is 374 ms for 4 frames and normal-ized to K = 1 dictionary. Thus, the average speed is 10.9 frames per second and dictionary. We get almost 11 times faster encoding time with same data point dimensionality.

(44)

4.2 Randomly generated data set 29

(a) Original data

(b) Compressed with τt= 1024

(c) Compressed with τt= 256

Figure 4.1: The data set Rotating Bunnies is compressed with different spar-sity threshold values of τt. Here, ǫ = 10−5is fixed.

(45)

30 4 Result

(a) Original data.

(b) Compressed with τt= 1024.

(c) Compressed with τt= 256.

Figure 4.2: A closer look on the Rotating Bunnies in Figure 4.1. The high frequencies are most difficult to deal with.

Table 4.2: This table shows data points with different dimensionality. We run as many data points that can fit on the memory and compute the average speed, which consists of the number of data points per second during one dictionary iteration, where K = 1. Each data point has Nenumber of elements and the sparsity τtwas set to 5 % of Ne. The threshold ǫ = 10−5for all data sets. N/s is data points per second.

Data set N/s Ne RNG3 (3x3x3x3x3x3) 546 448 729 RNG5 (5x5x5x5x3x3) 211 864 5625 RNG8 (8x8x8x8x3x3) 37 735 36 864 RNG10 (10x10x10x10x3x3) 16 000 90 000 Rotating Bunnies (8x8x3x9x3x3) 79 264 15 552 RNG Painter (10x10x4x4x3x4) 61 497 19 200

(46)

4.2 Randomly generated data set 31

(a) Original data.

(b) Compressed with ǫ = 10−2.

(c) Compressed with ǫ = 10−4.

Figure 4.3: The data set Rotating Bunnies is compressed with different threshold values of ǫ. Here, τt= 1024 is fixed.

(47)

32 4 Result

Table 4.3: We run different data sets with one dictionary iteration. The mea-sured time for the functions in Algorithm 2 are: Norm on Line 1, n-mode-product on Line 3 and Sorting on Line 5. The timings are in milliseconds (ms). The RNG Painter data set requires four times more memory than the Bunnies, and therefore around four times longer computation time.

Data set Norm N-mode Sorting Total RNG3 (3x3x3x3x3x3x100K) 3.82_2.08% 19.06_10.39% 160.6_87.53% 183.48_100% RNG5 (5x5x5x5x3x3x25K) 6.86 32.44 78.59 117.89 5.82% 27.52% 66.66% 100% RNG8 (8x8x8x8x3x3x6000) 10.71_6.75% 47.42_29.89% 100.5_63.35% 158.63_100% RNG10 (10x10x10x10x3x3x4000) 17.35 83.84 148.82 250.01 6.94% 33.53% 59.53% 100% Bunnies (8x8x3x9x3x3x6930) 5.18_5.47% 27.06_28.59% 62.4_65.93% 94.64_100% RNG Painter (10x10x4x4x3x4x23K) 24.12_6.45% 104.69_27.99% 245.2_65.56% 374.01_100%

(48)

5

Discussion

In this chapter we will discuss and analyze the implementation, certain limita-tions and the result.

5.1 Implementation

5.1.1 Design

The design choices showed to be very limited. Since we assume that the num-ber of data points is very large, like an incoming stream-line, there are limited choices for the algorithm outline. One could argue that projecting the data points onto all dictionaries at once in parallel, and in that way remove the iteration on line 2 in Algorithm 2, is a better design choice. But projecting data points onto all dictionaries in parallel would require K times more memory and made no dif-ference in the time-per-data point performance perspective. For faster GPU with same amount of global memory this method would be worse.

Another aspect of the design choice is the segmented or chunk-wise computa-tion in parallel. This seemed to be the best choice to minimize memory transac-tions between the host and device. The other way would be to compute one line in Algorithm 2 for all data points and move to next line when finished. This would raise two problems. The first problem is the need of data points at next line and the second is the never-ending computations. To bring a simple example: For the first problem, assume the incoming number of data points N = 109_{and assume}

that the implementation is done in such way that the n-mode product on line 3 in Algorithm 2 is computed until all data points are finished. The data points would not fit on the global memory and would had to be written to the disk, and read again when line 4 starts. This would create unnecessary read-write operations to and from host. For the second problem, assume there are infinite number of

(49)

34 5 Discussion data points or a never-ending stream of data points. This would simply make the program stuck on the first line and never end.

5.1.2 N-mode product

The n-mode product is a very general numeric computation viewed from a data perspective. The tensor could be large or small, which would require two sep-arate implementations. The implementation optimized for a large tensor does not necessary benefit for smaller tensors, and vice-versa. This is shown in the implementation of this thesis. The second point of view is the number of dictio-naries or the number of dimensions of the tensor. If the number of dictiodictio-naries is very large, one would be careful for the post-process of an n-mode product. For example, unfolding a tensor to a matrix many times is an expensive opera-tion and not desirable. If the number of dicopera-tionaries is very small then there are several options, such as in-place/explicitly unfolding/folding. A small test-suite, that was not included in this report, showed that it is faster to unfold the tensor and then multiply with a matrix-matrix product for larger dictionaries. This was done by using the cuBLAS library which is a BLAS library (Basic Linear Alge-bra Subprograms) optimized for CUDA. However, this was not the case for the smaller dictionaries used in this thesis. Instead, the implementation described in Chapter 3 was faster. More about the cuBLAS library can be seen in [Nvidia, 2018a].

The limitation of this work is mostly memory-bounded, as almost all matrix-product implementations are, but also bounded by the performance of each in-dividual thread. The memory limitation comes from the global memory and the shared memory. More data points can be computed in parallel if one has access to a larger global memory and bigger dictionaries can be used with larger shared memory. However, this does not scale without the workload-limit of a thread. Each thread does a matrix-vector product, which means a thread has to iterate over every column of the matrix and compute the sum for every row of the vec-tor. If the vector is large, which means the number of elements along a dimension in the tensor is large, the computation will be exhausting for the thread. The im-plementation is therefore suited to smaller tensors and smaller dictionaries.

5.1.3 Sorting

The sorting process consumes a large portion of the computation time, see 4.3. The advantage of this type of general segmented sorting is that it will require roughly the same amount of time for much larger data set. But it comes with the downside of not being fully optimized for this case when each data point is small from a sorting perspective, which causes a lot of overhead. One can argue that sorting all elements in each data point when only needing the τt-th largest

element is an overdoing process. One way to counter this is to have two different implementations. The program would in run-time check how large the value τt

is and choose a more optimized technique. Especially for smaller values of τt. On