Implementation of Principal Component Analysis For Use in Anomaly Detection Using CUDA

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science Master’s thesis, 30 ECTS | Datateknik 2019 | LIU-IDA/LITH-EX-A--19/070--SE

Implementation of Principal

Component Analysis For Use in

Anomaly Detection Using CUDA

Implementation av principialkomponentanalys för användning

inom anomalidetektion med hjälp av CUDA

Joakim Bertils

Supervisor : Rouhollah Mahfouzi Examiner : Christoph Kessler External supervisor : David Khelil

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut en-staka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten ﬁnns lösningar av teknisk och adminis-trativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfat-tning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circum-stances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its proce-dures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

As more and more systems are connected, a large benefit is found in being able to find and predict problems in the monitored process. By analyzing the data in real time, feedback can be generated to the operators or the process al-lowing the process to correct itself.

This thesis implements and evaluates three CUDA GPU implementations of the principal component analysis used for dimensionality reduction of multi-variate data sets running in real time to explore the trade-offs of the algorithm implementations in terms of speed, energy and accuracy. The GPU implementa-tions are compared to reference implementaimplementa-tions on the CPU.

The study finds that the covariance based method is the fastest of the imple-mentations for the tested configurations, but the iterative NIPALS implementa-tion has some interesting optimizaimplementa-tion opportunities that are explored. For large enough data sets, speedup compared to the 8 virtual core CPU of around 100 is obtained for the GPU implementations, making the GPU implementations on option to investigate for problems requiring real time computation of principal components.

(4)

5 Results 28 5.1 Implementation . . . 28 5.2 Performance . . . 28 5.3 NIPALS tradeoff . . . 42 5.4 Real-time switching . . . 44 6 Discussion 45 6.1 Results . . . 45 6.2 Method . . . 47 7 Conclusion 49 7.1 Summary . . . 49 7.2 Future work . . . 49 Bibliography 51 A NIPALS implementation 54

(6)

List of Figures

2.1 Randomly generated data points . . . 4

2.2 Randomly generated data points with trend line . . . 5

2.3 NIPALS Loading Regression . . . 10

2.4 NIPALS Score Regression . . . 10

2.5 CUDA Grid of Thread Blocks . . . 14

4.1 Client-Server relation . . . 21

4.2 Client flow diagram . . . 22

4.3 Server flow diagram . . . 22

5.1 1 Sensor 44100 Hz Time Consumption . . . 29

5.5 2 Sensors 44100 Hz Time Consumption . . . 31

5.17 1 Sensor 44100 Hz Energy Consumption . . . 36

5.21 2 Sensors 44100 Hz Energy Consumption . . . 37

(7)

5.33 NIPALS number of components . . . 43

(8)

List of Tables

(9)

List of Listings

2.1 Covariance method pseudocode . . . 8

2.2 SVD method pseudocode . . . 9

2.3 Simple CUDA kernel . . . 15

2.4 Advanced CUDA kernel . . . 15

2.5 Simple host program . . . 17

4.1 SVD implementation . . . 25

4.2 Covariance implementation . . . 26

4.3 Real-time switching . . . 27

(10)

1 Introduction

This chapter describes the motivation, aim, research questions and delimitations for the thesis.

1.1 Motivation

With more and more production systems being connected together, there is a large benefit in being able to find and predict where and when problems in the process are likely to arise. In a factory, for example, unexpected stops in the production line can cause large costs and small disturbances in the process have a large influence in whether the whole system is working properly or not. With on-line analyzing of the data from the systems in a factory, feedback can be generated to the affected system in real-time, and the system can be adjusted either by an operator or by some self-adjusting mechanism. In many cases, such as Internet of Things (IoT) systems, there are a lot of data collected spanning many dimensions and therefore it is hard to draw any conclusions from the data regarding the classification of the current state of the system. By finding a representation of the data expressed in a space of largely reduced dimensionality, still containing as much information as possible, subsequent data processing steps have lower requirements for data processing size. This can be solved by principal component analysis which transforms a set of observations into principal components such that the principal components are ordered according to their variance which in turn can be used to find a subspace minimizing information loss.

1.2 Aim

The aim of this thesis is to find if it is feasible to implement a principal component analysis algorithm for performing the dimensionality reduction of the data working in real-time using a graphics processing unit (GPU) with CUDA compute

(11)

capabili-1.3. Research questions

ties. If it is feasible, the minimum performance needed may be investigated. Oth-erwise, optimizations and bottlenecks of the system will be investigated. Multiple implementations of the principal component analysis are implemented, tested and compared in terms of performance and real-time feasibility. The implementations will be compared in terms of computation time for the GPU algorithm, accuracy of the output data compared to the input data, power consumption for the system and latency from the sampled data to the calculated result.

1.3 Research questions

The thesis aims to find an answer to the following questions:

• Which of the developed implementations of the dimensionality reduction algo-rithm is most suited for monitoring processes consisting of IoT sensors in real time?

• How can a real-time implementation of principal component analysis be opti-mized for handling data from IoT units?

• What trade-offs exists in performance compared to data accuracy and power consumption? Is it possible to switch algorithm at run-time depending on the input data?

1.4 Delimitations

For the thesis, there exists some delimitations to reduce the problem to a manageable size. The delimitations are described below.

The algorithms will only be implemented using the NVIDIA CUDA general pur-pose graphics processor compute architecture and framework. A multi-threaded CPU implementation may be considered as reference. No other means of parallel computing will be considered, such as a cloud-based solution.

No other algorithms suited for the problem, such as factor analysis, will be imple-mented and evaluated for the results of the thesis.

Only the dimensionality reduction part of the process monitoring system and a generic streaming mechanism feeding the algorithms with well-formed data will be implemented. The system should make no attempt to classify and draw any conclu-sions regarding the data.

The input data will only consist of sound data sampled from a set of microphones that may be placed in proximity of the system being monitored. No other input data will be tested and evaluated in the thesis.

(12)

2 Background and theory

This chapter presents the necessary background and theory for the thesis.

2.1 Principal component analysis

Principal component analysis is a widely used statistical procedure for reducing the dimensionality of a problem to be able to analyze the data. The procedure is widely used in exploratory data analysis, which is an approach to summarize and visualize main features of a data set. It is used in many variations to solve different types of problems where much data is collected from the processes in focus.

Principal component analysis generates a new set of linearly uncorrelated vari-ables called principal components for the input data ordered so that the first variable accounts for the largest amount of variance in the data set. Each subsequent variable accounts for the largest amount of variance in the data set while being orthogonal to the previous variables. By choosing a subset A~1, ...,A~k of the principal components

~

A1, ...,A~n, where k ă= n, the projection from the original n-dimensional space to the

lower k-dimensional space which preserves the most variance is obtained. In general, k can be chosen to a low value while the procedure still preserves a high percentage of the total variance in the information contained in the data set. This allows for minimal loss in the information projection to the lower dimensional space, which al-lows the data to be processed by other algorithms in a more efficient manner, while still maintaining as much information as possible. In addition, the vectors spanning the principal component space are orthogonal which allows for optimizations in the algorithms after the dimensionality reduction [19], [26].

By tracking the principal components of a multivariate data set from process mon-itoring sensors such as sound sensors, lower-dimensional model of the system can be obtained which makes subsequent estimations of the system state easier and more convenient to work with while maintaining minimal loss in the contained informa-tion.

(13)

2.1. Principal component analysis

There exists many variations of this algorithm, each designed to efficiently solve a different problem under a certain set of conditions and properties of the input data. Chapter 3 presents related work using the principal component analysis very shortly outlined together with the problem that the algorithm is designed to solve.

Visualization

For illustration, consider the two-dimensional plot in Figure 2.1.

-10 -5 0 5 10 -8 -6 -4 -2 0 2 4 6 x y

Figure 2.1: Randomly generated data points

The plot contains randomly generated points of a generic data set. The points represent individual observations from a process sampled in two dimensions, repre-sented by x and y respectively. Although the points are random, a clear correlation between the two dimensions can be seen. By inserting a linear trend line, as seen in figure 2.2, this correlation can be easily visualized.

(14)

2.1. Principal component analysis -10 -5 0 5 10 -8 -6 -4 -2 0 2 4 6 x y

Figure 2.2: Randomly generated data points with trend line

The trend line constitutes the first principal component, as projecting all points on this line will yield the lowest squared error sum. This is the result of the principal component analysis procedure. By subtracting the projection of each point onto this trend line from the point, a new set of points is obtained, for which the procedure may be repeated to obtain the following principal components.

Formal definition and derivation

This section describes a derivation of the principal component analysis method based on the derivation presented by Jolliffe [19]. Formally the principal component anal-ysis can be described as finding the linear functions A~k

1_~

X of the elements ~X such that the transforms A~k

1_~

X maximizes the variance, where the variance is defined as in equation 2.1, ordered in such a way that the first principal component accounts for the largest amount of variance possible. Subsequent principal components ac-counts for the largest variance under the constraint that the principal component are orthogonal and uncorrelated to all previous principal components.

Var(~X) =E[(~X ´~µ)2] (2.1) µis the expected value of~X as described in

(15)

The variance of may also be written as equation 2.3, where Cov is the co-variance of the data andΣ is a symmetric matrix.

Var(~X) =Cov(~X,~X) = ~X1Σ~X (2.3)

The linear functionA~1 1_~ X can be written as ~ A1 1_~ X= a11x1+a12x2+...+a1nxn = n ÿ j=1 a1jxj. (2.4)

To derive the principal component analysis procedure, consider the fact that the first principal componentA~1must maximize the variance as described above. Using

equation 2.3, this may be described as Var( ~A1 1_~ X) = ~A1 1 Σ_A~ 1. (2.5)

It is trivial to find that the maximum for this relation is non-finite, so a constraint for the normalization of the function is introduced, stating that the sum of all squared elements must be unit, as described in

~

A1 1_~

A1 =1. (2.6)

To derive the maximum of equation 2.5, subject to the normalization constraint described in equation 2.6, the method of Lagrange multipliers may be used. After introducing a Lagrange multiplier for the normalization constraint the expression to maximize is described in equation 2.7.

~ A1 1 Σ_A~ 1´ λ( ~A1 1_~ A1´1) (2.7)

λis a Lagrange multiplier[3]. The Lagrange multiplier technique states that the first

derivative of the variables should be zero, which gives the equation Σ_A~

1´ λA~1= ~0, (2.8)

which is equivalent to

(Σ ´ λIn) ~A1= ~0, (2.9)

where Inis the unit matrix of size n ˆ n.

From equation 2.9, it can be concluded that λ is an eigenvalue of Σ and thus A~1

must be the corresponding eigenvector for that eigenvalue. To find which of the n eigenvalues that maximizes equation 2.7, recall that the quantity to maximize is the quantity described in equation 2.5, which can be written as

~ A1 1 Σ_A~ 1 = ~A1 1 λA~1= λA~1 1_~ A1= λ (2.10)

using the fact that λ is an eigenvalue ofΣ and the normalization constaint. As the quantity should be maximized, it implies that λ should be the largest of the eigenval-ues and thusA~1should be the eigenvector corresponding to that eigenvalue.

(16)

The second largest principal component, A~2 1

~

X, can be derived in a similar man-ner. The principal component should maximize the expression

~

A2 1

Σ_A~

2, (2.11)

subject to being uncorrelated with the first principal component, which is equivalent to Cov( ~A1 1_~ X,A~2 1_~ X) =0, (2.12)

where Cov(X, Y)is the covariance between the two random variables X and Y. Using the Lagrange multipliers method, the target function becomes

~ A2 1 Σ_A~ 2´ λ( ~A2 1_~ A2´1)´ φA~2 1_~ A1, (2.13)

where λ and φ are Lagrange multipliers for the normalization constraint and the uncorrelation constraint, respectively. Proceeding in the same manner as above, the expression differentiated with respect to A~2 should be zero, which yields equation

2.14.

Σ_A~

2´ λA~2´ φA~1 =0 (2.14)

Multiplying both sides of equation 2.14 withA~1 1

from the left yields equation 2.15.

~

A1 1

Σ_A~₂_´_A~₁1_λ_A~₂_´_A~₁1_φ_A~₁ ₌₀ _(2.15)

Due to the uncorrelation constraint, the first terms of the above equation is equal to zero, which yields

φA~1 1_~

A1 =0. (2.16)

However, A~1 is normalized which gives that for the above equation to hold, φ

must be zero.

Equation 2.14 is thus reduced to Σ_A~

2´ λA~2=0 (2.17)

which can be rewritten as

(Σ ´ λI_n) ~A2=0. (2.18)

Just as for the first principal component, λ must be an eigenvalue of Σ, and by repeating the above argument, it can be concluded that it must be the second largest eigenvalue. It cannot be the largest eigenvalue as that would imply that

~

A1= ~A2 (2.19)

which would violate the constraint that the principal components are uncorrelated. The same procedure can be repeated to prove that the p-th largest principal com-ponent is given by the p-th largest eigenvalue and corresponding eigenvector for the covariance matrix Cov(X, X)for the data X, for all p P[1, n].

(17)

2.2. Covariance based method

2.2 Covariance based method

As can be seen from the derivation in section 2.1, finding the principal components for a data set can be performed by finding the eigenvalues and eigenvectors for the covariance matrix for the data matrix X of size n ˆ m consisting of m observations of n variables. Some properties of the data are required for the method to give the correct result. The data must be centered, meaning that the mean of the data should be zero. The data must also be normalized in the sense that the standard deviation of the data must be one [19].

The covariance matrix of this data matrix is given by Σ= 1

n ´ 1X

T_X. _(2.20)

Solving for the eigenvalues of this matrix gives

Σ=VDV´1=VDVT (2.21)

where D is a diagonal matrix containing the eigenvalues ofΣ and V is a unitary ma-trix where the basis vectors are the corresponding eigenvectors. Since the principal components should be ordered according to variance influence, the eigenvalues must be sorted in descending order and the eigenvectors must be reordered accordingly. Listing 2.1 outlines how this can be done [19].

1 Sigma = 1 / (n - 1) * Transpose(A)*A 2 3 S, V = EigenValueDecomposition(Sigma) 4 5 Order = Sort(S) 6 7 Reorder(V, Order)

Listing 2.1: Covariance method pseudocode

2.3 Singular value decomposition based method

Consider the singular value decomposition of the data matrix X of size n ˆ m, such that

X =USWT, (2.22)

where S is a n ˆ m diagonal matrix containing the singular values of X, U is a n ˆ n unitary matrix and W is a m ˆ m orthogonal matrix.

The covariance matrix XTX can be written as

XTX=WSTUTUSWT =WSTSWT =WS2WT (2.23) which, by comparing to the eigenvalue decomposition of the covariance of X, gives that the left singular vectors of X are equivalent to the eigenvectors of XT_{X and the}

(18)

2.4. NIPALS

This method of computing the singular values does not require calculating the covariance matrix XTX. The singular value decomposition also places the singular values in descending order, avoiding a potentially time-consuming sorting step in the algorithm. Listing 2.2 shows how this can be done [19].

1 A_scaled = 1 / (n - 1) * A 2

3 U, S, VT = SingularValueDecomposition(A_scaled) 4

5 S = MultiplyElementWise(S, S)

Listing 2.2: SVD method pseudocode

2.4 NIPALS

NIPALS, which is short for Nonlinear Iterative PArtial Least Squares, is an iterative algorithm for finding the first few principal components by decomposing the matrix X into two matrices T and P as described in [2]

X= TPT. (2.24)

The columns of the matrix T are called scores and the columns of the matrix P are called loadings. Just as for the other implementations, it is assumed that the data matrix is pre-processed so that it scaled to achieve a mean of zero and a standard deviation of one for all observations constituting the data matrix. In the beginning, a variable h is set to 1 and Xh = X. The algorithm then follows these steps for each

desired principal component:

1. Choose thas the first column of Xh.

2. Compute the loadings p_hgiven by p_h =XT

hth/(tThth).

3. Normalize the loadings according to ph = ph/kphk.

4. Compute the scores thgiven by th= XT_hph/(p_hTph).

To visualize the algorithm, the steps are below described with images. Step 2 is a regression of each column Xk in Xh onto the initial column th and the regression

coefficient of each regression is stored in the corresponding place in ph, as can be

(19)

2.5. Relative error

Figure 2.3: NIPALS Loading Regression

The regression in this case is an ordinary least squares regression which is given by

β= y

T_x

xT_x. (2.25)

This gives the formula used in step 2.

Step 4 is a regression of every row of the data matrix Xh onto the normalized

vector of loadings. Each regression coefficient now represents the score for the corre-sponding row as can be seen in Figure 2.4.

Figure 2.4: NIPALS Score Regression

Once again the least squares method is used to perform the regression which, just as in the loading regression, gives the formula used in step 4.

Step 2, 3 and 4 should be repeated until the scores converge which in this case means that the difference between two iterations should be less than some parameter which gives the tolerance of the algorithm. Increasing the tolerance decreases the run time of the algorithm but decreases the accuracy of the results.

Finally let Xh+1 = Xh´thpTh. The eigenvalue is given by λh = tThth. Increase h

by one and repeat the process until all the desired principal components have been calculated [2, 27].

2.5 Relative error

The precision of the algorithms can be measured by using a reference algorithm im-plementation that is assumed to be correct. The relative error is defined as in equation

(20)

2.6. Dimensionality reduction

2.26 and describes how much a quantity differs from the true value in proportion to the size of the value. The largest deviation for matrices and vectors is taken as the largest deviating element of that matrix or vector. The use of the maximum norm is to find the largest error in a matrix and this is interesting due to the fact that the accuracy algorithms are often given in terms of a tolerance [12].

δx=max i,j ˇ ˇ ˇ ˇ Xij´Xre fij X_ij ˇ ˇ ˇ ˇ (2.26)

2.6 Dimensionality reduction

Principal component analysis is often used to find a representation of the data ex-pressed in a highly reduced dimensional space of dimensionality p compared to the n-dimensional space of the original data. Geometrically, this can be though of as fit-ting a p-dimensional ellipsoid to the data points in n-dimensional space. To construct a transformation from the original n-dimensional space to a lower p-dimensional space, a subset consisting of the first p columns of V given by the eigenvalue de-composition described by equation 2.21 are saved to a new matrix K such that

Kkl =Vkl f or k=1, ..., n and l =1, ..., p, (2.27)

given that

1 ď p ď n. (2.28)

It is of great interest to have a measurement of the amount of information pre-served in the dimensionality reduction. Therefore, the notion of cumulative energy is introduced. The cumulative energy content of the eigenvectors gives an estimation on how much information is saved in a subset of the original data after the dimen-sionality reduction have been performed. The cumulative energy content of a subset consisting of j eigenvectors are given by

gj = j

ÿ

k=1

Dkk, (2.29)

where D is the diagonal nxn matrix of eigenvalues of the covariance matrix [24].

2.7 Anomaly detection

While the implementation of the anomaly detection step is not the main focus of this thesis, a solid foundation of the anomaly detection theory is still of interest to reason about the impact on the performance of the full application chain when performing the dimensionality reduction step. There exists a number of different ways to im-plement anomaly detection, and one of the simpler methods is k-Nearest Neighbour which is described below.

(21)

2.8. General purpose GPU computing

k-Nearest Neighbour

k-Nearest Neighbour is an algorithm used to classify data. The input is generally a set of n data points in N dimensions and the output is a class membership for each of the points. The classification is performed by finding the k closest points for each point. The distances between the points~p and~q are determined by using the N-dimensional Euclidean norm expressed in equation 2.30.

d(~p,~q) = g f f e N ÿ i=0 (pi´qi)2 (2.30)

The class membership can be determined in a number of different ways but the most common is by a majority vote between the k neighbours of the data point, which means that the data point is assigned the class that most of the data point’s neigh-bours are classified as.

The model is typically trained with labeled data. The training data consists of N-dimensional data points that already have a predetermined class membership and the only step in the data training phase is storing the feature vectors. For classify-ing the data, the set of feature vectors has to be iterated over and the distance to each point has to be calculated, which yields a time complexity of O(nNk), where n is the size of the prediction data set, N is the number of dimensions and k is the hyper-parameter for the algorithm specifying how many of the neighbouring points to consider in the algorithm [1].

2.8 General purpose GPU computing

The GPU, or graphics processing unit, is originally as the name suggests developed and optimized for handling computations related to graphics. However it turns out that GPU:s may also be used for other problems, which have spawned a new branch of computing named General purpose GPU computing, or GPGPU computing for short. Since graphics computing is in general very parallel, the GPU:s have been designed to handle much data at the same time, provided that the same operations are applied to all the data. A large speedup compared to CPU implementations may be achieved even for other problems that are easy to parallelize [25].

CUDA

The speed of the program is only increased if the program can exploit the parallelism in the problem and efficiently use the parallelism provided by the underlying hard-ware. CUDA, which stands for Compute Unified Device Architecture is a platform for parallel computing and a programming model designed by NVIDIA to be used in the NVIDIA graphics cards that allow the developers to access many of the CPU computing features and the virtual instruction set for use in general purpose comput-ing. The CUDA environment comes with a special compiler that is able to translate C-like code to the GPU instructions. [23]

(22)

Control Architecture

The CUDA GPU is a single instruction, multiple thread (SIMT) computing device. That means that a thread is the fundamental unit in the CUDA programming model. Each thread is computing the same instructions with a specific subset of the data. The instructions to be executed constitutes a kernel, which in loose terms is a function that is run on the GPU. The threads in the CUDA model are hierarchically organized in a grid. The grid consists of a 3-dimensional array of blocks and each block consists of a 3-dimensional array of threads. Each thread is given a unique thread ID and a block ID so that the thread can be differentiated from other threads. The maximum number of threads and blocks are given by the hardware used, and may be queried at run-time for optimized performance. An illustration, taken from [8], of the grid, block and threads can be seen in figure 2.5, however it should be noted that it is possible to make use of 3-dimensional blocks and grids. [23]

(23)

2.8. General purpose GPU computing Block (0,0) Block (0,1) Block (1,0) Block (2,0) Block (2,1) Grid Thread (0,0) Block (1,1) Thread (1,0) Thread (2,0) Thread (0,1) Thread (2,1)

Thread (0,2) Thread (1,2) Thread (2,2) Block (1,1)

Thread (1,1)

Figure 2.5: CUDA Grid of Thread Blocks

Kernel

A simple CUDA kernel is listed in code listing 2.3. The kernel uses the thread index threadIdx to access an element in the array of floating point numbers named in. The function func is applied to the element and the result is stored in the array named out at the same index, forming a simple map-kernel. When the kernel is launched, the developer selects how many threads should be spawned and the layout of the threads in the 3-dimensional block. In this case, only the x-dimension of the thread index is used, meaning that the device will access the memory as if the memory is a one-dimensional array. The __device__ keyword indicates to the NVIDIA preprocessor and compiler that the code should be compiled as device code instead of host code [23].

(24)

1 __device__

2 void simple_kernel(float _*in, float _*out)

3 {

4 float element = in[threadIdx.x]; 5 out[threadIdx.x] = func(element); 6 }

Listing 2.3: Simple CUDA kernel

A more advanced CUDA kernel is listed in code listing 2.4. The kernel is using a two-dimensional thread index to access elements from row-major matrices. The kernel adds the matrices A and B, and the result is stored in matrix C. The kernel makes use of multiple blocks to be able to access very large matrices. The size of the blocks and the number of blocks to use is specified by the developer when launching the kernel. The current size is given in the special variable blockDim, giving the size in each dimension of the blocks. The special variable blockIdx gives the current block index for the spawned thread.

1 __device__

2 void matrix_add(float _*A, float _*B, float _*C,

3 int sizeX, int sizeY)

4 {

5 int x = blockIdx.x * blockDim.x + threadIdx.x; 6 int y = blockIdx.y * blockDim.y + threadIdx.y; 7

8 if(x >= sizeX || y >= sizeY) 9 return;

10

11 int offset = x + y * sizeX; 12

13 C[offset] = A[offset] + B[offset]; 14 }

Listing 2.4: Advanced CUDA kernel

Memory Architecture

A solid understanding of the CUDA memory model is needed to fully utilize the performance of the graphical processing unit. The host and the device do not share memory, so a transfer from and to the device memory is needed each time a compu-tation is to be made. There are two main levels of memory that are discussed here: the global memory and the shared memory. The global memory is typically large but slow with access times of hundreds of cycles. The shared memory is smaller but typically much faster. The shared memory is local to each block in the grid, meaning that the memory can be shared between threads in the same block. This provides a large optimization opportunity for programs where each thread has many memory accesses with high spatial locality.

(25)

GPU work flow

In a typical program utilizing the CUDA functionality, the following work flow is used:

1. Allocate device memory on the GPU 2. Copy the data from the CPU to the GPU 3. Execute the kernel

4. Copy the data from the GPU to the CPU 5. Free the allocated device memory

If the program is intended to run multiple times, step 2 to step 4 may be looped, executing step 1 and step 5 only once for the entire program to increase the perfor-mance of the application.

Listing 2.5 lists a simple host program highlighting all the above steps in the typi-cal workflow. The host program is the program running on the CPU that handles the control flow of the GPU interface, such as memory transfers and kernel executions. Special functions are used to allocate and free device memory as well as handling the data transfer from and to the device. In this case, the simple kernel from listing 2.3 was launched with a single block and 256 threads.

(26)

2.8. General purpose GPU computing 1 #include <stdlib.h> 2 #include <cuda_runtime.h> 3 4 int main() 5 { 6 int N = 256;

7 float _{*in, *out, *d_in, *d_out;}

8 in = malloc(N*sizeof(float)); 9 out = malloc(N*sizeof(float)); 10

11 // 1. Allocate device memory

12 cudaMalloc(&d_in, N*sizeof(float)); 13 cudaMalloc(&d_out, N*sizeof(float)); 14

15 // Fill out the values somehow

16 fill_values(N, in);

17

18 // 2. Copy the input data

19 cudaMemcpy(d_in, in, N*sizeof(float), cudaMemcpyHostToDevice); 20

21 // 3. Launch the kernel

22 simple_kernel<<<1, N>>>(d_in, d_out); 23

24 // 4. Copy the output data

25 cudaMemcpy(out, d_out, N*sizeof(float), cudaMemcpyDeviceToHost); 26

27 // 5. Free the allocated device memory

28 cudaFree(d_in); 29 cudaFree(d_out); 30 31 free(in); 32 free(out); 33 }

(27)

3 Related work

This chapter presents related work implementing variations of the principal compo-nent analysis.

3.1 Principal component analysis on alternate hardware

Zhang and Lim[29] implement the three methods SVD, covariance and NIPALS in NVIDIA CUDA for hyperspectral images and compare the performance of the algo-rithms. It is concluded that the covariance method of implementing PCA in CUDA has great potential in reaching the required real-time performance for the given data, which consists of large amounts of images. The hardware used was a NVIDIA GTX580 graphics processor. The authors identify two bottlenecks which are limit-ing the performance of the algorithm: the data transfer to the GPU and the size of the GPU memory. Both these bottlenecks are limited by hardware, but can be mitigated by alternative representations of the data or implementations of the algorithm that better utilizes the graphics memory and the memory bus. The presented solution is also limited to a single GPU jobs due to the early version of CUDA used, which leads to another improvement opportunity. No means of gathering the data was discussed, which also may affect the performance of the system competing for processing time. An optimized implementation of the principal component analysis algorithm in real time on a floating point DSP processor for processing of spacial and temporal data is described in [16]. The correctness and the performance of the algorithm is shown by implementing sinusoidal separation and blind source separation. The al-gorithm is intended to run at each frame, recalculating the principal components each time. This is an alternate hardware not discussed in this thesis, but it may still be interesting to find how the algorithm may be implemented on different types of hardware.

(28)

3.2. Principal component analysis use cases

3.2 Principal component analysis use cases

A method for using principal component analysis to monitor the respiratory motion with the help of a color and depth camera is outlined in [28]. It is done by sampling depth images with 640x480 pixels resolution at 6.7 Hz of the patient for 100 frames, and then applying the PCA algorithm to analyze the data. The experimental analysis showed that an accurate model for the respiratory motion can be found using only the first 3 principal components. The subsequent measurements are projected onto the principal component space, which gives a much more convenient data size to work with. The PCA algorithm is only applied on the data from the first 100 frames. The data from the subsequent frames are then projected onto the principal component space due to the assumption that the normal operation of the monitored system does not vary significantly over time. This may be an interesting approach for speeding up the program as for some types of data, the algorithm for finding a new transform may not need to be performed at every iteration.

[6] describes a method for controlling combustion in real time in a diesel engine with reconstructed in-cylinder pressure traces. Using this method, only five principal components were needed to reconstruct the data with less than 1% error in the root mean squared percent error, and the control performance increased with decreased variance in the control variables. This shows that it is possible to reduce the dimen-sionality of the data set drastically and still maintain a high amount of information.

3.3 Variations of principal component analysis

Principal component analysis is widely used in face recognition algorithms due to its ability to track important features in the data set. A method for using a variation of PCA to improve both recognition rate and image reconstruction performance is presented in [30]. The article [21] presents an alternate method which also builds on the principal component analysis to improve the amount of usable information available for face recognition. The article [15] outlines multiple methods for using principal component analysis with images of lower quality, large difference in light-ing and other hinderlight-ing properties as reference images with reasonable performance on a graphics card using CUDA.

Chen et al.[5] describes an implementation of a variation of principal component analysis called deep principal component analysis, which is used to detect faults and diagnose electrical drive systems in high speed trains. PCA divides the original data space into two orthogonal spaces, which are the principal component space (PCS) and the residual space (RS), where the dimensionality of the spaces depend on how much dimensionality reduction is desired for the given problem. Deep PCA further divides the obtained PCSs and RSs into principal component spaces and residual spaces. This means that the original data is divided into 2j subspaces, where j is the order of the deep principal component analysis algorithm. Increasing the order of the algorithm allows for weaker fault signals to be extracted properly. Fault detection is performed by developing a statistical null hypothesis which defines an acceptance region for normal system behaviour. The hypothesis can then be tested to determine whether a fault has occurred and in that case the cause of the fault.

(29)

3.3. Variations of principal component analysis

[10] presents a way of tracking dynamic time-varying signals with a variation of the principal component analysis and can generate a number of time series that max-imize the correlation with past data. The paper describes how this dynamic method can detect faults during process monitoring by comparing the data from the process with the model prediction. An alternate way of solving the same problem is de-scribed in [18]. This paper instead uses independent component analysis which has proven to be useful when dealing with data from processes that are non-Gaussian.

[31] presents a method for integrating data sets from a chemical process where the data have different sampling rates. This is a common case since the modern chemical processes requires many different kinds of sensors to measure all the relevant vari-ables in the system. The method is called multiple probability principal component analysis. The paper also outlines how to set up statistics for a system with multiple sampling rates to be able to detect and predict fault conditions in the process moni-toring scenario.

Principal component analysis performs poorly in terms of precision if the anoma-lies are large compared to the normal data [4]. The article describes a variation of the PCA algorithm called Robust PCA which assumes that the data matrix contain-ing the data is a superposition of a low-rank matrix and a sparse matrix. In this case the faults can be arbitrary large, for example in the case of a sensor failure, with minimal effect on the prediction accuracy. [14] presents a method based on matrix factorization which can efficiently deal with large amounts of data using this robust PCA method.

Gajjar et al[13] describes how principal component analysis can be used to com-press information from sensors deployed in factories with minimum information loss by introducing sparse principal component analysis. Sparse principal component analysis is a variation of principal component analysis where some of the principal component loadings, which are the weights of the projection, are restricted to zero. This allows for easier interpretation of the principal components while minimizing the loss in variance. The paper introduces a method for specifying the number of non-zero loadings.

All the variations will not be studied in this thesis, however they are still relevant if the work is to be continued for a specific type of data with some special properties.

(30)

4 Method

This chapter presents the system setup and what methods have been used to imple-ment and evaluate the system.

4.1 General system layout

To be able to test the full system in a realistic fashion, a system has been developed to give a realistic representation and mock-up model of the data collection and stream-ing performed by Cybercom’s Internet of Thstream-ings sensors and data management sys-tems. The developed system consists of two distinct programs, separated by a net-work connection as illustrated in Figure 4.1. One of the programs acts as the client and the other program acts as the server.

Client Server

Figure 4.1: Client-Server relation

The client program is responsible for the sound sampling from the set of micro-phones. Figure 4.2 shows the flow diagram of the client process. The sampling is performed with the help of OpenAL [17], which is a specification for input and out-put of sound implemented on many systems. The client is also responsible for nor-malizing the data, making sure that the input data to the algorithm implementations are formatted correctly and that the mean is zero and the standard deviation is one. This is, from a performance perspective very advantageous since the sound sampling is performed asynchronously in the client system which means that the client is idle during a large portion of the time. When the data is normalized, the data is packaged

(31)

4.2. Data format

into matrices which are then compressed using numerical floating point compression methods from the ZFP package and sent over the network to the server.

Sampling Pre-processing Compression

Figure 4.2: Client flow diagram

The server program is responsible for the GPU processing of the data. Figure 4.3 shows the flow diagram for the server process. When the data reaches the server, the data is first unpacked and decompressed. It is then fed to the algorithm that is chosen. The desired implementation is selected by passing a flag to the server program that is parsed. If no flag is set, the program changes the implementation at run-time to find the optimal implementation for the current data. The output of the system are the principal components and the transform for the projection to the lower dimensional data. The projection may then be used to project the original data to the lower dimensional representation and perform the anomaly detection using that data.

Decompression GPU Processing

Figure 4.3: Server flow diagram

4.2 Data format

When the data is sampled by the OpenAL, the library converts the samples to signed two-byte integers representing the waveform of the sound signal. The samples are converted to floating point values and during the conversion, the data is also nor-malized to a range of -1 to 1. A set of samples are then packed into a frame, which is essentially a small slice of the sound wave. The number of samples per frame is a configuration parameter that can be tuned for optimal performance. When a number of frames have been gathered, the frames are packed into a matrix, which in later steps are referenced as the data matrix. This is the matrix that is compressed and sent over the network to the server. The application is optimized to handle data packed in this format, however any similar data format should be handled well with the application.

All subsequent steps are performed using matrices and vectors of floating point numbers representing both the data matrix and the outputs of the algorithms. Due to the libraries used, the matrices are represented as column-major, meaning that all elements that belong to the same column is stored in contiguous memory.

(32)

4.3. GPU processing

4.3 GPU processing

Just as described in Section 2.8, kernels are needed to process the data on the GPU. A kernel is a program running on the GPU instead of the CPU. The CPU prepares the data and puts the data in special memory buffers designed to communicate be-tween the GPU memory and the CPU memory. The kernel is then invoked using the desired thread count and block count parameters. All this is abstracted into a data type representing a matrix. All GPU algorithms are then implemented using this matrix data type as a foundation. The CUSOLVER library has been adopted for the more computationally expensive operations, such as singular value decomposition and eigenvalue decomposition. This library contains already optimized kernels for these operations.

4.4 Latency

The latency of the implementation is defined as the time from the time point of the first sample of an iteration to the time point when the algorithm has finished process-ing all data from that iteration. The latency can be divided into three terms

T =Ts+Tc+Tr, (4.1)

where T is the total latency of the system. Tsit the total sample time of the iteration,

depending on how large the data matrix is and the sampling frequency. Tc is the

latency of the communication channel used for sending the data from the sampling client to the processing server which depends on the distance between the nodes and the means of communication. Tris the run time which depends on the

implementa-tion that is currently used.

4.5 Evaluation

The computer that is used for the testing is equipped with a Nvidia GTX1080 graph-ics card and an Intel Core i7-6700K CPU which is a 4-core processor with hyper-threading running at 4 GHz. Both the client and the server is running on the same computer, using separate processes.

To assert the correctness of the implementations, a reference implementation of the algorithms in Matlab has been developed. This reference implementation can save all matrices to files that can then be loaded into the developed CUDA program for testing and comparison. When the correctness of the implementation is verified using the reference implementation, a real-time performance test is performed with the developed test bench which is a script launching the setup for different config-urations and reports the time and energy consumption. The time and power con-sumption of each implementation is tested while varying a set of parameters. The set of parameters consists of the size of the matrices, the number of sensors and the sam-pling frequency of the sound data that constitutes the input data. The speed, energy consumption, accuracy and latency of the system varies with the size of the data set and the algorithm used. The application usually has to choose a few of these param-eters to prefer, weighing in the various paramparam-eters that affect the performance of the

(33)

4.6. Implementations

application and thus trading off some of the parameters for others. This introduces the idea of switching the algorithm used at run-time by defining what parameters to prefer. This is implemented with a scheme testing all implementations available and then choosing the best fit implementation according to some function, which in this case is simply the fastest implementation. Other functions may be of interest to some types of applications. This is not studied further in this thesis.

Additional parameters have been studied for the NIPALS algorithm; the number of components extracted and the accumulated variability of the principal compo-nents. As the NIPALS approach is iterative, any given number of components can be extracted, giving the application the ability to scale the performance of the imple-mentation.

All implementations have corresponding CPU implementations for comparison, which are tested using the same tests as the GPU implementations in terms of speed and energy for consistency in the results.

4.6 Implementations

The implementations are built using abstractions for the common and basic opera-tions which in turn launches the appropriate kernels on the GPU. The GPU imple-mentations of the SVD and the covariance method are described in detail below. The code for the NIPALS procedure is moved to Appendix 2.4 due to code size. The code got the GPU implementations are using procedures which using data structure called CudaMatrix abstracting the copying of the data to and from the GPU mem-ory and the calls to the underlying API functions for performing the matrix oper-ations. These operations employ a copy-when-needed strategy for minimizing the number of needed copy operations. It is also possible to explicitly issue a copy com-mand to and from the device memory using the cuda_matrix_copy_to_device() and cuda_matrix_copy_to_host() routines.

SVD

The data is copied to the matrix host vector, which is then synchronized to the device memory. The factor multiplication is then performed, multiplying the matrix by _N´11 which have been precomputed and placed in the factor scalar variable. The SVD de-composition is then performed and to obtain the principal components, each singular value must be squared. The relevant host code is listed in code listing 4.1. The multi-plication is performed with a pre-allocated work matrix W which is an optimization to avoid copying as much data as possible to and from the GPU.

(34)

1 CudaMatrix *A = cuda_matrix_create(nx, ny); // Input matrix

2 CudaMatrix *Ain = cuda_matrix_create(nx, ny); // Intermediate matrix

3 CudaMatrix *W = cuda_matrix_create(nx, ny); // Work matrix for the matrix multiplication

4 CudaMatrix *S = cuda_matrix_create(nx, 1); // SVD Sigma values

5 CudaMatrix *U = cuda_matrix_create(nx, nx); // SVD U matrix

6 CudaMatrix *VT = cuda_matrix_create(ny, ny); // SVD V^T matrix

7

8 // Copy the packet data to the A matrix

9 memcpy(A->host_v, data, nx * ny * sizeof(float)); 10 cuda_matrix_copy_to_device(A);

11

12 // Perform the scalar multiplication

13 cuda_matrix_mult_s_w(A, factor, Ain, W); 14

15 // Perform the SVD

16 cuda_matrix_svd(Ain, S, U, VT); 17

18 // Square the singular values on the GPU

19 cuda_matrix_mult_elem(S, S);

Listing 4.1: SVD implementation

Covariance

Just like in the SVD implementation, the data is copied and synchronized to the de-vice. The matrix multiplication AAT is then performed and the result is stored in A_evd. The eigenvalue decomposition is calculated for the A_evd matrix. The eigenval-ues must be sorted in descending order according to the principal component anal-ysis procedure and the matrix of eigenvectors must be rearranged to the same order as the eigenvalues. The sorting is here performed on the CPU as the functionality of sorting the numbers and at the same time extracting the order was not implemented for the CudaMatrix type. The order is a vector of initially ordered indices that are permuted in the same was as the eigenvalue vector. The eigenvector matrix may then use this vector of indices to reorder the matrix by simply swapping columns of the matrix according to the order vector. The relevant host code for the covariance implementation is listed in code listing 4.2.

(35)

1 // Copy the packet data to the A matrix

2 memcpy(A->host_v, data, nx * ny * sizeof(float)); 3 cuda_matrix_copy_to_device(A);

4

5 // Aevd = A * A^T

6 cuda_matrix_mult_m(A, A, Aevd, false, true); 7

8 cuda_matrix_evd(Aevd, S); 9

10 cuda_matrix_copy_to_host(S); 11

12 sort(S->host_v, S->host_v + nx, order); 13

14 cuda_matrix_copy_to_device(order); 15

16 // Reorder the transfrom order to match the eigenvalues

17 cuda_matrix_reorder(Aevd, order);

Listing 4.2: Covariance implementation

NIPALS

The code for the NIPALS implementation is listed in Appendix A. The code first cre-ates a few intermediate matrices and floats that are to be used for the algorithm. Then, for each of the needed principal components, the regression steps are performed. The comments describe what high level operation is performed by the function calls. Each step of the inner iteration performs the regression to compute the new loadings and then the regression to compute the new scores. The new scores is then compared to the old scores to find if the convergence criteria is fulfilled. When the implementa-tion has determined that the difference between the old scores and the new scores is sufficiently small, or that a certain number of iterations have been performed, the algorithm step for that particular principal component. The algorithm then updates the source matrix and stores the variance as a result for that iteration and, if more principal component are requested, continues with the next component.

Real-time switching

The implementation of the real-time switching between the algorithms is very sim-ple and intended as a proof of concept for such applications that may be interested always finding the fastest algorithm for the current input data. The code is listed in 4.3. The switching will loop though each of the currently registered implemen-tations every 1000 iterations and find the implementation that has the fastest time. That implementation will then be executed 1000 times and a new decision will be made. This will continue until the program has been requested to exit. This assumes however that the performance for each iteration does not change fast, as it takes 1000 iterations until a new method is evaluated.

(36)

1 while(!shouldExit) 2 {

3 float fastest_time = std::numeric_limits<float>::max;

4 int fastest_impl = -1;

5

6 // Loop through each implementation

7 for(int i = 0; i < N_IMPLEMENTATIONS; ++i)

8 {

9 // Find the run-time of the implementation

10 float impl_time = run_once(impl[i], data); 11

12 // Store the current implementation if it was faster than all the previous

13 if(impl_time < fastest_time) 14 { 15 fastest_time = impl_time; 16 fastest_impl = i; 17 } 18 } 19

20 // Make sure that we have found

21 assert(fastest_impl >= 0); 22

23 // Run the found implementation for 1000 iterations.

24 run_N(impl[fastest_impl], data, 1000); 25 }

(37)

5 Results

This chapter presents the results that have been obtained.

5.1 Implementation

Three implementations of the principal component analysis have been developed for the GPU: SVD based PCA, covariance based PCA and NIPALS based PCA. CPU im-plementation variants have been developed all for three GPU imim-plementations as ref-erence. All implementations have been designed to interface with Cybercom’s data sampling and streaming system as described in Section 4.1. The GPU implementa-tions have been developed in CUDA with the help of cuBLAS and cuSOLVER, which are CUDA libraries with basic linear algebra operations and decomposition solvers. The CPU implementations have been developed with the help of the Eigen3 library which provides sparse and dense vector and matrix operations for the CPU [7, 9, 11].

5.2 Performance

The time and energy consumption per iteration have been measured for a sampling frequency of 44100, 88200, 96000 and 192000 Hz, which all are common standardized sampling frequencies for different applications and sound qualities. The tests have been performed with 1, 2, 3 and 4 sensors which yields data matrices of size N ˆ(N ˚ Ns), where N is the base size of the matrix in the current test and Nsis the number

of sensors used in the test. The base size of the matrices in the tests consists of all powers of two from 32 to 512 inclusive. Each test have been run 10 times and the average run-time for each data point have been calculated to use as the result. The time and energy consumption was measured using MeterPU [20].

(38)

5.2. Performance

Time

Figures 5.1, 5.2, 5.3 and 5.4 shows the time consumption of the GPU and the CPU implementations of the principal component algorithms running with one sensor and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

100 1000 10000 100000 1x106 1x107 100 T ime [µs] N Deadline SVD Covariance NIPALS 1000 10000 100000 1x106 1x107 1x108 100 T ime [µs] N Deadline SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figure 5.1: 1 Sensor 44100 Hz Time Consumption

(39)

5.2. Performance 100 1000 10000 100000 1x106 1x107 100 T ime [µs] N Deadline SVD Covariance NIPALS 1000 10000 100000 1x106 1x107 1x108 100 T ime [µs] N Deadline SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figures 5.5, 5.6, 5.7 and 5.8 shows the time consumption of the GPU and the CPU implementations of the principal component algorithms running with two sensors and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

(40)

5.2. Performance 1000 10000 100000 1x106 1x107 100 T ime [µs] N Deadline SVD Covariance NIPALS 1000 10000 100000 1x106 1x107 1x108 1x109 100 T ime [µs] N Deadline SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figure 5.5: 2 Sensors 44100 Hz Time Consumption

1000 10000 100000 1x106 1x107 100 T ime [µs] N Deadline SVD Covariance NIPALS 1000 10000 100000 1x106 1x107 1x108 1x109 100 T ime [µs] N Deadline SVD(CPU) Covariance(CPU) NIPALS(CPU)

(41)

5.2. Performance 1000 10000 100000 1x106 100 T ime [µs] N Deadline SVD Covariance NIPALS 1000 10000 100000 1x106 1x107 1x108 1x109 100 T ime [µs] N Deadline SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figures 5.9, 5.10, 5.11 and 5.12 shows the time consumption of the GPU and the CPU implementations of the principal component algorithms running with three sen-sors and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

(42)

1000 10000 100000 1x106 100 T ime [µs] N Deadline SVD Covariance NIPALS 1000 10000 100000 1x106 1x107 1x108 1x109 100 T ime [µs] N Deadline SVD(CPU) Covariance(CPU) NIPALS(CPU)

(43)

5.2. Performance

Figures 5.13, 5.14, 5.15 and 5.16 shows the time consumption of the GPU and the CPU implementations of the principal component algorithms running with four sensors and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

(44)

Energy

Figures 5.17, 5.18, 5.19 and 5.20 shows the energy consumption of the GPU and the CPU implementations of the principal component algorithms running with one sen-sor and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

(45)

5.2. Performance 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 500000 1x106 1.5x106 2x106 2.5x106 3x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figure 5.17: 1 Sensor 44100 Hz Energy Consumption

0 10000 20000 30000 40000 50000 60000 70000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 200000 400000 600000 800000 1x106 1.2x106 1.4x106 1.6x106 1.8x106 2x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

0 10000 20000 30000 40000 50000 60000 70000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 200000 400000 600000 800000 1x106 1.2x106 1.4x106 1.6x106 1.8x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

(46)

5.2. Performance 0 10000 20000 30000 40000 50000 60000 70000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 200000 400000 600000 800000 1x106 1.2x106 1.4x106 1.6x106 1.8x106 2x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figures 5.21, 5.22, 5.23 and 5.24 shows the energy consumption of the GPU and the CPU implementations of the principal component algorithms running with two sensors and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

(47)

5.2. Performance 0 20000 40000 60000 80000 100000 120000 140000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 500000 1x106 1.5x106 2x106 2.5x106 3x106 3.5x106 4x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figure 5.22: 2 Sensors 88200 Hz Energy Consumption

0 20000 40000 60000 80000 100000 120000 140000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 500000 1x106 1.5x106 2x106 2.5x106 3x106 3.5x106 4x106 4.5x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

(48)

5.2. Performance

Figures 5.25, 5.26, 5.27 and 5.28 shows the energy consumption of the GPU and the CPU implementations of the principal component algorithms running with three sensors and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

0 50000 100000 150000 200000 250000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 7x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

0 50000 100000 150000 200000 250000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

(49)

5.2. Performance 0 50000 100000 150000 200000 250000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 7x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

0 50000 100000 150000 200000 250000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

Figures 5.29, 5.30, 5.31 and 5.32 shows the energy consumption of the GPU and the CPU implementations of the principal component algorithms running with four sensors and a sampling frequency of 44100, 88200, 96000 and 192000 Hz respectively.

(50)

5.2. Performance 0 50000 100000 150000 200000 250000 300000 350000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 7x106 8x106 9x106 1x107 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

0 50000 100000 150000 200000 250000 300000 350000 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD Covariance NIPALS 0 1x106 2x106 3x106 4x106 5x106 6x106 7x106 8x106 9x106 1x107 0 50 100 150 200 250 300 350 400 450 500 550 Ener gy [mJ] N SVD(CPU) Covariance(CPU) NIPALS(CPU)

Implementation of Principal Component Analysis For Use in Anomaly Detection Using CUDA