Matrix Multiplications on Apache Spark through GPUs

(1)

Matrix Multiplications on

Apache Spark through GPUs

ARASH SAFARI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Apache Spark through GPUs

ARASH SAFARI

Master in Computer Science Date: June 28, 2017

Supervisor: Per Austrin Examiner: Hedvig Kjellström

Swedish title: Matrismultiplikationer på Apache Spark med GPU School of Computer Science and Communication

(3)

Abstract

In this report, we consider the distribution of large scale matrix multiplications across a group of systems through Apache Spark, where each individual system utilizes Graphical Processor Units (GPUs) in order to perform the matrix multiplication. The purpose of this thesis is to research whether the GPU’s advantage in performing parallel work can be applied to a distributed environment, and whether it scales noticeably better than a CPU implementation in a distributed environment.

This question was resolved by benchmarking the different implementations at their peak. Based on these benchmarks, it was concluded that GPUs indeed do perform better as long as single precision support is available in the distributed environment. When single precision operations are not supported, GPUs perform much worse due to the low double precision performance of most GPU devices.

(4)

Sammanfattning

I denna rapport betraktar vi fördelningen av storskaliga matrismultiplikationer över ett Apache Spark kluster, där varje system i klustret delegerar beräkning- arna till grafiska processorenheter (GPU). Syftet med denna avhandling är att undersöka huruvida GPU:s fördel vid parallellt arbete kan tillämpas på en distribuerad miljö, och om det skalar märkbart bättre än en CPU-implementation i en distribuerad miljö.

Detta gjordes genom att testa de olika implementationerna i en miljö där optimal prestanda kunde förväntas. Baserat på resultat ifrån dessa tester drogs slutsatsen att GPU-enheter preseterar bättre än CPU-enheter så länge ramverket har stöd för single precision beräkningar. När detta inte är fallet så presterar de flesta GPU-enheterna betydligt sämre på grund av deras låga double-precision prestanda.

(5)

Contents iv

1 Introduction 1

1.1 Motivation and Aim . . . 2

1.2 Environmental and ethical concerns . . . 2

1.3 Problem Definition . . . 3

1.4 Previous Studies . . . 3

1.4.1 GPU computing . . . 3

1.4.2 Spark & GPU . . . 4

1.4.3 Delimitation . . . 5

1.5 Problem Statement . . . 5

2 Background 6 2.1 Linear Algebra . . . 6

2.1.1 Matrix Multiplications . . . 6

2.1.2 Partitioned Matrix Multiplication . . . 7

2.1.3 Parallel matrix multiplication . . . 8

2.1.4 BLAS library . . . 9

2.2 Graphical Processing Units . . . 10

2.2.1 GPU architecture . . . 10

2.2.2 CUDA . . . 11

2.2.3 GPU Limitations . . . 12

2.3 Spark . . . 12

2.3.1 Spark data management . . . 13

2.3.2 Spark Resource management . . . 14

2.3.3 MLlib . . . 16

2.4 Miscellaneous . . . 18

2.4.1 Netlib . . . 18

2.4.2 Native BLAS Libraries . . . 18

iv

(6)

2.4.3 Garbage collection . . . 19

2.5 Performance Optimization . . . 20

3 Methodology 23 3.1 Testing environment . . . 23

3.1.1 Setup . . . 24

3.2 Optimization Testing . . . 25

3.2.1 Partition Testing . . . 26

3.2.2 Executor testing . . . 27

3.2.3 Memory Management Testing . . . 27

3.2.4 Garbage Collection . . . 28

3.3 Scalability Testing . . . 28

3.4 Spark & Single Precision Operations . . . 29

4 Results 30 4.1 Optimization Test Results . . . 30

4.1.1 Data Partitioning . . . 30

4.1.2 Cores & Executors . . . 32

4.1.3 Memory Management . . . 34

4.1.4 JVM options . . . 36

4.2 Scalability Testing . . . 37

4.2.1 Optimal Environment Evaluation . . . 37

4.2.2 OpenBLAS Scaling . . . 38

4.2.3 NVBLAS Scaling . . . 40

4.2.4 Comparison Results . . . 42

5 Discussion 43 5.1 Speculations and Conclusions . . . 43

5.1.1 Performance . . . 43

5.1.2 Cluster Scaling Comparison . . . 44

5.1.3 Conclusion Summary . . . 44

5.2 Resolving Research Questions . . . 45

5.3 Methodology and Results Discussion . . . 46

5.4 Future work . . . 47

5.5 Summary . . . 49

Bibliography 50

Appendices 54

(7)

A Installation instructions 55

B Local Single vs Double Precision 59

(8)

Introduction

Matrix multiplications are linear algebra computations that are frequently used behind the scenes in many fields. Unfortunately, they are computationally heavy, and can take an unreasonable amount of time to complete for large datasets.

The solution to this problem lies in the parallel nature of matrix multiplications. The values of different cells in the resulting matrix can be computed independently of each other. It is this parallel nature that is exploited by Graphical Processing Units (GPUs). Due to matrix multiplications being heavily used in computer graphics [1], GPUs have been optimized to perform these types of operations extremely efficiently when compared to CPUs [2].

However, while GPUs are superior at performing the multiplications, they are much slower when it comes to accessing the main memory, which sometimes offsets the advantage that utilization of a GPU device brings. Further problems also arise as the size of the matrices grow large and enter the “big data” realm. When data gets too big for a single system to handle in a reasonable time, it is often distributed across a “cluster” of systems with the help of frameworks such as Apache Spark. However, this distribution comes with significant overhead costs. Additionally, Spark does not currently have any support for utilization of GPU devices. Therefore, workarounds such as wrappers and interception of calls have to be utilized if one wishes to use GPUs for large scale matrix multiplications in clusters.

1

(9)

1.1 Motivation and Aim

Matrix multiplications are widely used in many industries, such as the previously mentioned graphics industry. While many of these fields usually deal with relatively small matrices, some of them deal with matrices large enough for distribution to bexd helpful. Machine learning and data queries are example of instances where large scale matrix multiplications are of use.

Data can for example be queried from a database with the help of matrix multiplications by representing the entire data set as a matrix, and a query by another matrix. The resulting matrix of the multiplication between these two matrices would indicate the results of the query. Unfortunately, due to the computational complexity of matrix multiplications, these operations can take unreasonable running times for a single query on large datasets.

The aim of this thesis is to find out whether utilization of GPU’s could prove useful in speeding this process up and yield more reasonable running times.

1.2 Environmental and ethical concerns

Shorter running times on a distributed system has positive environmental effects by consuming less resources. Even if running times are already acceptable, the usage of more efficient hardware could lower the number of nodes needed in a cluster. This would ultimately lower the energy consumption both during use, and by reducing demand for production of additional hardware. However, GPU devices require a considerable amount of electricity in order to be kept cool during continues use. So even if GPU’s prove to multiply matrices faster, it is unlikely that this improved speed would come with an overall reduction in energy consumption.

Furthermore, areas that would benefit from this report, such as the machine learning and big data processing, are areas where ethical practices is a topic of ongoing conversations. In the case of machine learning, the prospect of smarter and more capable machines are exciting to some due to the great potential of enhancement to our day to day lives. At the same time, it is concerning to some, who are worried about the consequences of such a change, such as the prospect of mass unemployment caused by machines replacing human workers.

In the case of big data processing, there has recently been many instances of large corporations gathering and processing large amounts of personal data from users of their services in hopes to either provide better service, or increase their ad revenue. The general public is mostly less than pleased about databases

(10)

containing and processing their personal data, while simultaneously enjoying the fruits of this labour such as personalized Google search results.

In summary, the environmental effects of this thesis is marginal and its social effects controversial.

1.3 Problem Definition

The purpose of this thesis is to figure out whether delegation of distributed matrix multiplications to the GPU scales well despite the penalties that comes with the usage of wrappers, interceptors, and distribution framework. This is done by measuring the running time of distributed matrix multiplications for matrices and clusters of varying sizes. These measurements are made for multiplications performed both on the GPU and CPU, in order for comparisons to be possible.

1.4 Previous Studies

In this section, we mention a few previous studies related to this subject, and the insight they have provided going into this project.

1.4.1 GPU computing

General-purpose computing on graphics processing units (GPGPU) has been a phenomena since the early 2000’s. The idea is to utilize the massive parallel capacities of the GPU to speed up certain aspects of applications. The has been a high number of studies claiming a significant speedup when utilizing GPUs rather than CPUs [3, 4, 5, 6].

This notion has however been challenged and claimed to be exaggerated by Intel. In a 2011 paper titled “Debunking the 100X GPU vs. CPU myth” [7], Intel claims that many of the studies compare optimized GPU implementations to unoptimized CPU implementations. It also points out the importance of taking the cost of transferring data from host memory to device memory when making comparisons, which is an important detail to take into account.

The paper was in turn criticized by, among others, Nvidia for using a previous generation GPU and a current generation CPU in its measurements [8]. Nevertheless, the points that the Intel paper brought up against previous studies are still important and valid. In this thesis, we therefore try to optimize

(11)

our implementation of both the GPU and CPU solution, while also making sure that data transfer costs are taken into consideration.

1.4.2 Spark & GPU

Apache Spark is not GPU-aware, meaning that it does not attempt to utilize any GPU devices on the cluster. There has been a number of studies however that has tried to sidestep this limitation. Li et al. [9] proposed one such solution named “HeteroSpark”. The rather messy solution required GPU implementations of methods to have been pre-compiled and made available on a device with a GPU. The Spark applications are then to utilize these precompiled codes through a combination of Java Remote Method Invocation (RMI), Java Native Interface (JNI) and a Java wrapper for the precompiled GPU code. In late 2016, Yuan et al. managed to achieve a 4.83x speedup in performance of SQL queries by utilizing this solution [10].

Zadeh et al [11] has also presented a study in which optimized matrix multiplications through Spark’s linear algebra library were benchmarked.

Matrix multiplications through the GPU was one of the approaches tested and benchmarked by the study. However, the tests were not run on a distributed cluster, but only on a single, powerful node. The study found that, with their hardware, CPU implementations were superior to that of the GPU implementations for matrices with dimensions up to 10000 × 10000. After that point, GPU implementations take the lead.

But by only utilizing one single node, the matrices are not distributed and the multiplications are performed locally. These results therefore do not reflect the actual performance of distributed matrix multiplications, but rather the local performance of the Spark engine on a single node.

However, an important takeaway from [11] is the manner in which the GPU was utilized. Instead of the complicated solution proposed by HeteroSpark, a simple wrapper developed by Nvidia was utilized. This solution exploits the fact that Spark performs linear algebra computations by calling a native library on the system. These calls to the system’s native library are simply intercepted by the Nvidia wrapper and rerouted to the GPU. This solution, unlike HeteroSpark, only allows the GPU to be utilized for linear algebra operations. However, this is all that we require for this thesis. We therefore utilize this approach in our implementation.

(12)

1.4.3 Delimitation

The limiting factors in this thesis are the capabilities of the Spark engine, the limitations of GPU devices and libraries, and the available resources. Due to a lack of a sparse matrix multiplication library for the GPU that is compatible with Spark, the project is limited to dense matrices, and the number of nodes in our cluster is limited to 1 master node and 3 slave nodes due to hardware resource constraints.

1.5 Problem Statement

The main question this thesis is attempting to answer is “How does distributed matrix multiplications performed on Apache Spark scale (with regard to variables such as running time, different input sizes and cluster size), if the multiplications are performed by GPU devices rather than CPU devices?”.

In order to be able to answer this question fairly and accurately however, we need Spark to perform at its peak when evaluating both the CPU and the GPU performance. This leads us to the prerequisite question of “How can Spark be configured to run matrix multiplications as efficiently as possible?”

(13)

Background

This chapter introduces the concepts and techniques that are used in this report.

We first cover matrix multiplications and the challenges it brings. Then, we cover computations through GPUs and how we can distribute large workloads between multiple systems in order to speed up the multiplications.

2.1 Linear Algebra

A matrix is a two-dimensional data structure containing numbers in fixed rows and columns. Matrices are often used as a mathematical representation of some concept or object. Such representations are prevalent in many field. One such field is the digital graphics field. In digital graphics, any given object viewed from a given perspective is represented by a matrix [1]. The advantage of doing such representations is that changes in the perspectives of which the viewer is viewing the objects from, can efficiently be simulated with the help of simple linear algebra calculations such as linear transformations, rotations, scaling, projections, and the likes [1].

These types of calculations rely heavily on matrix multiplications, which is what this report is focused on.

2.1.1 Matrix Multiplications

Matrix multiplications are computationally heavy work. When multiplying two matrices, A and B, each cell in a resulting matrix C consists of the sum of a series of multiplications between the entries of a row from matrix A and a column from matrix B. Figure 2.1 illustrates this process, which is then to be repeated for all the cells in the matrix C.

6

(14)

Figure 2.1: Illustration of Matrix Multiplication, the figure depicts the pattern followed when populating Matrix C (blue) as a product of Matrices A (red) and B (green).

In fact, the time complexity to perform a matrix multiplication naively is an impractical O(n³). Even when utilizing more advanced algorithms, the results do not get much better. The Strassen algorithm, for example, runs in O(n^2.8) time [12]. And we have yet to even take constant factors into consideration.

In practice, this means that the running time of matrix multiplications can take hours for even moderately size matrices, which is not acceptable in many areas.

The running time can however be shortened by exploiting a few properties of the matrix multiplication process. Namely, their decomposable and parallel nature. These properties allow us to employ so called “divide and conquer”

and “parallel and distributed” strategies described in the coming sections.

2.1.2 Partitioned Matrix Multiplication

Matrix multiplications can be decomposed into smaller tasks through what is often referred to as “Block Partitioning” [13]. This strategy takes advantage of the associative and distributive properties of matrix multiplications in order to divide the two big matrices into groups of smaller sub-matrices. These sub- matrices are then multiplied together in an appropriate manner, before finally compiling the results of these sub-calculations together. See Figure 1.2 for an illustration.

This process introduces some additional workload in the form of the initial partitioning, and then later reassembling the pieces. However, it allows us to distribute the sub-multiplications between different systems working in parallel, which is a popular way of dealing with problems arising from too large workloads.

(15)

Figure 2.2:Block Matrix Partitioning. The figure illustrates how the 3 by 3 matrix A and the 3 by 2 matrix B can be multiplied together by dividing the original matrices into sub-matrices no bigger than 2 by 2.

2.1.3 Parallel matrix multiplication

Another property of matrix multiplications that can be taken advantage of when attempting to speed up the process, is that computations for different cells of the result matrix are independent of each other, as showcased by Figure 2.3.

Meaning that the workload of populating the resulting matrix can be divided amongst several entities, such as processor cores.

(16)

Figure 2.3: Depiction of an instance of matrix multiplication in process. The figure illustrates the fact that the computation of each individual cell are not dependent on the others.

2.1.4 BLAS library

No matter which strategy one takes towards solving matrix multiplications, it is important that it is implemented efficiently. Especially when performing large matrix multiplications on computers, where memory management is a key issue.

The BLAS (Basic Linear Algebra Subprogram) library was initially developed in 1979 using Fortran. This library contains efficient implementations of linear algebra subroutines that have been maintained for the last 35 years, and is implemented by many high profile vendors and libraries such as Intel’s Math Kernel Library [14], Nvidia’s cuBLAS library [15], and the Netlib BLAS and CBLAS projects [16]. What makes these libraries special, is the level of efficiency that they provide in all aspects. Aside from being written in low level lan- guages and utilizing Single Instruction Multiple Data (SIMD) strategies, their main strength over a typical implementation of linear algebra subroutines is their cache efficiency [17].

BLAS libraries partitions matrices (in the manner described in Section 2.1.2) by using block sizes that perfectly fit inside the processors cache memory. By doing this, halting of the multiplication process for the sake of retrieving data from the main memory can be minimized. In order to be able to do this however, the user needs to compile the library locally. This results in a native and system specific BLAS implementation optimized for the system of the end user [17].

(17)

2.2 Graphical Processing Units

A Graphical Processing Unit (GPU) is a special purpose processor originally designed and optimized specifically for processing 3-dimensional images more efficiently than other forms of processors (such as a CPU). The type of tasks this might entail was briefly touched upon in Section 2.1.1. Simply put, the tasks that a GPU device specializes in are tasks where great amounts of parallel workload needs to be perfomed with high throughput. Therefore, a GPU device possesses a massively parallel architecture consisting of thousands of small but efficient cores designed for handling multiple tasks simultaneously [2]. Figure 2.4 illustrates the difference between the architecture of a typical CPU device compared to that of a GPU.

Figure 2.4:Depiction illustrating the difference between the architecture of a CPU and a GPU.

2.2.1 GPU architecture

An important distinction between CPUs and GPUs is the architectural hierarchy of GPU devices. A simple program that is intended to run on a CPU typically contains a main function that runs serially on a single thread from start to

(18)

finish. Many threads executing different tasks or programs are juggled by a CPU core simultaneously. Modern CPUs contain a number of cores splitting up the workload.

A typical GPU program however, consists of a piece of code called a Kernel.

A Kernel is similar to a main function in many ways. The big distinction however, is that while only one instance of the main function is executed by a typical CPU program, there are usually hundreds of instances of Kernels executed when starting up a single GPU program. Each instance runs on a different thread.

Threads running on the GPU are grouped in what is referred to as blocks.

A block can contain up to 512 or 1024 different threads depending on the GPU device. The significance of blocks is that threads within the same block communicate through what is called the shared memory, while communication between blocks must be performed through the global memory. The difference between the two is that the shared memory consists of 16KB of memory, but can be accessed faster in comparison to the global memory, which usually consists of several GB of memory. The number of threads per block, and number of blocks can be configured. They should be chosen carefully in order for the program to run optimally [18].

2.2.2 CUDA

CUDA (Compute Unified Device Architecture) is a driver API that allows end users to write GPU-executable code. Using the CUDA API, a programmer can write a Kernel code, which is executed simultaneously by all GPU threads.

Nvidia has themselves developed a couple of libraries of their own using this API [19]. The most important ones for our purposes are named cuBLAS and NVBLAS.

As described in the last section, on top of writing an efficient kernal code, several other factors such as the thread count and block-size must be taken into consideration in order to produce a fast application. However, when it comes to Basic Linear Algebra Subprograms (BLAS), such as matrix multiplication, programmers can simply use the Nvidia developed cuBLAS library. cuBLAS is an implementation of BLAS (see Section 2.1.4) on top of the CUDA runtime. It contains system-optimized BLAS routines, and comes with a simple interface [15]. NVBLAS on the other hand is an even more context aware library built on top of cuBLAS. It is intended to replace native BLAS libraries by intercepting computationally heavy BLAS calls to the CPU, and redirecting them to GPUs that are present in the system. NVBLAS can be further configured to let a certain

(19)

percentage of the calls slip through to the CPU, effectively sharing the workload between the CPU and the GPU [20].

2.2.3 GPU Limitations

The major drawback of GPUs is their slow access to the main memory. This introduces a significant penalty to their running time that is only made up for once the matrices reaches a certain size, where their fast computation speed can make up for the lost time.

Another major drawback of most GPU devices is that their fast computation speed only applies to single precision operations. This increased single precision performance comes at a great cost to their double precision operation speed. As an example, by utilizing the bench-marking application provided by [11] in their report, it was established that the Nvidia Quadro K620 (the device used in this thesis) has a 30 times slower double precision performance compared to its single precision performance. Detailed results can be seen in Appendix B.

2.3 Spark

One way of tackling large workloads which consist of smaller, independent tasks is to distribute these smaller tasks amongst a number of separate systems.

These systems are to perform their individual tasks and report the results back to some central entity which oversees the process. This type of configuration is commonly referred to as a cluster, and the individual devices that form the cluster are referred to as nodes.

Clusters are often a cost efficient way of increasing performance with higher availability than a single system that offers comparable results [21]. However, they are accompanied by certain complications. Transferring data between the individual nodes, for example, is a necessity that is not trivial to accomplish efficiently. Distributing the tasks among nodes is yet another non-trivial task.

But perhaps one of the complications hardest to deal with is the prospect of failure or delays on a given individual node. The cluster needs to be able to detect a failing, or straggling node and redirect the workload, and perhaps recalculating data that was lost on the failing node.

Apache Spark is an open-source cluster-computing framework which handles all these complications in the background, and instead offers a simple interface for programming the entire cluster. Spark provides implicit data

(20)

parallelism and fault tolerance, meaning it automatically distributes tasks and data to nodes where optimal performance is expected while also being able to handle failures or delays at any given node by rescheduling [22].

Spark clusters usually consist of one master node, and one or more slave nodes. The Master controls resources such as memory or processors available in the worker nodes. When a task is initiated in the cluster, a driver process is created. This Driver service splits the bigger tasks into smaller ones, and delegates these to the slave nodes [23].

Data is primarily stored in the main memory of the nodes in a special data structure called Resilient Distributed Dataset, or RDD for short. RDDs can efficiently be stored on disk or transferred between nodes when the need arises. They are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators [24].

2.3.1 Spark data management

One of the main use cases of Spark, or clusters in general, is processing large amounts of data (Big data). What is considered “large” can be subjective, but generally, the concept applies to any set of data too big for one system to handle efficiently. This sub-chapter attempts to provide a shallow oversight of how Spark handles data of different sizes [25].

Let us consider a data set small enough to fit in the main memory of a single system. In this case, distributing the task is not necessary as far as memory management is concerned (although one still might want to do so in order to distribute the workload).

Next, consider a data-set too big for the main memory of one system, but small enough to fit in the main memory of several systems in the cluster. If we choose to not distribute the data, and instead process it on one system, we would be forced to keep some of the data on the hard drive and reading or writing to disk when the need arises. If we choose to distribute the data however, parts of the data will reside inside of the main memory of the different systems. When one node requires data that it does not possess, it can request it from the node that does, or alternatively, have the other node execute the task for it. Through this approach, we avoid any transactions between disk and main memory.

In the third case, we consider data too big for the entire main memory of the cluster combined. In this situation, even when we distribute the data across several systems, we are still forced to store some of the data on disk. This causes

(21)

additional problems when system A requires data that system B possesses, but has currently placed on the disk. In this situation, system B might need to evict some of the data it is using to the disk in order to be able to process system A’s request. This is a worst case scenario that sometimes cannot be avoided, but the damage can be controlled through sparks storage level option, which dictates what Spark will do when data becomes too big for the main memory [25]. The options are:

• MEMORY_ONLY:

This storage level dictates that the data is to be stored as plain Java objects in the JVM. Should the system run out of memory, data is dropped from memory entirely, and recalculated anew on the fly later, would the need arise. This is the default storage level.

• MEMORY_ONLY_SER:

This storage level functions like the previous one, in the sense that data is dropped and then recalculated when needed. But the difference between the two is that this option stores data in serialized form. This is generally more space-efficient than storing non-serialized objects, but more CPU- intensive.

• MEMORY_AND_DISK:

This storage level functions like the MEMORY_ONLY storage level, in the sense that data is stored in a non-serialized state. However, when data gets too big for the main memory, it is written to disk, rather than being dropped entirely.

• MEMORY_AND_DISK_SER:

This storage level functions like a mix between MEMORY_ONLY_SER and MEMORY_AND_DISK storage levels. Data is stored in a serialized form, and when it gets too big for the main memory, it is written to disk.

2.3.2 Spark Resource management

The memory management of spark is not as straight forward as alluded to in the previous section. Different nodes in the cluster may or may not posses different amount of resources. Having the same task being carried out on nodes with different amount of resources might be problematic and cause complications.

In order to avoid this, Spark performs its tasks through what it calls executors.

(22)

An executor is a virtual slave instance hosted by a node. Many instances of executors can be hosted by the same node, but all executors across all nodes posses the same amount of resources, such as RAM and CPU cores. The exact amount is specified by the user during execution. By default, each node hosts as many executors as possible, striving for an identical running environment for all executors while still taking advantage of as much as the nodes resources as possible [25].

Figure 2.5 illustrates an example where a cluster consists of two nodes. One of them possessing 4 CPU cores and 16 GB of memory, the other 4 CPU cores and 12 GB of memory. By specifying that each executor is to have 2 CPU cores and 8 GB of memory, the user would receive three executors in total, two from the first node, one from the second node. However, 2 CPU cores and 4 GB of memory from the second node would go unused. It is therefore important to be careful when selecting the executor properties, or large parts of the cluster could end up being left unused.

Figure 2.5:Figure describing the process of generating homogeneous executors from resources available on a cluster

There are som tradeoffs that are made when deciding on the size of executors. For example, creating many small executors creates more opportunity for parallelism, but each executor would be weaker and not be able to handle as large tasks as a larger executor could. Additionally, each executor comes with a certain amount overhead, leaving even less resources for the actual tasks.

When deciding on the resource division among executors, it is commonly recommended by the Spark community to leave a minimum of 1 core and 1 GB

(23)

of RAM untouched by Spark, in order to leave some resources behind for the OS and vital background processes [26].

Once an executor has received a certain amount of memory, it splits it up into different sections for different purposes. The two sections we are interested in are the storage memory and execution memory. Storage memory is used for caching of data that is to be used, while execution memory is used for the actual computation. The main memory that was referred to in the previous section is in actuality only the storage memory portion of the memory that is allocated to each executor [25].

2.3.3 MLlib

When distributing data across a cluster, the data structure that is used is of importance. It is often a challenge to partition and distribute the data in an efficient manner. In our case however, we can simply utilize standard data structures present in Sparks Machine Learning library.

Spark’s Machine Learning Library (MLlib) contains a vast array of procedures and data structures that are often used in machine learning contexts.

In this project, we utilize the library’s implementation of matrix representations, and its matrix multiplication interface.

MLlib possesses several data structures to represent matrices with. Here, we briefly go over those that are designed for distributional purposes [27].

The simplest distributed matrix representations are RowMatrix and IndexedRowMatrix. These are simple collection of rows, where each row is represented by a standard vector. The difference between the two representations is that IndexedRowMatrix has meaningful row indices, while RowMatrix does not. Since the rows are represented by a standard vector, column lengths are limited by the Integer range.

A more niche matrix representation is that of CoordinateMatrix. A Coordinate matrix is a sequence of entries, where each entry consists of the row and column indexes (i, j) and a double value. This representation is intended for matrices that are huge and very sparse.

The final matrix data structure is called BlockMatrix, BlockMatrices are representations of the partitioned matrices mentioned in Section 2.1.2. A BlockMatrix consists of a series of Blocks, which in turns consists of a block index (i, j) and a matrix.

Figure 2.6 demonstrates how different types of matrices can be partitioned.

Each gray block represents a portion of the matrix that can be assigned to a partition. Since local matrices consists of one big block, it cannot be distributed

(24)

between partitions. A row matrix however, has its rows distributed amongst the partitions, while the BlockMatrix has each block distributed. Of course, the partitioning and distribution of these matrices come at a price of additional overhead.

Figure 2.6:An illustration of how a matrix can is stored by Spark’s different data structures. On the left, the matrix is not partitioned and saved as a giant self contained unit. To the middle, each row of the matrix is stored separately. To the right, the matrix is divided and stored as uniformly smaller blocks. Source: [26]

The trade-off between these different matrix representations boils down to the relation between the increased opportunity for parallelism compared to the additional overhead that comes with separating the matrix into smaller pieces. Ideally, one wants to avoid a situation where some nodes are idle due to insufficient partitioning, while also avoiding a queue of smaller than necessary tasks on every node due to too much partitioning and overhead. Other factors, such as the memory limits of the nodes and the data transfer speed, should also be considered when deciding on how to best distribute a matrix.

A detail about the MLlib library that is particularly important in our case is that MLlib has no support for single precision data structures at all. All the data structures mentioned above are designed to hold double precision values exclusively.

(25)

2.4 Miscellaneous

There are a few miscellaneous libraries and techniques that can be utilized in order to optimize Spark jobs. This section contains a brief rundown on these subjects.

2.4.1 Netlib

The matrix multiplication routine of MLlib is not quite straight forward, due to the involvement of many different libraries and licensing issues that are brought about as a result. The multiplications are handled by the Netlib-java library, which primarily attempts to perform them through a system specific library such OpenBLAS, Intel MKL or Nvidia’s cuBLAS. Would that fail, a built-in native reference implementation written in Fortran is used. Would both of these approaches fail, a pure-java fallback implementation is used at a great cost to performance [28].

The system specific libraries, as well as the Fortran Netlib implementation, is not included in Sparks MLlib due to licensing issues, and have to be downloaded, compiled and linked by the end user manually.

The advantage of using these libraries is their high level of optimization.

“Optimized” here refers to specialist assembly instructions being combined with compile time profiling and the selection of array alignments for the Kernel and CPU combination [29]. All in all, the library is optimized for the machine that it is running on, rather than being generically tuned.

2.4.2 Native BLAS Libraries

As mentioned in Section 2.1.4, the BLAS library have been implemented by a series of optimized, native math libraries. Here, we quickly recap the two libraries most prominently used in this thesis.

The cuBLAS library is a device specific, GPU implementation of the BLAS routines. It is available free of charge as a part of the CUDA driver API provided by Nvidia. The NVBLAS library is built on top of the cuBLAS library and dynamically routines certain BLAS calls to GPU devices of the system.

The OpenBLAS library is a native CPU implementation of the BLAS library compiled by the end user resulting in a system optimized library. Being an open source project, it does not require a license and is compatible with any processor.

(26)

2.4.3 Garbage collection

Spark is a Java based framework, so naturally it is impacted by different JVM settings and environment variables. The settings we are particularly interested in are the garbage collection settings.

By default, the heap space of a Java application is divided into three generations. The young, old and permanent generation. The young generation is in part divided into three subsections named Eden, Survivor 1 and Survivor 2. When new data is created, it is initially placed in the Eden section. Every time a minor garbage collection is performed, data from Eden, together with any surviving data from Survivor 1 is moved to Survivor 2. The Survivor 1 and Survivor 2 regions are then switched. Data that has survived a number of minor garbage collections are moved to the Old generation section. Once the Old generation section is completely full, all threads of the application are suspended in order for a major garbage collection to take place.

Figure 2.7: Illustration of the different sections that the heap space of a Java application is divided into. Source: [30]

G1GC is a newer, alternative approach to garbage collection. In this approach, the heap is divided into equal sized heap regions. The regions are initially assigned similar roles as in the traditional approach, but the difference is that the sizes for the different sections are not set, but changed dynamically.

When data is created, it is allocated in an available Eden area. When minor Garbage Collection occurs, live objects are copied from one or more regions of the heap to a single region, and new free regions as are assigned as Eden regions. Full GC occurs only when all regions hold live objects and no full- empty region can be found.

(27)

Figure 2.8:Illustration of how the G1C1 garbage collection strategy divides the heap space into different sections. Source: [30]

2.5 Performance Optimization

As mentioned earlier, what we are interested in finding out in this report is how distributed matrix multiplications performed by the GPU scale compared to when they are performed by the CPU. As described in the previous chapter, when performing multiplications on a distributed environment, the actual multiplication is only a part of the workload. A speedup in the multiplication portion of the workload might not be accurately portrayed if the rest of the workload is performed inefficiently, and causing noise in our measurements.

In order to accurately measure the improvement that is brought about by the usage of GPUs, we therefore need to optimize the rest of the workload as much as possible.

Our application consists of three main components: the cluster that is managed by Spark, the processing unit that is performing the calculation and the linear algebra libraries containing the implementation for the multiplications. The processing units, as well as the libraries that are to be used in the project do not require any tuning, since they are already optimized out of the box.

When it comes to Spark however, the list of parameters that need tuning is rather long, and setting these parameters incorrectly can have a negative impact due to the reasons explained in the previous sections. Here is a quick summary of the parameters that needs tuning:

• Number of executors per node:

As explained in Section 2.3.2, Spark performs its tasks through executors, which is a virtual worker residing inside of a node. All executors on all nodes need to possess identical resources, and the user must decide whether to create many small executors, or fewer but bigger ones. Smaller

(28)

executors have the potential of speeding up the execution by providing more opportunity for paralleization. But come at a price of additional overhead and less capabilities per executor.

• Number of cores per executor:

This parameter controls the number of cores that are supplied to each executor. With more cores, individual tasks have the potential of being completed faster. However, giving more cores to each executor limits the total number of executors that can be hosted by a single node.

Additionally, the usage of CPU cores work slightly different when the multiplications are being performed by a GPU device.

Since Spark is not aware of any GPU devices, it assumes that each CPU core is performing multiplications assigned to it independently of the other ones. But in actuality, all CPU cores are delegating their matrix multiplication tasks to a single GPU device. This might lead to a queue for tasks incoming to the GPU device, and ultimately means that Spark is making decisions regarding the delegation of workload and the capabilities of its resources on false assumptions.

Finally, the usage of additional cores come with additional overhead in the heap.

• Size of each partition:

As explained in Sections 2.3.1 and 2.3.3, data needs to be partitioned and distributed across the cluster in specific manners in order for Spark to be able to process them. The size of these partitions are chosen by the user, and has an impact on the performance of the program. Larger partitions mean less overhead but also less parallelism.

• Data management strategy:

As we multiply bigger and bigger matrices, we expect them to become too big for the main memory of the nodes at some point. As described in Section 2.3.1 spark offers several alternatives for handling these scenarios:

– Keep data in main memory only, drop some data when memory is full.

– Keep data in main memory only, but in a serialized form in order to allow for more data to fit. Drop some data when memory is full.

– Keep data in main memory, spill to disk when main memory is full.

(29)

– Keep data in main memory, but in serialized form. Spill to disk when main memory is full.

Different options are expected to have different peaks and valleys when it comes to performance, depending on the size of the matrix.

• Memory fraction ratio:

This parameter specifies how much of the main memory is set aside for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records [25].

• JVM parameters:

Since Spark is a Java based application, JVM options such as different garbage collection strategies, amount of heap space, etc. has an impact on performance.

(30)

Methodology

The goal of this project is to compare the scaling of GPU centric distributed matrix multiplication to the traditional CPU centric approach. As described in the last chapter, there are a lot of factors other than the multiplication speed that affects the running time when using Spark. If these other factors are sub-optimally configured, we might get inaccurate data when evaluating the scalability.

In order to avoid this, our testing portion of the project is divided into two sections. In the first section, we attempt to evaluate the impact of these additional variables in order to find the optimal settings for our purposes. In the second portion, we use the optimal settings found in the first portion to measure the scaling of matrix multiplication more accurately.

In this chapter, we first describe the hardware and software used in our testing, before describing the testing process itself.

3.1 Testing environment

The cluster used in our tests consists of 4 identical nodes. One of which is used as master a Master node in the Spark cluster, while the other 3 as slaves. The hardware specs of these machines are as follows:

• 16 GB of RAM

• Seagate Desktop HDD 1TB 64MB Cache SATA 6.0Gb/s

• Nvidia Quadro k-620, released july of 2014 (Master node does not require a GPU device)

23

(31)

• Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, 4 cores, released may of 2014 All three nodes run CentOS 6, but the testing environment have been successfully installed on Fedora 25 and Ubuntu 16 systems according to instructions given in Appendix A. Additionally, all nodes are connected together through the same router, minimizing network latency.

3.1.1 Setup

In this section, we go through the preparatory work required for successfully running the tests in the coming sections. While most of these steps might seem straightforward, complications and irregularities should be expected. As there is a lack of detailed instructions on these subjects elsewhere online, detailed instructions on how to replicated the author’s testing environment can be found on Appendix A for those interested in replicating the experiments.

Below is an ordered list of actions that is required for setting up the necessary native libraries and Nvidia drivers. They need to be performed on all slave nodes.

• Install the latest Nvidia CUDA Driver. Full instructions can be found on Nvidia’s webpage.

• Download the source code for CBLAS and Native BLAS libraries from netlib.com

• Compile into a shared library (.so) using GCC version 4.8 or higher.

• Install latest Liblapack and OpenBLAS libraries, either through a package manager or compile from source.

• Link the installed libraries together.

• Configure NVBLAS into using the correct native BLAS library as fallback.

Spark also needs to be compiled from source, this needs to be done on all nodes.

• Clone the source code for Apache Spark from the official repository.

• Using Maven, build the spark engine from the source code using the - Pnetlib-lgplflag.

Once this has been done, Spark jobs can be submitted as usual. Refer to the

“Spark Quick Start” guide for further instructions.

(32)

3.2 Optimization Testing

This section describes the tests performed in order to evaluate the impact that different variables have on the running time of our matrix multiplications.

The purpose of these tests are to figure out the optimal environment for matrix multiplication jobs. The tests are ran with different parameters such as dimensions, blocksize etc, but the general structure of the tests is always as follows:

• Each test case consists of a number of iterations.

• In each iteration, two matrices are created and filled with randomized values.

• In the first iteration, the matrices consist of a given number of blocks. The number, as well as the block size, is specified through a parameter.

• In each following iteration, the dimensions of the matrix grows by a constant margin specified through a parameter.

• The test continues until either a single iterations takes longer than 20 minutes, or the program crashes due to hardware limitations (Java heap size).

(33)

Algorithm 1Square Matrix Multiplication

1: procedureMULTIPLICATION(blockSize, initialSize, interval)

2: rows ← initialSize/blockSize

3: cols ← rows

4: while r < 20minutes do

5: MatrixA ← BlockMatrix(blockSize, rows, columns)

6: MatrixB ← BlockMatrix(blockSize, rows, columns)

7: MatrixA.fill(Float.Random())

8: MatrixB.fill(Float.Random())

9: r ← 0

10: for i ← 1 to 10 do

11: StartTime ← CurrentTime

12: MatrixA.multiply(MatrixB)

13: RunTime ← CurrentTime-StartTime

14: r ← r + Runtime

15: r ← r/10

16: P rint(”Runtime for size ” + rows + ” : ” + r)

17: rows ← rows + interval

18: cols ← cols + interval

3.2.1 Partition Testing

The first variable whose impact needs to be tested is the size of the partition. It is suspected that multiplications of matrices consisting of, for example, 4 block of size 20×20 will have a different running time than 16 blocks of size 10×10. In order to test this hypothesis, we run the test programs described in the previous section. Block sizes of 500, 1000, 2000, 4000 and 8000 will be used when testing GPU multiplication, and 50, 250, 500, 1000, 2000, 4000 and 8000 when testing CPU multiplication.

The reasoning behind the choice of the block sizes for the GPU is that the minimum block size deemed big enough for NVBLAS to intercept is 400x400, and the default size of blocks are 1000. The rest of the block sizes were therefore picked as they are round multiples on both sides of the default value and allow for easy comparisons.

The spark executor settings used in these tests is 12 GB memory and 3 cores per executor, resulting 1 executor per node for a total of 3 executors. The storage level used is MEMORY_ONLY. The effect of these additional settings are assumed to be equalized between the different tests, as their effect on the

(34)

running times is assumed to be similar.

3.2.2 Executor testing

There are two additional variables related to executors that are suspected to affect both the running time, and perhaps the memory usage of the program.

• The number of cores per executor. The more cores an executor has, the more work it can perform in parallel through the CPUs. However, when performing multiplications through the GPU, the main parallel workload of the application is performed by one single GPU device, which is shared between all cores and executors. We should therefore benefit less from an increased number of cores in our case, than in most typical CPU-centric programs. The purpose of this test is to find out whether the diminished efficiency boost potential that additional CPU cores bring in our case is worth the additional overhead of working with several cores.

• The number of executors per worker. The more executors per node, the more potential for parallelization. However, creating more executors per slave node would require the resources of the worker to be divided among these executors, which might lead to problems with for example memory size. Additionally, there being only 1 GPU device shared among all executors is also a factor to consider.

In order to test the impact of these variables, we run the standard test program described in Section 3.2 with the following executor configurations. We use square matrices, with the default blocksize of 1000.

• 1 executor per node, 1 core and 12 GB of memory per executor

• 3 executors per node, 1 core and 4 GB of memory per executor

3.2.3 Memory Management Testing

There are two additional Spark configuration options related to the memory management that we suspect affects not only the running time, but also the maximum matrix size we can multiply.

(35)

• The storage level. As described in Section 2.3.1, different storage levels cause memory spilling to be handled with different strategies. In this test, we attempt to deduce the impact of these different storage level options on both running time, and the size of matrices that we can multiply by running the previously mentioned test with the default blocksize of 1000. Data gathered from this test is expected to showcase the effect of the different storage levels as the matrices grow too large for the main memory.

• Spark.memory.fraction value. This value decides the fraction of memory used for execution and storage. The lower this is, the less memory is set aside for storage purposes, and as a result spills and data evictions will be more frequent. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records [25]. A lower value is expected to cause longer runtimes, but allow bigger matrices to be multiplied. The impact of this variable, and the trade-of made between running time and matrix dimension, is explored for the preferred storage level of the previous test.

3.2.4 Garbage Collection

The final configuration that we are concerned with is the JVM garbage collection strategy. The two different garbage collection strategies described in Section 2.4.3 is tested on square matrices using the so far established optimal settings.

3.3 Scalability Testing

Once the tests described in the previous section has been performed, we evaluate the results of these tests and establish the optimal settings and environment for the matrix multiplication jobs.

Once the optimal settings have been established, we evaluate the scalability of both the CPU and GPU implementations. This is done by performing multiplications of different sizes with optimal settings. These series of tests are ran twice, once with only 2 nodes in the cluster, once with 3 nodes, giving us data to evaluate the scalability of the application both in terms of input size, and cluster size. More details about this phase of the testing is given in Section 4.2.

(36)

3.4 Spark & Single Precision Operations

As mentioned in Section 2.2.3, most devices have very poor double precision performance. Additionally, at the time of writing this report, Spark does not have any single precision support for its distributed matrix data structures.

This makes GPU devices perform very badly with an out of the box MLLib implementation.

Spark is an open source framework however, so a custom version of Sparks MLlib library with single precision support has been created for the purpose of our tests. The source code can be found in the following repository https://github.com/Arash-s/Spark-Single-precision-LinAlg.

(37)

Results

In this chapter, the results of the tests described in the previous chapter is presented. The first section will contain the results for the environment optimization tests, while the second section contains the scalability test results.

4.1 Optimization Test Results

The results of the optimization tests is grouped into four sections and presented below. Section 4.1.2 contains the results of the data partitioning tests, which tested the optimal sizes for partitions. Section 4.1.2 contains the results of the core and executor tests, which tests the optimal hardware division between the executors. Section 4.1.3 contains the results of Spark’s memory management tests, which indicated the optimal storage level and memory fraction values.

Finally Section 4.1.4 contains the results of the garbage collection tests, which tested two different garbage collection strategies and tested their impacts on performance.

4.1.1 Data Partitioning

As indicated by Figure 4.1, our tests seem to suggest that larger blocksizes are considerably faster when compared to smaller blocksizes. This applies to both GPU-based multiplications utilizing NVBLAS and CPU-based multiplications utilizing OpenBLAS. However, as can be seen by the running time of the 8000 unit blocksize, a blocksize too large to allow proper partitioning causes severe running time penalties. This can be seen best when trying to multiply the 16000 units matrix using a blocksize 8000, preventing proper partitioning between our three nodes.

30

(38)

Figure 4.1:Graph depicting the results of the partition tests of the OpenBLAS implementation (bottom) and the NVBLAS implementation (top). The different categories represent the different block sizes used when partitioning the matrices.

(39)

4.1.2 Cores & Executors

Our tests indicate that additional CPU cores speed up the running time of both Spark’s GPU and CPU backed matrix multiplication jobs by approximately 10%

(although this value fluctuates rather heavily, as can be seen in Figure 4.2). This speedup comes at a price of a significant heap overhead. As can be seen in Figure 4.2 below, each additional core that is added causes the program to crash earlier through a Java heap space exception.

Figure 4.2:Graph depicting the results of the CPU core tests on the OpenBLAS implementation (bottom) and the NVBLAS implementation (top)- Each category shows the running time of utilizing a certain number of CPU cores.

(40)

Furthermore, our tests indicate further speed increase when utilizing a higher number of smaller executors compared to fewer but larger executors.

As indicated in by Figure 4.3, 3 executors utilizing 1 core and 4 GB of memory each perform around 5% faster than a single executor with 3 cores and 12 GB memory. This speed increase, once again, comes at a price of additional overhead which causes the program to crash earlier.

Figure 4.3: Graph depicting the results of the Spark Executor tests on the OpenBLAS implementation (left) and the NVBLAS implementation (right). A comparison between utilizing a single executor with 3 cores and 12GB of RAM, and 3 separate executors with 1 core and 4GB of RAM.

(41)

4.1.3 Memory Management

The results of our Storage Level tests, depicted in Figure 4.4, shows a similar behaviour for both the CPU and the GPU. As can be expected, both implementations show a gradual change from the MEMORY_ONLY option being preferable at lower matrix sizes, to the MEMORY_ONLY_SER option being optimal somewhere in the middle, and the MEMORY_AND_DISK_SER option being optimal at the highest matrix sizes we could test.

Figure 4.4: Graph depicting the results of the storage level tests on the OpenBLAS implementation (bottom) and the NVBLAS implementation (top).

When examining the different memory fraction levels, the results of the GPU and CPU implementations were also quite similar. In both cases, the lower

(42)

fraction levels allowed the program to process much bigger matrices. The size of the matrices could be increased from 30000² to 40000² when comparing the highest fraction level to the lowest, among the ones we tested.

When it comes to the running time however, the tests showed vague signs of impact from the different fraction levels on the lower dimension multiplications. However, once the dimension of the matrices reached a high enough point, a clear relation could be seen between the running time and the fraction level chosen. This can be best observed in the last three categories of Figure 4.5. The turning point seems to be around the point where the higher fraction levels start to fail.

Figure 4.5: Graph depicting the results of the Memory Fraction value tests on the OpenBLAS implementation (bottom) and the NVBLAS implementation(top).

(43)

4.1.4 JVM options

Our garbage collection tests indicate a that the default garbage collection strategy of the JVM is far superior to the G1C1 strategy both for the CPU and the GPU implementation. Additionally, results indicate that the negative impact of using the G1C1 strategy is far greater on the CPU than on the GPU. As can be seen on Figure 4.6, the running time for the final category of the CPU implementation test was almost doubled when switching from the default to the G1C1 strategy.

Figure 4.6: Graph depicting the results of the Garbage Collection tests on the OpenBLAS implementation (left) and the NVBLAS implementation (right).

(44)

4.2 Scalability Testing

In this section, we first evaluate the results of the previous tests and describe how the scaling test is to be performed based on these evaluation. Later, we present the results of the scalability testing when running these settings.

4.2.1 Optimal Environment Evaluation

The following is a summary of our findings during the Optimization tests:

• Garbage Collection

The results for this test were quite clear, the default garbage collection strategy is preferred by both our implementations.

• Memory Fraction

Our tests indicate that the Memory.fraction value did not meaningfully impact the smaller matrix multiplications in either the GPU or CPU implementations. However, it was discovered that the limit of what a cluster can handle, when it comes to matrix dimensions, can be pushed further by lowering the memory fraction value at a small cost to runtime.

• Storage Level The tests indicated that the storage level should be gradually change from MEMORY_ONLY, to MEMORY_ONLY_SER, to MEMORY_AND_DISK_SER.

• CPU cores and executors

The tests indicated that addition of each core and executor marginally increases the speed of which the matrices can be multiplied at, but also lowers the maximum size of matrices the cluster can handle drastically.

• Block Size

The tests indicated clearly that a large block size was definitely advantageous over smaller ones. The block size should therefore be selected to be as large as possible without leaving idle nodes.

The scalability test is conducted by multiplying matrices as small as 5000, incremented by 5000 until further increments is no longer possible. The cluster is configured to run the multiplication as fast as possible based on the observations listed above. Meaning that the multiplications are attempted with

(45)

configurations yielding the fastest results. If such configurations leads to a crash, we tone down the variables that cause early crashes, such as the number of CPU cores or memory fraction level.

When attempting to tone down variables that cause crashes, it seems reasonable to start with the number of CPU cores, then the memory fraction level, and finally the block size. This is due to our results indicating that higher number of CPU cores causes crashes at much earlier stages. The block sizes are lowered only when absolutely necessary, since our results seem to indicate that they have the greatest impact on the running time of the application.

4.2.2 OpenBLAS Scaling

Figure 4.7 depicts the results of the scaling test performed on the OpenBLAS implementation. Using the configurations described in figure 4.8, we were able to multiply matrices with dimensions of 40000 units squared with 2 nodes, and 45000 units with 3 nodes.

Figure 4.7: The graph showcases the results of the Scalability test on clusters consisting of two and three slave nodes that are utilizing OpenBLAS.

(46)

Figure 4.8:Table describing the Spark configuration used when running the scalability test on two (top) and three (bottom) nodes.

The results seem to indicate that the multiplication speed scales linearly with minimal overhead cost. Figure 4.9 visualizes a comparison between the measured running time for the 3 node cluster, and what the expected running time would be if the two node cluster running time scaled perfectly with no overhead cost when expanding into three nodes. The expected running time for three nodes is calculated as ²₃ of the time it takes for two nodes.

Additionally, as Figure 4.7 depicts, the addition of a third node allows the application to process larger matrices of size 45000².

(47)

Figure 4.9:Graph depicting the similarity between the running time of the scalability test when utilizing three nodes and the running time that was projected based on the results of the tests ran on the cluster containing two nodes).

4.2.3 NVBLAS Scaling

Figure 4.10 depicts the results of the scaling tests performed on a NVBLAS implementation. The configurations used can be found in figure 4.11

Figure 4.10:The graph showcases the results of the Scaling test on clusters consisting of 2 and 3 slave nodes that are utilizing NVBLAS.

(48)

Figure 4.11:Table describing the Spark configuration used when running the scalability test on two (left) and three (nodes).

Similar to the results of the OpenBLAS test, the NVBLAS implementation test results showcase the ability to multiply larger matrices, and a linear scaling with minimal overhead cost when expanding the cluster form two to three nodes. The calculation was performed in the same manner described in the previous section and the results are visualized by Figure 4.12.

(49)

Figure 4.12: Graph depicting the similarity between the running time of the scalability test when utilizing three nodes and the running time that was projected based on the results of the tests ran on the cluster containing two nodes.

4.2.4 Comparison Results

Based on the results described in Sections 4.2.2 and 4.2.3, the following graph comparing the performance of the two implementations can be produced.

Figure 4.13:Graph comparing the performance for the CPU and the GPU implementation.

(50)

Discussion

In this chapter, the report is concluded by first discussing the results from the previous chapter. We then resolve our research questions and finally take a look back at our methodology in hindsight and discuss potential areas of interest for future work.

5.1 Speculations and Conclusions

Based on our test results described in chapter 4, we can make some speculations and draw some conclusions about the behaviour and scalability of both GPU and CPU based matrix multiplication on Spark clusters.

5.1.1 Performance

The two processing units used in our tests, namely the Nvidia Quadro K620 and the Intel i5-4460 were released roughly at the same point in time with similar price tags. A comparison between the performance of these two devices would therefore be reasonable.

As Figure 4.13 illustrates, the GPUs running time increased almost at half the speed of the CPU. This would indicate that the superior calculation speed that GPUs showcase in local environments can also be extended to a distributed environment. This is despite the fact that the utilization of GPUs with Spark requires wrappers.

Furthermore, it was showcased by our results in Section 4.1.2 that utilizing the CPUs full capabilities by utilizng all CPU cores severely lowers the maximum size of matrices that Spark can handle.

43