Evaluating the Importance of Disk-locality for Data Analytics Workloads

(1)

IT 20 008

Examensarbete 30 hp

Februari 2020

Evaluating the Importance of

Disk-locality for Data Analytics

Workloads

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Evaluating the Importance of Disk-locality for Data

Analytics Workloads

Christian Törnqvist

Designing on-premise hardware platforms to deal with big data analytics should be done in a way in which the available resources can be scaled both up and down depending on future needs.

Two of the main components of an analytics cluster is the data storage and computational part.

Separating those two components yields great value but can come with the price of performance loss if not set up properly.

The objective of this thesis is to examine how much the performance gets impacted when the computational and storage part gets divided into different hardware nodes. To get data on how well this separation could be done, several tests were conducted on different hardware setups.

These tests included real-world workloads run on configurations where both the storage and the computation took place on the same nodes and on configurations where these components were separated.

While those tests were done on a smaller scale with only three compute nodes parallel, tests with similar workloads were also conducted on a larger scale with up to 32 computational nodes.

The tests revealed that separating compute from storage on a smaller scale could be done without any significant performance drawbacks. However, when the

computational components grew large enough, bottlenecks in the storage cluster surfaced.

While the results on a smaller scale were satisfactory, further improvements could be made for the larger-scale tests.

(4)

(5)

1 Introduction

Traditionally, the models for on-premises deployment of analytics infrastructure have been single dedicated bare-metal clusters with direct-attached storage, aiming to serve many different users and use-cases. However, as enterprises grow their on-premises analytics infrastructure, they often find themselves deploying multiple clusters, with multiple dedicated bare-metal servers to support a wide variety of use-cases and new versions of new innovative frameworks. This task very quickly becomes both complicated and expensive. For instance, if more computational power is required, but the current level of storage is satisfied, a company is unlikely wanting to invest in additional storage nodes.

An arising question is thus: can we separate the computational nodes from the storage ones without receiving any performance loss? Another driving factor in answering this question is for the emerging container-based cluster management platforms such as Kubernetes [1]. In Kubernetes, the storage cluster will also typically run detached from the computational nodes.

One existing article that highlights this question is Disk-Locality in Datacenter

Computing Considered Irrelevant [2] by Ghodsi et al. The Ghodsi article takes

the stance that disk-locality is “irrelevant” in modern clusters due to the faster networking solutions that are becoming commercially available [2]. It also states that many other systems [3][4][5] use disk-locality as one of their primary metrics for evaluating efficiency while claiming that earlier work and their “quest for disk-locality” was based on the following assumptions:

1. Disk I/O is faster than network bandwidth

2. Disk I/O constitutes a significant fraction of a task’s lifetime.

Other arguments the article makes are how modern frameworks compress and decompress data when interacting with the disks. Decompression demands more CPU utilization while at the same time demanding less from the network, which is an additional case for the reduced need for disk-locality. Also, older data can have as little as one replica. In scenarios where the replica counts are low, it is improbable that the data is local in any case as the number of worker nodes increases.

This project examines how much impact separating storage from the com-puting nodes have on the performance in modern high-performance comcom-puting (HPC) clusters with fast internal networking. The thesis is written in collabora-tion with Ericsson and a project which provides a Kubernetes based on-premise platform for data analytics. For this Ericsson project, the computing nodes are entirely separated from the storage nodes which directly corresponds to what is being examined in this thesis.

(8)

2 Related work

This section presents two additional articles that have been of great inspiration behind the work of this thesis.

2.1 Making Sense of Performance in Data Analytics

Frame-works

Finding bottlenecks when running a workload on any given setup is vital when attempting to optimize the performance. The article Making Sense of

Per-formance in Data Analytics Frameworks [6] by Ousterhout et al. introduces

methods for examining bottlenecks in analytics workloads.

Ousterhout et al. also mention that related work in the literature has been putting too much focus on the belief that network, disk I/O, and also, straggler tasks constitute the majority of bottlenecks. A straggler task is when, e.g., multiple processes work in parallel to finish a job and one process ends later than the others, leaving the remaining processes idle while waiting for it to complete. The authors introduced a concept called blocked time-analysis to enable the programmers to be able to identify bottlenecks more quickly. This method uses white-box logging to measure how long each task spends blocked on a given resource.

Figure 1 displays an example of how blocked time-analysis could be used to figure out how much of a bottleneck the network played. The leftmost graph in the figure displays the time the system spent on various resources. The black boxes are representing the time the system was blocked on network I/O. If the time spent on the network I/O would be removed, the job would finish faster, which the right graph demonstrates. This method of “removing” the time spent on a specific task can reveal if that particular task is responsible for being a bottleneck in the system.

Figure 1: Blocked time-analysis example from the article Making Sense of

Performance in Data Analytics Frameworks[6]

(9)

run-time data, the authors experimented with eliminating the elapsed time of respectively the disk and network I/O. After doing this, the authors concluded that when removing elapsed time for disk I/O, the maximum improvement to job completion time could not exceed 19%. They observed that the average CPU utilization was close to 100%, while median disk utilization was at 25%. They attributed the high CPU usage and low disk utilization to compression, decompression, and deserialization of data (rebuilding stored data to objects in, e.g., the Java virtual machine (JVM)). Due to the low disk utilization, they believed that upgrading hard disk drives (HDD) to solid-state drives (SSD) would not have a significant impact on the performance.

For the networking part, they found that job completion time could only be improved by up to 2% by making the network infinitely fast. The authors concluded that one reason for this low number is that data transferred over the network is a subset of data transferred to disk, which leads to that jobs are more prone to be delayed by disk interaction rather than network interaction.

In past work, optimizing the network has led to more significant performance improvements. One of the main reasons for this is that previous efforts have focused on workloads where input data is equal to shuffle data (shuffling represents moving data between nodes), which they claim is not representative of typical workloads [6].

They also studied the impact stragglers have on the performance. By elim-inating the stragglers, the performance could improve by 23-41%. For some stragglers, they were unable to find the root cause. Some example solutions to stragglers which they discovered were: allocating fewer objects (to target garbage collection stragglers) or reducing the number of files for map output (to target shuffle write stragglers).

All their tests ran on a 20 node cluster. For scaling testing purposes, an experiment was also conducted for 60 machines, using the same TPC-DS workload but with three times larger input data. The authors found out that testing the on a larger scale did not yield any different conclusions in terms of performance improvements when removing stragglers and time spent on disk and network I/O.

Ousterhout et al. conclude their article by stating that their findings should not be taken as facts for every workload and hardware setup. Instead, they emphasize on the importance of running blocked time-analysis when attempting to optimize analytics frameworks on a given system.

2.2 SparkBench

The article SparkBench: A Comprehensive Benchmarking Suite For In-Memory

Data Analytic Platform Spark [8] introduced a suite for benchmarking workloads

(10)

The benchmarking suite SparkBench aids the user in both generating data for the workloads and running them on top of Apache Spark. According to the authors, although the community has developed a rich ecosystem around Spark, no Spark specific benchmark exists in the literature. Because of this, they introduced SparkBench, which covers four representative workloads. These workloads include graph computation applications supported by GraphX [10] of Spark, machine learning supported by MLlib [11], SQL applications, supported by native Spark SQL service, Hive [12] on top of Spark, as well as streaming, which is supported by DStream [13].

The article also presents some example runs of the available workloads, while also displaying CPU, memory, disk, and network I/O in performance graphs. Figure 2 demonstrates an example of how the article presents those graphs visually.

Figure 2: Excerpt from the SparkBench article [8] of resource consumption graphs from when running a SVM workload

After presenting the performance results, the authors analyze these results to identify any performance bottlenecks. Based on the observations of the graphs, each workload is then characterized based on if it is their data access pattern, resource consumption, and job execution time.

2.3 Unique Angle of this Thesis

This thesis presents the results of running workloads on various hardware setups. Tests are carried out on different hardware setups, both where the computing nodes also run the storage cluster but also for setups where they are separated. One of the primary tools in the workflow was SparkBench since it allows for both generating input data for and running multiple relevant workloads.

The method of analyzing with “blocked time blocks” was, for some workloads, used in a similar sense to what Figure 1 presented by, e.g., removing time spent on idling. Spark’s built-in interface also allows the user to view different Spark jobs and observe different stages (subparts of a Spark job) in the built-in Spark history server. When viewing a Spark stage in the history server, it is possible to see how much time each node spent on sending/receiving data over the network, and on actual computation.

(11)

size.

One workload in SparkBench that can provide a lot of insight into the performance of a cluster is one named Terasort [14]. In simple terms, this algorithm reads a large-sized input data, sorts it, and then stores it back to the storage medium. In this scenario, up to 32 workers read simultaneously from the storage cluster. Having 32 nodes accessing a single resource in the cluster in parallel lays an excellent foundation for finding potential bottlenecks, especially when comparing the results of running with fewer processing elements. Storage cluster related bottlenecks are one of the main focuses of this thesis. Combining this focus with the exploration of separating compute from storage gives this thesis a valuable angle when looking at the performance of HPC clusters.

3 Background

This section explains the different technologies and scaling methods used in this thesis.

3.1 Apache Spark

The open-source data analytics frameworks Hadoop MapReduce [15] and Apache Spark [9] are both designed to be capable of handling datasets more substantial than what can typically be handled by a single node. They do this while automatically providing load-balancing and scheduling in a distributed dataflow system.

In the traditional Map-Reduce model [4], as implemented in Hadoop MapRe-duce, the data flow graphs are limited to being acyclic, which prevents efficient use of iterative algorithms that, e.g., machine learning requires. The method Map-Reduce uses for resiliency is data redundancy. Spark, on the other hand, uses resilient distributed datasets (RDDs) [16] for resiliency, which carries enough information (lineage) to be able to recompute its contents. Spark is also not limited by having acyclic dataflow graphs, but instead, it enables the possibility to express iterative algorithms [9]. Hadoop MapReduce also had the limitation of that data needed to be written to hard drives after a Map-Reduce operation had finished. Apache Spark has the advantage of being able to store data in-memory in between operations which provides a large advantage performance-wise.

(12)

D Spark Driver D Job JobD D Spark Executor #2 D Task Task_D D Spark Executor #1 D Task Task_D Distributed Filesystem

Figure 3: Basic Apache Spark architecture

RDDs represent a read-only resilient collection of objects that typically are too large to be stored on a single node. The resilient distributed datasets are segregated across the executors of a Spark cluster into different partitions. The RDDs can be created either from a file residing in a distributed file system or from an existing collection created in the driver program. The programmer can apply two different types of operations on the RDDs, and these are transformations (for instance, map) and actions (reduce). Actions produce non-RDD values, which can be returned to the driver program or written to a storage medium while transformations transform an existing parent RDD into a new child RDD.

Three units of work exist in Spark, where a job is the biggest one, and it consists of one or more stages. A stage gets created when an operation launches on an RDD. In turn, the stages consist of several smaller tasks, which is the smallest unit of work in Spark.

Each partition in the RDD gets associated with a task during an action or transformation. The Spark executors where these partitions reside on are respon-sible for running these tasks. Once the task has finished, these partitions may move to a different node depending if the operation is a transformation with wide

dependency. There are two types of dependencies [17] for the transformations, narrowand wide. In a narrow transformation, all elements required to calculate a

particular partition in a child RDD already resides in the corresponding partition of the parent RDD. Simple functions such as map and filter both generate narrow dependencies. Wide dependencies stem from, for instance, join and groupByKey transformations and can have dependencies where multiple child partitions may depend on the same partition of the parent RDD. Wide transformations can thus involve data shuffling across the network, which can be a bottleneck for Spark applications [17].

(13)

written to secondary storage (for fault-tolerance purposes [16]). A shuffle read then involves the total records and bytes read, both locally and from remote executors.

3.2 Collectl

Collectl [18] is a tool that can collect timeline performance data on a single

node and store this into a file. The type of data collectl gathers can be, for instance: CPU wait/utilization, local disk throughput, network throughput, memory usage, along with a large number of other metrics.

3.3 Docker and Kubernetes

Docker [19] is an open-source project that provides the Docker Engine which can run applications inside containers that are isolated from other processes in the system. Unlike virtual machines [20], containers share the same kernel space as the host OS. This makes containers very light-weight and quick to launch. To boot a container inside the docker engine, an image is required. Images can easily be created and tailored exactly to the developer’s needs. Ideally, the size of the images should be kept as tiny as possible to allow for even quicker container launch time. Once an image has been created, it can be shared with others by uploading it to a repository available to other users which will be able to make use of the containerized app on their platforms. If a user wants to run his docker containers in a distributed setup where solutions for both storage and networking needs can be satisfied, the open-source container platform Kubernetes can be used.

Kubernetes [1] (K8s) is an open-source container platform used for managing services and containerized workloads. With K8s, developers can rapidly deploy their applications inside a Kubernetes cluster which will automatically be dis-tributed over different nodes. These deployments can be configured so that all launched containers have access to the same persistent storage. A common way is to have this persistent storage running on nodes separated from the “compute nodes” in which the docker containers will launch.

3.4 Scaling Methods

Two different methods were utilized for testing how well the system scaled with increasing the number workers and input data.

3.4.1 Weak Scaling

(14)

Thus, when performing weak scaling tests, the input size increase with the same factor as workers do. For example, a job that starts with 1GB input data and one worker, would in the next step run with two workers and 2GB input data. In ideal scenarios, the run-time stays constant as the number of workers goes up. One way to measure weak scaling is Weak Scaling Efficiency (WSE). It is calculated using the following formula: t1/tn, meaning that WSE is the ratio

of run-time between one worker and n workers.

D

Exec. 1 1 GB input

D

Exec. 1 1 GB input

D

Exec. 2 1 GB input

N = 1

N = 2

Figure 4: How the input size varies while increasing workers in weak scaling tests

3.4.2 Strong Scaling

An additional way to test the parallel scalability is to perform strong scaling [21] tests. Unlike weak scaling, the problem size stays fixed, while the number of processing elements increases. Thus, when doubling the number of workers, the run-time gets halved when linear scaling is present. Figure 5 shows how the input size increases when doubling workers.

The formula used for calculating strong scaling efficiency (SSE) is SSE =

t1/(N ∗tn) where t1is the run-time for when having one worker, N is the number

of workers and tn is the run-time for n workers. Optimal linear strong scaling

(15)

D

Exec. 1

1 GB

input

D

Exec. 1

1 GB

input

D

Exec. 2

N = 1

N = 2

Figure 5: How the input size varies while increasing workers in strong scaling tests

4 Method

This section outlines the method used for studying the relevancy of disk-locality.

4.1 Hardware Setup

Several benchmarking tests got executed on different hardware setups. Most setups resided inside a test cluster wherein a maximum of three storage nodes and three workers performed the tests. Benchmarks were also executed on a much larger cluster, the “production cluster,” with up to 32 workers and five storage nodes.

The tests ran on two different types of nodes. The hardware of these nodes are the following:

1. HPE Apollo 4200 Gen9. These nodes have an Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz processor with two CPU sockets, each with 14 CPU cores running one machine thread, thus totaling 28 cores. Available RAM is 128GB, and available disk write speed to the local SSD is around 1GB/s. For the storage, each HPE 4200 node has x5 spinning disks inside the test cluster and x7 spinning disks inside the production cluster. An individual spinning disk can achieve a read and write throughput of up to 200MB/s. 2. HPE 700 nodes with an Intel(R) Xeon(R) CPU E3-1585L v5 @ 3.00GHz. 4 cores per node and a total of 8 virtual cores. Each node has 64GB of RAM. NVMe-SSDs are available on these nodes which, is typically faster than a regular SSD since it connects through the PCI-express bus instead of via the SATA port. For both read and write throughput, the achievable throughput is around 2GB/s.

(16)

the Moonshot chassis. These two Moonshot switches have a capacity of 320 GigE. The HPE 4200 nodes connect to an Arista switch with 32 ports, each with a capacity of 100 GigE. The Arista switch is then directly connected to the switches of the Moonshot chassis.

The Moonshot chassis has a capacity of having 45 HPE 700 nodes. Since the HPE 700 nodes have, in theory, the capacity of sending and receiving data with a total throughput of 20Gb/s, the Moonshot switches, which has a total capacity of 2*320Gb/s is a theoretical bottleneck since 20 ∗ 45 > 2 ∗ 320 ⇔ 900 > 640.

4.2 System Architecture

The hardware introduced in the previous section is divided into several different cluster configurations. For the mentioned test cluster, there are three different setups. The production cluster also had a similar setup employed for contrasting the performance results against the test cluster. Figure 6 displays an overview of the different hardware setups.

D HPE 700 x32 Storage cluster Storage cluster D HPE 4200 x3

Prod. Cluster

Test Cluster

HPE 700 Prod. HPE 4200 Local D HPE 700 x3 HPE 700 Test D HPE 4200 x3 HPE 4200 Remote

Figure 6: The different hardware setups used in this project

• The setup “HPE 4200 Remote” runs on three separate HPE 4200 nodes while the storage cluster still runs on a different set of three HPE 4200 nodes in the same cluster.

• Another one is “HPE 4200 Local,” for which the storage cluster runs on the same nodes as the computation occurs.

• The setup “HPE 700 Test” is very similar to ”4200 Remote“, except that the worker nodes run on HPE 700 nodes.

(17)

bare metal, some performance drawback is expected in this cluster but on a very small scale.

The storage nodes were unable to serve as compute nodes in the production cluster, so comparing the two scenarios of having local vs. non-local disk-locality was thereby not possible in the storage cluster. This type of comparison was only possible to perform in the smaller test cluster where the storage nodes could both run the storage cluster as well as the Spark workers.

4.3 Method Used for Testing

Several Spark workloads (which are later explained in Section 4.4) was bench-marked in the test cluster to see how well separating compute from storage would pan out. For every test, every workload ran at least four times to get the average run-time performance. Each workload also had its own set of input data generated, and that input data got reused for every hardware setup.

Regarding the fairness of the different hardware setups, the number of available cores is higher for the HPE 4200 nodes in comparison to the HPE 700 ones. The HPE 700 nodes have only eight cores each, while the HPE 4200 has 28 available cores. Since the purpose of the thesis is to compare having the workload running locally vs. non-locally, the number of Spark executors (utilized cores) is limited to eight for the case when having the HPE 4200 nodes as workers. This way, the number of executors is the same for both the local and non-local setup, resulting in more fair results when comparing the two different setups. An acknowledged limitation of using this method is that, e.g., eight utilized cores on the HPE 4200 may perform better than eight utilized cores on the HPE 700 or vice versa depending on the workload. Thus, to better understand what may be a bottleneck for an individual workload on a specific hardware setup, the statistics involving CPU usage and memory usage has been interpolated to emulate all nodes having the same hardware for the tests where local vs. non-local storage is compared. This method is a trade-off that can lead to some small inconsistencies when displaying, e.g., CPU utilization. An example of this is that the average CPU utilization can go above 100% when applying this interpolation method.

In the article Making Sense of Performance in Data Analytics Frameworks [6], the authors verified their results by running the workloads on a larger scale where the increased the number of workers from 20 to 60. This thesis takes a similar approach for verifying the results by running SparkBench workloads inside the production cluster while performing both weak and strong scaling

efficiency [21] tests. Also, similarly to the blocked-time analysis, introduced in

(18)

4.4 Workloads

This section presents the Apache Spark workloads that got executed in the benchmarking tests.

4.4.1 GraphX workloads

GraphX [10, 22] is an embedded general graph processing framework built on top of Apache Spark. It provides composable graph abstractions that are sufficient to express already existing graph API. The workloads from this framework used in the tests are Connected components and SVD++:

• Singular-value Decomposition (SVD) [23] is in linear algebra the factoriza-tion of a complex or real matrix. The Spark implementafactoriza-tion, SVD++, can be used for improving recommendations for users based on feedback these users provide [8]. It creates a collaborative filtering model [24], meaning that the feedback of multiple users can, for instance, recommend an item to a user based on similarities of what different users have in common. • A connected component [25] is in graph theory, a subgraph of an undirected

supergraph where all vertices in the subgraph have a connection through some path. For instance, in a supergraph where two subgraphs are not connected, there are two connected components.

The connected components algorithm puts a label on each vertex in a connected component with the ID of its lowest-numbered vertex. One use case for this is approximating clusters in social networks.

4.4.2 MLlib workloads

MLlib [11] is the machine learning library for Apache Spark. It is written in Scala but is also using native C++ based linear algebra libraries. The workloads used in this framework are logistic regression, singular value decomposition (referred to as SVD++), and principal component analysis (PCA). Below is a

more detailed description of each workload:

• Logistic regression [26], when used as a machine learning classifier, predicts categorical or continuous data. For instance, one use case is to detect if a patient has cancer given features such as age, gender, and various blood tests. To train its classification model, it uses stochastic gradient descent. Input data is kept in memory through RDD abstractions, and a parameter vector is computed, updated, and broadcast in each iteration [8].

(19)

4.4.3 Terasort

One of the workloads named Terasort [14], did not belong to a specific framework. It is a benchmark that typically measures the amount of time to sort one terabyte of randomly distributed data. For this experiment, the input data size may vary depending on the size of the cluster.

In Spark, the Terasort workload has four stages. One read stage, which reads the data from storage, one sort stage, a partition stage which uses the first seven bytes of the byte array to partition the keyspace evenly and finally a write stage which writes the sorted and partitioned data back to storage.

4.5 Workflow improvement

Several Python scripts were written and served the purpose of running the workloads, fetching, and presenting the performance data. For instance, to be able to run weak and strong scaling tests where the number of workers and input size might vary, spark-job-starter.py was created. This script serves as a wrapper for the SparkBench framework. It can, in a few runs, generate the required data for all desired workloads on all different scales, run all those workloads while increasing the number of workers of each run (if desired), run each workload multiple times and then report average results. While doing this, it can also start collectl processes on each node while collecting performance data, and once all workloads have finished running, send the collectl data back to the master and store them in a zip file.

The files produced by collectl is not designed to be read by humans. The script collectl-data-parser.py converts the raw data into statistical data and presents it in a readable format. Some of the statistics this script calculates are average disk read throughput, amount of network I/O, and CPU utilization.

The workers of a running Spark job generates data that is parsed by the

Spark history server. This history server enables the user to view previous jobs

an explore on various statistics from the finished job. While the Spark History UI is useful, it does not display some of the vital metrics required by this project. One such metric was the total time the executors spent idling while waiting for straggler tasks to finish. The script spark-data-parser.py parses the history files and prints some of the statistics that is hidden in the Spark history server, such as the total time each job spent on idling or waiting on network I/O.

The Appendix section display some example outputs from the data parsing scripts.

5 Cluster Performance

(20)

5.1 Disk throughput

In Figures 7-10, local disk I/O is shown for each different machine used in this thesis. Each row represents how many processes are reading and writing concurrently (to different files).

As seen in Figures 7-8, the performance of the SSDs inside the HPE 4200 nodes drops quite significantly once the number of threads reaches eight threads for writing and as early as four threads when reading.

1286 1 thread 1555 2 threads 1845 4 threads 418 8 threads 0 500 1000 1500 2000 2500 3000 MB/s

Figure 7: Write throughput on a HPE 4200 node with an increasing amount of threads that write to different files on the local SSD

1931 1 thread 2826 2 threads 262 4 threads 301 8 threads 0 500 1000 1500 2000 2500 3000 MB/s

Figure 8: Read throughput on a HPE 4200 node with an increasing amount of threads that reads from different files on the local SSD

(21)

1970 1 thread 1703 2 threads 1590 4 threads 1140 8 threads 0 500 1000 1500 2000 MB/s

Figure 9: Write throughput on a HPE 700 node with an increasing amount of threads that write to different files on the local NVMe SSD

1443 1 thread 2101 2 threads 2313 4 threads 2398 8 threads 0 500 1000 1500 2000 2500 MB/s

Figure 10: Read throughput on a HPE 700 node with an increasing amount of threads that reads from different files on the local NVMe SSD

5.2 Storage cluster throughput

This section presents the throughput for writing to and reading from the storage cluster from both the test and production cluster’s standpoint. The tests used the Linux cp command to copy files between the local disk and a folder residing inside the storage cluster. When copying to the cluster, a local 2GB file is copied to the mounted folder using a unique file name, so two different nodes never write to the same location. Four different threads perform the copying operations simultaneously on each node.

5.2.1 Test cluster

(22)

HPE4200 Local HPE4200 Remote HPE700 Test HPE700 Prod. 0 500 1,000 1,500 1 ,254 1 ,202 1 ,101 930 1 ,252 1 ,198 1 ,100 890 Throughput (MB/s) Read Write

Figure 11: Read and write throughput to storage cluster. Each test has three nodes reading and writing with four threads in parallel

5.2.2 Production cluster

Figure 12 displays the throughput when copying files from storage to the local file system, while Figure 13 displays throughput in the other direction. The storage cluster is running on five nodes with seven dedicated hard drives, and every hard drive has around 200MB/s I/O throughput. Given this information, one estimation of the total throughput of the storage cluster should optimally be able to close to 7000MB/s (200 ∗ 5 ∗ 7 = 7000) for reading. Since the storage is using three-way replication, another theoretical limit for writes is 2333MB/s (7000/3 = 2333).

(23)

918 1 node (2GB read) 923 2 nodes (4GB read) 930 4 nodes (8GB read) 1474 8 nodes (16GB read) 2300 16 nodes (32GB read) 2633 32 nodes (64GB read) 0 560 1120 1680 2240 2800 MB/s

Figure 12: Storage cluster read throughput when copying files from the storage cluster to local disks on varying amounts of machines. Each machine copies a different 2GB file stored in the storage cluster to its own disk.

432 1 node (2GB write) 889 2 nodes (4GB write) 832 4 nodes (8GB write) 1008 8 nodes (16GB write) 1525 16 nodes (32GB write) 1725 32 nodes (64GB write) 0 400 800 1200 1600 2000 MB/s

Figure 13: Storage cluster write throughput when copying files to the storage cluster from local disks on varying amounts of machines. Each machine copies a 2GB file stored on its local disk to a machine-unique destination path in the storage cluster.

5.3 Network transfer throughput

This section displays how fast data can be transferred from one node to another. The experiment was done using the tool iperf3, which allows the user to test network throughput with varying packet sizes. Below are transfer rates presented for all different source and destination routes.

(24)

2.2 4200 to 4200 2.2625 700 to 700 0.6175 700 to 4200 1.15375 4200 to 700 0 0.5 1 1.5 2 2.5 GB/s

Figure 14: Transfer rates between nodes in the test cluster using the tool iperf3

11.5 4200 to 4200 1.155 700 to 700 1.24 700 to 4200 1.24 4200 to 700 0 2 4 6 8 10 12 GB/s

Figure 15: Transfer rates between nodes in the production cluster using the tool

iperf3

Figures 16-17 show the results of performing network throughput tests between a single storage node and multiple compute nodes in a many-to-one and one-to-many fashion. The figures show that in both directions, the maximum achieved throughput is around 15GB/s. Comparing this speed to how slow the storage cluster performed, this tells that it is very likely that storage may serve as a bottleneck. Figures 12-13 show for reading, the nodes achieved a maximum throughput of only 2.6GB/s and 1.7GB/s for writing.

2.5 2 nodes 4.9 4 nodes 9.9 8 nodes 14.2 16 nodes 15.1 32 nodes 0 4 8 12 16 GB/s

(25)

2.4 2 nodes 4.8 4 nodes 8.5 8 nodes 11.0 16 nodes 14.6 32 nodes 0 4 8 12 16 GB/s

Figure 17: Network throughput between one storage node and multiple worker nodes in the production cluster

Figure 18 shows the result of a test conducted on the switches which connect the nodes in the production cluster. Up to 32 node pairs (64 nodes in total) sent and received data simultaneously through the tool iperf3 to check for bottlenecks in the switch.

The worker nodes have 10Gbit interfaces, so unless the switch is a bottleneck for the total throughput, the average throughput should be close to 10Gbit/s for each worker. As observed in the Figure, the network scales very well up to 16 node pairs. The test reveals a bottleneck at 32 node pairs where the performance starts dropping with approximately 1.05Gb/s per node.

9.90 4 node pairs 9.90 8 node pairs 9.90 16 node pairs 8.85 32 node pairs 0 2 4 6 8 10 Gb/s

(26)

6 Results and Discussion

This section presents the results of this thesis accompanied by discussion.

6.1 Common bottlenecks found

When analyzing the workloads, some bottlenecks were recurring for multiple numbers of workloads. The consequences of those workloads also vary depending on the bottleneck.

One of the bottlenecks was not having sufficient working memory on each node. This lack of memory resulted in a CPU thrashing [28] state. When memory gets filled, pages of currently unused memory moves to secondary storage. If these pages are required again, the system fetches those pages back to the main memory. In a CPU thrashing state, the number of pages that get swapped out and back in grows undesirably large and the overall performance of the system plummets. Luckily, the nodes in this test are using fast NVMe SSDs, so the effects of CPU thrashing was not as substantial as they could have been.

Another observed bottleneck was the throughput of the storage cluster. The storage nodes of the production cluster were unable to serve a too large number of clients in parallel, which Section 5.2.2 also demonstrated.

6.2 Performance Comparison Between Hardware Setups

This section presents the different results for all workloads when comparing local vs. non-local persistent storage. These results are also contrasted against when running the workloads in the production environment. Table 1 displays an overview of the four different workloads used in this test.

Workload Input Output Shuffle Read Shuffle Write Stages Tasks Logistic Regression 67 GB 0 GB 97.1 KB 59.5 KB 10 901 PCA 32 GB 0 GB 210.6 MB 389 MB 8 308 Terasort 20 GB 20 GB 12.2 GB 18.1 GB 5 192

SVD++ 1 GB 0 GB 67.1 GB 100.5 GB 103 1296

Table 1: Information on the different workloads.

Figure 19 displays an overview of the different run-times for all the workloads in all different hardware setups. All setups are similar by having three Spark workers on three different nodes, with each worker having eight cores each at its disposal. Below is an explanation of the different setups.

• HPE 4200 (local) refers to the setup where the workload runs on the same HPE 4200 nodes that are running the storage cluster.

(27)

• HPE 700 (test cluster) also have the storage on the same nodes as HPE 4200 (local), however, where the workers run on remote HPE 700 nodes. • HPE 700 (prod. cluster) represents nodes that run in the production

cluster with the storage cluster separated from the worker nodes.

LogRegression PCA Terasort SVD++ 0 50 100 150 200 250 Seconds (s) HPE 4200 (local) HPE 4200 (remote) HPE 700 (test cluster) HPE 700 (prod. cluster)

Figure 19: Overview of run-times when running different workloads on different hardware setups

Figure 19 exhibits that for all setups, the performance results do not differ a vast amount despite that each workload got executed by compute nodes of varying hardware specifications. The HPE 4200 nodes utilize both a more significant number of cores and have more memory available. Because of this, the statistics involving CPU usage and memory usage has been interpolated to emulate all nodes having the same hardware. Both HPE 700 setups are also exhibiting similar performance even though the Spark workers of HPE 700 Prod. runs inside Docker containers.

6.2.1 Logistic Regression

(28)

overwhelming time on the read stage, which Figure 20 demonstrates. On average, for all the different hardware setups, 82% of the time is spent on reading the data. The setup that finished quickest was HPE 4200 (remote), and even though it was fastest, the difference in run-time for the “Remaining stages” was small.

4200 local

4200 remote

700remote _Prod.remote 0 50 100 150 200 250 36 34 39 35 203 143 174 188 Seconds (s)

Read Stage Remaining Stages

Figure 20: Run-times when running LogisticRegression on different worker node setups

An interesting thing to note is that HPE 4200 remote had its read stage finish faster than HPE 4200 local even though the storage cluster runs on those nodes. Looking at Table 2, HPE 4200 remote has double the throughput in terms of average network read in comparison to HPE 4200 local.

Setup 4200 local 4200 remote 700 remote Prod. remote Avg. Net. read TP. 97MB/s 194MB/s 97MB/s 89MB/s Avg. CPU util. 84% (interpolated) 115% (interpolated) 91% 89%

Avg. CPU wait 6.5% 0.5% 0% 0.2%

Table 2: Performance statistics for the read stage of the Logistic Regression workload. (Note: since the HPE 4200 nodes has more cores than the HPE 700 nodes, the CPU statistics are interpolated to compensate for this).

(29)

slightly worse for the nodes in the production cluster.

6.2.2 PCA

For this PCA workload, there were eight stages and 308 tasks in total. The workers read 32GB data from storage, and the amount of shuffle writes and reads are small, in the range of 200-400MB. Figure 21 shows an overview of how this workload performed.

4200 remote 4200

local

700remote _Prod.remote 0 50 100 150 43 43 61 71 70 94 70 70 Seconds (s)

Read Stage Compute Stage

Figure 21: Run-times when running PCA on different worker node setups Just as for Logistic Regression, the hardware that performed best was 4200 remote. For the read stage (Table 3), 4200 remote showed the best storage read performance along with better CPU utilization. Even for this test, it’s read stage was completed faster than for 4200 local despite the storage cluster running on the HPE 4200 local nodes. Both HPE 700 setups had a similar run-time performance in the read stage.

Setup 4200 local 4200 remote 700 remote Prod. remote Run-time 94s 70s 70s 70s

Net. read TP. 120MB/s 205MB/s 134MB/s 133MB/s Avg. CPU util. 66% (interp.) 87.5% (interp.) 82% 83% Avg. CPU wait 8% (interp.) 87.5% (interp.) 82% 83% Avg. PageIn. 66% (interp.) 87.5% (interp.) 82% 83%

(30)

For the compute stage (Table 4), the HPE 4200 nodes outperformed the HPE 700 nodes, and this is most likely due to more available CPU processing power. HPE 4200 remote also utilizes more than eight cores (142% avg. CPU utilization). The HPE 700 nodes also had their memory usage at 100%, for 75% of the time that the compute stage was running. Scenarios like this are when the HPE 4200 nodes get an unfair advantage due to having more memory, which Figure 21 reflects.

Setup 4200 local 4200 remote 700 remote Prod. remote

Run-time 43s 43s 71s 71s

Avg. CPU util. 94.5% (interp.) 105% (interp.) 84% 81% Max mem. alloc. 89% (interp.) 147% (interp.) 100% 100% 75th percentile alloc. mem. 88% (interp.) 142% (interp.) 99% 99%

Table 4: Performance statistics for the compute stage of the PCA workload (statistics for the HPE 4200 are interpolated to match the HPE 700).

6.2.3 Terasort

The Terasort workload had four stages, with 192 tasks for the HPE 700 setups and 2688 tasks for the HPE 4200 configurations. 20GB is read from storage and also written to storage once the workload finishes. In total, it reads 12.2GB from shuffles while writing 18.1GB for the shuffles writes.

(31)

4200 local

4200 remote

700remote _Prod. remote 0 20 40 60 80 100 24 34 53 52 16 15 18 20 6 6 7 ₈ 56 22 27 ₁₉ Seconds (s)

Read Stage Sort Stage Partition Stage Write Stage

Figure 22: Run-times when running Terasort on different worker node setups Table 5 reveals some additional information on idle time. Although the HPE 700 setups idle for a longer duration, the idle time ratio is very similar to HPE 4200 remote, so the greater run-time of the HPE 700 nodes cannot be attributed to idling.

Run-time 102s 77s 105s 99s

Total idle time 5s 8s 17s 11s

Run-time without idle time _97s _69s _87s _88s

Idle time ratio 0.05 0.13 0.13 0.11

Table 5: Idle-time Statistics for the entire Terasort workload.

In the read stage, the HPE local setup ran for approximately double the time compared to for the rest, which is very unexpected, considering the data resides on those nodes already. Table 6 reveals that the HPE 4200 local setup had an

Average PageInof 7GB while this is at 0.0GB for the rest of the hardware

(32)

Run-time 57s 21s 27s 18s

Avg. CPU util. 31% (interp.) 66% (interp.) 44% 71% Avg. CPU wait 12.4% (interp.) 2.2% (interp.) 7.5% 2% Avg. network input TP. 143MB/s 378MB/s 268MB/s 396MB/s Avg. network output TP. _144MB/s _11MB/s _9MB/s _8.5MB/s Avg. disk read TP. 122MB/s 8.7MB/s 2MB/s 0.2MB/s

Avg. PageIn 7GB 0.0GB 0.0GB 0.0GB

Partitions 672 672 48 48

Table 6: Statistics for the read stage of Terasort

Tables 7-8 show the performance data for the sort and partition stages. The workloads all perform very similarly in terms of run-time for all the stages.

Run-time 6s 7s 7s 9s

Avg. CPU util. 140% (interp.) 119% (interp.) 70% 59% Avg. network input TP. 289MB/s 360MB/s 264MB/s 209MB/s

Table 7: Statistics for the sort stage of Terasort

Run-time 16s 15s 18s 20s

Avg. CPU util. 108% (interp.) 108% (interp.) 83% 81%

Table 8: Statistics for the partition stage of Terasort

For the writing stage, the HPE 4200 local outperformed the remaining setups. It had the highest average network output throughput (storage write throughput) along with the best CPU utilization. By looking at this stage, having disk-locality on the nodes performing the computation appeared to be paying off.

HPE 4200 remote suffered from straggler tasks. By removing the time spent on idling due to those tasks, the difference in run-time between 4200 local and remove would be reduced to 5.3 only seconds.

Run-time _24s _34s _53s _52s

Total idle time _0.7s _4.5s _6.7s _4.3s

Run-time without idle time _24.2s _29.5s _46.7s _47.5s

Avg. CPU util. 101.5% (interp.) 87.5% (interp.) 71% 77% Avg. network output TP. 305MB/s 202MB/s 135MB/s 76MB/s Allocated mem. 75th percentile 98% (interp.) 144% (interp.) 99% 86%

Table 9: Statistics for the write stage of Terasort

6.2.4 SVD++

(33)

algorithm for which the input size can be minimal while the generated load and shuffle data can be massive. The workers read 67GB in shuffles while writing 100GB in shuffle writes. Figure 23 shows the different run-times of all hardware setups. Note: HPE 4200 remote is excluded from this test.

4200 local

700remote _Prod.remote 0 50 100 150 200 250 177 242 ₂₃₅ Seconds (s)

Figure 23: Run-times when running SVD++ on different worker node setups HPE 4200 local performed best with a run-time of 182s. HPE 700 remote slightly outperformed prod. remote, which was likely due to slightly faster internal network throughput, as shown in Table 10. HPE 700 remote had an average network input throughput of 327.7MB/s, while HPE 700 prod. had 280.8MB/s and 282MB/s.

This difference is likely caused by that the HPE 4200 nodes could progress the work faster due to having a more powerful CPU and more available memory. The HPE 700 nodes also had their memory allocated up to 99% for 75% of the time, while HPE 4200 never gets close to its limits.

(34)

Setup 4200 local 700 remote Prod. remote

Run-time 182s 217s 236s

Avg. Net. input TP. 129.5MB/s 109.2MB/s 93.6MB/s

Total Net. input TP. 259MB/s 327.7MB/s 280.8MB/s

Max Net. input TP. 2078MB/s 1269MB/s 1079MB/s

Total Net. output TP. 254.7MB/s 322.3MB/s 282MB/s

Avg. CPU util. 135% (interpolated) 65.7% 59.4%

Avg. CPU Wait 0.3% 2.6% 2.3%

Max CPU Wait 11% 57% 49%

75th percentile alloc. mem. 121% 98% 99%

Total. PageOut 22MB 125GB 133GB

(35)

6.3 Weak Scaling Tests

The following sections demonstrate tests where the Spark workloads undergo tests for weak scaling efficiency. The workloads chosen are relatively easy to analyze and are also sufficient to identify the possible bottlenecks. Table 11 displays an overview of the chosen workloads.

Workload Input Output Shuffle Read Shuffle Write Stages Tasks

Logistic Regression 8GB*W 0GB 0GB 0GB 10 208*W

PCA 25.5GB*W 0 GB ~260MB*W ~260MB*W 7 ~210*W

Terasort 39GB*W 40.5GB*W 55.5GB*W 37GB*W 4 ~320*W

Table 11: The different weak scaling workloads with the input sizes, shuffle read sizes. “W” stands for the number of workers and “~” denotes an approximate number.

6.3.1 Logistic Regression

One of the workloads that performed best in terms of weak scaling was Logistic Regression. It ran with a starting input of 32GB for two workers, and up to 525GB for 32 workers. The amount of data shuffled was insignificantly small. At most, the WSE dropped to 77% for 32 workers, which is above average compared to the rest of the workloads. Figures 24-25 display different run-time performance and weak scaling efficiency for the different numbers of workers.

0 5 10 15 20 25 30 35 0 100 200 Number of workers R un-time (s)

Run-time of weak scaling tests for Logistic Regression

(36)

0 5 10 15 20 25 30 35 0 0.5 1 Number of workers WSE

WSE for Logistic Regression

Figure 25: WSE for the Logistic Regression workload with a starting input size of 16402MB

The study of this workload focused on a single stage that ran for 73-75% of the time. The remaining stages were either single-threaded or very short-lived that ran for 0-6 seconds and were therefore hard to examine. The analyzed stage is one that performs a mapPartition-operation, i.e., each partition inside the RDD has a function applied to it. Meanwhile, it reads data from storage, so the read data becomes processed once it is available.

Table 12 displays the performance metrics of this mapPartitons/read-stage when running with 2-32 workers. One of the reasons that this workload scales so well is that the rate at which the workers can read data from storage does not reach a bottleneck, which the row Total average storage read throughput shows. This metric displays that the average read throughput gets doubled for each time the workers get doubled, which means that the storage did not serve as a bottleneck. The result of this is that the Average CPU utilization also stays approximately the same for all worker sets.

(37)

Workers 2 4 8 16 32

Run-time 139 137 147 149 161

WSE 1 1.01 0.95 0.93 0.86

Idle Time 6.21 5.29 9.65 14.3 17.39

Run-time - Idle time 132.47 131.6 136.6 134.6 144.2

WSE without idle time 1 1 0.97 0.98 0.91

Avg. CPU util. 88% 89% 84.8% 82% 70%

Avg./max CPU wait 0.2% / 44% 0.2% / 26% 0.5% / 49% 0.8% / 60% 5.8% / 65%

Avg. Storage Read TP 72MB/s 86MB/s 97MB/s 90MB/s 89MB/s

Tot. avg. Storage read TP. 144MB/s 344MB/s 779MB/s 1442MB/s 2847MB/s

Total Storage read 20GB 47GB 114GB 215GB 461GB

Table 12: Statistics for for performance metrics of the Logistic Regression workload in the read stage

6.3.2 PCA

The PCA workload was run with a starting input size of 25.5GB per worker and increased up until 816GB for 32 workers. For the scenario of having a single worker, the total amount of shuffle writes was 260MB, and the same amount was consumed in shuffle reads. For 32 workers, the shuffle read/write amounted to 7.3GB.

Figures 26-27 show that the workload performed well in the weak scaling tests with up to 16 workers where the WSE only dropped to 0.89. When doubling to 32 workers, the WSE dropped as low as 0.39. This section mainly compares having 16 vs. 32 workers to identify the bottlenecks.

0 5 10 15 20 25 30 35 400 600 800 Number of workers R un-time (s)

Run-time of weak scaling tests for PCA

(38)

0 5 10 15 20 25 30 35 0.4 0.6 0.8 1 Number of workers WSE

WSE for PCA

Figure 27: WSE for the PCA workload with a starting input size of 26388MB The storage read stage which also and processes the data it reads (Table 13), ran for only 3.3 minutes with 16 workers but for much longer with 32 workers at 9.6 minutes. Because it processes the data at the same time as reading it, the read throughput is primarily not capped by how fast each node can read data from storage but rather how fast the data can be processed. For one worker, the average read throughput was 122MB/s (Table 13), so this number serves as the maximum read rate in this stage. Read throughput being close to 122MB/s was also valid for 16 workers who had 116MB/s. The maintained read rate is no surprise since Figure 27 shows that weak scaling is still performing well up to 16 workers.

(39)

Workers

1

16

32 Run-time

182s

199s

574s

WSE

1

0.91

0.31 CPU Util.

87%

80%

22%

Avg. CPU wait

0.2%

3.3%

28%

Avg. network input TP.

122MB/s 116MB/s

39MB/s

Total network input TP.

122MB/s 1790MB/s 1258MB/s

Total network input

22GB

358GB

723GB

Table 13: Performance statistics for the read stage of the PCA workload

6.3.3 Terasort

This workload started with an input and output size of 39GB for one single worker and up to 1.2TB for 32 workers. The amount of data used for shuffle writes was around 55.5GB per worker, and the shuffle writes averaged 37GB/worker.

Terasort is a highly valuable workload to run when searching for bottlenecks in the system. It has four different stages, one read stage, one sort stage, one partition stage, and finally a write stage. Additionally, the read stage also performs a shuffle write, i.e., write shuffle data to the local disk before starting the next stage.

The read stage is excellent for evaluating performance and exploring the existence of potential bottlenecks in the storage system. The sort and partition stages reveal the performance of the computation and shuffling parts of the system. The writing stage can reveal if there are any bottlenecks related to when combining shuffle reads and writing to storage.

Figures 29-30 reveals the performance of different input sizes. For up to four workers, the WSE never drops below 1.0. Starting at eight workers, the WSE begins to drop each time the input amount and worker count increases, as shown in Figure 30.

(40)

Readstage Sortstage _Partitionstage Writestage 0 200 400 600 800 R un-time (s) 1 worker 2 workers 4 workers 8 workers 16 workers 32 workers

Figure 28: Run-times for different stages when running weak scaling testing of the workload Terasort with a starting input of 41GB and a varying number of workers 0 5 10 15 20 25 30 35 500 1,000 1,500 Number of workers R un-time (s)

Run-time of weak scaling tests for Terasort

(41)

0 5 10 15 20 25 30 35 0.4 0.6 0.8 1 Number of workers WSE

WSE for Terasort

Figure 30: WSE for the Terasort workload with a starting input size of 10355MB Starting with the read and write stages, Figure 31 reveals the decreasing average storage throughput per worker when the worker count grows. The result of the diminishing throughput is reflected in the chart shown in Figure 28, which displays an increasing run-time for each stage and every input size. The read stage (Table 14) gets affected by the lower read throughput at eight workers where its run-time gets slightly increased. This trend also continues for 16 and 32 workers. 0 5 10 15 20 25 30 35 0 50 100 150 200 Number of workers Avg. Throughput (MB/s)

Average read and write throughput to storage per node for Terasort Read stage Write stage

Figure 31: Average network throughput for each node with varying number of workers for read stage and write stage

(42)

Workers 1 2 4 8 16 32

Run-time 115s 115s 130s 196s 310s 525s

WSE (for this stage) ₁ ₁ _0.88 _0.59 _0.37 _0.22 Avg. CPU util. 61% 61% 59% 42% 22.0% 13.5% Avg. Network Input TP. 194MB/s 189MB/s 179MB/s 115MB/s 74MB/s 46.6MB/s

Table 14: Statistics for the read stage of the Terasort workload

Workers 8 16 32

Run-time 291s 454s 765s

WSE (for this stage) 1 0.64 0.38

Avg. CPU util. 74% 40% 24%

Avg. Storage Write TP. 85MB/s 54MB/s 33MB/s

Table 15: Statistics for the write stage of the Terasort workload

Regarding the sort and partition stages, the weak scaling efficiency per-formed well. The two stages do not involve any interaction with the storage cluster, which is what usually causes the bottlenecks. The sorting stage has a WSE of 0.78, while the partition only falls to 0.95. Table 16 gives an overview of the performance metrics for the sort stage. This table reveals that the drop in WSE is caused by the shuffle read time. Without it, the WSE would stay at 0.98 (WSE without Shuffle read time).

Workers 8 16 32

Run-time 47s 50s 60s

WSE 1.0 0.94 0.78

Shuffle read time 4.17s 7.57s 15.52s

Run-time - Shuffle read time 42.8s 42.4s 44.5s WSE without Shuffle read time 1.0 1.0 0.96

Idle Time 0.0s 0.0s 0.0s

Avg. CPU util. 65% 60% 53%

Avg. network output 332MB/s 336MB/s 301MB/s Total network output TP. 2662MB/s 5388MB/s 9600MB/s Total network output data 127GB 274GB 584GB

Table 16: Statistics for the sort stage of the Terasort workload

6.4 Strong Scaling Tests

(43)

Workload Input Output Shuffle Read Shuffle Write Stages Tasks

SVD++ 1.5GB 0 GB 494-1800GB 494GB 27 13824 ConnectedComponent 19.3 GB 0GB 67.2GB 52.1GB 23 11776 Terasort 193 GB 201 GB 446GB 297GB 4 2560

Table 17: The different workloads of the strong scaling tests with the input sizes, shuffle read sizes and etc. presented

6.4.1 SVD++

For this workload, 1.5GB of data gets read from storage. The shuffle writes totals 494GB, and shuffle reads vary between 494GB-1800GB depending on the number of workers.

As earlier mentioned, SVD++ is a workload that reads a small amount of data from storage but generates a much more significant amount of shuffle data throughout the different stages. Because of this, SVD++ was not well suited for weak scaling tests since when doubling the input size, the shuffle data size would grow more than double, resulting in a problem size that would grow disproportionately. For strong scaling tests, the input size stays the same, which allows for more fair comparisons between the worker setups. This workload is also intriguing since it does not put considerable stress on the storage cluster but instead on local computational power, network speed between nodes, and local disk throughput.

To properly test how well the system performed with a large number of workers, this workload read a relatively big input size of 1.4GB. Unfortunately, the smallest number of workers that could handle this input size without crashing was four workers. As can be observed in Figure 33, the SSE even goes above 1.0 when increasing workers. Superlinear scaling is usually an indication of CPU thrashing for a smaller number of workers, resulting in performance loss. (Note: when conducting this test, only 29 workers were available.)

5 10 15 20 25 30 0 500 1,000 1,500 Number of workers R un-time (s)

Run-time for SVD++ with increasing number of workers 1.4GB

(44)

5 10 15 20 25 30 1

1.5

Number of workers

SSE

Strong scaling effiency for SVD++ with increasing number of workers 1.4GB

Figure 33: Strong scaling for workload SVD++ with a worker count ranging from 4 to 29.

Increasing from eight to 16 workers caused the most significant increase in SSE where it went from 1.2 to 1.9. Table 18 reveals statistics that show why CPU thrashing was present. The row % of time with 100% used memory reveals that the four workers had 100% memory used 82% of the time, while this number was 65% for eight workers and even less for the remaining worker sets. The row Avg. Page faults/node tells that the amount of page faults is nearly four times as big when having four workers in comparison to having eight workers. Total Data Paged out of memory shows that the total amount of data paged out of memory is roughly the same for all sets of workers. A consequence of this is, on the other hand, that the Average Data Paged out

of memory is more substantial for a lower number of workers. Since such a

significant portion of the working memory gets paged out for the smaller number of workers, Total Data Paged back to memory reveals that a significant portion of this data was needed to get paged back into the main memory. On average, this amount averages 1638GB per node with four workers and 488GB with eight workers. Since the page faults are so high and the large amount of data that is needed to get paged back to memory, the nodes of the Spark clusters consisting of four and eight workers enter a CPU thrashing state.

The severity of the CPU thrashing gets minimal for 16 and 29 workers, where the amount of data that needs to get fetched back to memory is only around 60MB. Because of this, the SSE can grow when increasing the number of processing elements. Looking at the SSE, it is decreasing from 16 to 29 workers. One initial suspicion for the drop of SSE from 1.9 to 1.68 would be that there is some bottleneck that is reached for this workload when the number of workers grows too large.

Another contributing factor as to why the performance is lower for the smaller number of workers is due to having a larger amount of data that is being shuffle-read (Total shuffle read data). Shuffle read wait time and

shuffle read wait time %reveals that for the two smaller worker sets, the

(45)

more massive worker sets, it ranges from 14%-18.6%. The exact reason why the smaller worker sets required more remote reads than the rest needs to be could not be pinpointed and is something to discover in future work.

The rows Total idle time and Run-time without idle time reveal that the idle time is causing a significant delay in the job completion time. When subtracting the total idle time for each run-time and calculating the SSE, SSE

without idle timereveals that the strong scaling efficiency stays roughly the

same for 16 and 29 workers.

Workers 4 8 16 29

Run-time 1731s 723s 225s 142s

SSE 1 1.2 1.9 1.68

Total idle time 112s 135s 60.14s 44.8s

Run-time without idle time 1619s 589s 165s 97.2s

SSE without idle time 1 1.37 2.45 2.30

Avg. CPU util. 56% 44% 48.1% 45.1%

Avg. CPU wait 2.4% 2% 1.4% 1.4%

Avg. Shuffle read TP. 196MB/s 139MB/s 281MB/s 262MB/s

Total Shuffle Read Data 1359GB 807GB 460GB 474GB

Shuffle read wait time 500s 223s 41s 18.5s

Shuffle read wait time % 29% 31% 18.6% 14%

Total Shuffle Write Data 460GB 460GB 460GB 460GB

Total network output data 1354GB 803GB 458GB 473GB

% of time with 100% used mem. 82% 65% 35% 5%

Avg. Page faults/node 167469K 48187K 24604K 13653K

Total Data Paged out of memory 488GB 447GB 470GB 454GB

Total Data Paged back to memory 1638GB 488GB 69MB 61MB

Average Data Paged out of memory 122GB 61GB 29GB 15GB

Average Data Paged back to memory 409GB 55GB 4MB 2MB

Table 18: Statistics for the entire SVD++ workload

6.4.2 Connected Component

This workload reads 19.3GB from storage and writes 0GB back to it. In total, the shuffle reads are of the size 67.2GB while the shuffle writes are at 52.1GB. Connected component is a workload where the amount of shuffled data is relatively low in comparison to the other workloads used for the strong scaling tests. Instead of being shuffle heavy, this workload requires a significant amount of memory for allocation. If enough memory is available, this workload scales very well.

(46)

eight workers, the CPU thrashing is gone, which causes the superlinear scaling. After the increase in SSE, starting with eight workers, it starts to decrease for every increment of worker count. The row “% of run-time spent idling,” also reveal that the amount of time of the workload that is spent idling also increases when increasing the number of workers.

Since the characteristics of this workload are very similar to SVD++ in Section 6.4.1, this workload is not analyzed in the same detail.

Workers 4 8 16 32

Run-time 690s 277s 157s 106s

SSE 1 1.24 1.09 0.81

% of Run-time spent idling 11% 16% 24% 40%

Avg. CPU util. 64% 72% 65% 53%

% of time with full mem. 49% 0% 0% 0% Avg. Page In/worker 8GB 201KB 51KB 47KB

Table 19: Statistics for the Connected Component workload. The “max” indicates the maximum observed CPU wait%.

5 10 15 20 25 30 200 400 600 Number of workers R un-time (S)

Run-time for Connected Component with increasing number of workers 87GB

(47)

5 10 15 20 25 30 0.6 0.8 1 1.2 1.4 Number of workers SSE (ratio)

Strong scaling effiency for Connected Component with increasing number of workers 87GB

(48)

6.4.3 Terasort

The Terasort workload reads 193GB from storage and writes 201GB to it. It generates 446GB shuffle-read data and 297GB of shuffle-write data.

As when testing Terasort for weak scaling efficiency, testing for strong scaling efficiency also confirms that storage is a considerable bottleneck for this system. Figures 36-37 display the varying run-times and strong scaling efficiency when increasing the number of workers. For this workload, the strong scaling efficiency plummets when the number of workers grows larger.

5 10 15 20 25 30 600 800 1,000 1,200 Number of workers R un-time (S)

Run-time for Terasort with increasing number of workers 331GB

Figure 36: Strong scaling for workload Terasort.

5 10 15 20 25 30 0.4 0.6 0.8 1 Number of workers Strong Scaling Efficiency (ratio)

Strong scaling effiency for Terasort with increasing number of workers 331GB

Figure 37: SSE for workload Terasort.

(49)

Read stage Sort stage Partition stage Write stage 0 200 400 R un-time (s) 4 workers 8 workers 16 workers 29 workers

Figure 38: Run-times for different stages when running strong scaling testing of the workload Terasort with an input of 313GB and a varying number of workers

Read stage Sort stage Partition stage Write stage 0 0.5 1 SSE (ratio) 4 workers 8 workers 16 workers 29 workers

Figure 39: SSE for different stages when running strong scaling testing of the workload Terasort with an input of 313GB and a varying number of workers

Table 20 displays the performance statistics for the read stage. Average

storage read throughputdrops each time the number of workers increases. A

Evaluating the Importance of Disk-locality for Data Analytics Workloads

Examensarbete 30 hp

Februari 2020

Evaluating the Importance of

Disk-locality for Data Analytics

Workloads

Abstract

Evaluating the Importance of Disk-locality for Data

Analytics Workloads

Christian Törnqvist

Table of Contents

1

Introduction

2

Related work

2.1

Making Sense of Performance in Data Analytics

Frame-works

2.2

SparkBench

2.3

Unique Angle of this Thesis

3

Background

3.1

Apache Spark

3.2

Collectl

3.3

Docker and Kubernetes

3.4

Scaling Methods

D

D

D

N = 1

N = 2

D

Exec. 1

1 GB

input

D

Exec. 1

1 GB

input

D

Exec. 2

N = 1

N = 2

4

Method

4.1

Hardware Setup

4.2

System Architecture

Prod. Cluster

Test Cluster

4.3

Method Used for Testing

4.4

Workloads

4.5

Workflow improvement

5

Cluster Performance

5.1

Disk throughput

5.2

Storage cluster throughput

5.3

Network transfer throughput

6

Results and Discussion

6.1

Common bottlenecks found

6.2

Performance Comparison Between Hardware Setups

6.3

Weak Scaling Tests

Workers