Performance Comparison Study of Clusters on Public Clouds

(1)

Performance Comparison Study of

Clusters on Public Clouds

Prestandajämförelse av cluster på offentliga molnleverantörer

Martin Wahlberg

Faculty of Health, Natural Science and Technology (HNT) Computer Science

30 hp Javid Taheri

(2)

(3)

Performance Comparison Study of Clusters

on Public Clouds

(4)

(5)

This report is submitted in partial fulfillment of the requirements for the Master’s degree in Computer Science. All material in this report which is not my own work has been identified and no mate-rial is included for which a degree has previously been conferred.

Martin Wahlberg

Approved, June 13 2019

Advisor: Javid Taheri

(6)

(7)

Abstract

As cloud computing has become the more popular choice to host clusters in recent times there are multiple providers that offer their services to the public such as Amazon web services, Google cloud platform and Microsoft Azure. The decision of cluster provider is not only a decision of provider, it is also an indirect decision of cluster infrastructure. The indirect choice of infrastructure makes it important to consider any potential differences in cluster performance caused by the infrastructure in combination with the workload type, but also the cost of the infrastructure on the available public cloud providers.

(8)

Acknowledgements

First and foremost, I would like to offer my sincerest gratitude to my supervisors Dr. Javid Taheri at Karlstad University and Robin Nilsson at CGI for providing me with guidance throughout the thesis.

I would like to acknowledge Johan Selberg and Andr´e Winberg at CGI for helping me and answering all my questions throughout the first part of the thesis.

To my friends Daniel Larsson and Simon Sundberg, thank you for all our small de-stressing talks during the entire spring.

I am most grateful to my mom Helen´e, my dad Stefan and my sister Johanna for all encouragement and support I have received.

(9)

List of Figures

3.1 Overview of Apache Hadoop [7] . . . 7

5.1 CPU intensive CPU usage AWS 4 cores 16 GB RAM . . . 40

5.2 CPU intensive CPU usage GCP 4 cores 16 GB RAM . . . 40

5.3 CPU intensive memory usage AWS 4 cores 16 GB RAM . . . 40

5.4 CPU intensive memory usage GCP 4 cores 16 GB RAM . . . 40

5.13 I/O intensive CPU usage AWS 4 cores 16 GB RAM . . . 44

5.14 I/O intensive CPU usage GCP 4 cores 16 GB RAM . . . 44

5.15 I/O intensive memory usage AWS 4 cores 16 GB RAM . . . 44

5.16 I/O intensive memory usage GCP 4 cores 16 GB RAM . . . 44

5.17 I/O intensive disk usage AWS 4 cores 16 GB RAM . . . 45

5.18 I/O intensive disk usage GCP 4 cores 16 GB RAM . . . 45

5.19 I/O intensive memory usage AWS 16 cores 64 GB RAM . . . 45

5.20 I/O intensive memory usage GCP 16 cores 64 GB RAM . . . 45

5.21 I/O intensive disk usage AWS 16 cores 64 GB RAM . . . 46

5.22 I/O intensive disk usage GCP 16 cores 64 GB RAM . . . 46

5.23 I/O intensive CPU usage AWS 4 cores 32 GB RAM . . . 46

5.24 I/O intensive CPU usage GCP 4 cores 32 GB RAM . . . 46

(12)

5.26 MapReduce CPU usage GCP 4 cores 32 GB RAM . . . 50

5.27 MapReduce memory usage AWS 4 cores 32 GB RAM . . . 50

5.28 MapReduce memory usage GCP 4 cores 32 GB RAM . . . 50

5.29 MapReduce disk usage AWS 4 cores 32 GB RAM . . . 51

5.30 MapReduce disk usage GCP 4 cores 32 GB RAM . . . 51

5.31 MapReduce CPU usage AWS 16 cores 64 GB RAM . . . 51

5.32 MapReduce CPU usage GCP 16 cores 64 GB RAM . . . 51

5.33 MapReduce memory usage AWS 16 cores 64 GB RAM . . . 51

5.34 MapReduce memory usage GCP 16 cores 64 GB RAM . . . 51

5.35 MapReduce disk usage AWS 16 cores 64 GB RAM . . . 52

(13)

List of Tables

4.1 Input to estimation of π . . . 23

4.2 Amount of input data to TeraSort . . . 24

4.3 Input to Word Count . . . 24

4.4 Potential instance types . . . 27

4.5 Instance types used in experiments . . . 28

4.6 Cost/month per instance type . . . 28

4.7 CPUs used in experiments . . . 28

4.8 Region and location . . . 29

4.9 Machine images . . . 29

4.10 Storage information . . . 30

4.11 SSD disk performance values . . . 30

4.12 Spark configuration for experiments . . . 38

5.1 Completion time of CPU intensive experiment . . . 40

5.2 Completion time of I/O intensive experiment . . . 44

5.3 Completion time of MapReduce experiment . . . 50

5.4 Recommended provider considering performance . . . 54

(14)

(15)

1 Introduction

The term Cloud computing is used regularly throughout our daily lives, especially as the usage of cloud resources is becoming more popular to use than owning the infrastructure ourselves. Since cloud computing involves a lot of different areas and use cases, it can sometimes be difficult to fully grasp the extent of the term. From the National Insti-tute of Standards and Technology [26], the following definition of cloud computing can be read: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand net-work access to a shared pool of configurable computing resources (For example, netnet-works, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” The computing resources are commonly seen as one of the following service models: Software as a Service (SaaS), Platform as a Service (PaaS) or Infrastructure as a Service (IaaS).

SaaS means that applications can be directly accessed and used through different types of devices and the user has no control over the underlying infrastructure or how the ap-plication is managed with the exception of user configuration [10, 26]. When it comes to PaaS, the hardware and some additional software that enables the user to create and manage applications are provided, and the user does not have to manage the underlying infrastructure themselves [10, 26]. IaaS provides the fundamental computation resources such as servers, network and storage etc. While also ensuring that the delivered resources are reachable. The user is then responsible for all additional software such as operating system (OS) and applications [10, 26].

(16)

Azure.

Deciding on a cloud provider that should be used to host clusters on is not only a choice of provider, it is also a choice of infrastructure. The choice of infrastructure can be overwhelming because of all available options, but it is a crucial choice because of potential differences in cost and performance between infrastructure on the same or a different cloud provider. A key factor that has to be considered before deciding on a cloud provider and infrastructure, is whether there is a difference in cluster performance for the desired workload between the cloud providers in question and their available infrastructure. Perhaps one cloud provider has access to infrastructure that performs an I/O intensive workload faster and cheaper than any of the other cloud providers in question, which would make that cloud provider the logical choice for I/O intensive workloads. In order to simplify the process of choosing the most suitable cloud provider for a desired workload, it is important to have information about any potential cost and performance differences between the cloud providers in question and their available infrastructure.

1.1 Problem statement

To the authors knowledge it does not exist any extensive study that directly compares the cost and performance of multiple different cluster infrastructures across multiple public cloud providers. The closest work that has been found are [27] and [28]. In [27] an idea for a framework that compares the suitability of the different cloud providers, in relation to the performance needs of the user is discussed. A proof of concept experiment is performed in [27] that includes three different public cloud providers and one single type of virtual machine (VM) instance. [28] is an extension of [27] but as it was not open for the public and only the abstract was available, it is unclear whether or not it includes an extensive study of infrastructure performance between public clouds.

(17)

between a few different public cloud providers. The study began with the goal of including the public cloud providers AWS, GCP and Microsoft Azure in the comparison. Unfortu-nately, Microsoft Azure had to be removed from the study due to time constraints as well as non-native support from the software used to create and setup the clusters.

The working hypothesis of the study was that running an identical workload on different clusters that have access to similar hardware, should produce the same result, in close to similar time, while also utilizing the same amount of resources relative to what’s available, regardless of the provider hosting the cluster. To either verify or disprove the hypothesis, three different types of experiments were executed on clusters residing on the selected cloud providers. The experiment types were chosen from the most common workload types, which resulted in the following experiment types: CPU-intensive, I/O-intensive and MapReduce.

Throughout the study the following research question was pursued:

Q1 How to choose the most suitable cloud provider for a workload based on its charac-teristics.

In order to retrieve sufficient information for the research question to be answered at the end of the thesis, the following sub-questions had to be answered:

SQ1 How to reliably and systematically run experiments to compare cluster performance of cloud providers against each other.

SQ2 How to analyse the experiment results and retrieve performance metrics in minute detail from the selected cloud providers during workload execution.

(18)

(19)

2 Summary of technical work

(20)

(21)

3 Background

This Section presents information about the software, technologies and providers used in the study and which enabled the creation of clusters as well as experiment execution.

3.1 Hadoop

Hadoop is an open source distributed processing framework that manage big-data process-ing and storage in a clustered system [8]. From Figure 3.1 it can be seen that Hadoop consist of two major parts, the Hadoop distributed file system (HDFS), that is used for distributed storage, and also MapReduce which is a framework used for distributed com-putation. Hadoop also consist of two smaller but yet important modules, the Yet Another Resource Negotiator (YARN), which is a job scheduler and resource manager inside the cluster, and also the Common Utilities that consist of Java libraries and utilities required by the rest of Hadoop.

(22)

3.1.1 MapReduce

MapReduce is a framework that is used to process large amount of data, in parallel, inside a clustered system. The MapReduce algorithm mainly consist of two different tasks, Map and Reduce [24, 25]. The Map task takes a data set and breaks it down into a set of tuples, each consisting of a Key-Value pair, essentially creating another set of data from the first one [24, 25]. The Reduce task then takes the output from the Map task and reduces the retrieved set of tuples into a smaller set of tuples, which can then be returned and stored inside the HDFS [24, 25].

According to [24], there are two different types of data processing primitives called mappers and reducers inside a cluster running the MapReduce framework. From [24] it can also be read that, “Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change.”, which means that not only can the MapReduce framework handle large quantities of data, it is also highly scalable which makes it an excellent choice for clustered systems.

A core principle of the MapReduce framework is to take the processing to the data, instead of the data to the processing. That is, instead of sending data to a node that have been assigned one of the roles mapper or reducer, the nodes that have local access to the data are assigned a mapper or reducer tasks which minimizes the network traffic between the nodes, thus also allowing the MapReduce framework to instead manage task assignment, task verification and all necessary copying of data between cluster nodes[24]. 3.1.2 HDFS and YARN

(23)

the stored data is done by a Java process called the DataNode [11, 29].

The NameNode is responsible for storing the directory tree of all files in the system, tracking where data is stored in the cluster, receiving heartbeats from the DataNodes as well as replying to client application locate, add, copy, move or delete file requests [29]. The reply to file system operation requests consist of a list of DataNodes that have access to the requested files. The NameNode is the centre piece of HDFS and since it is the only process that contains the metadata for all files, it becomes a single point of failure as well as a possible bottleneck for the entire file system. It is possible to have redundancy in form of a secondary NameNode process that in case of a failure of the first NameNode take its place and continues the serving of client applications and storing of metadata etc.

When the NameNode receives a file system operation request from a client application, it puts a global lock on the metadata of the entire file system in order to perform the operation atomically, this results in a single-writer multiple reader concurrency [19, 32]. There are two other important components of the HDFS system, the journalnode and the zookeeper. The journalnode is responsible for storing any changes made to the metadata stored in the NameNode, as well as keeping the secondary NameNode updated on any metadata changes in case of a failure of the primary NameNode [19, 32]. The responsi-bilities of the zookeeper are coordinating the failover from the primary NameNode to the secondary NameNode as well as having both nodes agree on their respective role (primary or secondary) [19, 32]

The DataNode is responsible for storing and processing of the data. Optimally the data is replicated across the cluster on three different DataNodes to create redundancy. Two of the three DataNodes reside in the same server rack and the third in a different rack as this provides protection against DataNode as well as server rack failure. When processing data the DataNode talks directly with the client application to provide files from the file system operation requests as well as results from MapReduce operations [11].

(24)

manager (NM) for each existing node in the cluster. The RM is responsible for distributing the available resources among all applications and consists of two parts, the scheduler and the application manager. The schedulers’ task is to allocate resources to all applications that are currently running [36], while the application manager accepts job submissions, negotiate container resources that are used for the AM as well as restarting the container if it would fail [36]. Each individual node has a NM that monitors the behaviour and resource utilization of the node and also reports the information to the RM. Each individual application has its own AM that negotiate requested resources from the RM as well as working together with the NM in order to execute and monitor job tasks [36].

Depending on the cluster setup and hardware there might be a need for additional YARN configuration. The YARN configuration file can be found in the HDFS installation directory and is named yarn-site.xml. There are a many different settings that can be configured in the yarn-site.xml file, but parameters that are related to container instances can be especially important to configure before running any YARN applications. One of the important container settings is the yarn.scheduler.maximum-allocation-mb parameter, which regulates the maximum amount of memory in MB that can be allocated by the RM for a single container. Another important setting is yarn.nodemanager.resource.memory -mb that specifies the maximum amount of physical memory that can be allocated for containers on the entire node. Lastly, the yarn.nodemanager.resource.cpu-vcores pa-rameter sets the maximum number of virtual cores that can be allocated to a container on the node.

(25)

3.2 HopsFS

Hops is a new distribution of the Apache Hadoop framework and HopsFS is a replacement of the old HDFS with a set of new features. HopsFS replaces the single node in-memory metadata service of the old Hadoop NameNode, with a distributed metadata service built on a network database (NDB) with the NewSQL language [19, 32]. HopsFS also use MySQL clusters that consist of at least one management node that monitor and configure the cluster, as well as multiple NDB DataNodes that act as a storage engine for regular data but also act as storage for all database tables that can be both accessed and processed [19, 32]. The existing NDB DataNodes are placed in something called node groups. The size of the node group depends on the replication factor for each partition. For example, If there are six NDB DataNodes and the replication factor is two, then three different sub clusters are created, each with two NDB DataNodes that have a copy of the other NDB DataNodes partition.

As mentioned in Section 3.1.2, having only a single NameNode in HDFS creates both a single point of failure as well as a potential bottleneck. HopsFS solution to this problem is to have multiple NameNodes, where one of them is elected as the leader by a special election algorithm [30]. The DataNodes are connected to all existing NameNodes, but they only send their block reports to the leader NameNode which then distributes the received block reports over the currently alive NameNodes for load balancing [19, 31, 32].

Using the solution in Hops cause the metadata to no longer be stored directly in the NameNode(s) as it was in HDFS. Instead the NameNodes have a data access layer driver that is used to process the data directly from the NDB DataNodes while also enabling Hops to use a variety of NewSQL databases [19, 31, 32].

(26)

decoupled. But if the size of the file is smaller than the threshold the metadata and the actual data are stored together [32]. By doing this, the performance in terms of throughput and latency when processing smaller files is increased since the data can be retrieved from the NDB DataNodes instead of the file system [32].

3.3 Spark

Spark is a scalable data processing platform that is used for workloads ranging from stream-ing jobs to batch processstream-ing applications. Spark is also designed to be considerably faster than the MapReduce model used in HDFS [22]. Since Spark has such a variety in the workloads it can execute, it is suitable for both analysis tasks as well as data processing tasks [22]. It is possible to combine Spark with an existing Hadoop cluster since Spark is able to read data from an external source such as HDFS or also run on top of an existing cluster manager such as YARN [22].

The main programming abstraction in Spark is called Resilient distributed datasets (RDDs). An RDD represents a collection of data that is distributed across many different nodes and that can be both accessed and manipulated in parallel [22]. When a spark program is executed it is first submitted into a process called the driver. The driver creates RDDs from the submitted program and delegates operations to other processes called executors. The executors then execute the assigned operation and then either return the result to the driver or saves it themselves [22]. The executors can reside on different nodes or they can all reside on the same node, and it is the responsibility of the driver to manage all available executors for the spark program.

(27)

When submitting a program for execution the following steps are performed [22]: 1. The driver process is launched and calls the main method of the submitted program 2. The driver negotiates resources for all desired executors

3. The cluster manager launches all executors that can be afforded with the negotiated resources

4. The driver runs the submitted program and delegates work (tasks) between all avail-able executors

5. The executors perform their assigned tasks and return/save the result

6. Once all tasks are finished the driver terminates all executors and release their re-sources back to the resource manager

3.4 Flink

(28)

Flink is based on the Kappa architecture which only utilize a single processor that treats all input as a stream, the streaming engine process all data in real-time and treats batch data as a special case of streaming [5]. The Flink execution architecture consist of, a Client, a JobManager as well as TaskManagers, and the workflow of a Flink job execution looks the following way [4]:

1. The Client provides the program and constructs a dataflow graph of the job, which is sent to the JobManager.

2. The JobManager receives the dataflow graph and constructs an execution graph, which is then used to delegate jobs to the TaskManagers.

3. The TaskManagers receive their jobs, executes all of them in parallel and report their respective status back to the JobManager.

Flink can be deployed on multiple different resource providers such as YARN, Mesos, Kubernetes and also baremetal clusters, but it is also possible to integrate Flink with for example, Hadoop, Spark and MapReduce which makes it a flexible framework to work with [4, 6].

3.5 Hopsworks

Hopsworks is a platform that provides access to a range of different open-source frameworks for development, machine learning, data analytics, stream processing, model serving and metric visualization. Some of the frameworks that are accessible through Hopsworks are: Spark, Kafka, Grafana, Jypiter, Kubernets and Tensorflow [9, 15, 17, 20, 21, 22, 23, 34].

(29)

which use code to process any existing data. The quotas are limitations of CPU, GPU and storage usage that is linked to each project. If more resources are necessary it is possible to change the quotas by contacting an administrator [20].

The workloads are created inside the projects and it is done by for example creating a new Job. A Job creates and runs a YARN application such as Spark or Flink. It is also possible to create other workloads such as Tensorflow, but that requires the user to access the Tensorflow module and create the workload form there, while a regular YARN Job have its own module [20]. When the type of YARN application has been decided there are additional configuration to do before the job can be started. The configuration depends on the chosen application type. For example, If Spark is selected as the type then the following have to be configured: The path to the JAR file containing the code for the spark job, the main class name, arguments for the spark job, driver memory, driver cores, number of executors, executor cores and executor memory.

Since Hopsworks run Spark on top of YARN, it is possible that settings regarding YARN have to be changed as well. Changing the default values mentioned in Section 3.1.2 is especially important if all available cluster resources are supposed to be used for a job. If the default values are not changed, only 12 GB of memory and 8 cores can be used for an entire job which can severely limit the cluster performance during the job execution. The default values can be changed by following the steps described in Section 3.1.2.

(30)

3.6 Karamel

Karamel is a software that is used to create VMs on AWS, GCP and Openstack Nova, as well as installing distributed applications on baremetal, cloud or multi cloud clusters [16]. Karamel is built upon the Chef framework and use an abstraction called cookbooks. The cookbooks are hosted on git and contains recipes for distributed applications [16]. A cookbook is a repository that contain multiple installation files called recipes. The recipes can be executed to install the distributed application. To use a cookbook through Karamel it first have to be Karamelized. Karamelize mean that there exists a specific file in the cookbook that tells the Karamel client the order that the recipes should be executed in. More specifically, Karamel creates a Directed acyclic graph (DAG) for all defined recipes. The DAG is used by Karamel to extract the order of recipe execution [16]. Creating a cluster with the help of Karamel requires at least one reference to a valid cookbook that includes the desired recipes. The reference to the cookbook and all recipes are placed in a cluster definition file that is submitted to the Karamel client. 3.6.1 Cluster definition file

The cluster definition file consists of different sections where configuration for the cluster can be specified. The cluster definition file is required to be in YAML format in order to be successfully submitted. Appendix A presents an example of a cluster definition file that installs the entire Hopsworks platform on a cluster consisting of 2 nodes. Breaking down the cluster definition file from Appendix A it can be extracted that the different sections in a general cluster definition file are the following:

1. General settings:

• Name of the cluster.

(31)

• Type: The number of cores and amount of RAM the VM will have. The type names are predefined on AWS and GCP.

• Zone/Region: Where your VMs will be located. Predefined names on AWS and GCP.

• Image/Ami: The virtual machine image used in the VM. Predefined on AWS and GCP or it can be custom made.

• Disksize: The amount of storage in GB that all VMs will have access to. Only an option for GCP at the time of the study.

2. The attribute Section: Consist of a variety of attributes such as installation directory, interface for the Kagent service and credentials to Hopsworks etc.

3. Cookbooks: The name, github url and branch/version of the cookbooks that contain the required recipes.

4. Groups: A collection of VMs that each run the same recipes during the installation. The name, size and recipes for each group have to be specified. It is also possible to override settings from General settings here.

(32)

cluster. If the installation node is not able to connect to all other nodes the installation of the Hopsworks platform will not be successful.

A major advantage of the cluster definition file is that it simplifies the process of changing hardware, image and the size of the cluster etc. For example, changing the number of DataNodes from 1 to 5 in the example from Appendix A, only requires a change of the size variable in the DataNode group. It is also possible to add additional nodes to an already existing cluster via Karamel which provides simple cluster scaling.

3.6.2 API and Web service

Karamel provides the possibility to use both a Java API and a web service to create clusters and install distributed applications. To use the Java API or the web service, the following have to be accessible: Credentials to the platform where the cluster is hosted, a cluster definition file and the path to an existing SSH-key pair that is used to connect to the cluster (optional).

Using the Java API properly requires certain procedures to be done in certain steps. First the credentials have to be validated in order to get access to the desired platform. If the credentials are valid and the path to existing SSH-keys is provided, then those keys will later be used to connect to the cluster. If a path to an existing SSH-key pair is not provided, the API can generate new SSH-keys and use them instead. When the credentials and SSH-keys are set, the cluster definition file have to be converted from YAML to JSON format before submitting it for validation through the API. Once the conversion is done and the cluster definition file is submitted and accepted the cluster creation and application installation can begin by calling a startInstallation method from the API.

If the web service is used instead of the Java API, it can be accessed through

(33)

cluster definition file has been accepted by the Karamel service, the credentials to the cloud providers specified in the cluster definition file have to be submitted and validated. If the provider credentials pass the validation, a path to a pair of existing SSH-keys can be submitted or a new pair can be generated. Once the credentials and SSH-keys have been validated the installation can be started.

Both the API and web UI provide information about the ongoing installation and if any errors occur. The API provides installation information through a terminal and the web service through the Karamel web service UI.

3.7 Cloud providers

The cloud providers used in the study was AWS and GCP, although originally Microsoft Azure was supposed to be a part of the study as well. Due to the reason that Karamel did not have native support for Microsoft Azure as well as time constraints, it was decided that Microsoft Azure was to be removed from the study.

(34)

(35)

4 Experiments

In order to gather performance metrics from the clusters it was necessary to run a number of different experiments. Neither the experiment type nor the actual experiment workloads were known from the beginning. Instead, both the experiment type and workloads were decided upon once it was time to begin the experiment phase. The reason for not deciding on the experiment type and workloads in advance, was due to lack of experience working with the Hopsworks platform. That is, not having sufficient knowledge about what would be both feasible and relevant to work with through Hopsworks. A known requirement was that the experiments should if possible, be limited to a duration time between 10 to 20 minutes. The reason for placing a limit on experiment duration was due to time constraints in combination with the number of experiments that had to be performed. That is, in order to complete all experiments in a reasonable time, the input to the experiments had to be configured is such a way that it allowed a single experiment to complete in a maximum of 20 minutes. The rest of the Section introduce the experiment type, experiment workloads as well as their corresponding configuration that were used to gather performance metrics for the comparison study.

4.1 Experiment types

(36)

find already existing Spark jobs than Flink jobs, thus it was decided that the experiment type would be Spark. The advantage of using already existing Spark jobs was that the jobs had already been tested and the risk that it would not work was minimal. If the job for some reason would fail and there was something wrong with the actual code, a new job could be found and used as a replacement.

The performance aspects that was evaluated in the study are, CPU utilization, memory utilization and I/O usage. Considering the chosen performance aspects, it was desired to find three different Spark jobs that were either CPU intensive, memory intensive or I/O intensive in order to be able to truly test the chosen aspects. Unfortunately, a Spark job that was purely memory intensive could not be found, instead it was replaced with a classic MapReduce job as it is a common workload type for clusters. The individual Spark jobs that were found are further explained in the subsequent subsections.

4.1.1 CPU intensive

The CPU intensive Spark job is a job that estimates a value of π. This value is estimated by simulating dart throwing at a circle that is enclosed by an unit square. The area of a circle can be calculated by the formula πr2_{, which in this case results in the area being}

(37)

Table 4.1: Input to estimation of π Number of data points Type 1 500 000

Type 2 500 000 Type 3 625 000 Type 4 625 000 Type 5 800 000

The program that was used to run the job was a Scala version and it can be found in [33]. The JAR file that contained the main class of the program was already present in one of the example projects on the Hopsworks platform and was downloaded from it for further use in the other experiments. The path to the main class file that was used when creating the Spark job through the Hopsworks platform is: org.apache.spark.examples.SparkPi. 4.1.2 I/O intensive

The I/O intensive Spark job was a Spark version of the TeraSort algorithm. The program in its entirety consist of three smaller programs which are: the data generation program called TeraGen, the actual data sorting program called TeraSort and lastly a validation program for the sorted data called TeraValidate [35]. Since it was only desired to see the cluster performance during the actual sorting of the data, the only program that was of interest was the TeraSort program. The TeraGen program is necessary in order to generate the data that will be sorted, but the performance of the program is of no interest from the perspective of the performance study, thus no metrics was collected for that program. The TeraValidate is only a post sorting completion check that validates that the data is actually sorted by the TeraSort program, no metrics was collected for this program as well. The amount of data that was used as input to the TeraSort program can be seen in Table 4.2.

(38)

Table 4.2: Amount of input data to TeraSort Amount of data Type 1 25 GB Type 2 25 GB Type 3 30 GB Type 4 35 GB Type 5 60 GB

the downloaded code with the following command: mvn install. This will create a JAR file that can be uploaded to the Hopsworks platform and used to run the three Spark jobs. The path to the main class files that should be used when creating the Spark jobs on the Hopsworks platform are: com.github.ehiggs.spark.terasort.TeraGen, com.github. ehiggs.spark.terasort.TeraSort and com.github.ehiggs.spark.terasort.Tera Validate.

4.1.3 MapReduce

The MapReduce Spark job was a classic word count program. The word count program simply calculates the occurrences of each unique word inside a given file. This is done by producing key-value pairs for each word in one or more mappers, and then sending the key-value pairs to a set of reducers that process the key-value pairs into a smaller set of key-value pairs. This is done as many times as it takes until the whole file have been processed. The size of the input file sent to the word count program can be seen in Table 4.3.

(39)

The used program was a Java version of the word count job and it can be found at [33]. Since the word count program comes from the same source as the CPU intensive job, the JAR file is also the same and it was already present on the Hopsworks platform. The path to the main class is: org.apache.spark.examples.JavaWordCount.

The input to the word count program was .txt files that were created by the use of four scripts. The scripts used for creating the 21.6 GB and 41.3 GB file can be found in Appendix C respective D, and the script used for creating the 46.6 GB and 74.5 GB file can be seen in Appendix E respective F.

The scripts for the 21.6 GB and 41.3 GB text files do not create exactly 21.6 GB or 41.3 GB, instead they create 32 respective 64 GB of text. To get 21.6 GB and 41.3 GB the script was simply stopped before it could complete since the necessary file size had been reached. The scripts work in the following way, first the line This is just a sample line appended to create a big file. is written to a text file, then the file is simply appended to itself x amount of times where x is the last number in (a,b,x).

(40)

4.2 Experiment setup

The experiment setup consisted of creating the clusters as well as preparing all nodes with any required external data, scripts or other files. Default settings for installed services or other general settings present on the nodes were changed if necessary. All required setup is presented in this subsection and the problems that occurred during the phases are described in Section 6.

4.2.1 Instance types

In order to be able to compare the cluster performance between the two chosen cloud providers the hardware and configuration of the clusters had to be as similar as possible. It was decided that predefined instance types should be used if possible when creating the clusters. Since Karamel was used to create the clusters and the cluster definition file had to include a pre-defined instance type when submitting it to the API or web UI, the choice to use pre-defined instance types when possible was simple.

The primary information that was of interest regarding the instance types was the number of cores and the amount of memory that would be available on the VM. The available cores and memory are of importance since it was important that the instance types were as similar as possible across both the chosen providers. There was additional information available on AWS such as network performance, but since this was not a performance aspect that was tested in the study the information was disregarded. The instance types that were considered for potential usage from both AWS and GCP can be viewed in Table 4.4

(41)

Table 4.4: Potential instance types

AWS Cores RAM GCP Cores RAM Type 1 m4.xlarge 4 16 GB n1-standard-4 4 15 GB Type 2 r5a.xlarge 4 32 GB n1-highmem-4 4 26 GB Type 3 m4.2xlarge 8 32 GB n1-standard-8 8 30 GB Type 4 r5a.2xlarge 8 64 GB n1-highmem-8 8 52 GB Type 5 m4.4xlarge 16 64 GB n1-standard-16 16 60 GB

The types in Table 4.4 have a core-memory ratio of either 4 or 8 on AWS, and 3.75 or 6.5 on GCP. Before deciding on the exact types to use from the selected cloud providers, it was desired to use instance types with different core-memory ratio, but that also had a similar corresponding instance type on the other cloud provider. Finding such instance types turned out to be harder than what was initially expected, and the reason for this was that the range of instance types offered by the two providers did not correspond particularly well, thus the instance types in Table 4.4 were chosen because they had a good counterpart on respective provider.

After some clusters had been created on both AWS and GCP it was discovered that the clusters on GCP created through Karamel had HDD disks instead of SSD disks as storage type. The fact that clusters on GCP were created with HDD disks was not ideal since the clusters on AWS were created with SSD disks, which meant that an eventual performance comparison would be of no value since the SSD disks are known to perform better than HDD disks. In order to make a fair performance comparison between the two cloud providers it was decided that a baremetal installation of the Hopsworks platform would be done on GCP instead of a regular installation through Karamel.

(42)

as their cost per month can be seen in Table 4.5 and 4.6.

Table 4.5: Instance types used in experiments

AWS Cores RAM GCP Cores RAM Type 1 m4.xlarge 4 16 GB Custom 4 16 GB Type 2 r5a.xlarge 4 32 GB Custom 4 32 GB Type 3 m4.2xlarge 8 32 GB Custom 8 32 GB Type 4 r5a.2xlarge 8 64 GB Custom 8 64 GB Type 5 m4.4xlarge 16 64 GB Custom 16 64 GB

Table 4.6: Cost/month per instance type AWS GCP Type 1 269.21 $ 296.27 $ Type 2 292.63 $ 350.82 $ Type 3 431.71 $ 405.54 $ Type 4 478.56 $ 514.63 $ Type 5 756.72 $ 624.07 $

The CPUs that were used in the clusters are presented in Table 4.7. Worth noting is that the CPUs used in clusters hosted on AWS are 10 - 20 percent faster than the CPUs used in clusters on GCP, only considering their respective base frequency. Having faster CPUs available can have an impact of the performance of a cluster and have to be taken into consideration when analysing the results. The information about the CPUs was taken from [2] and [13].

Table 4.7: CPUs used in experiments

(43)

AWS web portal at the date these experiments was performed, information regarding the attached CPU was displayed before moving to the next step. Looking at this information when selecting type 2 and 4 from Table 4.7 a 2.2 GHz CPU is displayed as the processor that would be used. In [13] it can be read that the processor used for clusters on GCP have a base frequency of 2.0 GHz, and the only other available information that could be found is about the maximum turbo speed for All core turbo and Single core turbo, which is 2.7 GHz and 3.5 GHz respectively.

The region that was used to host the clusters are presented in Table 4.8 and the machine images used to create the cluster nodes can be found in Table 4.9. The machine image used on AWS is a bit special because it was only used as the base for a custom-made machine image that was created before setting up any clusters. The reason a custom machine image was used is because the amount of storage could not be specified in the cluster definition file when creating a cluster on AWS as mentioned in Section 3.6. Therefore a custom machine image was created with the AWS machine image from Table 4.9 with an inclusion of 1 TB of storage. The custom image was then used in the cluster definition file sent to Karamel.

Table 4.8: Region and location AWS GCP Region eu-west-1 europe-north1-a Location Ireland Finland

Table 4.9: Machine images

AWS GCP

Image ami-08660f1c6fb6b01e7 ubuntu-1604-xenial-v20190325 OS Ubuntu server 16.04-LTS Ubuntu server 16.04-LTS

(44)

for clusters on GCP which enabled SSD disks to be used on both cloud providers. The storage type and size that was used in both the NameNode and the DataNode for all cluster types in the performance study can be found in Table 4.10.

Table 4.10: Storage information

AWS GCP

Storage General purpose Standard persistent Disk type SSD SSD

Size 1 TB 1 TB

The overall I/O performance of the cluster for a job is related to the capabilities of the disks that are used. Benchmarking values for the disk throughput (TP) for both the General purpose SSD disk on AWS and the Standard persistent SSD disk on GCP can be found in Table 4.11. The values in the Table was retrieved from [1] and [14]. From [14] the following can be read, “Unlike standard persistent disks, the IOPS performance of SSD persistent disks depends on the number of vCPUs in the instance. SSD persistent disk performance scales linearly until it reaches either the limits of the volume or the limits of each Compute Engine instance.”, and from [1] it can be read that, “GP2 is designed to offer single-digit millisecond latencies, deliver a consistent baseline performance of 3 IOPS/GB (minimum 100 IOPS) to a maximum of 16,000 IOPS”. This is information that will be essential when analysing the results of the experiments and making conclusions from the collected data, especially since the SSD disks on the two providers do not scale in the same way.

Table 4.11: SSD disk performance values

(45)

4.2.2 Creation of clusters

As a part of the performance study, a Java program was developed that simplified the cre-ation of clusters as well as the Hopsworks platform installcre-ation on the two chosen providers. The program has multiple smaller modules, one for each provider, one for the cluster defi-nition file creation, one for calling the Karamel API and one module that asks the user for input. There is also a module that contains different templates that are loaded depend-ing on the user input. The program works in the followdepend-ing way: First, the program asks the user for the cloud provider that the cluster should be created on, then it asks for the number of nodes that should be present in the cluster, after that it asks for the framework it should install (currently it only supports Hopsworks), the program then asks for the name of the cluster and lastly it asks if the cluster should be balanced, CPU-focused or RAM-focused.

Once all input mentioned above have been provided by the user, a cluster definition file is automatically created by the program and submitted to the Karamel API together with the correct cloud provider credentials as well as a SSH key-pair. Preferably the program should ask for the cloud provider credentials and SSH key-pair as well, but due to time constraints this feature was put on hold. The Karamel API then handles authentication towards the chosen cloud provider, enures that the submitted credentials are authorized and then creates the cluster nodes as well as starting the installation of the Hopsworks platform once the nodes are created.

(46)

and it is done this way since it was difficult to predict which recipes should be installed on which node as well as if there were any dependencies between the software. Implementing the software recipes to be added one by one simplified any changes to the installation node for each recipe.

The program was used to create clusters mainly on GCP as there turned out to be a few issues creating clusters on AWS through the Karamel API, which are discussed in more detail in Section 6. In order to create clusters on AWS, the Karamel Web UI was used instead. The web UI was also used to install Hopsworks on baremetal clusters residing on both AWS and GCP.

Installing Hopsworks on a cluster hosted on AWS required that the VMs were created inside a Virtual Private Cloud (VPC) that only allowed incoming traffic to a node if it originated from another node in the VPC. The reason a VPC had to be used was that the installation failed at the installation of the Spark job history server if the cluster could receive traffic from any outside source.

It was decided that the NameNode would always be created with instance type 1 from Table 4.5, while the instance type of the DataNode should vary. This decision was made because the instance type of the NameNode does not affect the workload performance of the cluster as the entire work is performed by the DataNode. For instance type 2 and 4 in Table 4.5 it was not possible to have instance type 1 on the NameNode when setting up a cluster on AWS. One of the reasons instance type 2 and 4 were incompatible with instance type 1 was that neither the Karamel API nor web UI recognized the instance types and were unable to create the nodes.

(47)

4.2.3 General cluster preparation

When the installation of the Hopsworks platform is completed the platform can be reached by connecting to port 8080 on the NameNode. In order to setup a connection to the Na-meNode, a SSH tunnel can be created with the command: ssh -i <path to private ssh key> <username>@<ip of NameNode> -fNL 8080:localhost:<port on local machine>. Once the command completes, the Hopsworks UI can be reached by opening a browser of choice and connecting to localhost:<port you specified>/hopsworks. From there one can login to the Hopsworks platform with the admin credentials which are Username: admin@kth.se and Password: admin.

The easiest way to perform operations such as creating projects and datasets or execut-ing spark jobs is through the browser and the Hopsworks UI. It is possible to do it through the terminal but it takes practice and is not as simple and intuitive as the UI. Before any jobs can be executed a project has to be created, this is done by simply pressing the New project button and choosing the different workload types that should be available in the project. Once the project creation is completed, the desired job can be created, configured and executed. Data that is supposed to be a part of any workload can be placed in its own dataset and then referenced in the job configuration as an argument.

4.2.4 DataNode preparation

(48)

14 GB, while yarn.nodemanager.resource.cpu-vcores would be set to 4.

In order to monitor the memory utilization a script was created that works in the following way. From the /proc/meminfo file the field MemTotal and MemAvailable are extracted. The names of the field are rather self-explanatory, but the value of the MemTotal field is the total amount of memory that is available on the entire node, and the value of the MemAvailable field is the currently available memory on the entire node. That is, the memory that can be used by processes. A percentage value for the memory usage of an entire node can be calculated with equation 4.1, that consist of two steps:

1. Calculate the fraction that the currently available memory constitutes in relation to the total amount of memory on the node.

2. Calculate the fraction that the currently used memory constitutes in relation to the total amount of memory on the node

The second step of equation 4.1 is simply 1 minus the result of the first step, and can be interpreted as the memory usage of the node at a specific moment.

Equation 4.1 is used in the memory utilization script to calculate a value of the memory usage twice every minute, the value is then written with to a .txt file with four decimal precision. The script has to be manually closed when the experiment is finished since it is built as an infinite do-while loop. In order to be able to run the script the bc language have to be supported on the node. Bc can be installed by running the command: sudo apt-get install bc. The script can be found in Appendix G.

CurrentlyU sedM emory

T otalM emoryOnN ode = 1 −

CurrentlyAvailableM emory

T otalM emoryOnN ode (4.1)

4.3 Running the experiments

(49)

4.3.1 Spark-job creation

When a project has been created the option to create a YARN job becomes available. Creating a new YARN job can be accomplished by going to the Jobs option followed by pressing the New job button. Once the New job button has been pressed a new view will be presented where general information about the YARN job such as name, type and JAR file can be entered. The name is simply the name of the job and will be used to identify the job in a list with all other available jobs. The job type can be either Spark or Flink but the experiments that were performed in the study were Spark jobs, as mentioned in Section 4.1. Once the type has been chosen, additional settings for the job becomes available. In order for the YARN job to know what to execute, the actual code to be executed must be provided in JAR format. The JAR should contain all code that is supposed to be executed as well as any additional dependencies that are required by the program. If the code is incomplete or if any dependencies are missing from the JAR the YARN job will fail to execute.

(50)

4.3.2 Spark-job configuration

When all general information such as name, job type and all information related to the JAR file has been entered, the option to provide additional configuration to the YARN job becomes available. The available relevant configuration consists of the following: Driver memory, driver cores, number of executors, executor cores and executor memory. There is additional configuration available but those settings were not changed, enabled or disabled for any experiment.

In order to utilize all available cores as much as possible, the number of executors for each experiment were equal to the number of cores available on the DataNode. That is, if there were four cores available on the DataNode, there were four executors, if there were 16 cores available, then 16 executors were created. The driver was assigned one core, which means that it had to share processing time with the executors. The driver memory was set to 2048 MB or 4096 MB and these values were discovered by trying a set of different values for the different types of experiments while simultaneously monitoring the job to see if it completed or not. There were some issues with the driver running out of memory for some experiments when the assigned memory was too low but with the values mentioned above the experiments managed to complete.

The amount of memory that was allocated for the executors was calculated in two steps. The first step was approximating the memory per executor by the use of equation 4.2.

executorM emory = availableM emory − driverM emory

(51)

with different values for the executor memory to see if the job started correctly and most important, if all specified executors were created.

The reason trial and error were chosen as the method to discover the highest value for the executor memory parameter was because of how executors are created in a Spark job. As mentioned in Section 3.3, when a spark job is created, resources are negotiated by the driver process in order to create a specified number of executors. The resources for the driver container as well as the executor containers are taken from a pool of resources, and the size of this pool is defined in the yarn-site.xml file. The driver and executors are created one by one until no more executors can be created with the negotiated resources. If no executors are created then the Spark job cannot execute since there are no processes that can perform work. But if at least one executor is created the Spark job can successfully be started and completed.

(52)

Table 4.12: Spark configuration for experiments

(53)

5 Result and analysis

This Section will present the metrics used in the study, the result from monitoring the cluster performance and an analysis of the collected data. The analysis also highlights any unexpected results and also provides an explanation to them if possible. The graphs used in this section only constitutes a small part of all collected graphs, but were chosen in order to present the most important results from the experiments. The rest of the graphs are summarized in the subsequent sections but can be provided upon request.

5.1 Metrics

The metrics that were collected for the performance comparison study was CPU utilization, memory utilization and disk usage. The CPU utilization metric was collected in graph form from the CloudWatch software on AWS, and from the Stackdriver software on GCP. It would have been preferable to get raw data instead of only a graph, but unfortunately this seemed to be impossible from both the used software. The only metric that was collected in the form of raw data was the memory utilization with the help of a script that can be found in Appendix G and described in Section 4.2.4. The raw data gathered from the script in Appendix G was later used to create graphs for visualization. The disk usage data was collected from the Hopsworks platform with the help of Grafana that provided different performance metrics for each individual Spark job. The collected disk usage metrics consist of the sub-metrics reads/s, writes/s, bytes read/s and bytes written/s.

5.2 CPU intensive

(54)

Table 5.1: Completion time of CPU intensive experiment AWS GCP

Type 1 19 min 25 min Type 2 23 min 21 min Type 3 14 min 16 min Type 4 15 min 16 min Type 5 12 min 14 min

Figure 5.1: CPU intensive CPU usage AWS 4 cores 16 GB RAM

Figure 5.2: CPU intensive CPU us-age GCP 4 cores 16 GB RAM

Figure 5.3: CPU intensive memory usage AWS 4 cores 16 GB RAM

(55)

Figure 5.5: CPU intensive CPU usage AWS 4 cores 32 GB RAM

Figure 5.6: CPU intensive CPU usage GCP 4 cores 32 GB RAM

(56)

Figure 5.9: CPU intensive CPU us-age AWS 16 cores 64 GB RAM

Figure 5.10: CPU intensive CPU us-age GCP 16 cores 64 GB RAM

(57)

the difference is around 10 percentage units. Figure 5.9 and 5.10 present the CPU usage for clusters using instance type 5, and from the graphs it can be seen that AWS has a significantly lower CPU utilization than GCP.

The CPU intensive results for clusters with type 2 was highly unexpected since the cluster on GCP suddenly completed the workload faster than on AWS. From figures 5.5, 5.6, 5.7 and 5.8 it can be concluded that there was no significant difference in resource utilization which implies that since AWS clusters had access to a faster CPU, it should be able to complete the experiment in a shorter amount of time than GCP clusters. The experiment was redone multiple times but gave the same result every time. With the available information no reasonable explanation was found to why clusters with instance type 2 complete the experiment faster on GCP than on AWS. A possible theory would be that there are differences between the CPU models used by the two instance types, but as there was no time to research this theory in the study it can only be labeled as a possible explanation.

To summarize the results from the CPU intensive experiment it is clear that when looking strictly at performance, AWS is overall the better provider. Clusters on AWS complete the workload faster and use an equal or lesser amount of resources than clusters on GCP for all different instance types, with an exception of type 2 which takes longer on AWS than on GCP. When looking at both the cost and performance aspects simultaneously however, the choice of provider becomes more troublesome. For some instance types the choice is clear as the cluster performance is better and it costs less on one of the providers, but for other instance types the choice will be between either better performance or lower cost.

5.3 I/O intensive

(58)

Table 5.2: Completion time of I/O intensive experiment AWS GCP

Figure 5.13: I/O intensive CPU us-age AWS 4 cores 16 GB RAM

Figure 5.14: I/O intensive CPU us-age GCP 4 cores 16 GB RAM

Figure 5.15: I/O intensive memory usage AWS 4 cores 16 GB RAM

(59)

Figure 5.17: I/O intensive disk us-age AWS 4 cores 16 GB RAM

Figure 5.18: I/O intensive disk us-age GCP 4 cores 16 GB RAM

Figure 5.19: I/O intensive memory usage AWS 16 cores 64 GB RAM

(60)

Figure 5.21: I/O intensive disk us-age AWS 16 cores 64 GB RAM

Figure 5.22: I/O intensive disk us-age GCP 16 cores 64 GB RAM

Figure 5.23: I/O intensive CPU us-age AWS 4 cores 32 GB RAM

(61)

values for the disks can be seen in Table 4.10.

Because of the fact that the disk performance differs between clusters on the two providers, the expectations of the experiment results changed as well. It was instead expected that for clusters that had instance type 1, 2, 3 and 4 would have similar results across both providers as there should not be any significant difference in disk performance. But for clusters with instance type 5 it was expected that the cluster on GCP would out-perform the cluster on AWS because of the disk out-performance scaling. Table 5.2 presents the time it took for all different clusters to complete the experiments, and Figure 5.13 through 5.22 show the CPU and memory utilization as well as the disk usage for type 1 and 5.

It is clear from Figure 5.15, 5.16, 5.19 and 5.20 that the memory utilization is highly similar between clusters on the two providers which corresponds to what was expected. Figure 5.17 and 5.18 present the disk usage for clusters using instance type 1, and from the figures it can be seen that the disk performance is highly similar as well. There are a few differences between them, AWS clusters had more reads than GCP clusters but it read less data each time. At the same time GCP had more writes but wrote less data each time. These differences seem to balance each other and thus results in a similar completion time, although it is little bit faster on AWS. The results for clusters with instance type 2, 3 and 4 follow the same pattern as instance type 1 with similar memory utilization and small differences in disk usage. The similarities in memory and disk usage is reflected in the completion time that can be seen in Table 5.2.

(62)

number of cores which should result in better disk performance on GCP than on AWS because of how the disk performance scales. The difference in performance is also reflected in the time it takes for the experiment to complete, which is presented in Table 5.2 where it can be seen that the GCP cluster takes a little bit more than half of the time that the AWS cluster takes to complete the experiment.

The difference in CPU utilization between the two providers was not expected at all, in Figure 5.13 and 5.14 the CPU utilization for instance type 1 is presented, and it is clear that there is a difference between the two providers. On GCP the CPU utilization is always close to 100%, while it varies a lot on AWS. The reason for this difference is hard to pinpoint, but one would assume that when data is read into memory the CPU would be used as much as possible to process the data as fast as possible. Instead the CPU seems to be restricted during parts of the experiment on clusters hosted on AWS but not for clusters hosted on GCP. Another peculiar result is that when memory and disk graphs are examined more closely there seems to be two different parts in the experiment, one reading phase and one writing phase. The two different parts of the experiment are reflected in the CPU usage from Figure 5.13 and also in the disk usage from Figure 5.17 through 5.22, but not in 5.14 which is peculiar. The most probable reason for the difference in CPU usage is that the CPU is restricted during the reading phase of the experiment on clusters hosted on AWS. The restriction is possibly caused by a bottleneck at either the disk, memory or a combination of the two. But in order to be entirely sure about the root cause more extensive research and testing have to be performed.

(63)

with the result from the CPU intensive experiment, where it was shown that instance type 2 did not perform particularly well compared to the other instance types. However, the theory is only a possible explanation and needs to be thoroughly researched in order to see what the actual reason is. In other words, the reason that only the cluster with instance type 2 on AWS behaves the same way as all clusters on GCP is unknown, but undoubtedly strange and it is difficult to explain this behaviour with the information at hand.

To summarize the results for the I/O intensive experiment there was quite a significant difference in CPU utilization between the two providers which was not expected. The cause of the difference in CPU usage is probably because of a restriction of the CPU while the reason clusters with instance type 2 have similar CPU utilization as clusters on GCP is unknown. The memory utilization and disk usage on the other hand fulfilled the expectations, where the memory utilization had no significant difference between the two providers for any of the instance types. The disk usage had as mentioned earlier a few differences between the providers, where the most significant difference occurred when instance type 5 was used and the GCP disks outperformed the disks on AWS in all categories. Choosing the most suitable provider for I/O intensive workloads with only performance in mind is rather difficult as neither provider is overall better than the other. The only clear choice of provider is for instance type 5 where GCP have slightly more than half the completion time of AWS and the disk on GCP is significantly better than the disk on AWS. Taking the cost of the instance types in consideration will probably result in that the cheapest provider being chosen as the overall cluster performance is almost identical for instance type 1, 2, 3 and 4.

5.4 MapReduce

(64)

Table 5.3: Completion time of MapReduce experiment AWS GCP

Figure 5.25: MapReduce CPU usage AWS 4 cores 32 GB RAM

Figure 5.26: MapReduce CPU usage GCP 4 cores 32 GB RAM

Figure 5.27: MapReduce memory usage AWS 4 cores 32 GB RAM

(65)

Figure 5.29: MapReduce disk usage AWS 4 cores 32 GB RAM

Figure 5.30: MapReduce disk usage GCP 4 cores 32 GB RAM

Figure 5.31: MapReduce CPU usage AWS 16 cores 64 GB RAM

Figure 5.32: MapReduce CPU usage GCP 16 cores 64 GB RAM

Figure 5.33: MapReduce memory usage AWS 16 cores 64 GB RAM

(66)

Figure 5.35: MapReduce disk usage AWS 16 cores 64 GB RAM

Figure 5.36: MapReduce disk usage GCP 16 cores 64 GB RAM

The result for the MapReduce experiment presented in Figure 5.25 through 5.36 as well as Table 5.3 show that clusters on both providers have overall highly similar behaviour for this type of workload. Figure 5.25, 5.26, 5.31 and 5.32 show that the CPU utilization is close to identical between both providers for type 1 and 5. The CPU utilization was close to identical for the rest of the instance types as well, That is, there was no significant difference between the providers regarding the CPU usage aspect. From Figure 5.27, 5.28, 5.33 and 5.34 it can be concluded that there is no significant difference in memory utilization either, and the behaviour regarding memory usage is consistent throughout all of the different instance types as well. The only difference between the two providers can be seen in Figure 5.29, 5.30, 5.35 and 5.36 that show the disk usage for instance type 1 and 5. From the figures previously mentioned it can be seen that the disks on GCP and AWS read the same amount of times, but the disk on GCP read 10 - 20 MB more per read than the disk on AWS. The difference in disk performance is also consistent across all different instance types and it is always in favour of GCP.

(67)

workload is without question GCP when looking only at the performance aspect. Although there is no difference in CPU and memory utilization, the disks performance is better on GCP and this is reflected in the time it takes for the cluster to complete the experiment. One peculiar thing is that even though AWS clusters have access to 10 - 20% faster CPUs, GCP clusters always complete the experiment in the same or a shorter amount of time than AWS. From the available results it can be concluded that the difference in disk per-formance outweighs the difference in CPU speed in favour of GCP. Deciding on a provider when taking only performance in mind it is clear that GCP is the best one as it completes the workload faster, have better disk performance and also use less or an equal amount of CPU and memory. Taking the cost aspect into consideration there will once again be a decision between performance and cost for some of the instance types since one provider has either better cluster performance or a lower cluster cost.

5.5 Summary

(68)

Table 5.4: Recommended provider considering performance Type 1 Type 2 Type 3 Type 4 Type 5 CPU intensive AWS GCP AWS AWS AWS

I/O intensive AWS - GCP AWS GCP MapReduce GCP GCP GCP GCP GCP

Table 5.5: Recommended provider considering cost Type 1 Type 2 Type 3 Type 4 Type 5 CPU intensive AWS AWS GCP AWS GCP

I/O intensive AWS AWS GCP AWS GCP MapReduce AWS AWS GCP AWS GCP

(69)

6 Problems

During the creation and configuration of the clusters several different problems occurred. The first problem occurred when trying to create clusters through the created program that used the Karamel Java API to connect to the cloud provider, create VMs and install the Hopsworks platform. The problem that occurred was that when the credentials had been accepted (authenticated and authorized), the API refused to create any VMs on AWS. Instead it got stuck and refused to progress any further. To solve this issue, The installation logs was examined to see if anything related to the problem was reported. Unfortunately, the information in the logs did not provide any information that helped solve the problem. Instead the solution was to use the Karamel web UI instead of the Karamel API and the created program to setup clusters and install Hopsworks. The web UI had no issues creating VMs on AWS and progressing further in the installation.

The second issue that occurred was during the actual installation of the Hopsworks platform. Since the information regarding how to correctly create a cluster definition file and any dependencies between the different software was sparse or non-existent, many of the installation attempts failed. The total installation time of the Hopsworks platform was around 90 minutes and most of the failed attempts took around 45 to 60 minutes. Since there were many failed attempts, the Hopsworks platform installation phase was very time consuming until a solution to the problem had been discovered. The root cause of the problem was incorrectly placed recipes and missing dependencies between the software. After muchf information gathering, asking questions and examining already existing cluster definition files all dependencies were sorted out and the installation could proceed. The already existing cluster definition files can be found on the github page of Hopsworks’ creators [18].

Performance Comparison Study of Clusters on Public Clouds

Performance Comparison Study of

Clusters on Public Clouds

Prestandajämförelse av cluster på offentliga molnleverantörer

Martin Wahlberg

Performance Comparison Study of Clusters

on Public Clouds

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

1

Introduction

1.1

Problem statement

2

Summary of technical work

3

Background

3.1

Hadoop

3.2

HopsFS

3.3

Spark

3.4

Flink

3.5

Hopsworks

3.6

Karamel

3.7

Cloud providers

4

Experiments

4.1

Experiment types

4.2

Experiment setup

4.3

Running the experiments

5

Result and analysis

5.1

Metrics

5.2

CPU intensive

5.3

I/O intensive

5.4

MapReduce

5.5

Summary

6

Problems