Spark on Kubernetes using HopsFS as a backing store

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Spark on Kubernetes using HopsFS as a backing store

Measuring performance of Spark with HopsFS for storing and retrieving shuffle files while running on Kubernetes

SHIVAM SAINI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Authors

Shivam Saini <shivams@kth.se>

Information and Communication Technology KTH Royal Institute of Technology

Place for Project

Logical Clocks AB Stockholm, Sweden

Examiner

Seif Haridi

KTH Royal Institute of Technology

Supervisor

Jim Dowling

KTH Royal Institute of Technology

Date: October, 2020

(3)

Abstract

Data is a raw list of facts and details, such as numbers, words, measurements or observations that is not useful for us all by itself. Data processing is a technique that helps to process the data in order to get useful information out of it. Today, the world produces huge amounts of data that can not be processed using traditional methods. Apache Spark (Spark) is an open-source distributed general-purpose cluster computing framework for large scale data processing. In order to fulfill its task, Spark uses a cluster of machines to process the data in a parallel fashion. External shuffle service is a distributed component of Apache Spark cluster that provides resilience in case of a machine failure. A cluster manager helps spark to manage the cluster of machines and provide Spark with the required resources to run the application.

Kubernetes is a new cluster manager that enables Spark to run in a containerized environment. However, running external shuffle service is not possible while running Spark using Kubernetes as the resource manager. This highly impacts the performance of Spark applications due to the failed tasks caused by machine failures. As a solution to this problem, the open source Spark community has developed a plugin that can provide the similar resiliency as provided by the external shuffle service. When used with Spark applications, the plugin asynchronously back-up the data onto an external storage. In order not to compromise the Spark application performance, it is important that the external storage provides Spark with a minimum latency. HopsFS is a next generation distribution of Hadoop Distributed Filesystem (HDFS) and provides special support to small files (<64 KB) by storing them in a NewSQL database and thus enabling it to provide lower client latencies. The thesis work shows that HopsFS provides 16% higher performance to Spark applications for small files as compared to larger ones. The work also shows that using the plugin to back-up Spark data on HopsFS can reduce the total execution time of Spark applications by 20%-30% as compared to recalculation of tasks in case of a node failure.

(4)

Keywords

Spark, Kubernetes, HopsFS, Data processing, Distributed and Parallel processing

(5)

Abstract

Data är en rå lista över fakta och detaljer, som siffror, ord, mätningar eller observationer som inte är användbara för oss alla i sig. Databehandling är en teknik som hjälper till att bearbeta data för att få användbar information ur den. Idag producerar världen enorma mängder data som inte kan bearbetas med traditionella metoder. Apache Spark (Spark) är en öppen källkod distribuerad ram för allmänt ändamål kluster dator för storskalig databehandling. För att fullgöra sin uppgift använder Spark ett kluster av maskiner för att bearbeta data på ett parallellt sätt.

Extern shuffle-tjänst är en distribuerad komponent i Apache Spark-klustret som ger motståndskraft vid maskinfel. En klusterhanterare hjälper gnista att hantera kluster av maskiner och förse Spark med de resurser som krävs för att köra applikationen. Kubernetes är en ny klusterhanterare som gör att Spark kan köras i en containeriserad miljö. Det är dock inte möjligt att köra extern shuffle-tjänst när du kör Spark med Kubernetes som resurshanterare. Detta påverkar starkt prestanda för Spark-applikationer på grund av misslyckade uppgifter orsakade av maskinfel. Som en lösning på detta problem har Spark-communityn med öppen källkod utvecklat ett plugin-program som kan tillhandahålla liknande motståndskraft som tillhandahålls av den externa shuffle-tjänsten. När det används med Spark- applikationer säkerhetskopierar plugin-programmet asynkront data till en extern lagring. För att inte kompromissa med Spark-applikationsprestandan är det viktigt att det externa lagret ger Spark en minimal latens. HopsFS är en nästa generations distribution av Hadoop Distribuerat filsystem (HDFS) och ger specialstöd till små filer (<64 kB) genom att lagra dem i en NewSQL-databas och därmed möjliggöra lägre klientfördröjningar. Examensarbetet visar att HopsFS ger 16 % högre prestanda till Spark-applikationer för små filer jämfört med större. Arbetet visar också att användning av plugin för att säkerhetskopiera Spark-data på HopsFS kan minska den totala körningstiden för Spark-applikationer med 20 % - 30 % jämfört med

(6)

omberäkning av uppgifter i händelse av ett nodfel.

Nyckelord

Spark, Kubernetes, HopsFS, Data processing, Distributed and Parallel processing

(7)

Acknowledgements

I am sincerely grateful to my examiner Seif Haridi and my supervisors Jim Dowling for offering me the opportunity to work on this project and for trusting me with the responsibility of tackling this complicated problem. Secondly, I would like to thank my supervisor Theofilos Kakantousis from Logical Clocks for his constant guidance and genuine support throughout the whole development of the thesis. I would also like to thank my dear friend Muhammad Haseeb Asif for guiding and helping me understand the complex concepts in the thesis work.

(8)

Acronyms

API Application Programming Interface

CPU Central Processing Unit

GFS Google File System

HDFS Hadoop Distributed File System

K8s Kubernetes

NDB Network Database

NFS Network File System

OS Operating System

RDD Resilient Distributed Dataset

(11)

List of Figures

1.1.1 Shuffle Process [16] . . . 2

2.1.1 The architecture of a Spark application [32] . . . 9

2.1.2 A cluster driver and worker (no Spark Application yet) [32] . . . 10

2.1.3 Cluster mode in Spark [32] . . . 12

2.1.4 Client mode in Spark [32] . . . 12

2.1.5 Requesting resources for a driver [32] . . . 13

2.1.6 Launching the Spark Application [32] . . . 14

2.1.7 Application execution [32] . . . 14

2.1.8 Shutting down Application [32] . . . 15

2.2.1 Kubernetes architecture [19] . . . 18

2.3.1 Submitting Spark application on Kubernetes [30] . . . 20

2.4.1 Shuffle Process in Spark [17] . . . 21

2.4.2Reducer executor fetching shuffle data from Mapper executor [29] . . . 22

2.4.3Mapper executor terminates unexpectedly [29] . . . 22

2.4.4Reducer executor getting information about shuffe data from External Shuffle service [29] . . . 23

2.4.5 Isolation between Mapper executor and External Shuffle service in Kubernetes [29] . . . 23

2.5.1 HDFS Architecture [23] . . . 25

2.7.1 HopsFS Architecture [23] . . . 28

3.1.1 Proposed Architecture . . . 30

3.5.1 Shuffle files storage format in external file system . . . 34 4.4.1 Comparison of execution times for Dataset 1 in Scenario A on Cluster 1 40 4.4.2Comparison of execution times for Dataset 1 in Scenario B on Cluster 1 41 4.4.3Comparison of execution times for Dataset 2 in Scenario A on Cluster 1 42

(12)

LIST OF FIGURES

4.4.4Comparison of execution times for Dataset 2 in Scenario B on Cluster 1 43 4.4.5 Comparison of execution times for Dataset 2 in Scenario A on Cluster 2 45 4.4.6Comparison of execution times for Dataset 3 in Scenario B on Cluster 2 46 4.4.7 Comparison of execution times for Dataset 3 in Scenario A on Cluster 2 47 4.4.8Comparison of execution times for Dataset 2 in Scenario B on Cluster 2 48

(13)

List of Tables

4.1.1 Kubernetes Cluster 1 specifications . . . 37

4.1.2 Kubernetes Cluster 2 specifications . . . 37

4.4.1 Dataset size and corresponding shuffle data size per file . . . 39

4.4.2Evaluation for Dataset 1 in Scenario A on Cluster 1 . . . 39

4.4.3Evaluation for Dataset 1 in Scenario B on Cluster 1 . . . 40

4.4.5 Evaluation for Dataset 2 in Scenario B on Cluster 1 . . . 43

4.4.7 Evaluation for Dataset 2 in Scenario B on Cluster 2 . . . 45

4.4.9Evaluation for Dataset 3 in Scenario B on Cluster 2 . . . 48

(14)

Chapter 1 Introduction

Data is a list of facts and details, such as numbers, words, measurements, observations or just descriptions of things. It is a raw list of facts that is not useful for us all by itself.

Information is what provides a logical meaning to this data [33] and in order to get the information, we need to process it. Data processing [12] is the technique to process the available data in order to get useful information out of it. In the past, humans used to process the data manually which was a slow process and prone to errors. The advent of computer technology brought a revolution as manual processing was replaced by automated processing done by computers which was fast, reliable and free from human errors.

1.1 Background

Processing the data became much easier with the advent of computers as there was a small amount of data that needed to be processed. But with time, computers became more advanced and more powerful and thus generated a lot of data. Today companies produce Petabytes of data in everyday work [12]. Traditional data processing techniques will take days to process this amount of data. This problem led to the advent of data processing tools that work in distributed and parallel fashion and reduce the processing time from days to just a few hours.Apache Spark [7] is one of the many data processing frameworks that helps us to speed up the processing. Others include Apache Hadoop [3], Apache Flink [2], Apache Hive [5], Apache Storm [8], Tableau [11]

etc.

(15)

CHAPTER 1. INTRODUCTION

Spark is an open-source distributed general-purpose cluster computing framework for large scale data processing. In order to fulfill its task, Spark uses a cluster of machines to process large scale distributed data in a parallel fashion that provides the user with quick and accurate results. This thesis aims to solve one of the challenges faces by Apache Spark framework today. But in order to briefly understand the problem, we first need to get familiar with three concepts of Spark.

Cluster manager: While working in a distributed fashion, Spark needs to manage the cluster of machines in order to get the right results. A cluster manager helps Spark in doing this management and arranges the required resources to run the application.

Spark has the support of various cluster managers. Apache Mesos [6] and Apache Yarn [4] are the most widely used cluster managers to date.

Shuffling: Sometimes while processing and to get the final result, Spark needs to redistribute the data across the executor machines running on the cluster. So lets say there is a dataset partitioned across the executors. Spark takes a partition key and transfers all the data whose key matches to that partition key, to the same executor. The executors then process this data locally to get the final result. To do this shuffling task, Spark has two types of executors, mapper executors and the reducer executors.

In the figure 1.1.1, each shape or color represents data with the same key. On the left side are the three mapper executors, each holding one key each. On the right side are the reducer executors which get data for each key reduced on the same executor after the shuffle process.

Figure 1.1.1: Shuffle Process [16]

(16)

External shuffle service: While doing the shuffle process, the mapper executors can go down due to system failure before the reducers are able to fetch the intermediate data from mappers. This definitely requires the Spark application to recalculate the map task which in turn, highly impacts the performance of Spark application. To overcome this issue, Spark uses an external shuffle service. In Yarn and in Mesos, each node in the cluster can run a single shuffle service process alongside the other executors. This service has the complete information about where the mapper task has stored the intermediate data and helps the reducer to get it in case of mapper executor failure.

1.2 Problem

Following are the few points that describe the problems with current implementation in Spark framework:

• Lack of isolation: External shuffle service runs on the same host as that of other processes that use YARN node manager. Other Spark applications run on the same host as well. If shuffle service misbehaves, the applications running on the same node manager will also be negatively impacted.

• Scalability issues: There is only single external shuffle service running on a node. All the executors running on that node write to the same disk and share their metadata with the same shuffle service. This causes the shuffle service to become a bottleneck in scaling the system. Moreover, if the shuffle service gets corrupted, the metadata written by all executors will be lost and the whole task will have to be rescheduled again.

• Requires continuous uptime: In order for the executors to write the data on local disk, the shuffle service needs to be running so that they can provide their metadata to it. If shuffle service is stopped, the node on which the shuffle service is collocated will not be able to schedule the executors. But a shuffle service that is running without serving any executors is a waste of resources.

• Need for containerization: Various organisations run a lot of their work loads on containers example serving, stateful workloads, databases etc. Containerising Spark will not only bring a streamlined developer experience but will also help to reduce operational cost in addition to an improved infrastructure utilization.

(17)

The Spark open source community already addressed those challenges and proposed using a new type of resource manager that can solve all the above problems.

Kubernetes (K8s) [18], which is an open source system, used to automate deployment, scaling and management of containerized applications seemed to be a good fit for this role. Kubernetes, often abbreviated as K8s, has gained popularity and has become a leading cluster orchestrator that lets us manage containerized apps at scale. As Kubernetes provides isolation, easy scaling up and scaling down of resources and container environment, all that were missing from the current implementation of Spark, the open-source Spark community has already made a lot of progress in using k8s as Spark’s resource manager [30].

But using Kubernetes also has its own limitations. K8s is an orchestrator for containers [36] and was not designed keeping in mind the implementation of the Spark architecture. Although it solves all the above mentioned problems but also creates new ones. For example, one of the biggest drawbacks is that running the external shuffle service is not possible due to the isolation feature provided by the K8s architecture for additional security. The following chapter will discuss in details why this is so. This will highly impact the performance of Spark application and running Spark jobs will become inefficient.

To further address this issue, the contributors to the open source community have suggested to used a remote storage for persisting shuffle data and then using it as an alternative to external shuffle service [1]. This idea has captured a lot of attention of various contributors of Spark community. The aim is to find a file system that will serve Spark application with minimum data latency and least affects its performance.

Google, an active contributor to open-source Spark community has tested performance of various file systems like Hadoop Distributed File System (HDFS), Google File System (GFS), Bigtable, Apache Crail, Network File System (NFS), but none of them have live up to the expectations in terms of performance[16].

Hops: Hops is a next-generation distribution of Apache Hadoop, with a heavily adapted impelementation of HDFS, called HopsFS [24]. HopsFS is a new implementation of the the Hadoop Filesystem (HDFS). As opposed to HDFS, HopsFS stores the metadata in an in-memory distributed database, Network Database (NDB).

HopsFS has a special support for small files (=<64 KB default) where the metadata and the data are both stored together on NDB. This special support highly increases

(18)

the performance of file system for small files [23]. Detailed implementation of the architecture is found in the section 2.7.

It is clear that this idea of using an external storage to persist Spark shuffle data definitely has a lot of potential and needs to be addressed further. Therefore, it will also be the area of focus of this thesis. Hence, we will be formulating the following hypothesis and try to answer if the hypothesis is valid.

Given the special support for small files, HopsFS provides Spark applications with lower latency when dealing with the small files as compared to larger ones while running on Kubernetes (K8s)

1.3 Purpose

As discussed in the previous section, running Spark on K8s has become the demand of present industry, but it also introduces additional challenges. The purpose is to discuss these challenges and find possible alternatives as a solution to the challenges.

Furthermore, the thesis presents the work that has been done in the direction to solve this problem.

1.4 Goal

The goal of the thesis is to present a solution to the problem presented in this thesis.

Further our goal will be to analyze if the proposed solution actually solves the problem by doing qualitative and quantitative analysis. At last, we want to discuss if there is a scope for further research in the area.

1.5 Benefits, Ethics and Sustainability

This thesis aims to do no harm to any community or a person directly or indirectly as a result of this work. Instead it aims to help the data engineers and data scientists who are facing the issue discussed as a problem in this thesis work. Since this thesis is a work around an open-source project, it has been made sure that proper references are given to the people who have already worked in the area or to the work that has been used as a reference.

(19)

This work also has sustainability impact on the environment. The use of resource orchestrator such as Kubernetes help to scale the system on demand an eventually improves the utilization of compute resources leading to a smaller infrastructure footprint and a corresponding reduction in carbon footprint[15].

1.6 Methodology

The work done in this thesis falls under the category of System Design and follows Design Science Methodology as describe by Aline Dresch et al [13], although it involves some application integration as well. Therefore, the methodology selected for this thesis is Design Science Research Methodology as proposed by Ken Peffer et al [26].

The methodology consists of following steps

1. Problem identification and motivation: Identify research problem and then justify the value of the solution. Since the problem definition is used to develop the solution, the thesis work has tried in a best way to break the problem into atomic parts to reduce its complexity and have a better understanding.

Moreover, reasonable justification for the value of solution has been given using examples and past studies as it will motivate the audience to pursue the solution and accept the solution.

2. Objectives of a solution: In the thesis work, it is inferred what the objectives of the solution will be based on problem definition. Moreover, the objectives have been compared with current solutions and their efficacy.

3. Design and develop: Special importance was given to first design the infrastructure and also designing the functionalities of solution. Only after selecting the best suitable design, the actual artifact was built.

4. Demonstration: In the thesis work demonstrated how the designed solution solves the targeted problem by conducting various experiments.

5. Evaluation: Based on the experimental results achieved in the previous step, the thesis work has tried to evaluate how well the designed solution solves the target problem. This helps to understand if the solution has actually solved the problem or there was a need to design a new solution and repeat the above steps.

6. Communication: Keeping in mind about the future research and the

(20)

researchers, the thesis work has tried its best to communicate the targeted problem and the importance of this solution to its audience. As this study might prove to be beneficial to some researchers or might act as a basis of novel researches, the thesis provide the future scope of this existing study.

1.7 Outline

The rest of the thesis is organized in the following way:

• Chapter 2 presents the theoretical background knowledge required to deeply understand context of the thesis. It also comprehensively describes the problem statement targeted in the thesis.

• Chapter 3 This chapter describes the implementation of the solution provided by this thesis work. It describes why the choice of the solution was made, and discusses the architecture, design, implementation and integration of it.

• Chapter 4 This chapter evaluates the present solution by discussing and analyzing the experimental outcomes obtained during the research work. This chapter also describes the test environment used during the experiments.

• Chapter 5 The chapter concludes the work by reiterating the problem, discusses the provided solution and briefly describes the future scope of the study.

(21)

Chapter 2 Background

This chapter will introduce the background of the thesis topic in more details and then also help to understand the target problem in depth. Later we will also discuss the work that has already been done in the related field.

2.1 Spark

As introduced in the previous chapter, Spark [7] is an open-source distributed general-purpose cluster computing framework for large scale data processing. Today, Spark is most actively developed open source framework for distributed and parallel processing, therefore it naturally is one of the standard tools used by developers and data scientists interested in big data [9].

Normally when we think of a computer, we think of a single machine that we have at home. We use it for various purposes, watching movies, playing games and doing office work. Normally our computer meets all our demands. But at some point we all realize that this computer is not powerful enough to do a specific task that you just came across. Large scale data processing is also one of the tasks that a single computer is not capable of completing or might take a lot of time that a user can not wait for, until the completion of the task.

Spark uses a cluster of machines to process large scale distributed data in a parallel fashion that provides the user with quick and accurate results. It used these clusters of machines in such an organized way as if it was using a single machine with huge amount of power. Spark needs to manage this cluster of machines to coordinate in order to

(22)

CHAPTER 2. BACKGROUND

get the right results. Spark has the support for multiple programming languages used widely such as Python, Java, Scala and R. It also capable of doing diverse tasks ranging from SQL to data streaming and machine learning.

2.1.1 Spark Architecture

The Spark follows the master-slave architecture [21]. Its cluster consists of a single master and multiple slaves. The architecture has been explained using the book Spark:

The Definitive Guide written by Bill Chambers and Matei Zaharia [32].

Figure 2.1.1: The architecture of a Spark application [32]

• Spark driver: Spark driver is a process which is the heart of the application.

This is where the main() function of the application runs. It sits on a machine also termed as a node inside the cluster. It has three main tasks:

– Maintaining the information about the Spark application.

– Responding to user’s input or program.

– Analyzing, distributing and scheduling work on the executors.

.

• Spark executors: The executors are the processes that actually perform operations on data assigned by the driver process. The executors have only two tasks:

(23)

– Complete the work assigned by the driver.

– Report the state of the computation on that executor back to the driver process.

Figure 2.1.1 shows a basic architecture of a Spark application. We can see the driver on the left which acts as a master. There are four executors on the right which acts as slaves to their master.

• Cluster Manager: As mentioned earlier, Spark relies on the concept of using machines in clusters in order to process the data in distributed and parallel fashion. But Spark does not manage the cluster itself. Instead, it relies on some cluster manager, sometimes also referred to as a scheduler, to maintain the coordination between executors and the driver process. When submitting an application, Spark driver requests this cluster manager to provide necessary resources to complete the processing task.

It might sound confusing, but the cluster manager also has its own abstraction of ”driver” (sometimes called as master) and ”worker”. The main difference lies in the fact the these are tied to physical machines rather than processes as it is in case of Spark. Figure 2.1.2 shows the basic cluster setup. On the left is the Cluster Manager Driver Node, and the circles represent the running daemon processes, managing each of the worker nodes. Keep in mind that these process belong to cluster manager and not to the Spark application.

Figure 2.1.2: A cluster driver and worker (no Spark Application yet) [32]

When it comes to running Spark application, Spark will need to request resources. Cluster manger will be solely responsible for providing and managing

(24)

underlying machines where the Spark application will run. There are following cluster managers supported by Spark

– Standalone: It is the simplest cluster manager that comes included with Spark and has very limited features. It makes it easy to setup a cluster that Spark can manage itself.

– Apache Mesos: It is the open source cluster manager popular of big data workloads [6]. Mesos handles the workloads in a distributed environment by dynamic resource sharing and isolation. It is useful when deploying applications in a large scale cluster-environment.

– Hadoop Yarn: Like Mesos, Yarn is also used while deploying applications in a large scale cluster environment [4]. The differnece lies in the fact that Yarn is specailly designed for Hadoop workloads whereas Mesos is designed for all kinds of workloads. It is favourable to use Yarn if the user is already running a Hadoop cluster.

– Kubernetes: Pretty new in terms of resource manager for running Spark in cluster environment and is still an area of research [18]. It provides higher scalability and also provides additional security by providing isolation to the running applications on the cluster, that no other resource manager is capable of.

2.1.2 Execution modes in Spark

Execution modes provide the user the ability to decide where the resources will be located while running an application. In Spark, three execution modes are supported.

• Cluster mode: When Spark submits an application, it requests the cluster manager to provide the resources. The cluster manager creates the Spark driver process in one of the nodes in the cluster in addition to the executor processes.

In this way, it is the cluster manager which is solely responsible to maintain all the application related processes. Figure 2.1.3 shows cluster manger placed the Spark driver process (solid orange block) on one worker node and executors (dotted blocks) on other worker nodes.

• Client mode: It is similar to the cluster mode except for the fact that the driver

(25)

Figure 2.1.3: Cluster mode in Spark [32]

process runs on the client machine that submits the Spark application. So, the client is responsible for maintaining the driver process and the cluster manager takes care of executor processes. In figure 2.1.4, the driver process is located on the machine outside of the cluster and all executors are inside the cluster.

Figure 2.1.4: Client mode in Spark [32]

• Local mode: It is the most basic way of running Spark. In this execution mode, Spark runs all the processes on a single machine and parallelism is obtained in

(26)

terms of threads. This method of running Spark is good to learn Spark and to do basic experiments, but is not advised in the production environment.

2.1.3 Life cycle of Spark Application

In this section, we will discuss the stages in the life cycle of a Spark application. We will try to understand using illustrations and will assume that the cluster is running four nodes, one node running cluster’s driver manager and others are the worker nodes.

• Client Request: The first step is to submit the application from the local machine in the form of a pre-compiled jar or library. Spark then requests the cluster manager to create the Spark driver process on a node in the cluster. Note that at this point, Spark only asks for resources in order to run the driver process.

The client program that submitted the application exits and the application starts running on the cluster. Figure 2.1.5 shows the client request process.

Figure 2.1.5: Requesting resources for a driver [32]

• Launch: Now the driver process starts running in the cluster, it starts running the user code. This code initializes a session called spark-session which is required to run the application on a cluster. The spark-session in the driver process then requests the cluster manager to create executor processes inside other nodes in the cluster as shown in fig 2.1.6. The cluster manager responds by creating the required no of executors which are specified by the user while

(27)

submitting application via the client process. The cluster manager also sends the relevant information about the executor processes to the driver process.

Figure 2.1.6: Launching the Spark Application [32]

• Execution: Now that the Spark cluster (driver + executors) is up and running, the executors start performing tasks which are assigned to them by the driver process. They communicate among each other to transfer and process the data and also inform the state of the current execution to the driver process.

Figure 2.1.7: Application execution [32]

• Completion: The driver process exists with either success or failure depending on if all the executor tasks were completed successfully. The cluster manager

(28)

then shuts down the executors after the completion of the driver process. Figure 2.1.8 shows the completion phase of Spark application.

Figure 2.1.8: Shutting down Application [32]

2.2 Kubernetes

Kubernetes [18] is an open source system, used to automate deployment, scaling and management of containerized application. Also known as K8s, Kubernetes provides support to control number of containers running on a cluster. The first unified container management system was developed at google and was called Borg [34].

It was developed to manage both long-running services and batch jobs. But Borg had some pain points, which Google addressed by developing Kubernetes and later open sourced it [10]. Also known as K8s, Kuberntes traces its lineage directly from Borg.

In early era, organizations used to run their applications on actual servers or computers. There were many limitations in this method which was a big problem for the applications to stay running in production. The applications running on the server took so many resources, that sometimes other applications were never able get enough resources to run their process and eventually go into starvation¹. To solve

1Starvation is a problem encountered in concurrent computing where a process is perpetually denied necessary resources to process its work.

(29)

this issue, Virtualization [35] came into the picture where companies started running applications in virtual environment. In virtualization, guest OS is run on top of the host Operating System (OS) which eventually helps to isolate different applications as they run on different OS.

However, in virtualization, the guest OS shares the actual resources with the host OS.

So, it divides the resources among both OS, whether the applications running on them require that much of resources or not. Most of the times, virtualization lead to the wastage of resources.

To further solve this issue, organizations started using containerized applications.

A containerized application make use of containers [36] which is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings and can run on shared OS. Even though the containers share same OS, they never interfere with other containers or host OS as they always work in their isolated containerized environment. This makes containers the first choice of organizations while running applications in production environment.

Big organizations can have thousands of containers running on servers. It becomes a burden to manage those containers manually and will probably require a lot of work force. This is where Kubernetes comes into play. Kubernetes as a container orchestration tool provides the ability to mange thousands of clusters running on same or different servers.

2.2.1 Pods

Kubernets architecture has a master and multiple workers. These workers run on nodes and when combined together, these nodes form a cluster. The user talks to the master to declare and specify what they want to run and what will be the desired state of the cluster using a configuration file. Desired state means what container image to run, how many replicas will be running, what network to operate on and things like that. Then the master will schedule the workload on the nodes themselves to run as containers.

(30)

Actually nodes do not run containers but the pods [27], which are a high level abstraction of containers. So the pod is the lowest level of scheduling in Kubernetes and pods can have multiple containers running inside it. Each pod has its own IP address, and if two pods want to contact each other, the underlying architecture takes care of that. Containers in the same pod share a single IP address and a single namespace.

2.2.2 Desired state

The Kubernetes cluster tries to maintain the desired state of the cluster by constantly monitoring the cluster in loops. Lets say, the user specified to have 10 replicas of a service pod, then it will be the desired state of the cluster. Kuberntes will schedule 10 replica pods of the program on nodes. Now let’s say after some time, the node that was running 2 replica pods crashed due to some reason. Now only 8 replica pods were left and it will be the actual state of the system. In order to attain back the desired state, Kubernetes will spin up two new pods on another node. This is what makes Kubernetes a very powerful tool.

2.2.3 Kubernetes Architecture

As metioned in the previous section, K8s follows a master-slave architecture [21] [19].

There is a K8s master and there are multiple slave or worker nodes. The user submits the application through the master, and multiple components which are services of the microservice application might get scheduled over more than one node in the distributed cluster.

• Kubernetes Master: There are multiple components of the master that make up the control plane of the Kubernetes cluster and are responsible for the health and desired state of the cluster. These components make the global decisions for the clusters like scheduling, deploying of pods and responding to other cluster events.

– Kube-api server: It is the component of control plane that exposes the Kubernetes Application Programming Interface (API). We can think of it as a front end to the K8s cluster and acts as a single point of contact for the outside world into the cluster by exposing RESTful API.

(31)

Figure 2.2.1: Kubernetes architecture [19]

– Etcd: A distributed key value datastore which acts as a single state of truth for the cluster. It is the only stateful component in the cluster and maintains the configuration and state of the cluster persistently.

– Kube-scheduler: In in real world scenarios, there are thousands of pods running and deciding on which node to run those pods is an important decision. Kube-scheduler helps the system to decides where the application or the components of the application will be scheduled.

– Controller manager: Kubernetes make the use of controller manager to manage its running components. The controller manager further has various components and keep track of certain tasks. For example, Node Contoller keeps the track of nodes in the cluster. If a node goes down, it automatically creates a new node in the cluster. Similarly there is a replication controller that makes sure that there are exact no pod replicas are present as are required by the user.

All of these services are exposed by this single API endpoint. Whether the user is using a command line interface for K8s i.e. ‘kubectl’ or a dashboard, example third party tool, the user always end up talking to the API endpoint which is exposing the REST API that will be the gateway to Kubernetes.

• Components on worker nodes: Following components run on each node in the cluster and provide the Kubernetes runtime environment.

– Kubelet: Kubelet is responsible for running the pods on the node. Kube-

(32)

scheduler directs the kubelet to create the pod on a node. Kubelet makes sure that desired no of pods or the required container are always up and healthy.

– Kube-proxy: Kube-proxy is responsible to provide the network proxy to the node. It manages the network rules in the nodes, using which the services inside the cluster are able to communicate with each other.

– Container runtime: To run containers inside a pod, kubernets needs a runtime. Kubernetes has the option to use different runtimes, but Docker is the default runtime used in kubernetes.

2.3 Integrating Spark with Kubernetes

A lot of organisations run a lot of their work loads on containers example serving, stateful workloads, databases etc. Containerising Spark will not only bring a streamlined developer experience but will also help to reduce operational cost in addition to an improved infrastructure utilization. Overall, Kubernets provides Spark with following benefits:

1. Provides isolation feature to running Spark application from other process in the system.

2. Allows user to easily scale up or down the application according to he workload and saving operational cost at the same time.

3. Makes the provisioning of the Spark application easy for the users with minimal operational efforts.

4. Kubernetes offers a lot of different add-ons and services to use like third-party logging, monitoring and security tools as there is a huge ecosystem surrounding both Kubernetes and docker containers.

2.3.1 Running Spark on Kubernetes

Its easy to run a Spark job on K8s and can be done using a spark-submit script provided with Spark [30]. As shown in fig 2.3.1, when a user or client runs spark-submit, the request goes to the API server and asks to run the driver pod which runs the Spark driver. The driver then talks to the scheduler and requests for the executor pods which

(33)

will be running the Spark executors. The scheduler will create the executor pods with all the dependencies and configuration given by the user. The executors listen to the Spark driver and perform individual Spark tasks given by the driver. After completing their task, each executor sends it’s result to the driver. Once the process is completed, the executors pods are terminated and are cleaned up means it can not be accessed anymore and no logs can be accessed. But the driver pod persists and remains in COMPLETED state and will persist the logs for the user to look at the results or for any errors if the job failed due to a reason.

Figure 2.3.1: Submitting Spark application on Kubernetes [30]

2.4 Shuffle in Spark

In the last chapter, we introduced a problem about running the shuffle external service while running Spark on Kubernetes. In the following section we will understand why is this so. But before that, lets refresh our concept about shuffle in Spark and get the better understanding about the shuffle process.

Spark stores the data in form of Resilient Distributed Dataset (RDD) [28] which is a fundamental data structure of Spark. Using these RDDs, Spark stores that data in a distributed way into different partitions. While some operations are perfomed on an RDD, a new RDD is generated. The shuffle operation is a process by which Spark redistributes data from an RDD across the executors in the Spark application. So lets

(34)

say there is a dataset partitioned across the executors in form of RDD. Spark takes a partition key and looks for the data in different executors whose key matches to that partition key, and then transfer it to the same executor. The executors then process this data locally to get the final result. Shuffle is a very intensive task, and consumes a lot of power and resources. So Spark tries to avoid shuffle whenever possible. But there are few operations in Spark like reduceByKey, groupByKy etc where shuffle can not be avoided.

Figure 2.4.1: Shuffle Process in Spark [17]

To do this, shuffles have two sides, map side and a reduce side. As shown in fig 2.4.2, on the map side of the shuffle, the mapper executor will write shuffle files down to its local disk and shuffle files are partitioned by the given partition key. On the reducer side, the reducer executors will contact the mapper executors in the application and request them to provide the shuffle files corresponding to a particular partition key, the mapper executors read the data from local disk and send to the reducer executors accordingly.

As shown in the fig 2.4.1, the stages of a reduceByKey operation is shown. The mapper calls the map task using a .map operation which is performed on data stored in the

(35)

form of RDD. Once the map task is done, Spark stores the processed data in the form of new RDD. Now on the reduce side, new RDDs are generated after the data is shuffled (marked as shuffle) on the request of the reducer. Once the data with the same keys is present, the reducer then performs the .reduceByKey operation to get the final result.

Figure 2.4.2: Reducer executor fetching shuffle data from Mapper executor [29]

This goes well if the reducer executors are able to contact the mapper executors directly and the mapper executors also stay around to provide the shuffle data that is required.

But if a mapper executor crashes before the reducer executor could get the data, all the data written by that executor is lost. So Spark will need to run the map tasks again on the executors that are remaining. It results in poor scaling.

Figure 2.4.3: Mapper executor terminates unexpectedly [29]

To solve the above issue, Spark introduced something called external shuffle service.

In yarn and in Mesos, each node in the cluster can run a single shuffle service process.

Now map executors no longer need to serve data themselves. The mapper executors after calculating their shuffle data still write on the local disk. They also inform the external shuffle service about the location of the files. Now, reducer executors instead of talking to mapper executors directly talk to the external shuffle service that is colocated with the mapper executor to get information about the data to fetch and further process it as shown in fig 2.4.4.

Now the mapper executor can exit the cluster for any reason. It can be preempted by yarn, or it can be removed by Spark in order to save resources or it might crash due to an intermittent failure . In all these cases Spark does not lose the shuffle files written

(36)

by that mapper executor as it will still be available in node memory and the reducer will be able to fetch it by contacting the external shuffle service using RPC protocol. Spark can use this feature to deallocate resources when they are not performing any task and are not needed anymore.

Figure 2.4.4: Reducer executor getting information about shuffe data from External Shuffle service [29]

But this implementation of external shuffle service does not work in Kubernetes and the simple explanation for this is isolation. In K8s, we do not want multiple pods to share the same volume mount for security purpose. In Yarn, the mapper executors were writing data to the local disk and the shuffle service that was colocated with that executor was reading the data from the same hard drive. This implementation will not work in Kubernetes because no two pods will share the same hard drive. If a mapper executor running in a pod writes data to its local disk, then the shuffle service running in another pod can not reach into the local disk of the mapper executor pod and grab the data to serve it to the reducer executor. And for the same reason, the mapper executor pod can not simply push its data to the external shuffle service pod.

Figure 2.4.5: Isolation between Mapper executor and External Shuffle service in Kubernetes [29]

Finally, the K8s resource manager does not have the implementation for external shuffle service. So it is not possible to use the Dynamic Resource Allocation feature.

(37)

2.5 HDFS

In the recent years, the amount of data that is generated has been increasing exponentially. Companies want to store this data in a distributed manner so that they can process this data and get useful information out of it in order to make services better and increase user’s experience. To this data, Apache Hadoop [14] is the state- of-art software which provides many frameworks to store and process large amounts of data. Hadoop uses a Hadoop Distributed file system to store a large amount of data in a distributed manner. The stored data can then be processed using different frameworks like Hive, Hbase, Mapreduce, Spark etc. These frameworks are managed by Apache Yarn which acts as a resource manager and application scheduler for the Hadoop clusters. HDFS is an open source implementation of GFS.

2.5.1 HDFS architecture

HDFS also follows the master-slave architecture[23]. There is a master node called the NameNode. This node decouples the metadata from the data and is responsible for storing the metadata of the file system. Files in HDFS are split into small blocks 256MB approx and are stored on another type of nodes called DataNodes. These blocks are replicated across the DataNodes for security and high availability. The replication factor can be decided by the user and is set to 3 by default. As the Namenode stores the metadata of the whole system, the clients and DataNodes always request the NameNode(NN) for any action. Since the whole system is controlled by NN, it can easily get overwhelmed and is a single point of failure. Hence there is a secondary NameNode called Standby NameNode(SbNN) that is always in sync with the primary NameNode called Active NameNode(ANN).

All the Data Nodes are connected to both the Active NameNode and the Standby NameNode, but they report to Active NameNode by default. They periodically generate a block report and send it to the Active Namenode so that it can validate the metadata about which blocks are stored on which Datanode. Moreover, Datanodes also send periodic heartbeats to the Namenode to confirm that they are still in active state. If Namenode does not get any heartbeats from a Datanode for some specific time because that node failed, then for each block on the failed Datanode, the Namenode will ask other surviving data nodes that stores the copy of the blocks to replicate the blocks on some other alive Datanode. In case the Active NameNode fails, the Standby NameNode

(38)

becomes the Active NameNode and both DataNodes and Clients start contacting the new Active NameNode. Zookeeper coordination service is responsible for coordinating the fail overs from active to standby Namenode and to decide on which machine the active Namenode is running.

Figure 2.5.1: HDFS Architecture [23]

Whenever there is a request for data, it always goes to the client. Client first contacts to the Namenode to get the location of the block on the Datanode where that data is stored. After getting the location, the client directly contacts the Datanode to access the data.

The Namenode implements operations atomically, by providing a global lock on the entire file system and hence providing single-writer and multiple-readers access to the system. Some operations like deleting large directories that require locking for too long are not atomic. Locking the whole system for a long time can starve the clients and adversely affect the performance of the file system. The changes made to the system are written by the Active NameNode and are asynchronously logged to Standby NameNode. Writing logs synchronously while having lock on the entire file system can also lead to starving of clients and hence affecting the performance. Ofcourse, this comes at the cost of unavailability of the system. If the Standby Namenode also dies

(39)

along with the active Namenode before the changes were asynchronously logged on the former, then all changes will be lost.

The size of metadata is relatively very small as compared to the actual size of data.

Generally there is 1GB of metadata for every 15 petabyte of file system data. Companies nowadays have a very high amount of data generated everyday. And with the advancement in network technologies and according to current trends, the volume of data generated is gonna be even higher in future. The current implementation of HDFS or any other file system is not capable of storing and dealing with this high volume of data. In the case of HDFS, single Namenode architecture is a major bottleneck in terms of scalability for Hadoop system.

2.6 Network DataBase

In a shared nothing architecture, the computing nodes are distributed and are interconnected, but they work independently in their own memory and have disk storage which is not shared. A shared nothing architecture eliminates the presence of a single point of failure and offers greater scalability by adding as many nodes as the user wants. The data is distributed and is partitioned across different nodes that run on different machines. This process is called sharding in which data is partitioned horizontally.

MySQL Cluster is a shared-nothing, replicated, in memory, auto-sharding, consistent, NewSQL relational database [22]. Network DataBase (NDB) is the storage engine for MySQL Cluster. Network database model was designed to overcome some of the shortcomings of hierarchical database models, more specifically the flexibility. It is based on the idea that multiple member files or records can be linked to multiple owner files and vice versa. As opposed to hierarchical file systems that only allow single parents for a file, Network databases allow a file to have more than one parent.

In NDB, the Datanodes are divided into node groups. No of node groups are determined by a replication factor R. For a given set of N nodes, there will be N/R node groups. NDB partitions the data and distributes them across the node groups. The default replication factor is 2, and each Datanode in a group hold the complete copy of data. Hence each node group can handle a maximum of 1 node failure. For example if there are 12 nodes, the data will be divided into 6 node groups, and the system can

(40)

handle a failure of 6 nodes as long as there is one surviving node in every node group.

Replication factor can be increased in order to sustain multiple node failures.

2.7 HopsFS

Hadoop Open Platform-as-a-Service (Hops) [24] is a new open-source distribution of Apache Hadoop that is based on a next-generation, scale-out distributed architecture for HDFS and YARN metadata. Hops uses HopsFS [23] which is a new implementation of the Hadoop Filesystem for storage. Hops is designed to meet the scalability limitations of Hadoop as HopsFs is more scalable and flexible as compared to HDFS and provides much higher performance. Hops can be scaled both at runtime both at metadata layer and data layer by adding new nodes.

2.7.1 HopsFS architecture

Similar to HDFS, HopsFs also has four types of nodes: Datanodes, Namenodes, client and NDB database nodes. In HopsFs, the metadata is stored in an in-memory, shared- nothing, open-source database called NDB cluster. There are multiple Namenodes that have access to this database that makes the Namenodes stateless and helps to scale the whole file system. Hops read and write all the metadata operations from this database.

HopsFs also decouples the data from metadata only for files that are > 64KB in size and are stored on the Datanodes. For files <= 64 KB the data is stored along with the metadata. This increases the speed of the file system.

As mentioned earlier, there are multiple Namenodes that are all active at the same time. They serve the requests from different clients parallely but there are some housekeeping system operations that need to be coordinated in order to keep the file system in stable state. Operations like finding the dead Datanodes and replacing them and ensuring that data is properly replicated needs to be coordinated. If not done so, the system might end up in an unstable state or the integrity of the system might be lost. For example, the replication manager service keeps track of the replicated data. If any block is over-replicated, the replication manager will remove the replications until a required number of replication blocks is reached. If there are multiple replication managers running in parallel, there might be a conflict among them resulting in the total removal of data from the system. Hence these housekeeping operations are

(41)

Figure 2.7.1: HopsFS Architecture [23]

managed by a leader Namenode. Unlike HDFS, which uses Zookeeper for coordination services, HopsFs implements a leader election and group membership service by using the database as a shared memory.

(42)

Chapter 3 Design and Implementation

This chapter will provide a detailed description about the architecture of the proposed solution and its implementation. The chapter further discusses how the solution is integrated into the existing architecture.

3.1 Proposed Solution

In order to address the issue of running external shuffle service while running Spark on K8s, the Spark community came up with an idea to use a distributed storage layer or a cloud storage as a replication layer for the shuffle files and then asynchronously back up the shuffle data to this distributed storage layer [1]. The idea is that now the mapper executor will still write its shuffle data to the local disk after processing. Once the map task is complete, it will start a separate thread that will asynchronously back up the data to the distributed layer. The backup task will run asynchronously as we do not want the map task to wait for the backup to complete before the start of the subsequent reduce stage.

The community has developed a plugable interface [25] to read and write the persistent shuffle files to a remote storage. This idea got a big support from many of the contributors from the Spark open-source community and a lot of progress has already been made. Further, this plugable interface can be used with any file system that is an implementation of HDFS. This idea is expected to be a part of official Spark release in the later versions. Keeping in mind about the future support and active participation and contribution of the community members, this idea was chosen to be an alternative

(43)

CHAPTER 3. DESIGN AND IMPLEMENTATION

to the problem presented in this thesis work.

Fig 3.1.1 shows the design implementation of Spark application running on K8s. Here, each executor is running inside the Kubernetes pod. For the simplicity, only one mapper and reducer executor pods are shown. The executors processing the Spark job will work in the following way.

1. The mapper executors after completing the map task will write the shuffle data on to its local disk.

2. The mapper executor will also start a separate thread that will asynchronously back-up the shuffle data onto external storage.

3. During reduce phase, the reducer executor will try to fetch the data from mapper executor.

4. If reducer executors fails to get the shuffles data from mapper, it will try to fetch the data from external storage.

Figure 3.1.1: Proposed Architecture

3.2 Choosing the file system

The ideas to asynchronously back up shuffle files needs the support of an external files system where the data can be backed up. In order to not compromise with

(44)

the performance, it is important that the chosen file system provides with minimum latency. Spark is compatible with hadoop data and can read and write data to any file system that is an implementation of HDFS. Spark used Hadoop FileSystem API to access the data from various file systems. HopsFS, an implementation of HDFS is highly scalable provides all features of HDFS. Moreover, as mentioned before, HopsFS has special support for small files which are less than 64KB in size and provides quick storage and retrieval of these small files. Hence HopsFS was more favourable choice over HDFS. For the similar reason, other cloud storages like Amazon S3 and GFS were not choosen. Therefore as a HopsFS was chosen as the file system to be used as the backing store.

3.3 Async-shuffle-upload API Design

As the API is still in the developing phase, it was required to make some modification to the API code to make it compatible with Hopsworks distribution of Spark code[20] in order to use HopsFS as external storage. Moreover, some modification were also done in the Spark code itself [31]. As an outcome, the new API performs in the following way

• Keeps the current implementation to store the shuffle files on the local disk of the executor.

• Extend the current implementation to asynchronously back up shuffle files to a distributed file system.

• Write shuffle files to an external server - unlike external shuffle service, this server does not have to be co-located with the executors on the worker nodes.

• Reducer executors first try to fetch the shuffle data from mapper executors. If there are unable to contact the mapper, they contact the external shuffle files server to fetch the data.

In the following section, we will briefly discuss about the implementation of the async- shuffle-upload API. To keep things simple and clear, we will be talking about only specific classes and interfaces that are important from the point of view of this thesis.

Overall, the project includes 5 submodules 1. API

(45)

2. Core

3. Immutables 4. Scala

5. Scala test utils

API: The classes in the API submodule takes care of the variables like credentials and endpoint of the server that will be used as a backing store for shuffle files. These credentials are set as configuration to Spark application when it starts running. This API also develops special support to used Amazon S3 as a backing store by using S3a plugin.

Core: The next important module is the core sub-module and is the heart of this API.

It has several packages inside it.

• Client package: Inside the Client

package, class BaseHadoopShuffleClientConfiguration acts as an extractor for configuration related to the file system that will be used as a backing store, specified via SparkConf. It checks the baseUri to extract the schema of the file system. Any file system that is an implementation of Hadoop file system can be used as a backing store. For example, to use HDFS as a backing store, the base uri will be set like ‘hdfs://’. Similarly for s3, the base uri can be set as ‘s3a://’.

Another important interface in this package is ShuffleClient which is actually responsible for providing the functionality for storing and retrieving shuffle files data from remote storage. Client package further includes two more packages.

Further this module takes care of merging local and remote shuffle files when needed which is usually the case when some shuffle files can be found locally and others need to be fetched from external storage.

• IO: Inside the IO package, one of the important classes to discuss is HadoopAsyncShuffleDataIO. It is the root of this plugin that implements asynchronous backing up of shuffle data. It also delegates operations to the local disk implementation for shuffle IO operations. By delegating the operations, the API makes sure to keep the current implementation to store the shuffle files on the local disk of the executor and achieve the backup functionality without changing the present implementation for IO operations.

(46)

Class HadoopAsyncShuffleDriverComponents take care of the driver process.

One of the important tasks done by this class is to determine if a block needs to be recomputed in case it is also not available on the backing store. Furthermore, class HadoopAsyncShuffleMapOutputWriter implements ShuffleMapOutputWriter which is already present in Spark and delegates task to write partitions to local disk, then kicks off a thread that asynchronously backs up the shuffle data and index files to remote storage.

• Reader: Classes in this sub-module is responsible to fetch group of blocks from other executors and other group of blocks from from remote storage that can not be fetched from local storage. The iterator first tries to fetch as many blocks as possible from the local disk. In case of a fetch failure, it checks if the remaining blocks can be fetched from remote storage. If that is possible, then we initiate the fetch.

Rest of the modules provide other functionalities that wire up with the above discussed modules to provide fully working API. As mentioned before, to keep things simple and clear, we will be leaving them from the discussion as they are not important from point of view to understand the functionality.

3.4 Using async-shuffle-upload API in a Spark job

While running a Spark job on K8s, spark-submit script provided with Spark package is used to submit the job. Different configurations can be specified while using the spark-submit script. To make the Spark application backup the shuffle files on external storage, we need to set two configurations while running spark submit.

1. spark.shuffle.sort.io.storage.plugin.class.v2 and set it value to org.apache.spark.palantir.shuffle.async.io.HadoopAsyncShuffleDataIo

2. spark.shuffle.hadoop.async.base-uri and set its value to uri://IP:PORT/path_to_folder

Where URI is the URI of the file system used for storing shuffle files, IP is the address of server where the file system is hosted and listening on port PORT. path_to_folder is the location on file system where user wants to save the shuffle files. An example might look like this hdfs://10.0.2.15:8020/shivam/

(47)

The first configuration tells Spark to use the plugin that will asynchronously back up the shuffle files. In the second configuration, the user provides the base-uri of the files system being used and then the location where the shuffle files should be stored.

In the above example, hdfs is the base URI and the shuffle files will be stored under folder ‘/shivam’ running on HDFS at 10.0.2.15 and listening at port 8020. If the user does not provide any base URI scheme, Spark will revert to normal execution where it will not store any shuffle files on external storage.

3.5 Shuffle files generation format

Following section describes how the shuffles files are stored on external file system.

The file stored on HDFS that are accessed by Spark for processing are stored in blocks.

During the test, it was observed that for each block of the file, two files were generated during shuffle after the map task, one holding the data and the other holding the metadata of the data file.

Figure 3.5.1: Shuffle files storage format in external file system

Figure 3.5.1 is taken from the shuffle files generated during one of the conducted test.

The file on which processing was done was stored on HDFS in three blocks. Hence after the map phase during the shuffle, three folders were generated (0,1,2) for each block of the file. Further on expanding these generated folders, we can see that two files are generated for each HDFS block. For instance in case of block 0, one generated file is 0.data storing the data, and other 0.index storing the metadata for the data. This is how Spark stores the shuffle files on the local storage as well while processing the data during normal execution. Storing the data on the external file system in the same way will eliminate the possibility to introduce any additional complexity before going to the reduce phase.

(48)

Chapter 4 Experimental evaluation

This chapter aims to present the experimental setup for this thesis. This includes the testing scenarios, their results and datasets used. The results are then presented using a comparative framework.

4.1 Designing the testing scenarios

While introducing the async-shuffle-upload API in section 3.3, we said that during the reduce phase, Spark first tries to find the files on the local memory of the mapper executor. If it fails to find the shuffle files locally, only then they are fetched from the backup file server, if available there. For the purpose of finding the performance of the file system used as backing store, we need to make the shuffle files on mapper executor unavailable, so that they are read from backup storage. In our case, the back up storage is HopsFS. One of the possible solutions is to kill the mapper executor pods after they complete the map task and the shuffle files are backed up on the remote file system.

In that scenario, when the reducer executors start, they will fetch the files from remote storage.

In an another scenario to study the performance of the file system, we want Spark to skip fetching files from the local memory during the reduce phase and directly fetch the files from external storage. Here, no executors are killed/preempted. Although this will not impact the time taken by the reducer to read files from external storage, but will impact the total time Spark will take to finish the execution of the application.

Hence our test cases will be divided into following scenarios.

Spark on Kubernetes using HopsFS as a backing store

Spark on Kubernetes using HopsFS as a backing store

Measuring performance of Spark with HopsFS for storing and retrieving shuffle files while running on Kubernetes

SHIVAM SAINI

Authors

Place for Project

Examiner

Supervisor

Abstract

Keywords

Abstract

Nyckelord

Acknowledgements

Contents

1 Introduction

2 Background

3 Design and Implementation

4 Experimental evaluation

5 Conclusion and future work

Bibliography

Acronyms

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Background

1.2 Problem

1.3 Purpose

1.4 Goal

1.5 Benefits, Ethics and Sustainability

1.6 Methodology

1.7 Outline

Chapter 2 Background

2.1 Spark

2.1.1 Spark Architecture

2.1.2 Execution modes in Spark

2.1.3 Life cycle of Spark Application

2.2 Kubernetes

2.2.1 Pods

2.2.2 Desired state

2.2.3 Kubernetes Architecture

2.3 Integrating Spark with Kubernetes

2.3.1 Running Spark on Kubernetes

2.4 Shuffle in Spark

2.5 HDFS

2.5.1 HDFS architecture

2.6 Network DataBase

2.7 HopsFS

2.7.1 HopsFS architecture

Chapter 3

Design and Implementation

3.1 Proposed Solution

3.2 Choosing the file system

3.3 Async-shuffle-upload API Design

3.4 Using async-shuffle-upload API in a Spark job

3.5 Shuffle files generation format

Chapter 4

Experimental evaluation

4.1 Designing the testing scenarios