Towards an S3-based, DataNode-less implementation of HDFS

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Towards an S3-based, DataNode-less

implementation of HDFS

FRANCO JESUS CACERES GUTIERREZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

DataNode-less

implementation of HDFS

FRANCO JESUS CACERES GUTIERREZ

Master’s Programme, Software Engineering of Distributed Systems, 120 credits

Date: December 8, 2020 Supervisor: Seif Haridi Examiner: Jim Dowling

School of Electrical Engineering and Computer Science Host company: Logical Clocks AB

Swedish title: Mot en S3-baserad implementering av HDFS utan DataNodes

(3)

Towards an S3-based, DataNode-less implementation of HDFS / Mot en S3-baserad implementering av HDFS utan DataNodes

c

2020 Franco Jesus Caceres Gutierrez

(4)

Abstract

The relevance of data processing and analysis today cannot be overstated.

The convergence of several technological advancements has fostered the proliferation of systems and infrastructure that together support the generation, transmission, and storage of nearly 15,000 exabytes of digital, analyzable data. The Hadoop Distributed File System (HDFS) is an open source system designed to leverage the storage capacity of thousands of servers, and is the file system component of an entire ecosystem of tools to transform and analyze massive data sets. While HDFS is used by organizations of all sizes, smaller ones are not as well-suited to organically grow their clusters to accommodate their ever-expanding data sets and processing needs. This is because larger clusters are concomitant with higher investment in servers, greater rates of failures to recover from, and the need to allocate more resources in maintenance and administration tasks. This poses a potential limitation down the road for organizations, and it might even deter some from venturing into the data world altogether. This thesis addresses this matter by presenting a novel implementation of HopsFS, an already improved version of HDFS, that requires no user-managed data servers. Instead, it relies on S3, a leading object storage service, for all its user-data storage needs.

We compared the performance of both S3-based and regular clusters and found that such architecture is not only feasible, but also perfectly viable in terms of read and write throughputs, in some cases even outperforming its original counterpart. Furthermore, our solution provides first-class elasticity, reliability, and availability, all while being remarkably more affordable.

Keywords

Hadoop distributed file system, HDFS, HopsFS, S3

(5)

ii | Abstract

(6)

Sammanfattning

Relevansen av databehandling och analys idag kan inte överdrivas.

Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder.

HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare.

Nyckelord

Hadoop distributed file system, HDFS, HopsFS, S3

(7)

iv | Sammanfattning

(8)

Acknowledgments

I would like to express my gratitude to my examiner Dr. Jim Dowling and my supervisor Dr. Gautier Berthou for their continuous attention, prompt feedback, and openness to address any matter at length. Additionally, I am grateful to Logical Clocks for providing all the necessary resources to carry out my research, and to my peers therein for always being willing to lend a hand. I am fortunate to have been given the opportunity to undertake such an interesting project in a place with some of the smartest people I have met.

Finally, but most importantly, I would also like to thank my parents, and Alexandra, my sister, for their unconditional support. Not only during the last two, distant years, but for as long as I can remember. There is not a shred of doubt in my mind that I would not be where I am without any of you. I owe you everything. Thank you.

Stockholm, December 2020 Franco Jesus Caceres Gutierrez

(9)

vi | Acknowledgments

(10)

List of Figures

2.1 The architecture of HDFS . . . 8 2.2 Architectures of HDFS and HopsFS . . . 10 3.1 Architecture of an S3-based HopsFS. NameNodes respond to

metadata requests from clients, and make delete requests to S3. Clients read/write user data from/to S3. *Optionally, NameNodes can also read/write user data from/to S3 if the consolidation feature is enabled—see Section 3.6. . . 16 3.2 Data model of files in HopsFS. The Inode table stores file and

directory metadata, and file records have associated ordered Block records, which in turn have associated Replica records. . 22 3.3 Data model of files in the S3-based HopsFS [1]. The inode

table stores file and directory metadata, and file records have associated ordered Block records, which in turn have associated Replica records. . . 23 3.4 Anatomy of the S3 file and S3 object abstractions in the S3-

based HopsFS. . . 25 3.5 Anatomy of the S3 object abstraction in the S3-based HopsFS.

An example instance is shown for a non-CRC32-enabled configuration (a) and a CRC32-enabled configuration (b), both with a part size of 10 megabytes. . . 26 3.6 High-level diagram that depicts how writes work in the S3-

based HopsFS. The writing happens via a Java OutputStream returned by the HDFS Client component when a create or append operation is invoked. Internally, the output stream uploads object parts to S3 in parallel with the data provided by the client process. . . 28

(13)

x | LIST OF FIGURES

3.7 High-level diagram that depicts how reads work in the S3- based HopsFS. The reading happens via a Java InputStream returned by the HDFS Client component when an open operation is invoked. Internally, the input stream downloads the contents of the objects in order and in parallel, and makes them available for the client process to read. . . 29 3.8 High-level diagram that depicts how concatenations work in

the S3-based HopsFS. . . 30 3.9 High-level diagram that depicts how truncations work in the

S3-based HopsFS.. . . 33 3.10 High-level diagrams that depict how the (a) cloud management

mechanism works in the S3-based HopsFS. In it, a leader NameNode assigns NameNode workers tasks to (b) delete S3 objects and/or perform (c) object consolidation. . . 36 4.1 Write performance comparison between the S3-based HopsFS

and raw S3 for various file sizes. . . 43 4.2 Read performance comparison between the S3-based HopsFS

and raw S3 for various file sizes. . . 44 4.3 System throughput comparison between the S3-based HopsFS

and the regular HopsFS accross various write loads. . . 47 4.4 System throughput comparison between the S3-based HopsFS

and the regular HopsFS accross various read loads. . . 48 4.5 Comparison of client throughput achieved versus CPU usage

accross various client APIs. . . 51

(14)

List of Tables

4.1 Monthly S3 cost in U.S. Dollars to store 16TB of user data . . 53 4.2 Monthly EC2 cost in U.S. Dollars to store 16TB of user data

across 3 DataNodes. . . 54

(15)

xii | LIST OF TABLES

(16)

List of acronyms and abbreviations

AMI Amazon Machine Image

API Application Programming Interface AWS Amazon Web Services

CLI Command Line Interface COS Cloud Object Storage CRC Cyclic Redundancy Check

DAL Data Access Layer DFS Distributed File System

EBS Elastic Block Store EC2 Elastic Compute Cloud ENA Elastic Network Adapter

HDFS Hadoop Distributed File System ISP Internet Service Provider

JVM Java Virtual Machine NDB Network DataBase

RAM Random Access Memory RPC Remote Procedure Call

S3 Simple Storage Service SDK Software Development Kit VPC Virtual Private Cloud

(17)

xiv | List of acronyms and abbreviations

(18)

Chapter 1 Introduction

Since the 1960s, the data storage technology industry has experienced one breakthrough after another in what constitutes a history nothing short of marvelous. For instance, the price per gigabyte of storage devices has gone from 1 million US dollars to as little as 0.02 cents [2]. At the same time, the areal density—i.e. bits per square inch—has made colossal leaps, going from 2,000 to 1,000,000,000,000 [2].

The evolution of physical data storage technologies has undoubtedly played a vital role in enabling the growth of what is known as the digital universe—

i.e. literally every bit of digital data that is stored or transmitted in any shape or form—to a previously unfathomable 40,000 exabytes [3], thus ushering in the era of data. However, as noteworthy as the hardware achievements are, the data-driven world of today would not be possible without specialized software systems that allow organizations to store, process, and ultimately harness unprecedented volumes of information.

1.1 Background

Distributed File Systems (DFSs) are systems that transparently leverage the storage capabilities of multiple machines to give users the ability to store of files in a hierarchical fashion using directories.

TheHadoop Distributed File System (HDFS) [4, 5] can easily store and serve dozens of petabytes worth of files across thousands of machines in a fault-tolerant way. Essentially, HDFS has two main high-level components:

a NameNode and one or more DataNodes. The DataNodes store the actual contents of the files, while the NameNode stores the file system’s metadata—

i.e. the hierarchical structure of the file system and which DataNodes store

(19)

2 | Introduction

which files.

HopsFS [6, 1] is a file system based on HDFS that overcomes its main scalability bottleneck—the fact that there is only one active NameNode with the limited capacity of a single machine. Instead of having one NameNode storing the entirety of the file system’s metadata, HopsFS supports multiple active NameNodes which offload the storage of metadata to an external common database.

Cloud object storage services allow users to store arbitrary blobs of data—

i.e. the objects—in a non-hierarchical way via the Internet. Amazon’sSimple Storage Service (S3) [7] is one of the most popular of such services. In it, users can create containers (called buckets) in which to upload, manage, and list objects in a key-value fashion.

1.2 Problem

HDFSwas designed to support thousands of DataNodes running on commodity hardware, providing great capacity and reliability[5]. While this might be tenable for large companies with their own IT infrastructure teams, such setups are not feasible for smaller organizations, which naturally start out with more modest clusters. As their needs grow, these organizations have to incorporate more DataNodes, whose cost can be a restricting factor. What is more, larger clusters come with higher rates of failure and the need to allocate more resources to the maintenance of servers.

At the same time, HDFS is not well suited to efficiently handle varying levels of load. All DataNodes must be running even if only a small subset of data is being used. Furthermore, barring complex and significant modifications to the system [8], it is not possible to dynamically manage the replication of files in response to specific demand surges, which normally result in throughput degradation.

Finally, potential users need assurances when it comes to the safety of their data, and the ability to migrate out as smoothly as possible. If the metadata components of either HDFS or HopsFS are lost or corrupted, then the user data stored in DataNodes is rendered virtually useless. While both systems have mechanisms [9,10, 11] to ameliorate the safety of their metadata, they usually involve more resources and complexity, and the risk is only lessened.

Regarding ease of migration, both systems would require users to go through ad hoc processes, which can be long and expensive. This could deter users from choosing eitherHDFSand HopsFS for fear of being locked into a system that might not end up fitting their use cases.

(20)

1.3 Purpose

The purpose of this thesis is to present a proof-of-concept implementation of HopsFS in which user data is stored/retrieved directly in/fromS3without the intervention of any DataNode whatsoever.

A DataNode-less HopsFS has several implications. First, the growth of file systems will not be burdened by a potentially costly increase in servers that need management and whose failures need to be dealt with. Second, having no DataNodes means that users could potentially be more financially efficient since they would be paying for storage only as opposed to storage plus servers.

This is particularly relevant when one considers the fact that a single block of data is usually replicated accross three servers, not to mention the load that the latter exert on the NameNode(s). Third, the elastic nature of S3[12] will enable clusters to automatically and efficiently meet sudden surges in demand for any arbitrary subset of files. Finally, storing full and properly named files—

i.e. with their full HopsFS path—inS3means that users are free to stop using HopsFS at any point and without having to migrate their data out of a block format since it would already be inS3to be trivially moved elsewhere.

The fulfilment of this thesis constitutes a step forward in providing small and medium-sized organizations with an affordable, easy-to-deploy, highly scalable and flexible, yet fully featured, version of HopsFS, thus making it more accessible to better navigate and leverage our current data-rich world.

1.4 Goals

The goals of this degree project are: to design and implement the necessary modifications to the HopsFS NameNode and Client components so that a cluster can work fully on top ofS3while maintaining theHDFS Application Programming Interface (API)contract [13]; to provide an analysis of the trade- offs of the proposed architecture; and to present a comparison between an ordinary, cloud-based HopsFS cluster and an S3-based HopsFS cluster in terms of performance and storage cost.

1.5 Delimitations

The design and implementation of a system that integratesS3’s access management and security features [14] with that of HopsFS/HDFSis not within the scope

(21)

4 | Introduction

of this project. For the purpose of this thesis, both the NameNode and the clients have unrestricted access toS3.

Furthermore, the system presented in this work is not intended to be production-ready. Failure modes such as loss of connectivity amidst data transfers or corrupted states have not been exhaustively guarded against.

1.6 Structure of the thesis

Chapter2presents relevant background information about the systems to be used in this thesis, namely HDFS, HopsFS, and S3, and also the related work vis-à-vis implementingHDFSon top of a cloud object storage service.

The software and data architectures developed in this work are described in Chapter3 along with an analysis of their properties. Chapter4 outlines the experimental evaluation of theS3-based HopsFS and presents a discussion of its results. Finally, concluding remarks and possible directions of future work can be found in Chapter5.

(22)

Chapter 2 Background

This chapter first provides fundamental background information about the category within distributed systems in which the proposed solution falls, namely distributed file systems. It thereafter delves into the original system upon which my work is based, theHDFS. Furthermore, cloud object-storage services are described, with a particular emphasis on AmazonS3, the service used as the underlying storage layer for the presented implementation in this report. Finally, the chapter also introduces and describes HopsFS, and related work relevant toHDFSsand its interaction with cloud object-storage services.

2.1 Distributed file systems

A DFS is one that provides the same services as those provided by the file storage capabilities (e.g. create, delete, random access, etc.) of an operating system, but also one that does so while existing beyond the constraints of the operating system of a single machine. Instead, aDFSfunctions on top of many interconnected and independent, but cooperating computers, called servers, which together provide file storage and access services to other computers, called clients [15]. Clients then interact with the DFS through a givenAPI [16,17,18] as if they were doing so with a single, logical system.

Although aDFSallows leveraging the added storage resources of arbitrary numbers of computers, concerns such as fault tolerance, scalability, performance, concurrency, and transparency become more prominent, as is usually the case whenever one exits the realm of centralized systems [19]. As such, severalDFSshave been developed over the decades, many with fundamentally different designs and implementations [20,21,17,22,23] focusing on maximizing certain qualities over others, depending on their target use case. Examples

(23)

6 | Background

of areas in which such differences manifest themselves are path resolution, file caching schemes, statefulness of service, crash recovery procedures, file replication mechanisms and strategies, and operation semantics.

2.1.1 The Hadoop Distributed File System

HDFS[4,5] was developed by Yahoo to meet their rapidly increasing need to store and process large amounts of data—up to 10 petabytes of total capacity across ten thousand nodes being the original requirement. Perhaps HDFS’

most notable characteristic and selling point is its ability to grow its storage capacity and I/O bandwidth by just adding more commodity servers. With this,HDFSclusters are able to easily scale up to thousands of servers to serve petabytes of data to an even greater number of clients.

HDFS’s architecture (see Figure 2.1) decouples namespace—file system metadata such as file and directory information, permissions, and access times—from the actual stored data. This decoupling is evident in the two different types of servers: the NameNode, a centralized server which is in charge of maintaining the namespace and processing metadata-related requests; and the DataNodes, which are in charge of storing user data and processing data-access requests. Finally, clients access the file system via the HDFS Client library. All communication between components happens on top of the TCP/IP protocol usingRemote Procedure Calls (RPCs)[24].

Having a decoupled architecture to handle metadata operations and data access operations in separate nodes is motivated by the fact that the former are generally much faster and fewer than the latter—for instance, fetching the location of a file (metadata request) versus reading 100GB (multiple data access requests), which means that if they were to be handled by a single type node, then the data access requests would quickly bottleneck the metadata requests, and more so in highly distributed setups. Thus, having a dedicated namespace server allows metadata requests to remain fast, while also allowing user-data servers to scale out independently. This type of architecture is also present, albeit with vastly different underlying implementations, in many other DFSssuch as the Google File System [22], Ceph [25], Lustre [23, 26], and CalvinFS [27].

HDFSprovides availability and reliability of user data via file replication.

Files are modelled as a sequence of blocks, each replicated several times across different DataNodes, where they are stored as regular files in their local and native file systems. Although file blocks are usually configured to be 128MB in size and replicated 3 times—i.e. 3 copies in total, both these settings are

(24)

user-definable for each file at the time of creation, with the replication factor being modifiable afterwards—which also provides scalability since it can be modified to accommodate periods of high demand. With this overall strategy, the likelihood of losing the content of a block due to hardware failure—i.e.

all DataNodes hosting the replicas of a given block fail beyond recovery—is greatly reduced. HDFSprovides a default policy to determine how DataNodes are chosen as targets in which block replicas are to be stored, but also the option to define and use custom ones.

The metadata stored by the NameNode consists of the hierarchy of files and directories that make up the file system. Both files and directories—logical containers of files and other directories—are tracked by the NameNode with inodes, which are data structures holding information such as permissions and modification and access times. Additionally, the NameNode keeps a mapping of file blocks to the DataNodes where their replicas are stored, so that it can informHDFSclients of the physical locations of the files they need to read. To keep the response times of metadata operations—e.g. opening, closing, deleting, renaming files—low, the NameNode maintains the entire namespace in Random Access Memory (RAM). However, it also persists essential information—such as a log of the operations it has executed—to its local file system for durability and robustness purposes—i.e. being able to return to the latest consistent state after restarts or failures. To counter failures for which recovery is impossible or impractical, HDFScan also run with multiple NameNodes at the same time, which work together by either using shared storage [9] or a distributed log of operations [28]. It is important to note that in both approaches, only one NameNode is active at any given time, and the others are standing by for failover. A solution calledHDFSFederation [29] enables the concurrent operation of multiple active NameNodes inHDFS, however each NameNode maps to a different namespace, and thus can be seen more like multipleHDFSclusters with shared DataNodes.

The mapping of file blocks to DataNodes kept by the NameNode is built from the block reports sent to it by the former. A block report is information about the block replicas that a DataNode stores, and is generally sent every hour. In addition to block reports, DataNodes also send frequent heartbeats to the NameNode to let it know they are operational, and if none are received from a particular DataNode after a configurable period of time, said DataNode, along with the replicas it hosts, is considered unavailable by the NameNode.

Given that the NameNode does not directly initiate contact with DataNodes, it uses the replies to heartbeats to send commands to the latter to, for instance, remove local replicas or replicate them to other DataNodes—commands that

(25)

8 | Background

are necessary to maintain the file system as stable as possible.

To maintain the integrity of user data, HDFS uses checksums whenever such data is transmitted—i.e. when theHDFSclient reads/write data from/to a DataNode or when DataNodes transmit replicas among themselves. Additionally, DataNodes store a metadata file for each replica they host containing checksums for their blocks’ data. Those checksums are used to periodically check for possible data degradation, in which case the DataNodes will report such occurrences to the NameNode for correction.

Figure 2.1: The architecture of HDFS showing the interaction of its components and the replication of file blocks across several replicas in different DataNodes. A rack refers to a collection of nodes that share a switch [24].

2.1.2 HopsFS

HopsFS [1] is an open-source implementation ofHDFS, meaning that regardless of differences in architecture or logic, it continues to provide the same functionality and strong consistency and atomicity guarantees exposed to external systems, which is significant since HDFS is a core component on top of which many other prominent data-processing frameworks or systems—

e.g. MapReduce [30], HBase [31], Hive [32], Pig [33]— are built. It was

(26)

developed primarily to address the metadata bottleneck present in HDFS by allowing the concurrent operation of multiple active NameNodes as opposed to just one. With this, HopsFS is able to achieve no downtime during failover and up to 16 times greater throughput thanHDFSfor industry workloads, and even 37 times greater throughput for write-intensive workloads.

They key design decision behind HopsFS is the displacement of namespace metadata storage from the heap space of a single Java Virtual Machine (JVM) [34] instance, to a separate, distributed database. This allows the NameNode implementation to be stateless and thus have multiple active instances processing metadata requests from clients or DataNode-initiated requests.

Network DataBase (NDB)is the storage engine of MySQL Cluster [11], an in-memory, distributed database with transactional consistency. Among others, these characteristics made NDB the choice for the external database with which NameNodes interact to store and manipulate namespace metadata.

Figure2.2depicts bothHDFS’ and HopsFS’ architectures, most notably being the presence of multiple active NameNodes in HopsFS’ architecture against the one in HDFS’, and the external NDB cluster for metadata storage in HopsFS. It is important to know that, whileNDBis the current engine, HopsFS can work with any storage engine that supports at least read-committed isolation levels [35].

HopsFS also brings to the table additional improvements in the realm of small files. Under real-world workloads, it provides up to 61 times greater throughput for write operations, and lower latencies for both read and write operations by factors of 3.15 and 7.39, respectively [37].

2.2 Cloud object storage

Cloud Object Storage (COS) refers to the data storage model in which users store and access data of arbitrary sizes and arbitrary types in a key-value fashion within a flat—i.e. non-hierarchical, contrary to regular file systems—

namespace, through mostly RESTful [38]APIson top of HTTP [39]. In this model, users first create containers, which represent independent namespaces in which objects can be stored. An object is any data that is uploaded to a namespace with a key that uniquely identifies it. In addition to their keys, they can also have user-defined metadata attached to it. Objects within a namespace need not share any similarities in type or size.

The most prominent platforms that offerCOSservices—e.g. AmazonS3 [7], Azure Blob Storage [40], IBM Cloud Object Storage [41], and Google

(27)

10 | Background

Figure 2.2: Architectures of HDFS and HopsFS. HDFS appears in high- availability mode using a distributed log [28] with a single passive NameNode.

HopsFS is shown with 4 active stateless NameNodes—of which one is the leader [36] for coordination and housekeeping purposes—that communicate with theNDBcluster for metadata storage and manipulation through a DAL driver. It is important to note that both HDFS and HopsFS clients are compatible with HopsFS, albeit with some caveats [1].

Cloud Storage [42]—do so while advertising virtually unlimited storage—i.e.

unlimited number of objects per container, maximum scalability, availability, and durability. These attributes makeCOSa popular choice for archiving data, storing backups, or acting as the storage solution for cloud-native applications.

Object-based storage—of the non-cloud variety—originated as a proposal to enable the creation of high-performance, highly scalable, and secure data- sharing systems, which proved difficult to build due to the limitations posed by block-based interfaces—i.e. interfaces that abstract storage as a simple collection of addressable blocks [43,44]. Instead of interacting directly with block-storage devices, systems interact with object-storage devices. Object- storage devices provide a much more sophisticated interface that allows clients to create and manipulate uniquely identifiable objects of arbitrary size along with additional, optional metadata. The object-storage device itself handles the management of its internal space, and also keeps per-object system metadata for proper maintenance. This means that, for instance, file systems relying on object-based storage need not concern themselves with tracking which blocks make up which user files, and could simply have a one-to- one mapping of files to object identifiers given that objects can also grow and shrink as requested. What is more, unlike block-storage devices, object- storage devices are capable of enforcing security policies themselves, which

(28)

effectively removes the need for trust in clients that attempt to access user data.

Examples of relevant file systems that rely on object-based storage are Lustre [23] and PanFS [45].

2.2.1 Amazon Simple Storage Service (S3)

S3[7] is Amazon’s object storage service, and one of the most popular storage options available in Amazon Web Services (AWS) [46] due to its ease of use, industry-leading availability and scalability, and a plethora of additional features—e.g. object versioning, protection against accidental deletions, in- place querying for analytics, and full integration with otherAWSproducts for monitoring, budgeting, security, and others [14]. Most notably, S3 provides 99.999999999% durability, which means that "if you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years" [12, 47]. These facts coupled with AWShaving a major market share [48] makeS3a reasonable choice for users.

S3allows users to create buckets—i.e. containers—in which they can store a virtually infinite number of immutable objects, each as large as 5 terabytes [12], and attach any user-defined metadata to them. Users can create and manage buckets and objects through S3’s RESTfulAPIs, through the AWS SDK [49], or through other tools provided by Amazon. Finally, the costs incurred by users is proportional to the amount of data that they store, how often it is accessed, and in which ways [50].

S3is a feature-rich service with a much more complex model than what is described in this section. For example, buckets can be stored in different geographical locations for further reliability [51], different storage classes of various prices are available for objects to be stored in depending on how frequently they are accessed [52], and access control lists and bucket policies can be created to enforce fine-grained authorization schemes. However, these are beyond the scope of this report.

2.3 Consistency and atomicity guarantees of HDFS and S3

HDFS’s goal is to expose a standard (POSIX [16]) file system API to be used by many concurrent and distributed clients. As such, HDFS’s provides one-copy-update semantics [13], which means that after an operation—i.e.

file creation, file update, file deletion, file or directory rename—completes,

(29)

12 | Background

the immediate new state of the file system and its contents should be visible and available to every subsequent client request. For instance, once a client closes a brand new file, any other client should be able to open said file, and every listing request for the file’s directory should include it as well. For our purposes, this can be thought of as strong consistency. Regarding atomicity, HDFS ensures that creating or overwriting a file, deleting a file, renaming a file or a directory, and creating a directory, all happen in a single atomic operation—i.e. either all sub-operations succeed, or none of them do. This is important because it guarantees that failures do not leave the file system in potentially inconsistent state.

On the other hand,S3’s goal is to provide maximum durability of the data it stores over everything else. This is evident by the aforementioned claim of only losing one object every 10,000 years, whereas the authors of HDFS expected a loss of one file block every 200 years [4]. To that end, and to also keep latencies as low as possible and throughput as high as possible,S3 offers consistency guarantees that are far more relaxed than thoseHDFS. S3 provides read-after-write consistency for the creation of new objects, which means that immediately after the creation of a new object succeeds, any client can expect to be able to retrieve it [53]. A significant caveat to this is that if an object key is used to retrieve an object before the object actually exists, then S3 will no longer guarantee read-after-write consistency for that key.

For object overwrites and deletions, and bucket—i.e. an object container—

listings, only eventual consistency is guaranteed—which is also the same level of consistency guaranteed for those objects affected by the aforementioned read-after-write caveat. This means that after altering the state of a bucket by adding, updating, or removing objects, subsequent listing requests or requests to retrieve the contents of an object are not guaranteed to return the most up- to-date information. This is because the propagation of changes across S3’s happens asynchronously, and thus, portions of their infrastructure can return stale information until they are caught up. As for atomicity,S3only guarantees that after updating an object, either the old data or the new data will be returned upon retrieval, never partial or corrupted data. Therefore, unlikeHDFS, for example, it does not provide a way to perform updates on multiple objects in a single, atomic operation.

2.4 Related work

As mentioned in Chapter2.1.2, HopsFS is an implementation ofHDFSin the sense that systems that rely on its functionality are not aware that they are not

(30)

interacting with a modified version of HDFS. In other words, HopsFS and any other implementation ofHDFSmust provide the exact same functionality as HDFS along with all of its guarantees both in terms of consistency and atomicity. However,HDFSalso provides the option for other existing systems to provide an HDFS-like interface by means of extending the abstract Java class [54] org.apache.hadoop.fs.FileSystem [55], though in such cases users should not take for granted that all functionality is available nor that the consistency and atomicity guarantees are those ofHDFS.

When it comes to efforts involving modifications toHDFS, most research seems to be focused both in improving the efficiency of the handling of small files, and in presenting alternatives to the default replica management logic to increase the availability and performance of an HDFScluster. Improving the efficiency of how small files are handled in HDFSis arguably necessary because every single one of them takes up as much metadata space as a much bigger file, which is a concern when there are very limited resources in the NameNode that can only scale up, unlike the DataNodes which can scale out. To resolve this, various approaches have been proposed, and they generally work by merging smaller files into a single big HDFS file, and then tracking them individually within the latter via an additional index file [56,57,58,59,60,61]. In the area of replica management, notable examples include a replica placement policy to evenly assign DataNodes for the creation of new blocks, which removes the need for HDFS rebalancing [62] while maintaining reliability requirements [63]; a replica management system [8]

to quickly adjust the replica count of blocks depending on how frequently they are being accessed, thereby improving the elasticity ofHDFS; a replica placement policy to address the reliability and performance issues raised by running clusters on physically co-located virtual machines in the cloud [64];

and modifications to the HDFS balancer [62] itself to prioritize availability during rebalancing, thus improving fault tolerance [65].

Extensions of the org.apache.hadoop.fs.FileSystem class are known as connectors. Relevant connectors exist for cloud object storage services such as S3 [66] and Azure Blob Storage [67]. However, neither provides the full functionality ofHDFS—for instance,S3’s connector does not allow for file appends and the one for Azure Blob Storage does not allow for file concatenations—or consistency and atomicity guarantees—for example, S3’s connector could return an incomplete list of the files in a directory.

The connectors exist and are used because there are many use cases that do not require the full sell of features and attributes ofHDFS—e.g. storing and accessing unique files. However, the consistency and atomicity requirements

(31)

14 | Background

are imperative to guarantee correctness in, for instance, MapReduce [30], Hive [32], and Spark [68]. To overcome the limitations ofS3’s eventual consistency model, and thus be able to leverage its massive scale in workloads that require stronger consistency, solutions like Netflix’s S3mper [69], Amazon’s EMRFS [70,71,72,73], and S3Guard [74] introduce the use of an additional strongly consistent data store, such as DynamoDB [75], to store and manage extra file system metadata, thereby enabling consistent results. Another solution is presented in a connector developed by IBM called Stocator [76], which relies on a sophisticated file naming scheme to circumvent the issues ofS3’s eventual consistency without the need for an additional metadata store, thus reducing complexity. The aforementioned improvements notwithstanding, there is still not a solution that fully implements the HDFS API with its required consistency and atomicity guarantees.

In 2019, Stenkvist [77] presented a modified version of the HopsFS DataNode to store block replicas inS3rather than in local storage—specifically, the write process involves first writing the incoming block from the client to a DataNode’s local storage, and only after the file is closed will the DataNode upload it toS3and then delete it from local storage—which could subsequently be downloaded again to DataNodes to be streamed back to clients during reads. This solution provided most ofHDFS functionality—deleting files is not supported—and guaranteed consistency and atomicity by leveraging the consistent metadata layer of HopsFS. However, it also featured scenarios in which the eventual consistency ofS3would degrade performance in HopsFS due to having to wait for blocks known to exist to become accessible—

i.e. in some situations, requests for an object would be issued before the object actually existed, thus nullifying the read-after-write consistency of new objects. While Stenkvist showed that an S3-backed HopsFS was seemingly feasible with the added benefit of lower financial costs, the performance of their solution put the scale back into balance when compared to a regular HopsFS setup. Their evaluation showed that both write and read throughputs in the S3-based HopsFS setup were consistently lower across all workloads than those of the regular HopsFS setup, and this was particularly notable in the read scenarios. Their results make sense since, while both setups rely on DataNodes to serve/receive data to/from clients, the S3-based HopsFS setup incurs additional—and significant—latencies due to DataNodes having to transfer/stream data to/fromS3on top of doing so in their local storage.

(32)

Chapter 3 Implementation

This chapter will delve into the core details of the work presented in this report.

In Section 3.1, it will go through the architectural changes made to HopsFS and what those changes mean for various quality attributes of the system.

After that, in Section 3.2, the modifications to the model at the metadata level will be described, as well as the application model shared between components. Afterwards, Section3.4 will outline the implementation details of some of the most notable client-side operations ofHDFS. The way in which the mismatch of the consistency and atomicity models betweenHDFSandS3 were addressed will be explained thereafter in Section3.5. Lastly, Section3.6 introduces an additional, optional process called object consolidation.

3.1 S3-based, DataNode-less architecture

The main contribution of this work is the removal of the DataNode component from HDFS/HopsFS and replacing it with the use of S3. This decision has various potential implications, primarily in terms of throughput, availability, scalability, and elasticity. As depicted in Figure 3.1, HDFS clients now perform reads and writes against S3 rather than the DataNodes of a cluster.

Like in HDFS and HopsFS, clients still interact with the NameNode(s) to perform metadata operations. While inHDFS and HopsFS the NameNodes interact with the DataNodes to perform block operations, in the new architecture the NameNodes interact withS3by making delete requests, and also optionally read and write requests. These last optional requests only happen if object consolidation is enabled, a feature which will be further described in Section 3.6.

(33)

16 | Implementation

Client

Client NameNode NDB

S3

metadata ops

read/write delete

(read/write)*

Figure 3.1: Architecture of an S3-based HopsFS. NameNodes respond to metadata requests from clients, and make delete requests to S3. Clients read/write user data from/to S3. *Optionally, NameNodes can also read/write user data from/to S3 if the consolidation feature is enabled—see Section3.6.

3.1.1 System throughput

The proposed architecture affects system throughput both client-side and server-side. On either side, there is no inherent reason for which performance would be better or worse in either architecture. However, the factors that affect performance are different, with the common denominator being the fact that S3 is an external system over which users have very limited control. This contrasts with the presence of DataNodes in anHDFScluster over which users have complete control in regards to how many they run, and what hardware and platform software they run on.

Client-side, the throughput of interest is that of reads and writes of user data. In anS3-based architecture, the rate of transfer of such data will be a function of the overall network connectivity conditions between the clients and S3, and the computational performance of the clients, both in terms of clock speed and their aptitude for parallel processing.

The network connectivity conditions encompass mainly the bandwidth that is available between all the clients of a cluster andS3, plus the latency thereof.

Throughput and latency can be maximized and minimized, respectively, by deploying anS3-based HopsFS cluster in anAWS Virtual Private Cloud (VPC) [78] located within the same S3 configured region [51]. Therefore, users should note that both quality attributes will tend to deteriorate if clusters are deployed in a differentAWSregion to that of theirS3configuration, or if clients and NameNodes both run on-premise, in which caseInternet Service Provider (ISP)limitations might come into play.

(34)

Unlike in a DataNode-based architecture, the processing power of the clients also appears to play a big role in the throughput of an S3-based cluster, at least in the present implementation. This work leverages the AWS Software Development Kit (SDK) for Java provided by Amazon [79], and therein the classes required to access AWS (S3 included) ostensibly perform extra computation that would otherwise not be incurred—regular HDFS/HopsFS clients rely on lower level network calls that are naturally far more I/O-intensive than CPU-intensive. Furthermore, achieving high- throughput uploads and downloads fromS3is done so by launching multiple, parallel requests, each downloading or uploading a given chunk of user data [80]. This means, for instance, that while a client might have an available network bandwidth of 10 Gbps, it is unlikely that it will be able to saturate it completely with a single connection toS3. To make up for this bandwidth underutilization, the client should instead use as many parallel connections as is optimal for whichever setup it is running on. From this, it is evident that the maximum throughput that clients can achieve is not only limited by the available network bandwidth, but it is also directly proportional to the number of parallel connections to S3 that they can handle. Ultimately, this makes the computational power of the clients of the S3-based architecture a notable concern, which was previously not the case. On top of this, and to safeguard the integrity of the data that is received by S3, each chunk has to be uploaded with an attached MD5 [81] digest of its contents. This digest has to be created client-side, and thus it is additional computation that the client must perform. In practice, the effect of this extra calculation was found to be negligible, but it is worth pointing since it is something else that is not a concern in HopsFS/HDFS—granted, the latter still perform cyclic redundancy checks, but so does the S3-based implementation.

Server-side, performance in an S3-based architecture can be boosted indirectly by the lack of internal load. As explained by Shvachko [5], the internal load of an HDFS cluster refers to the heartbeats and block reports that DataNodes issue to the NameNode(s) periodically. The internal load depends on variables such as the number of DataNodes, the frequency in which heartbeats and block reports are issued, and the number of blocks each DataNode hosts. This makes it impossible to even estimate the expected internal load for arbitrary setups. However, Shvachko [5] presents an expected load that takes about 30% of the NameNode processing capacity, and this is for a cluster with 10,000 DataNodes providing a total of 60 petabytes of storage capacity, with each one of them sending heartbeats every 3 seconds and block reports every hour. Granted, the internal load is less of a problem in HopsFS

(35)

18 | Implementation

since the client load is distributed across all NameNodes. However, being able to completely avoid dealing with internal load means that the cluster can scale better since an increasing number of DataNodes is no longer a concern.

File deletions in this S3-based implementation have an asynchronous part.

Concretely, after a file is deleted at the metadata level, additional work still needs to be performed at a later stage. Although this will be explained in more detail in Section3.4.5, it is important to note that the additional work required for this operation is carried out in concurrent, subordinate processes, one in each of the NameNodes. Essentially, these processes send the actual and final delete requests to S3 and wait for confirmation. While technically these requests do affect the NameNodes’ throughput, they do so to a much more manageable degree. The asynchronous nature of these operations make it possible for the system to execute them at whichever rate is optimal, which is not the case in HDFS nor in the regular HopsFS due to the need for an up-to-date view of the namespace and overall preservation of their integrity—

which is why it is recommended to keep heartbeats frequent even on large clusters [5]. Furthermore, while heartbeats and block reports are necessary even if there is no client activity, the asynchronousS3operations only happen when there are deletions to perform, which means that the impact on the NameNodes’ throughput is proportional to the number of scheduled deletions.

This last point is relevant because, in the industrial workload used to test HopsFS [1], it is shown that deletions make up around 0.75% of the bulk of operations. All in all, it would be safe to assume that the asynchronous deletions would have a rather minuscule impact in server-side performance in comparison to the processing of heartbeats and block reports. What is more, such operations could even be performed by another system with access to theNDBdata store—where all the information required to perform deletions is stored, thus eliminating entirely any impact to the NameNodes’ throughput.

Another option could be consideringS3deletions as tasks that the NameNodes enqueue on a low-latency message queue to be picked up by other arbitrary worker processes, possibly even outside the cluster.

Object consolidation, though optional, also makes a dent on server-side performance in a way similar to that of file deletions. This feature will be explained in more detail in Section 3.6, but for this section it is important to note that it involves performing asynchronous and periodic requests toS3.

Unlike file deletions, object consolidation performs expensive data streaming from and to S3, rather than only one-off metadata requests—i.e. deletion requests. Although object consolidation tasks are also spread across the available NameNodes in a cluster, they pose a bigger burden in performance

(36)

due to greater use of bandwidth. Therefore, consolidation frequency and bandwidth usage should be adjusted with care. Just like file deletions, though, because all the data required to perform object consolidation is available from the metadata store in NDB, the actions could easily be performed by an external system to the NameNodes but with access toNDB, thus nullifying its impact on server-side performance.

3.1.2 Availability and durability

It can be argued that availability and durability are closely linked in a system of this type, be it for theS3-based one or the regular one. This is because if either S3or the DataNodes are unavailable, it is as if the data were lost. At the same time, if some data is lost, it is as if the services were not available to serve it. Of course, temporary unavailability does not imply permanent loss of data, nor permanent loss of data temporary unavailability, but given that the purpose of the systems is primarily to serve data, a case can be made for discussing them together.

AnS3-based HopsFS is subject to both the availability of the NameNodes and the metadata store, andS3. However, because the former two components are also present in a DataNode-based setup, it makes sense to compare availability and durability only in terms of the underlying storage of user data.

A DataNode-based setup of HopsFS can have varying levels of availability and durability depending on the available resources and user preference.

For example, a 50-petabyte cluster can support a given number of files with a certain degree of availability achieved by a determined replication degree. However, the same cluster could increase availability by increasing the replication degree, but this would mean that fewer files could be stored in total.

Conversely, the same cluster could support more files if the replication degree were to be reduced, but this would weaken availability consequently. All of this is to say that there are no fixed numbers for availability nor durability.

Values for this attributes are mainly, if not only, dependent on the specific setup of a given cluster—i.e number of DataNodes and total capacity—and its configuration—i.e. block size and replication degree.

An S3-based setup, on the other hand, does have relatively fixed or known levels of availability and durability. Regarding availability,S3provides different levels for the different classes of storage it offers. The highest belongs to the S3 Standard storage class, the one used in this implementation, which has a value of 99.99% [12]—i.e, the service will be available 99.99% of the time in a given year. When it comes to durability, as mentioned in Section

(37)

20 | Implementation

2.2.1,S3provides 99.999999999% durability of objects in a given year, which also translates into an expected loss of 0.000000001% objects each year [12].

For reference, these guarantees are good enough for various industry leaders to rely onS3for critical big-data systems [82].

3.1.3 Scalability and elasticity

Scalability refers to the degree to which a system is able to meet greater workloads without taking significant hits in performance, while elasticity is understood as the degree to which a system can dynamically adjust its resources to meet fluctuating levels of workloads, both higher or lower. HDFS was designed with scalability being one of its main objectives, and this is made evident by the simplicity with which a great number of DataNodes can be added to an existing cluster to increase its capacity, availability, and/or performance—naturally, this also applies to HopsFS to a much greater extent since HopsFS NameNodes and metadata storage can scale out. S3is similarly known for its massive scalability. However, unlike with HDFS, S3 users are not concerned with having to increase most of the aforementioned attributes themselves. Rather, capacity inS3is virtually unlimited [12], the availability level is fixed, and they can expectS3to handle whichever level of loads they incur in a transparent fashion and without decreased performance.

WhileS3offers simplicity, users have no control over performance other than applying the provided guidelines and patterns to maximize it [80]. That is, once maximum throughput is achieved, they have no way of increasing it if needed. This is not the case for HDFS and HopsFS, in which users can decide what hardware the DataNodes run on, and can also perform ad hoc and tailored optimizations at will. Granted, this advantage is a trade-off between flexibility and control versus complexity. That is, HDFSand HopsFS users must deal with adding and maintaining DataNodes themselves and potentially having to introduce the use of new technologies to handle larger clusters—e.g.

extensions like Federation [29] and ViewFS [83]. Some use cases even require further and more advanced engineering when the available solutions are not enough, such as Uber’s [84] and Twitter’s [85], though these are ostensibly not as pervasive.

In addition to increasing the effective capacity of a cluster, more DataNodes can be added to make room for additional replicas of existing blocks. This can be done not only to increase the availability and durability of those blocks, but also to counteract decreased performance when demand for a subset of blocks is higher than usual to the point that throughput decreases considerably.

(38)

By adding DataNodes, the replication degree of the files for those blocks can be bumped—either manually or automatically [8], and thus eventually the system will be able to catch up to the higher demand by increasing the number of sources. Evidently, this requires additional work, thereby increasing complexity even more. Furthermore, when demand stabilizes to the usual levels, removing the previously added DataNodes can be cumbersome.

This is due to, for example, the possibility that the newly added DataNode(s) now host the last whole replica(s) of a block or more, and thus removing them without performing any additional checks to guard against such scenarios would result in permanent loss of user data that could otherwise be preserved.

This is clearly not something that needs to be done by users in S3. Though the underlying mechanisms are not clear, S3 seamlessly adapts to demand spikes [80] without explicit user intervention—i.e, average throughput is not dampened by increased loads. What is more, users are also not concerned with post-surge operations to scale back.

3.2 Data architecture

As described in Section 2.1.2, HopsFS’ NameNodes rely on a relational database—i.e. NDB—to store namespace metadata [6, 1]. Therefore, the current implementation extends the existing entity model with special care in adopting the same optimizations put in place by the original authors while replacing the underlying storage with S3. Indeed, a great priority in the undertaking of this project was reusing and replicating as much of the existing data architecture, not only in compliance to the sound notion of code reusability, but also to guarantee comparable performance and, more importantly, leverage its correctness.

3.2.1 Metadata entity model

In the regular HopsFS, user data is stored in the form of replicas of sequential blocks that together make up files. Replicas are persisted in DataNodes, but their metadata is stored in an NDB table which exists at the bottom of a hierarchy—a hierarchy that, for the purposes of this section, has inodes at the top, and blocks in the middle. Figure 3.2 depicts the aforementioned entity hierarchy: a file—for instance /user/foo.txt—inode record in the Inode table can have multiple ordered block records associated with it in the Block table, which in turn can have multiple associated records in the Replica table themselves. Readers will note that there is additional

(39)

22 | Implementation

metadata associated with blocks and replicas in tables such as PRB (pending replication blocks) and URB (under-replicated blocks); such metadata, is used to support operations that maintain the cluster’s homeostasis—e.g. to make sure that there are enough valid copies of a piece of user data or to perform clean-up.

Figure 3.2: Data model of files in HopsFS. The Inode table stores file and directory metadata, and file records have associated ordered Block records, which in turn have associated Replica records.

UsingS3as the underlying storage solution for user data means allows us to represent files as a collection of orderedS3objects, rather than a collection of ordered blocks and replicas. While it is technically possible to continue to use the block and replica abstractions with an adapter to store the latter inS3—i.e.

keep representing files in terms of blocks and replicas and just change where the bytes are stored, it was determined at the beginning of the project that such approach would require carefully modifying and testing a great number of moving parts and, taking into consideration the time constraints, I opted for the design with the smallest blast radius possible that still guaranteed correctness.

With user files being expressed in terms ofS3 objects, we arrived at the database entity model shown in Figure3.3. In it, the inode table remains the same, whereas the previous hierarchy comprised of blocks and replicas has been replaced with the s3_object table. The latter table contains all the necessary information to locate and uniquely identify any S3 object—i.e. the region (region field) to which theS3the bucket (bucket field) that contains it belongs, its key (key field), and the version ID (version_id field) that differentiates objects in the same bucket that share the same key [86]—and to determine what position it has within the collection of objects that make up the file to which it belongs—i.e. its index within the sequence (index field).

Finally, the s3_object table also contains a size field, which specifies

(40)

how many bytes should be read from the S3 object. While S3 exposes the size of an object through its API, that size reflects the total number of bytes in the object, which can be different from what is stored in NDBdue to the ability of users to perform file truncations. For instance, the file /foo.txt could initially have a size of 100 bytes and be made up of a singleS3object of the same size, and then, after a user truncates it to 50 bytes, the object will still be 100 bytes in S3, yet HopsFS will know to read only 50 bytes thanks to the updated value in the size field. This is a trade-off between response time and cost: while having truncations simply update an object’s size inNDB is a lot faster than downloading said object and then uploading it again with the reduced size—S3objects are immutable—it also means that users need to keep paying for those bytes inS3that are no longer being tracked in HopsFS.

inode ...

s3_object PK inode_id int NOT NULL PK id int NOT NULL

index int NOT NULL size int NOT NULL region string NOT NULL bucket string NOT NULL key string NOT NULL version_id string NOT NULL s3_processable

PK inode_id int NOT NULL reschedule_at int NULL scheduled_for string NULL

s3_deletable PK id int NOT NULL

region string NOT NULL bucket string NOT NULL key string NOT NULL version_id string NOT NULL scheduled_at int NULL scheduled_for string NULL Partition

inode_idby

Partition inode_idby

Partition by id

Figure 3.3: Data model of files in the S3-based HopsFS [1]. The inode table stores file and directory metadata, and file records have associated ordered Block records, which in turn have associated Replica records.

We have discussed the new s3_object table that replaces the previous block, replica, and related ones. However, Figure3.3 also features the tables s3_processable and s3_deletable. The former one is needed to perform consolidation operations (see Section3.6) and the latter to perform asynchronous deletions of S3 objects that are no longer being tracked (see Section3.4.5). Readers might also note the presence of snippets attached to the tables that indicate by which columns they are partitioned. The reason for partitioning the tables will be explained in Section3.2.2.

(41)

24 | Implementation

3.2.2 Performance of metadata operations

The use ofNDBalong with its distributed—i.e. multi-shard—nature introduced the need for additional technical decisions to ensure good performance of operations. Namely, decisions on how to partition data and how to properly leverage transactions to carry out HopsFS metadata operations—e.g. read, list, delete. A full explanation of the aforementioned decisions is beyond the scope of this report—readers are welcome to review the available literature [6, 1]. This section, however, delves into how this S3-based implementation is fitted into those decisions to maintain their original goals.

Regarding data partitioning, the original HopsFS implementation partition ed inode-related metadata by the identifier of their. The benefit of this is that all the related metadata of a given inode is stored in a single shard, thus enabling efficient operations. Therefore, to maintain such benefit, this pattern is repeated with the new inode-related metadata—i.e. S3 processables and S3 objects. In other words, this explains why Figure 3.3 shows the s3_processable and s3_object tables as being partitioned by their inode identifier.

Finally, to ensure safe transactions of inode-related metadata in this implementation, the lock-acquiring logic had to be updated as well. This is because HopsFS takes locks following a total order for the given metadata entities in a transaction, which prevents cyclic deadlocks. Though this was a simple task—adding the new inode-related metadata to the locking order, it remains relevant since otherwise some operations would inevitably time out due to deadlocks. At the same time, the accumulation of deadlocks would progressively degrade the performance of the system as a whole.

3.3 Application model

HDFSclients no longer interact with DataNodes to read or write block replicas of files. Rather than streaming packets of block chunks to and from DataNodes as in HDFS and HopsFS, clients instead upload or download full or partial S3 objects. From this we arrive at the model depicted in Figure3.4, which shows an example of the data structures shared between the NameNodes and the clients. As illustrated, the content of a file inode is represented by the abstraction of an S3file, which is a collection of S3 objects. The latter are also abstractions that contain all the information needed to download an actual object fromS3and in which order they should be read.

Given that this implementation aims to maximize throughput, theAPIused

(42)

S3 File

INode

name:/foo.txt size:200MB

S3 Object

index:1 size:100MB region:us-east-2 key:/foo.txt bucket:fjcg version_id:nwrfl8

S3 Object

index:2 size:56MB region:us-east-2 key:/foo.txt bucket:fjcg version_id:k2ao3y

S3 Object

index:0 size:44MB region:us-east-2 key:/foo.txt bucket:fjcg version_id:p0w4nm

Figure 3.4: Anatomy of the S3 file and S3 object abstractions in the S3-based HopsFS.

to stream data to S3 is the multipart upload [87]. As the name suggests, creating an object using this API means uploading one or more numbered pieces of data independently until the entire content of the object has been uploaded. Afterwards, the object can be downloaded in parts as well.

The maximization of throughput is achieved by uploading and downloading multiple parts in parallel, to the degree that client is configured or what the host’s constraints—e.g. network bandwidth, memory, CPU—allow.

The usage of the multipart uploadAPI, as advantageous as it is, concomitantly increases complexity. This is because an inode file is ultimately represented by anS3file, which is a collection of indexedS3objects, which are themselves collections of ordered parts. Thus, the aforementionedS3 object abstraction can be expanded into what is depicted in Figure3.5. The part size is globally configured for a given cluster. In the illustration, the part size is 10 megabytes.

This means that, in general, object will be made up of ceil(s/p) parts, where s is the length of the object and p is the length of the part.

The complexity of the application model does not culminate with the consideration of S3 object parts. In fact, as readers might have noticed in Figure3.5, the object under the CRC32-enabled cluster has two 10-megabyte parts even though the object abstraction is only 18 megabytes long. This is because this implementation provides the option of additional integrity checks when reading data from S3. The goal of this feature is to replicate the checksummed packets of user data sent from DataNodes to clients during reads in HDFS. Thus, when enabled, clients append a CRC32 (32-bit cyclic

Towards an S3-based, DataNode-less implementation of HDFS

Towards an S3-based, DataNode-less

implementation of HDFS

FRANCO JESUS CACERES GUTIERREZ

DataNode-less

implementation of HDFS

FRANCO JESUS CACERES GUTIERREZ

Abstract

Keywords

Sammanfattning

Nyckelord

Acknowledgments

Contents

List of Figures

List of Tables

List of acronyms and abbreviations

Chapter 1 Introduction

1.1 Background

1.2 Problem

1.3 Purpose

1.4 Goals

1.5 Delimitations

1.6 Structure of the thesis

Chapter 2 Background

2.1 Distributed file systems

2.1.1 The Hadoop Distributed File System

2.1.2 HopsFS

2.2 Cloud object storage

2.2.1 Amazon Simple Storage Service (S3)

2.3 Consistency and atomicity guarantees of HDFS and S3

2.4 Related work

Chapter 3

Implementation

3.1 S3-based, DataNode-less architecture

3.1.1 System throughput

3.1.2 Availability and durability

3.1.3 Scalability and elasticity

3.2 Data architecture

3.2.1 Metadata entity model

3.2.2 Performance of metadata operations

3.3 Application model