Hash-based Eventual Consistency to Scale the HDFS Block Report

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

Hash-based Eventual Consistency to Scale the HDFS Block Report

AUGUST BONDS

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

(2)

Abstract

The architecture of the distributed hierarchical file system HDFS imposes limitations on its scalability. All metadata is stored in-memory on a single machine, and in practice, this limits the cluster size to about 4000 servers. Larger HDFS clusters must resort to namespace federation which divides the filesystem into isolated volumes and changes the semantics of cross-volume filesystem operations (for example, file move becomes a non-atomic combination of copy and delete).

Ideally, organizations want to consolidate their data in as few clusters and namespaces as possible to avoid such issues and increase operating efficiency, utility, and maintenance. HopsFS, a new distribution of HDFS developed at KTH, uses an in-memory distributed database for storing metadata. It scales to 10k nodes and has shown that in principle it can support clusters of at least 15 times the size of traditional non-federated HDFS clusters. However, an eventually consistent data loss protection mechanism in HDFS, called the Block Report protocol, prevents HopsFS from reaching its full potential.

This thesis provides a solution to scaling the Block Report protocol for HopsFS that uses an incremental, hash-based eventual consistency mechanism to avoid duplicated work. In the average case, our simulations indicate that the solution can reduce the load on the database by an order of magnitude at the cost of less than 10 percent overhead on file mutations while performing similarly to the old solution in the worst case.

1

(3)

(4)

Sammanfattning

Det distribuerade, hierarkiska filsystemet Apache HDFS arkitektur begränsar dess skalbarhet. All metadata lagras i minnet i ett av klustrets noder, och i praktiken begränsar detta ett HDFS-klusters storlek till ungefär 4000 noder.

Större kluster tvingas partitionera filsystemet i isolerade delar, vilket förändrar beteendet vid operationer som korsar partitionens gränser (exempelvis fil-flytter blir ickeatomära kombinationer av kopiera och radera). I idealfallet kan organisa- tioner sammansl˚a alla sina lagringslösningar i ett och samma filträd för att undvika s˚adana beteendeförändringar och därför minska administrationen, samt

öka användningen av den h˚ardvara de väljer att beh˚alla. HopsFS är en ny utg˚ava av Apache HDFS, utvecklad p˚a KTH, som använder en minnesbaserad distribuerad databaslösning för att lagra metadata. Lösningen kan hantera en klusterstorlek p˚a 10000 noder och har visat att det i princip kan stöda klusterstorlekar p˚a upp till femton g˚anger Apache HDFS. Ett av de hinder som kvarst˚ar för att HopsFS ska kunna n˚a dessa niv˚aer är en s˚a-sm˚aningom-konsekvent algoritm för dataförlustskydd i Apache HDFS som kallas Block Report.

Detta arbete föresl˚ar en lösning för att öka skalbarheten i HDFS Block Report som använder sig av en hash-baserad s˚a-sm˚aningom-konsekvent mekanism för att undvika dubbelt arbete. Simuleringar indikerar att den nya lösningen i genomsnitt kan minska trycket p˚a databasen med en hel storleksordning, till en prestandakostnad om mindre än tio procent p˚a filsystemets vanliga operationer, medan databasanvändningen i värsta-fallet är jämförbart med den gamla lösningen.

3

(5)

(6)

Acknowledgements

Firstly, I would like to thank my examiner Jim Dowling for trusting me with the responsibility of tackling this complicated problem. Secondly I would like to thank my supervisor Salman Niazi for helping me along the way and having patience with me, despite the endless flow of questions. Thirdly I would like to thank my team at Logical Clocks for all the laughs, and finally, RISE SICS for the financial support and providing me a good working environment.

A special thanks to Monika T˜olgo for always pushing me to be better.

5

(7)

(8)

List of Figures

2.1 HDFS Architecture . . . 11

2.2 Replica State Transitions [1] . . . 14

2.3 Block State Transitions [1] . . . 15

2.4 HDFS Federation[2] . . . 18

2.5 Consistency-Scalability Matrix . . . 19

5.1 Impact of Number of Buckets on Block Report Time . . . 35

5.2 Impact of Block Processing speed on Block Report Time . . . 35

9

(11)

(12)

List of Tables

2.1 Block States . . . 12

2.2 Replica States . . . 12

2.3 Reported Replica Fields. . . 14

5.1 Simulation Configuration . . . 34

11

(13)

(14)

List of Listings

4.1 Simple Hash-based approach, Name Node . . . 26

4.2 Simple Hash-based approach, Data Node . . . 26

4.3 Hash-based approach with Concurrent Modification Handling, Name Node . . . 27

4.4 Hash-based approach with Concurrent Modification Handling, Data Node . . . 28

4.5 Handling data node restarts and stale data nodes, Name Node . . . 28

4.6 Handling Appends, Name Node . . . 29

4.7 Handling Recovery, Name Node . . . 29

4.8 Final Design, Name Node. . . 30

4.9 Final Design, Data Node . . . 31

13

(15)

(16)

List of Acronyms and Abbreviations

This document requires readers to be familiar with certain terms and concepts.

For clarity we summarize some of these terms and give a short description of them before presenting them in next sections.

HDFS Hadoop Distributed File System[3]

HopsFS The KTH-developed new distribution of HDFS[4]

DAL Data Access Layer, the metadata layer HopsFS CRDTs Conflict-free Replicated Data Types[5]

YARN Yet Another Resource Negotiator[6]

15

(17)

(18)

Chapter 1 Introduction

1.1 Background

With over 4 million YouTube videos watched, 46000 Instagram posts shared and 2.5 petabyte of internet data consumed by American users, every minute of every day[7], the absolute necessity of large-scale distributed storage systems is evident.

The design of such systems is, however, a very complicated task with many pitfalls and unavoidable compromises.

Because of the thermodynamic limitations in scaling vertically, i.e. building more powerful computers, large-scale distributed systems resort to scaling horizontally, i.e. adding more computers to the system. The problem with horizontal scaling is that increasing the number of devices in a distributed system also increases communication complexity and adds to the number of failure points.

In case of distributed storage, as the network grows so does the problem of maintaining data consistency, data integrity, and redundancy.

Arguably the most famous big data cluster framework, Hadoop, is entirely open source and is used by large organizations, such as Yahoo!, Facebook and Spotify for data storage and analysis. Hadoop stores its files in the Hadoop Distributed File System (HDFS) that allows for reading, writing and appending files of unlimited sizes. However, as will be discussed, because of its design the file system and the cluster size cannot safely reach beyond 4000 nodes.

In a Hadoop cluster, there is at least one control node (Name Node) and one or more storage/worker nodes (Data Nodes). The Name Node(s) does, among other things, decide where clients should write new data and how the existing data should be balanced and distributed over the cluster. The Name Node stores all the cluster information in-memory.

Because of storing all cluster metadata in-memory on that single Name Node, the cluster has a limit to the number of files it can keep track of. Storing

1

(19)

2 C^HAPTER 1. INTRODUCTION

metadata in-memory means that all file system operations and all data node reports must eventually go through it as well. There are possibilities for moving the synchronization of metadata outside of the Name Node using message passing middleware, but, for simplicity’s sake, the Name Node has complete control.

Hadoop Open Platform as a Service (Hops) is a new distribution of Hadoop that aims to increase scalability and performance by removing critical bottlenecks in the Hadoop platform. To scale the metadata layer, Hops uses a distributed in-memory database called NDB. Using NDB for metadata raises the potential maximum number of files to 17 billion. Since NDB supports ACID transactions, it can also be used to distribute Name Node work.

HopsFS is the Hops equivalent of HDFS and takes care of the file storage.

Being a distributed system, the number of potential errors is significant, and crashes are frequent. Asynchronous operations that postpone synchronizing with the Name Node improves scalability. However, HopsFS still maintains a high level of redundancy and data-loss protection and to do this the Name Node needs to keep an up to date view of the cluster with its files and blocks. The working solution for data consistency and redundancy checks is called block report.

Block Report is a mechanism in HDFS for maintaining an up to date view of the cluster on the Name Node, with respect to stored files. Once an hour, every data node sends a complete list of its blocks, the block report, to the Name Node.

The Name Node processes each block in the list and compares it with its view, updating the view when appropriate. To avoid misunderstanding, we will use the word block when talking about the Name Node’s idea of a block, and replica when talking about a copy of a block stored on a Data Node. In these terms, the Data Node submits a list of replicas as its block report.

Data Nodes used in production at companies such as Spotify regularly host more than a million blocks, so the corresponding multi-megabyte block report becomes a heavy burden for the Name Node to process. The Name Node also has to query the distributed database for the most up to date view of the reporting node, so it adds load to the database system.

The focus of this thesis is improving the performance and processing time of the block report mechanism.

1.2 Problem

Ideally, organizations that do Big Data analysis would like to consolidate their data in the fewest clusters possible for gains in operating efficiency and utilization.

However, the metadata architecture of Hadoop limits the cluster size because of storing all metadata in-memory on a single node called the Name Node. HopsFS is a new distribution of HDFS, developed at KTH Royal Institute of Technology in

(20)

1.3. G^OAL 3

Stockholm, Sweden, that scales metadata size and throughput using a distributed in-memory database. HopsFS has shown that in principle, it can support clusters at least 15 times larger than HDFS[4]. However, an internal protocol in HDFS called block report prevents HopsFS from handling clusters of these sizes.

Currently, the block reporting protocol consumes an increasing amount of computational and bandwidth resources with increasing HDFS cluster sizes.

In practice, for clusters larger than 10K servers, block-reporting saturates the metadata layer preventing HDFS from scaling further.

Hence, the question becomes: Can we improve the scalability of HDFS without making compromises performance and durability? And how would such a solution look? In this thesis, we will design and develop a more scalable eventually consistent protocol for block-reporting in HopsFS (and HDFS).

1.3 Goal

The goal of this work is to redesign the current HopsFS block reporting mechanism for improved performance and scalability, as well as to perform the necessary evaluation of the implementation. Furthermore, we hope that our solution will be useful for the original Hadoop project, as a contribution the open source community.

1.3.1 Contributions

This work consists of an explanation of the existing HDFS block report mechanism, a design and implementation of a new, hash-based block report mechanism, and a comparison of the old and new design.

1.3.2 Benefits, Ethics and Sustainability

We do not see any ethical issues with this project. Risks are that we do not find a solution, but a negative result (impossibility of further optimization of block reports) would also be a contribution. On the other hand, a positive outcome could increase server utilization and efficiency in large-scale clusters, reducing electricity, maintenance and hardware requirements for Big Data operations.

1.4 Method

We will attempt to formalize the existing consistency mechanism and then create an improved design. After improving on the original design, benchmarks will be

(21)

4 C^HAPTER 1. INTRODUCTION

conducted to get a quantitative measure of performance increase. Reasoning and tests are used to verify correctness of the new solution.

1.5 Delimitation

This thesis focuses solely on implementing and evaluating a better performing consistency mechanism for the distributed file system HopsFS. We wish for our improved solution to maintain the same consistency properties as the old solution so that they can be used interchangeably. Hopefully, the algorithm design will be applicable also to the original project HDFS, but the actual application of the new design to HDFS is outside of the scope of this work.

1.6 Outline

The thesis follows the following outline. Starting off, the introduction (Chapter1) gives an overview of the problem and the background. After that, the background (Chapter2) goes in-depth into context necessary for understanding the work, and an overview of existing solutions in the space. After that comes an explanation of the methods used (Chapter 3). The hash-based approach chapter (Chapter 4) describes the new solution. At the end of the work is an evaluation5, followed by a summary and suggested future work (Chapter6).

(22)

Chapter 2 Background

2.1 Hadoop

Hadoop is a software framework for ”reliable, scalable, distributed computing.”

[3]. It started out as an open source implementation of the Google File System (GFS)[8], coupled with cluster management and the support for map-reduce jobs[9], and ended up being one of the core technologies of the big data sphere.

Today Hadoop has a 25% market share in big data analytics.[10]

Google built GFS and MapReduce to deal with its ever-growing index of web-pages and to perform aggregated analysis on that index. Therefore, Hadoop initially came with some core assumptions: users do not delete data, the data is written in a stream and read by multiple readers, and data-loss is intolerable.

Furthermore, MapReduce used the concept of data-locality to increase processing speed in a communication-bound system. The assumptions about data storage and retrieval will be discussed in Section2.1.1dedicated to Hadoop’s distributed file system HDFS.

Hadoop uses data-locality to deal with processing large datasets. Instead of moving the data over the limited network, it moves the computation to the nodes where the data resides. One of the insights that lead to the creation of GFS, MapReduce and finally Hadoop, was that cheap commodity hardware could used be instead of expensive specialized hardware, as long as the slightly higher failure rate was handled on the software level. The architecture is an example of horizontal scaling, increasing the number of machines instead of increasing the performance of each individual machine.

In May 2012 the first alpha of Hadoop 2 was released[11], that extended the framework in a significant way: it added a scheduler and resource allocator called YARN [6]. YARN made it possible for third-party software to request cluster resources and run jobs, which allowed for new data processing frameworks to

5

(23)

6 C^HAPTER2. B^ACKGROUND

plug into the platform. Nowadays Hadoop supports a number of popular Big Data frameworks such as Spark, Hive, Kafka, HBase, and Storm.

2.1.1 HDFS

The default file system in Hadoop is Hadoop Distributed File System (HDFS).

HDFS is an open source clone of Google File System based on the paper Google published 2003 [8]. An HDFS cluster consists of two types of nodes: Data Nodes, and one or more (backup) Name Nodes. The Name Node synchronizes access to the file system and maintains a view of the cluster: which data nodes store which files, what is the desired replication level, and which files need re-replication. The data nodes serve as simple data storage servers.

Files in HDFS are divided into chunks called Blocks. The client operates on files, but internally and transparently to the user the files are split, distributed and replicated on the block level. A block has a predefined maximum size; generally 64mb or more. HDFS stores these blocks, not the complete files, on its data nodes.

The Name node controls the cluster and maintains the namespace. Users interact with HDFS via a provided client. When a user wants to read a file, it requests a read-stream from the client. This stream does the work of asking the Name node for locations (Data nodes) that hold the next block to read and then proceeds with setting up a read socket to one of those data nodes. To the user, it looks like a local file read.

A file can contain a virtually unlimited number of blocks. Blocks are limited in size so when a block becomes full, the writer is notified and is forced to request a new block to write to. The client handles this transparently to the user.

2.2 NDB

NDB [12] is a distributed, in-memory, shared-nothing database solution that supports ACID transactions with row-level locking, capable of 200M queries per second[13]. Storing database data in volatile memory requires specific work to maintain durability. In case of NDB, this means a replication factor of two, coupled with regular snapshotting to disk.

ACID transactions (Atomicity, Consistency, Isolation, Durability) is a concept that defines properties of concurrent single- or multiple statement operations on a database. They are used to ensure that a) either all statements completed, or the effects are entirely discarded (Atomicity), b) that consecutive reads of the same data rows give the same result (Consistency), c) concurrent statements operate on the database as if executed alone (Isolation), and d) the result of a completed operation is never lost (Durability). Such transactions are necessary

(24)

2.3. H^OPS 7

for a database system to emulate the corresponding properties of a data-race free in-memory data store. NDB supports such transactions with an isolation level of READ COMMITTED. This means that data read by a transaction is guaranteed to be the result of a completed transaction. In essence, a transaction only operates on its own copy of the row (the last committed state before the transaction begun) and will only complete successfully if no other simultaneous transaction has updated that row thereafter. If the transaction fails to complete due to its read values being modified by another transaction, it has to rollback and be retried.

To scale the system horizontally, NDB employs sharding. Sharding is a partition of a dataset according to some consistent mapping of row keys to datanodes in the cluster. This can be done automatically, unless some another mechanism is specified e.g. sharding based on a specific column or full-table replication. Sharding your tables correctly can allow transactions to never involve more than any one datanode, removing the need for cross-datanode communication and thereby improving transaction speed. NDB falls back to cross-partition transactions when this is not possible.

A key property of NDB that makes it useful for high-performance workloads is that it scales linearly with respect to reads and writes, as you increase the number of datanodes in the system.

2.3 Hops

Hops is an independent fork of Hadoop (see Section 2.1) that aims to solve some of the inherent scalability issues present in the platform. One of the main complaints against Hadoop is the decision to store all cluster meta data in- memory inside a single Java Virtual Machine(JVM) on the primary Name Node.

Furthermore, all Hadoop file-system operations are synchronized over the same lock which limits scalability. To remedy these issues Hops stores all its meta data in a distributed in-memory database called NDB2.2.

Storing the meta data outside the JVM comes not without drawbacks, though.

Access times are slower because of network latency, and the network bandwidth between the name nodes and the database nodes limit the speed and amount of parallel operations that can be performed. Hops has redesigned some aspects of the file system internals to remove the synchronization bottleneck present in Hadoop. In HopsFS all file system operations are executed as concurrent transactions[4].

(25)

2.3.1 Transactions in Hops

To reduce the number of round-trips to the NDB database, Hops uses a local cache on each Name Node. Tables are interacted with through so called

”EntityContext”s, that manage the cache state. All metadata operations need to implement a transaction interface, consisting of three operations: setUp, acquireLocks and performTask. setUp is allowed to perform individual requests to the database to resolve any additional information necessary for the main task.

acquireLocks is the main cache population phase where the developer specifies a number of database rows that need to be populated (through ”locks”), and performTask performs database updates. Updates performed in the performTask are made to the local cached copy of the database rows, and at the end of the method, results are automatically committed back to the database by the transaction handler. The transaction handler takes care of rollbacks and retries in case of failure.

Aquiring locks on multiple rows in multiple tables concurrently can lead to so called dead-locks, where two transactions are waiting for the other to release a lock the first one has already acquired. The Data Access Layer discussed in the following section, enforces a total order of lock acquisition that spans both tables and individual rows, that makes such dead-locks impossible.

2.3.2 HopsFS

HopsFS is a drop-in replacement for Hadoop Distributed File System that replaces the in-memory metadata store of the control node with a database abstraction called Hops Data Access Layer (DAL for short). The data access layer has one main implementation interfacing with the in-memory, shared-nothing, distributed database system Oracle NDB. Moving metadata into a database makes it queryable and batch-editable. It also allows for a much larger potential amount of metadata as compared to keeping the metadata in memory. In case of NDB version 7.5, this equates to 24TB of metadata in a 48-node cluster with 512GB of RAM on each NDB data node. Consequently, HopsFS has the potential to store metadata information of up to 17 billion files.[4]

At the heart of the Hops File System is the representation of a file or directory, the inode. All inodes have a parent inode, an associated timestamp, ownership information and permissions. The inode dataset is partitioned by its parent inode to enable quick directory listings. The root inode is accessed by all filesystem operations and is therefore made immutable and cached on all namenodes. However, because of having a single namespace root, the whole directory tree ends up in the same partition. To mitigate this, the children of top-level directories are partitioned according a configurable hashing scheme,

(26)

2.4. E^VENTUAL C^ONSISTENCY 9

spreading out the second-level directories across the database nodes.[4]

All file system operations are performed as transactions. However, transactions on their own are not sufficient to scale out a hierarchical file system where simulatenous subtree-operations would rarely succeed without being invalidated by concurrent modifications down the tree. To handle this, HopsFS uses an application level locking mechanism where a sub-tree lock cannot be acquired if a node down the tree is locked, and vice-versa.

While moving the metadata into a database results in slower access times than storing it in-memory, the ability to perform concurrent operations on the data as transactions enables a safe way to scale out the system. Compare this to the HDFS single namespace lock, enforcing serializability, which hampers performance and limits HDFS to a 16th of the throughput of HopsFS.[4].

HopsFS relies on the same block report protocol for replica management as HDFS, and is therefore limited in a similar way. However, as HopsFS introduces a database layer for storing cluster metadata it actually worsens the scalability constraint of Block Report. Now not only does the report need to be transmitted over the network from the data node to the name node, but the processing on the name node requires reading a number of database rows, similar to the number of replicas in the report. Operations that modify replica meta data are performed in transactions and so every row read must eventually be locked.

The result is that HopsFS needs to resort to making full reports less frequent to operate at similar cluster sizes. An improved design of the HopsFS block reporting mechanism to lessen the load on the database an increase block report processing speed is the main contribution of this work, and is presented in Chapter4.

2.4 Eventual Consistency

Properties in distributed systems are either Safety Properties or Liveness Properties.

Safety properties give guarantees about what bad things will never happen, and liveness properties state that good things happen eventually. A well implemented distributed system never violates saftey properties, and never ceases to make progress, liveness. Consistency models are used to reason about the quickness of the system to fulfill its liveness properties.

Distributing resources always requires compromises. As famously stated in the CAP Theorem [14] by Eric Brewer, any distributed system can provide max two out of the following three properties: Consistency, Availability and Partition Tolerance. Consistency considers the behavior of the system under concurrent operations, Availability the continuous function of the system, and Partition Tolerance the case of lost messages between nodes in the system. Because of this inherent limitation of distributed systems, and the almost inevitable loss

(27)

of messages, most applications that need to scale must compromise between availability and consistency.

Shared-memory abstractions are architectures that allow for distributing memory over multiple units of storage, while to the outside observer looking more or less, like a single unit of storage. HDFS is an example of this. Since consistency is not a binary property, we use Consistency Models to help reason about different levels of consistency a shared-memory abstraction can emit. The consistency model best matching that of the HDFS block report is the Eventual Consistency Model.

Eventual Consistency guarantees that after all updates have arrived, the system will eventually be in sync, meaning that there is a period of time, after updates have arrived, allocated for conflict resolution. This allows for quicker general operation of the system, with performance penalties only when node states diverge. Another example of a consistency model is Eventual Consistency’s more powerful cousin, Strong Consistency, which guarantees that as soon as each node has received all updates, the system is consistent.

2.5 HDFS Block Management

2.5.1 Blocks and Replicas

The file storage architecture of HDFS (Figure2.1) is modelled after Google File System[8]. Similar to GFS, files in HDFS are divided in fixed size parts called blocks (GFS: ”chunks”). Block sizes are configurable with a default of 64MB, but the division is completely transparent to the user.

Like Hadoop, HDFS relies on the ’single control node - multiple worker nodes’-architecture. The name node maintains a complete view of the cluster and decides where new blocks should be placed. By default, HDFS replicates every block three times and the replication level of a file is the same as that of it’s blocks. Block placement decisions are made with the goal of balancing load on the cluster and minimizing the risk of data loss.

Physical copies of blocks are called replicas. Replicas are stored on data nodes which serve as dumb replica stores that have no knowledge of the rest of the cluster. Since data nodes are simple, the complexity is moved to the client. To write a file a client first asks the name node for locations to write the first block.

After receiving a list of data nodes to write to it sets up a pipeline between itself and the data nodes proceeds with the data transmission. Once a block has been filled and all data nodes in the pipeline have sent acknowledgements, the client requests from the name node a new block and new locations to write to.

Throughout the lifetime of a block, the block itself and its replicas go through a cycle of states. Block states are described in Table 2.1 and replica states are

(28)

2.5. HDFS B^LOCKM^ANAGEMENT 11

Figure 2.1: HDFS Architecture

(29)

Block State Description

UNDER CONSTRUCTION The block is being written

COMMITTED The client has closed the block or requested a new block to write to, and the minimum number of data nodes has not yet reported finalized replicas.

COMPLETE The client has closed the block or requested a new block to write to, and it has reached minimum replication.

UNDER RECOVERY A writing client’s lease has expired and recovery has started.

Table 2.1: Block States Replica State Description

RBW Replica Being Written means a client has started/is writing data to this replica.

FINALIZED The client has closed the file/requested a new block and all data has arrived.

RUR Replica Under Recovery means that a recovery the corresponding block has started.

RWR Replica Waiting to be Recovered is a state that all RBW replicas enter after the data node restarts

TEMPORARY Temporary replicas are ephemeral and appear during transfer of blocks for re-replication.

Table 2.2: Replica States

described in Table2.2. If no errors are encountered during writing, a new block goes from UNDER CONSTRUCTION to COMMITTED to COMPLETE, while the replicas go from RBW to FINALIZED. A deeper dive into the state transitions can be found in the Section2.5.2.

2.5.2 HDFS Block Report

Every file has a target replication level. Ideally, every file in the file system should be maximally replicated at all times, but because of node failures, disk failures and network failures replicas can be lost or corrupt. The replication level is maintained by a combination of client and name node processes.

On the name node, a periodic procedure triggers re-replication of the blocks

(30)

known to be under-replicated. To not overload the cluster the batches of blocks scheduled to be repaired are sized proportionally to the cluster. Blocks can be marked as under-replicated in several ways: by the client reporting corrupted reads, the client reporting pipeline failure when writing, the data nodes not reporting finalized replicas, data nodes marked dead and file replication upgrades.

Also the data nodes themselves periodically do checksum controls on their stored replicas.

Since the Name Node maintains a complete view of the cluster, all writes and reads need first to be confirmed by the Name Node. However, data nodes operate independently so the Name Node knowledge of the state on the data nodes can quickly become obsolete. To keep track of where blocks are stored and in which state they are, Hadoop deploys a consistency mechanism called Block Report.

To deal with the case of missing messages or other situations that cause replicas to be left in states inconsistent with their blocks, there is a mechanism called Block Report. The Block Report is a two-part mechanism for synchronizing block and replica states.

On the one hand, you have the once-an-hour full report. The full report consists of a complete list of replicas and and their states (replica contents excluded, see Table2.3). The full report is also sent upon data node registration (startup). On the other hand, there are incremental reports. Incremental reports are used to synchronize data node and name node state on the fly as replicas are modified. Specific replica transitions are reported with the following messages:

• Receiving (blockId, generationStamp, RBW)

• Received (blockId, generationStamp, length, FINALIZED)

• Deleted (blockId, generationStamp)

The ”Receiving” message is sent when a write is started is set up, that is on create, append, append-recovery (append pipeline failed), and pipeline recovery (normal write failed). The ”Received” message is sent whenever the replica is finalized: close, recovery and close recovery, replication(TEMP to FINALIZED).

The ”Deleted” message is sent when a replica is deleted. See Figure 2.2 for a detailed overview. The corresponding block transitions can be found in2.3.

Incremental block reports and the full block report together provide a guarantee that replica states will be eventually synchronized with the name node. Every data or version-changing modification is reported on its own and, in case of lost messages, there is the full report to catch any remaining differences. Block Report is therefore equivalent with an eventually consistent protocol. With time, if all messages arrive, the state of the data nodes will be in sync with the name node view.

(31)

Field Description

blockId id of corresponding block length number of bytes in the block generationStamp a version number

status RBW/FINALIZED/RUR

Table 2.3: Reported Replica Fields

Figure 2.2: Replica State Transitions [1]

(32)

Figure 2.3: Block State Transitions [1]

(33)

The time and space complexity of the block report is linear with respect to the number of blocks on a data node. In HDFS this has imposed a strict limitation on cluster sizes, since the complete reporting time of 1M blocks is in the order of one second, clusters have been limited to around 4000 nodes, for a report-interval of 1h.By the structure of the Block Report mechanism we can extract the essential requirements it fulfills:

1. On data node start/restart the state in terms of RUR, RBW, FINALIZED and deleted replicas is synchronized by an initial full block report

2. Every replica modification is additionaly reported on its own

3. Every hour the state in terms of RUR, RBW, FINALIZED and deleted replicas is synchronized by the full block report.

The fact that incremental reports are sent on every important data modification, can be used to define a secondary property: If data nodes receive all replica modifications, and the name node receives all incremental reports, the state is synchronized. We utilize this fact to design the new hash-based block report.

2.6 HopsFS Block Report

While identical in design the HopsFS Block Report exhibits different performance properties than the HDFS Block Report because it interacts with the database for metadata. As explained in section2.3.1, transactions go through three steps: initial data resolution, cache population and updates. This is true also for HopsFS Block Report. However, because locking of hundreds of thousands of rows is not feasible in NDB within the time constraints of the block report, the replica information is initially read outside of a transaction as an index scan. Then, the list of replicas in the report is compared with the previously read replica information in the database to compute a diff. The differences found in the diff are processed individually as transactions, causing each modification to require a round-trip to the database.

Our measurements show that on an unloaded NDB cluster, we can read the block and replica information at around 200k replicas per second outside a transaction, and a single replica modification transaction takes around 20ms.

As we expect a datanode in a typical cluster to contain at least 1M blocks, the initial reading phase takes at least 10s. Furthermore, if every block requires a modification, there is an additional 2000s of total transaction processing. For improved start-up times, this second step can be parallelized, and after start-up, we expect the number of modifications needed in a typical block report to be low.

Therefore the focus is on optimizing the first step.

(34)

2.7. M^ERKLET^REES 17

2.7 Merkle Trees

Merkle Trees are balanced binary trees of hashes where the bottom level is full.

Merkle Trees were initially conceived as a solution to digital signatures of large amounts of data, as a combination of the application any pre-existing, proven, cryptographic hash function over equally sized chunks of the data, where two neighbor hashes are combined and hashed again to compute a parent hash all the way up to the root. In case of alteration during transmission, the offending chunk(s) could be found simply by extracting the smallest subtree of offending hashes, and only that segment would need to be re-transmitted.

2.8 State of the Art

As companies have to deal with larger and larger amounts of data, and the need for high-speed access to said data increases as the market becomes more competitive, the performance and scalability of storage systems becomes a central component in their success. Companies such as Google and MapR and organisations like Apache Hadoop are working on scaling their data storage architectures to larger and larger clusters but as the CAP impossibility result dictates[], every solution has to compromise either consistency, availability or partition-tolerance. As network connectivity problems is the norm, rather than the exception, partition- tolerance is practically a necessity. Hence, the compromise is between consistency and availability. Varying degrees of consistency and availability require different mechanisms for fault-tolerance and redundancy.

This chapter will discuss some notable distributed data storage solutions out there, and compare them with the properties and performance of Hadoop Distributed File System. Figure 2.5 illustrates the differences between the platforms and the prospects of HopFS, in terms of scalability and consistency.

As Federated HDFS doesn’t provide a unified namespace, it removed from the figure.

2.8.1 Federated HDFS

HDFS Namespace federation tries to lessen the load of the name node, by running several overlapping clusters on the same data nodes. Each cluster has its own folder hierarchy and acts as an independent file system. The architecture is illustrated in Figure2.4. As HDFS in and of itself scales to 4000 nodes, and you can keep adding federated clusters, the scalability of the solution is practically unlimited. Namespace federation does not come without drawbacks, however. As the file hierarchies are isolated from each other, a file move across namespaces

(35)

Figure 2.4: HDFS Federation[2]

becomes a non-atomic operation. Management is also more complex since every cluster is managed independently. For file operations within a namespace, the performance is the same as that of single-cluster HDFS.

2.8.2 Amazon S3

Amazon S3 is a direct competitor to HDFS in the data storage market. In terms of elasticity and cost, S3 wins by a landslide [15]. With S3 you pay for what you store, and the pricing is competitive. HDFS forces you to pay the cost up front for buying or renting the hardware, and any storage space you are not using is wasted. However, to scale the system, Amazon has chosen to strongly reduce the consistency guarantees that S3 provides.

Contrary to the hierarchical nature of HDFS, S3 is a simple block store.

Objects are stored in so called buckets, each with a unique identifier key.

According to the documentation, PUT operation on a new object provides read- after-write consistency, while updates or deletes are eventually consistent.[16]

That means that a new file can be read after being written, but might not show up in bucket listings for some time. It also means that in case of simultaneous updates, a read might respond with the old data or either of the results of the updates, until consistency has been reached. Put simply, a user cannot expect

(36)

2.8. STATE OF THE A^RT 19

Figure 2.5: Consistency-Scalability Matrix

data to be available immediately after writing. This stands in stark contrast to the consistent operations on an HDFS instance.

2.8.3 Google Cloud Storage

Google Cloud Storage, like S3, is also a non-hierarchical block store where blocks are organized in buckets. It does however provide consistent PUT on new blocks, consistent PUT-update and consistent delete. Only access control is eventually consistent. Google Cloud Storage consistency provides higher throughput than S3 for both uploads and downloads, but as an inevitable consequence of the stronger consistency, access latencies are also much higher.[17]

(37)

(38)

Chapter 3 Method

This chapter presents the scientific approach used in this work, and the expected outcomes.

3.1 Goals

As an engineering thesis, it focuses on the design and evaluation of a new solution to a an engineering problem. The main goal is to produce a rationale, design, and verification of a new design of the HDFS Block Report. More concretely, the hope is to provide an improved Block Report mechanism in HopsFS that has the potential to scale to clusters at least 10 times the previous maximum.

3.2 Tasks

The tasks to be undertaken in this work are:

1. A thorough review of related materials

2. An analysis of the distributed storage state-of-the art

3. An in-depth analysis of the properties and behavior of the existing HDFS Block Report

4. The production of a rationale for, and the design of, a new Block Report mechanism

5. An evaluation of the design in terms of correctness and performance 21

(39)

22 C^HAPTER3. M^ETHOD

3.3 Evaluation

The correctness evaluation is done by reasoning, and to evaluate the potential performance of the new solution we will do simulations on a model of the system, using model parameters based on real-world measurements. In case a correct implementation can be achieved in time, there will be performance measurements taken on a 40 node HopsFS cluster running within the SICS ICE data center in Lule˚a, Sweden.

(40)

Chapter 4 The Improved HopsFS Block Report

The block report protocol in HDFS and HopsFS is a limiting factor on cluster scalability. This chapter will outline a new design that has potential for scaling to 10x previous cluster sizes.

4.1 Solution Idea

Our new block report has some strict requirements, namely, it should provide the same guarantees as the old block report (Section2.5.2). However, it has to do so without the same linear complexity full block report dominating the processing time, and at the same time not impose a more than marginal overhead on general operation.

The key insight is that because of the incremental block report already reporting every replica modification, the full block report is mostly redundant.

We use this fact to create a CRDT-inspired incremental hashing solution that can be used to determine whether the full block report is necessary or not. To support concurrent modifications we split the report into a number of buckets, each with its own hash, to let some buckets be inconsistent without redoing the whole report.

4.1.1 Incremental Hashing

In essence, the block report synchronizes a list of replicas and their states on a data node with the name node knowledge of that data node. We avoid duplicated work by representing the list of replicas by an integer that we keep up to date based on what incremental reports we have seen. To do this we use one way functions and the commutative property of ”+” over the integers.

Assume there is a one-way function that compresses a reported replica (see 23

(41)

24 C^HAPTER 4. T^HE I^MPROVEDH^OPSFS B^LOCKR^EPORT

Table2.3) into an integer.

hash(Replica) ! Z,Replica = (id,length,generationStamp,status) We can use that function to generate a single replica that represents the whole report, by simply applying it over all the elements in the list, and adding up the results.

hash(BlockReport) =

Â

r2BlockReport

hash(r)

The key realization is that because of the commutative property of + over Z, the hash function also commutes, such that for replicas R1,R₂holds:

hash([R₁]) +hash([R₂]) =hash([R₂]) +hash([R₁]) =hash([R₁,R₂]) The old block reporting scheme already reports every change to a replica’s state, in an at-most-once fashion. That allows us to draw the following conclusion:

Theorem 1 Assume un-ordered, exactly once delivery of incremental reports.

Given an initial hash (H₀ = 0) on the Name Node and a Data Node with zero replicas, only allow replica modifications that are additions or deletions of complete blocks. Upon performing a number of file writes and file deletions, if the hash is updated incrementally every time an update from the modified Data Node comes in (+ for block additions, - for deletions), when all incremental reports have arrived the stored hash will match the hash of the complete block report.

The introduction of hflush and append to Hadoop [1], also introduced an array of added messages between the Data Node and Name Node: namely the BlockReceivedAndDeleted that served as an indication that a block was either:

being written, finalized, or deleted from the Data Node. The idea was to create an intermediary block report that catches changes with higher granularity to enable reading intermediary replicas. But the complete block report was still kept to act as a catch-all solution. We use this pre-existing incremental block report and extend it in our hash-based solution.

4.2 Design

4.2.1 A First Attempt

Assuming successful writing of files and, for now, ignoring appends and internal maintenance, the operations left are file creation and file deletion. Creating or

(42)

4.2. D^ESIGN 25

deleting a file without errors will result in corresponding additions and removals of replicas. Theorem1handles precisely that, but has two limitations: it assumes that every update arrives exactly once and it assumes that there is only one possible state FINALIZED a replica can be in. We use it to design our first implementation in Listings 4.1 and 4.2, where processReport is the old block report processing code that checks every reported replica against its knowledge of the associated block and schedules work and updates meta data. The reported data node state is considered the ”truth”.

This solution is safe but insufficient. Note that the complete block report blockReport/2 calculates the hash of the actual report, not only the FINALIZED blocks. That way we can be sure that the actual state of the data node is considered. However, given that at least one replica with a non-FINALIZED state is very likely to exist on a Data Node at any given time, the hash will virtually never match, and so the performance of the block report remains the same. In addition to that, if a data node restarts and misses updates it will falsely report matching hashes, file appends will cause hash inconsistencies since the last block of the appended file will be reported finalized twice, and replicas that get their generation stamp bumped will cause hash inconsistencies as well.

To summarize, there are four detrimental issues with this solution:

1. Hashes match only if all replicas are FINALIZED (cannot handle concurrent writes)

2. A datanode that restarts and misses updates will falsely report matching hashes (incorrect)

3. Hashes conflict if a file has been appended, since the last block of the appended file will be reported as FINALIZED at least twice: once the first time it was written. (cannot handle appends)

4. Hashes conflict if block has been recovered since the bump in generation stamp counts as a FINALIZED block (cannot handle recovery)

These issues will be addressed in the following sections.

(43)

Listing 4.1: Simple Hash-based approach, Name Node NAME NODE

upon i n i t ( ) :

data nodes = t h e s e t o f d a t a n o d e s i n t h e c l u s t e r f o r dn i n data nodes :

s e t h a s h e s [ dn ] = 0

upon i n c B l o c k R e p o r t ( dn , r e p l i c a ) :

k e p t B l o c k = p r o c e s s R e p o r t e d B l o c k ( r e p l i c a ) i f ( k e p t B l o c k && b l o c k . r e p l i c a = FINALIZED ) :

h a s h e s [ dn ] = h a s h e s [ dn ] + hash ( r e p l i c a ) e l s e i f ( ! k e p t B l o c k ) :

h a s h e s [ dn ] = h a s h e s [ dn ] hash ( r e p l i c a )

upon b l o c k R e p o r t ( dn , r e p l i c a s ) : hash = hash ( r e p l i c a s )

i f ( h a s h e s [ dn ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( r e p l i c a s ) h a s h e s [ dn ] = hash ( k e p t B l o c k s )

e l s e :

/ / do n o t h i n g

Listing 4.2: Simple Hash-based approach, Data Node DATA NODE

upon i n i t ( ) :

/ / Read b l o c k s t a t e s from d i s k b l o c k s = g e t d i s k b l o c k s ( ) do work ( )

do work ( ) : l o o p :

R e c e i v e commands | | do work | | r e c e i v e b l o c k s ; on b l o c k m o d i f i e d :

t r i g g e r ( s e l f , i n c B l o c k R e p o r t ( b l o c k ) ) once an hour :

t r i g g e r ( b l o c k R e p o r t ( s e l f , b l o c k s )

(44)

4.2. D^ESIGN 27

4.2.2 Handling Concurrent Writes

Traditional Hadoop workloads are read-heavy. The Spotify workload that was handed to the Hops team was roughly 97% reads [4] which means a comparatively small number of replica modifications. However, For a current day Criteo workload there is 80+ TeraBytes worth of data added every day. If we assume that the data is being streamed and each file is at least bigger than the block size of 64 Mb with a replication factor of three, the number of blocks added per day is bound by 80T B/64M = 4M blocks. 4M blocks over 1000 nodes is 4000 blocks per data node per day ⇡ 6 replicas/s/dn. If a replica of 64MB takes three seconds to write, that means at any point in time there are 18 replicas being written per data node. Therefore, our single-hash solution is rendered useless.

To account for these concurrent writes, we divide the blockId space into a configurable number of ”buckets”. Blocks are assigned to buckets consistently using modulus on the blockId. Each bucket gets an assigned hash, and is processed independently of the other buckets. Given a high enough number of buckets, concurrent modifications only render a small fraction of the hashes inconsistent, and major performance gains can still be achieved. The only modification needed is replacing the single-hash-per-datanode datastructure with a datanode-buckets map (see Listings4.3and4.4).

Listing 4.3: Hash-based approach with Concurrent Modification Handling, Name Node

NAME NODE:

upon i n i t ( ) :

data nodes = t h e s e t o f d a t a n o d e s i n t h e c l u s t e r f o r DN i n data nodes :

f o r b u c k e t I d i n b u c k e t I d s :

s e t h a s h e s [DN] [ b u c k e t I d ] = 0 upon i n c B l o c k R e p o r t ( dn , r e p l i c a ) :

k e p t R e p l i c a = p r o c e s s R e p o r t e d B l o c k ( r e p l i c a ) b u c k e t I d = r e p l i c a . b l o c k I d % NUM BUCKETS i f ( k e p t B l o c k && b l o c k . s t a t e = FINALIZED ) :

h a s h e s [ dn ] [ b u c k e t I d ] =

h a s h e s [ dn ] [ b u c k e t I d ] + hash ( r e p l i c a ) e l s e i f ( ! k e p t B l o c k ) :

h a s h e s [ dn ] [ b u c k e t I d ] hash ( r e p l i c a ) upon b l o c k R e p o r t ( dn , b u c k e t s ) :

(45)

f o r b u c k e t i n b u c k e t s : b u c k e t I d = b u c k e t . i d

hash = hash ( b u c k e t . b l o c k s )

i f ( h a s h e s [ dn ] [ b u c k e t I d ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s ) h a s h e s [ dn ] [ b u c k e t I d ] = hash ( k e p t B l o c k s ) e l s e :

/ / do n o t h i n g

Listing 4.4: Hash-based approach with Concurrent Modification Handling, Data Node

DATA NODE:

upon i n i t ( ) :

R e c e i v e commands | | do work | | r e c e i v e b l o c k s ; on b l o c k m o d i f i e d :

t r i g g e r ( b l o c k R e p o r t ( s e l f , b u c k e t i z e ( b l o c k s ) )

4.2.3 Handling Data Node Restarts

To handle the case of a stale data node, either due to a restart or because of missing messages, treat the first report, and any report from a data node detected stale, as if all hashes are inconsistent. Nodes are detected stale using a consistent mechanism provided by HDFS internally.

Listing 4.5: Handling data node restarts and stale data nodes, Name Node NAME NODE

/ / The f i r s t b l o c k r e p o r t

/ / and r e p o r t from s t a l e d a t a nodes / / s h o u l d be c o m p l e t e l y p r o c e s s e d upon b l o c k R e p o r t ( dn , b u c k e t s ) :

i f ( i s F i r s t R e p o r t ( dn ) | | i s S t a l e ( dn ) ) :

(46)

4.2. D^ESIGN 29

f o r b u c k e t i n b u c k e t s :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s ) h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( k e p t B l o c k s ) e l s e :

i f ( h a s h e s [ dn ] [ b u c k e t . i d ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s ) h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( b u c k e t . b l o c k s )

4.2.4 Handling Appends

To handle file appends, whenever a client begins an append operation, undo the hash of the last FINALIZED replica (see Listing4.6).

Listing 4.6: Handling Appends, Name Node NAME NODE

upon C l i e n t . append ( p a t h ) : f i l e = g e t F i l e ( p a t h )

b l o c k = g e t L a s t B l o c k ( f i l e )

b u c k e t I d = b l o c k . i d % NUM BUCKETS d a t a n o d e s = g e t L o c a t i o n s ( b l o c k ) f o r dn i n d a t a n o d e s :

h a s h e s [ dn ] [ b u c k e t I d ] hash ( block , FINALIZED ) . . . / / R e s t o f o l d append l o g i c

4.2.5 Handling Recovery

Upon block recovery initialization, if no recovery in progress, undo hashes of the corresponding replicas on the data nodes. Furthermore, if a client is forced to upgrade its pipeline due to a writing error, undo the hash of the FINALIZED replica to all datanodes in the previous pipeline (see Listing4.7).

Listing 4.7: Handling Recovery, Name Node NAME NODE

/ / i n t e r n a l name node code

/ / c a l l e d d u r i n g l e a s e r e c o v e r y

FSNamesystem . i n i t i a l i z e B l o c k R e c o v e r y ( b l o c k ) : i f ! i n R e c o v e r y ( b l o c k ) :

(47)

oldDataNodes = g e t L o c a t i o n s ( b l o c k ) b u c k e t I d = b l o c k . i d % NUM BUCKETS f o r dn i n o l d D a t a n o d e s :

/ / hash b l o c k a s r e p l i c a by a d d i n g s t a t u s

h a s h e s [ dn ] [ b u c k e t I d ] = hash ( block , FINALIZED ) / / Make s u r e we do n o t l e a v e s t a l e r e p l i c a s u n n o t i c e d . upon C l i e n t . u p d a t e P i p e l i n e ( b l o c k ) :

oldDataNodes = g e t L o c a t i o n s ( b l o c k ) ; b u c k e t I d = b l o c k . i d % NUM BUCKETS f o r dn i n oldDataNodes :

h a s h e s [ dn ] [ b u c k e t I d ] = hash ( block , FINALIZED ) . . . r e s t o f u p d a t e p i p e l i n e code

4.2.6 Final Design

Listings 4.8 and 4.9 contain the code of the new hash-based block report with improvements applied.

Listing 4.8: Final Design, Name Node NAME NODE

upon i n i t ( ) :

data nodes = t h e s e t o f d a t a n o d e s i n t h e c l u s t e r f o r DN i n data nodes :

f o r b u c k e t I d i n b u c k e t I d s : s e t h a s h e s [DN] [ b u c k e t I d ] = 0 upon i n c B l o c k R e p o r t ( dn , r e p l i c a ) :

k e p t R e p l i c a = p r o c e s s R e p o r t e d B l o c k ( r e p l i c a ) b u c k e t I d = r e p l i c a . b l o c k I d % NUM BUCKETS i f ( k e p t B l o c k && b l o c k . s t a t e = FINALIZED ) :

h a s h e s [ dn ] [ b u c k e t I d ] + hash ( r e p l i c a ) e l s e i f ( ! k e p t B l o c k ) :

h a s h e s [ dn ] [ b u c k e t I d ] hash ( r e p l i c a ) upon b l o c k R e p o r t ( dn , b u c k e t s ) :

i f ( i s F i r s t R e p o r t ( dn ) | | i s S t a l e ( dn ) ) :

(48)

4.2. D^ESIGN 31

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s ) h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( k e p t B l o c k s ) e l s e :

i f ( h a s h e s [ dn ] [ b u c k e t . i d ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s ) h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( b u c k e t . b l o c k s ) / / I n t e r n a l code

FSNamesystem . i n i t i a l i z e B l o c k R e c o v e r y ( b l o c k ) : i f ! i n R e c o v e r y ( b l o c k ) :

oldDataNodes = g e t L o c a t i o n s ( b l o c k ) b u c k e t I d = b l o c k . i d % NUM BUCKETS f o r dn i n o l d D a t a n o d e s :

h a s h e s [ dn ] [ b u c k e t I d ] = hash ( block , FINALIZED ) upon C l i e n t . u p d a t e P i p e l i n e ( b l o c k ) :

oldDataNodes = g e t L o c a t i o n s ( b l o c k ) ; b u c k e t I d = b l o c k . i d % NUM BUCKETS f o r dn i n oldDataNodes :

h a s h e s [ dn ] [ b u c k e t I d ] = hash ( block , FINALIZED ) . . . r e s t o f u p d a t e p i p e l i n e code

upon C l i e n t . append ( p a t h ) : f i l e = g e t F i l e ( p a t h )

b l o c k = g e t L a s t B l o c k ( f i l e )

b u c k e t I d = b l o c k . i d % NUM BUCKETS d a t a n o d e s = g e t L o c a t i o n s ( b l o c k ) f o r dn i n d a t a n o d e s :

h a s h e s [ dn ] [ b u c k e t I d ] hash ( block , FINALIZED ) . . . / / R e s t o f o l d append l o g i c

Listing 4.9: Final Design, Data Node DATA NODE

upon i n i t ( ) :

(49)

r e c e i v e commands | | do work | | r e c e i v e b l o c k s ; on b l o c k m o d i f i e d :

t r i g g e r ( b l o c k R e p o r t ( s e l f , b u c k e t i z e ( b l o c k s ) )

(50)

Chapter 5 Analysis

5.1 Results

The performance evaluation was done as simulations of processing time given different cluster configurations and number of file system operations performed per second. Correctness and design analysis is done in Section5.2.

5.1.1 Simulations

The model for calculating block report speed in HopsFS, where h is num buckets, b is num blocks per dn, cps is replica changes per datanode per second, bts=index scan rows read/s, bms=replica states corrected per second, RT T is DN-NN roundtrip time, RT TDB is NN-database rountrip time, in is number of incorrect hashes after the nth block report and e is the expected number of lost incremental updates between any two block reports, is:

i₀=0

i_n=max(e,BR_n⇤ cps),n 1 BR₁=RT T +bytesInReport

netSpeed BRn=RT T + RT TDB+min(i_{n 1},h)

h ⇤ b

bts+ e

bms,n 2

Keep in mind that this is assuming that stale datanodes are avoided for new block writes, which is the case if the number of stale datanodes is below a configurable threshold percentage, defaulting to 50%.

In every case, we assume a 2.7% write percentage, and a network latency of 5ms. In Figure 5.1we simulate the effects of increasing the number of buckets,

33

(51)

34 C^HAPTER 5. A^NALYSIS

Property Value

RTT 5 ms

RTTDB 1 ms

bytes per reported block 30

network speed 5 Gbps

corrections per second (bms) 1000 errors between reports (e) 50 blocks read per second (bts) 150 000

Table 5.1: Simulation Configuration

and measure block reporting time as a function of number of file system operations performed per unit time. In Figure5.2 we simulate the effects of database read speed on block reporting time. Configuration parameter values are defined in Table5.1, according to measurements taken in our production cluster. Because of the impact of concurrent operations, we need to do a step-wise simulation until a stable time has been reached. Every individual measurement is therefore the last measurement of a sequence of 50 block reports for that configuration.

5.1.2 Real-world Performance

Unfortunately we did not manage to reproduce these results in a cluster, because of issues remaining with the implementation. However, to get as close as possible to real-world values in our simulation, we adjusted the model parameters according to values measured on the HopsFS production cluster in SICS ICE, Lule˚a.

5.2 Analysis

5.2.1 Correctness

Correctness is an essential property for a data-loss protection mechanism. Tradi- tional HDFS block report is quite simple to reason about since the complete data node state is transmitted and replicas are compared one by one. Our hash-based solution instead compares a number of calculated hashes and we must therefore show that these hashes correspond to the expected state. The hash-based reporting uses knowledge of the underlying state and its transformations to keep the hashes up to date.

We start by showing correctness of a single-hash solution, and afterwards extrapolate it to multiple ”buckets”. As we are comparing a hash of the state, instead of the actual state, we need to make sure that the hash actually corresponds

(52)

5.2. A^NALYSIS 35

0 1 2 3 4 5 6 7 8 9 10 0

5 10

Millions of file system operations per second (2.7% writes) [M ops/s]

Timeperreport[s]

Impact of number of buckets on block report time

1k buckets/dn 2k buckets/dn 3k buckets/dn 4k buckets/dn

Figure 5.1: Impact of Number of Buckets on Block Report Time

0 1 2 3 4 5 6 7 8 9 10 0

5 10 15 20 25

Millions of file system operations per second (2.7% writes) [M ops/s]

Timeperreport[s]

Impact of block processing speed on block report time

50k blocks/s 100k blocks/s 150k blocks/s

Figure 5.2: Impact of Block Processing speed on Block Report Time

(53)

36 C^HAPTER 5. A^NALYSIS

the state it is supposed to represent. On the data node, processing power is abundant, so we generate a new hash every time we send a report. On the name node it gets more complicated. Block and replica information is spread over multiple tables: block infos, replicas and replica under constructions. The block infos table contains length, block id, inode reference, and generation stamp.

The replicas table is simply a mapping between blocks and storages (data nodes), and the replica under constructions table contains non-finalized replica states.

Our hash depends on all the values block id, block length, generation stamp and construction state, so if a block info entry is modified by either length or generation stamp, this must be reflected in the hash for every replica of that block.

Since we are not re-calculating the hash, but rather incrementally updating it on replica or block updates, we need to make sure that all updates are reflected in the hash and that in case of doubt we invalidate the hash rather than give a false positive. We believe that the reasoning put forward in the first attempt at a solution (see Section4.2.1) and the improvements introduced in the following Sections4.2.3,4.2.4and4.2.5are sufficient to show the validity of the solution.

5.2.2 Pitfalls

The motivation behind this project was a previously unfeasible block reporting time of roughly 10s per million blocks which, given the high expectations on durability in HDFS, were not enough to scale an hourly reporting cluster configuration to more than maybe a thousand nodes at best. HopsFS comes with load-balancing between name nodes, and durable metadata in the distributed database NDB, but that isn’t quite enough. With the improved design, we can see that given enough buckets, the block reporting speed is improved by at least an order of magnitude, with a worst case report time comparable to the old solution.

However, there are several pitfalls to consider.

5.2.2.1 Sensitivity to Misconfiguration

The most obvious problem is that of configuring block reports to have too few buckets to handle concurrent operations. The threshold between large improvements and no improvements is very narrow. The results of the simulation clearly show that given enough buckets to handle concurrent operations, the block report times remain very quick. However, what happens when the number of file system operations per second increases to the point of hitting the threshold when the buckets are not enough? Currently the only way to update the number of buckets is to shut down the whole cluster, reconfigure every data node, reconfigure the name nodes, and update the bucket ids for every replica in the database. If

Hash-based Eventual Consistency to Scale the HDFS Block Report

Hash-based Eventual Consistency to Scale the HDFS Block Report

AUGUST BONDS

Abstract

Sammanfattning

Acknowledgements

Contents

List of Figures

List of Tables

List of Listings

List of Acronyms and Abbreviations

Chapter 1 Introduction

1.1 Background

1.2 Problem

1.3 Goal

1.3.1 Contributions

1.3.2 Benefits, Ethics and Sustainability

1.4 Method

1.5 Delimitation

1.6 Outline

Chapter 2 Background

2.1 Hadoop

2.1.1 HDFS

2.2 NDB

2.3 Hops

2.3.1 Transactions in Hops

2.3.2 HopsFS

2.4 Eventual Consistency

2.5 HDFS Block Management

2.5.1 Blocks and Replicas

2.5.2 HDFS Block Report

2.6 HopsFS Block Report

2.7 Merkle Trees

2.8 State of the Art

2.8.1 Federated HDFS

2.8.2 Amazon S3

2.8.3 Google Cloud Storage

Chapter 3 Method

3.1 Goals

3.2 Tasks

3.3 Evaluation

Chapter 4

The Improved HopsFS Block Report

4.1 Solution Idea

4.1.1 Incremental Hashing

Â

4.2 Design

4.2.1 A First Attempt

4.2.2 Handling Concurrent Writes

4.2.3 Handling Data Node Restarts

4.2.4 Handling Appends

4.2.5 Handling Recovery

4.2.6 Final Design

Chapter 5 Analysis

5.1 Results

5.1.1 Simulations

5.1.2 Real-world Performance

5.2 Analysis

5.2.1 Correctness

5.2.2 Pitfalls