Making Big Data Smaller: Reducing the storage requirements for big data with erasure coding for Hadoop

(1)

Making Big Data Smaller

Reducing the storage requirements for big data with erasure coding for Hadoop

Steffen Grohsschmiedt

Master of Science Thesis

Software and Computer Systems

School of Information and Communication Technology KTH Royal Institute of Technology

Stockholm, Sweden

21 July 2014

Examiner: Professor Seif Haridi Supervisor: Dr. Jim Dowling

TRITA-ICT-EX-2014: 98

(2)

c Steffen Grohsschmiedt, 21 July 2014

(3)

Abstract

The amount of data stored in modern data centres is growing rapidly nowadays.

Large-scale distributed file systems, that maintain the massive data sets in data centres, are designed to work with commodity hardware. Due to the quality and quantity of the hardware components in such systems, failures are considered normal events and, as such, distributed file systems are designed to be highly fault-tolerant. A common approach to achieve fault tolerance is using redundancy by storing three copies of a file across different storage nodes, thereby increasing the storage requirements by a factor of three and further aggravating the storage problem.

A concrete implementation of such a file system is the Hadoop Distributed File System (HDFS). This thesis explores the use of RAID-like mechanisms in order to decrease the storage requirements for big data. We designed and implemented a prototype that extends HDFS with a simple but powerful erasure coding API.

Compared to existing approaches, we decided to locate the erasure-coding management logic in the HDFS NameNode, as this allows us to use internal HDFS APIs and state. Because of that, we can repair failures associated with erasure- coded files more quickly and with lower cost. We evaluate our prototype, and we also show that the use of erasure coding instead of replication can greatly decrease the storage requirements of big data without scarifying reliability and availability. Finally, we argue that our API can support a large range of custom encoding strategies, while adding the erasure coding logic to the NameNode can significantly improve the management of the encoded files.

i

(4)

(5)

Acknowledgements

I would like to express my deepest gratitude to my supervisor Dr. Jim Dowling, who was a great source of inspiration and coined the vision of this project. Also, I would like to thank my advisers Mahmoud Ismail and Salman Niazi for countless hours of support.

Finally, I would like to thank the Swedish Institute of Computer Science (SICS) for providing me with a nice working environment, all necessary compute resources, and great colleges always willing to exchange ideas.

iii

(6)

(7)

List of Figures

2.1 The three Vs of big data [1] . . . 6

2.2 The Google File System [2]. . . 8

2.3 The Hadoop Distributed File System [3] . . . 10

2.4 RAID 5 example . . . 12

2.5 Erasure coding example . . . 13

2.6 (10, 6, 5) Locally Repairable Code [4] . . . 16

2.7 Duration of node unavailability [5] . . . 17

2.8 Failed nodes per day at Facebook [4] . . . 17

2.9 Markov model for RAID 5 [6] . . . 18

3.1 HDFS-RAID (Xorbas) overview . . . 20

4.1 HOP-EC overview . . . 25

4.2 ErasureCodingManger in the NameNode . . . 26

4.3 Erasure coding library. . . 27

4.4 Overall process . . . 29

4.5 Block placement example for a (6, 4) encoded file . . . 31

4.6 States of an encoded file . . . 32

5.1 Storage overhead . . . 36

5.2 Encoding duration. . . 37

5.3 Repair duration . . . 38

5.4 Read performance [7] . . . 39

5.5 Read duration of a 10GB file . . . 40

5.6 Triplication: Corrupted files per failed node . . . 41

5.7 HOP-EC: Corrupted files per failed node . . . 42

5.8 HOP-EC: Repair duration per failed node . . . 43

5.9 HOP-EC: Corrupted files during random block loss . . . 44

5.10 Markov model for RS and LRC [4] . . . 44

5.11 HDFS-RAID: Corrupted files per failed node . . . 46

ix

(12)

(13)

List of Tables

2.1 MTTF of replication compared to various RS codes [5] . . . 14 5.1 Comparison of HOP-EC and HDFS-RAID . . . 47

xi

(14)

(15)

List of listings

1 Word count in MapReduce . . . 7 2 HOP-EC API . . . 33

xiii

(16)

(17)

List of Acronyms and Abbreviations

This document requires readers to be familiar with terms and concepts used in the field of distributed systems and data analytics. For clarity, some of these terms are summarized here with a brief description, before making use of them in the following sections.

API Application Programming Interface - Interface of a software exposed to application developers.

DAL Data access layer - A software layer providing access to data storage.

ECC Error-correcting code - A mechanism of encoding data so that a specific number of failures can be tolerated.

[8]

GFS Google File System - A distributed file system developed by Google. [2]

HAR Hadoop Archive - A utility of Hadoop combining multiple files as one file [17, p. 78].

HDFS Hadoop Distributed File System - A distributed file system maintained by the Apache Software Foundation.

[9]

HOP Hadoop Open Platform as a Service - A distribution of Hadoop developed at KTH and SICS. [10]

HOP-HDFS A distributed file system extending HDFS with multiple NameNodes. [11,12]

MTTDL Mean Time To Data Loss - Measure of the expected time until data is lost in a storage system. [6]

xv

(18)

xvi LIST OFACRONYMS ANDABBREVIATIONS

MTTF Mean Time To Failure - Measure of the expected time to failure of hardware components. [13]

RAID Redundant Array Of Inexpensive Disks - An array of hard disk increasing throughput and reliability [14]

SICS SICS Swedish ICT - A research institute in Stockholm.

[15]

(19)

Chapter 1 Introduction

The need to maintain and analyse a rapidly growing amount of data, which is often referred to as big data, is increasing vastly. Nowadays, not only the big internet companies such as Google, Facebook and Yahoo! are applying methods to analyse such data [2, 9, 16], but more and more enterprises at all. This trend was already underlined by a study from The Data Warehousing Institute (TDWI) [1] conducted across multiple sectors in 2011. The study revealed that 34% of the surveyed companies were applying methods of big data analytics, whereas 70%

thought of big data as an opportunity.

A common approach to handle massive data sets is executing a distributed file system such as the Google File System (GFS) [2] or the Hadoop Distributed File System (HDFS) [9] on data centres with hundreds to thousands of nodes storing petabytes of data. Popular examples of such data centres are the ones from Google, Facebook and Yahoo! with respectively 1000 to 7000 [5], 3000 [4] and 3500 nodes [9] providing storage capacities from 9.8 (Yahoo! [9]) to hundreds of petabytes (Facebook [16]). The storage requirements for these data centres are increasing rapidly over time. For instance, Facebook claims to have a growth rate of a few petabytes every week [16]. To cope with this massive growth without exploding costs, the use of commodity servers and disk drives is ordinary [5].

Having this quantity of commodity servers and a high workload, failures are the norm [2]. Consequently, in order to provide data reliability and availability, while having frequent node failures, data is usually replicated onto multiple nodes and racks [2,9]. However, replicating data n times also requires n times the disk space and hence increases the storage requirements even further.

1

(20)

2 CHAPTER 1. INTRODUCTION

1.1 Problem description

The Hadoop Distributed File System [9] is capable of handling petabytes of data spread over thousands of nodes, built of commodity hardware, while tolerating constant node failures. The high level of fault tolerance is achieved by replicating data across nodes and racks. Thereby, the default strategy is to replicate a chunk of data onto three different nodes located on two different racks [17, p. 74].

Whereas this offers great resilience to failure, it also introduces a large storage overhead. For instance, storing a file of 100GB requires 300GB of disk space.

The total storage capacity, as available to users, is hence divided by three. A widely used approach solving this issue for general file systems, is the usage of so called Redundant Arrays of Inexpensive Disks (RAID) [18]. Advanced RAID systems divide files into stripes consisting of a predefined number of blocks. On each stripe, special mathematical functions are applied in order to generate parity blocks. The applied functions have the characteristic to allow the recreation of the original data without the presence of all blocks. In order to allow a function specific number of disks to fail, while still being able to recompute the original data, the blocks of each stripe and the according parity blocks are spread over multiple disks. Using this concept, the redundancy storage overhead introduced by the parity blocks is smaller than a full replication of the data.

This thesis introduces and evaluates HOP-EC, a RAID-like mechanism for HDFS, replacing the triplication mechanism without scarifying reliability or availability. Hence, HOP-EC solves the redundancy storage overhead problem of HDFS and thereby decreases the growth problem of big data.

1.2 Problem context

In The Pathologies of Big Data [19], Jacobs characterizes big data as to “be defined at any point in time as ’data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time’“. He states that this is not necessarily only limited by the possibility to store the data, but also by the means of analysing it. Jacobs claims that this is not a new phenomenon, but the scale has changed with the existing possibilities.

A popular approach of tackling the latest generation of big data is the MapRe- duce programming model [20] and its open-source implementation Hadoop [17], whose concepts exploit massive parallelism in combination with linear scalability.

Hadoop has already been successfully applied by many organizations with big internet companies being a popular example [9, 16]. Additionally, several data warehouse and machine learning solutions, such as Apache Pig [21], Apache Hive [22] and Apache Mahout [23], have been built on top of it.

(21)

1.3. STRUCTURE OF THIS THESIS 3

Also famous but not publicly available are the Google solutions GFS [2]

and Colossus [24]. The development of Hadoop was strongly influenced by publications about these systems [17, p. 9].

1.3 Structure of this thesis

Chapter 1 defines the problem and characterizes its context. Chapter 2 then introduces the background and the specific knowledge necessary for the reader to understand the problem and the rest of this thesis. Following, related work is discussed in chapter 3. Subsequently, chapter 4states the used methods, goals, metrics and the proposed solution of this work. The solution is then analysed and evaluated in chapter5. Finally, chapter6draws a conclusion and gives directions for future work.

(22)

(23)

Chapter 2 Background

Specific knowledge about the tools, as well as concepts used to analyse massive amounts of data and the methods of providing redundancy, is necessary for the reader in order to understand this thesis. The following sections give a definition of the term big data and describe the related concepts. Subsequently, an introduction into the concepts of RAID technologies and erasure coding is given.

Finally, failures in distributed storage systems and associated failure models are discussed.

2.1 Big data

Big data is an umbrella term, often used to describe huge data sets and the respective analytical methods. However, the data sets tagged with this term have usually more characteristics than only their size, that makes them difficult to handle. Russom [1] gives a more precise definition by characterizing big data via the three Vs, namely volume, variety and velocity, as illustrated in figure2.1.

Volume describes the physical size of the data. The amount of data to be analysed might be massive but still needs to be processed. The size varies from terabytes to petabytes and maybe soon even exabytes.

Variety stands for the diversity of data that needs to be processed. The data, that needs to be analysed, comes from many different sources, has many different formats and may be structured, semi-structured or even unstructured. Human language and texts, for instance, are unstructured data, while XML documents are semi-structured. All of these kinds of data can include valuable information.

Accordingly, the analytical tools to be used need to be capable of handling arbitrary input formats.

Velocity is another important property of big data and can be seen as the frequency at which data is generated or delivered. The data to be analysed might

5

(24)

6 CHAPTER2. BACKGROUND

come as a fast, continuous and endless stream. Additionally, the knowledge model derived from this stream might require to be updated in a timely manner, or even in real time.

Figure 2.1: The three Vs of big data [1]

2.2 MapReduce

MapReduce is a programming model, originally introduced by Google [20].

MapReduce implementations are automatically parallelizable by design. The model is inspired by aspects of functional programming and breaks algorithms down to a map function, a reduce function and an optional combine function. Each map function reads an input set of key/value pairs and emits an intermediate set of key/value pairs. The MapReduce framework groups the intermediate key/value pairs with the same key and passes them to the reduce function in a single call.

The reduce function reads all values for a specific key and applies a function on them, usually resulting in a much smaller output. The combine function can be applied in-between the map and the reduce phase, in order to reduce the amount of intermediate data before transferring it over wires. Multiple executions of the map and the reduce function are started in parallel by the framework. The input chunk size for mappers and the data partitioning, the way the intermediate data is divided onto reducers, is specified by the user.

A popular example application, which was introduced in the original paper, is counting the frequency of words in documents. A python version of the appropriate map and reduce functions is shown in listing1. The name and the content of a document is passed to the map function as key and value. The function

(25)

2.3. THE GOOGLE FILESYSTEM 7

iterates over all words, while emitting the word and the value one for each of them.

The reduce function obtains a word as key and all of its related values as a list of ones. It then accumulates all the ones in a loop and emits the final result for a word. The result is the number of occurrences in the analysed documents.

The Google implementation of MapReduce is executed on top of GFS.

Thereby, map tasks are attempted to be executed on a node on which the input data is located, or, if not possible, on a node close to it. Node failures or slow nodes are handled by rescheduling tasks after reaching a timeout.

The MapReduce model, in combination with a distributed file system, is a great possibility to parallelize algorithms and to handle large data sets, as it scales excellently while reducing bandwidth requirements by exploiting data locality.

# key = document name

# value = document content def map(key, value):

for word in value.split():

emit(word, "1");

# key = word

# value = list of number of occurrences def reduce(key, value):

sum = 0;

for v in value:

sum += v;

emit(key, sum);

Listing 1: Word count in MapReduce

2.3 The Google File System

The Google File System (GFS) [2] is an append-only distributed file system designed to work at a large-scale, in data centres built of thousands of commodity servers, facing constant node failures. Early versions of GFS had a single master and multiple chunk server architecture as shown in figure 2.2. Thereby, the namespace was handled by the master, whereas chunk servers stored the file data. Files were divided into large chunks, usually of a size of 64 megabytes, in order to reduce the amount of metadata stored on the master. By default, chunks were replicated three times and placed onto multiple racks. The single master architecture enabled the application of a sophisticated chunk placement strategy in order to increase availability and throughput. Clients had to query the master for write and read locations.

(26)

In order to guarantee reliability of the metadata, the master state was replicated to multiple servers and a new master process was started in case of fatal errors.

A data mutation was only considered committed after being written to all master replicas. To increase availability, so called ’shadow’ masters provided read only data access in case of a master downtime.

The original GFS design had multiple drawbacks [25] with the most significant one being the single master architecture, which was originally chosen to simplify the design. This architecture limits scalability and availability, as the master becomes a bottleneck and a single point of failure. Given that the master was storing the metadata for all files, the workload of the master increases heavily with the number of files. Google identified this issue and modified later versions of GFS, in order to support multiple masters in parallel. The second issue was the chunk size of 64 megabytes. While originally designed for Google’s crawling and indexing process, which produces large files, GFS was more and more used for other applications, which require a much smaller chunk size, as their files are smaller than 64 megabytes. Newer versions use a chunk size of one megabyte, which is easily possible with a multiple master architecture, as storing of metadata is not a limitation any longer.

A recent version of GFS, codenamed Colossus [24], has an automatically shared metadata layer and supports Reed-Solomon codes instead of chunk replication, in order to guarantee reliability. Reliability through the use of Reed-Solomon codes will be discussed later in this document.

Figure 2.2: The Google File System [2]

(27)

2.4. APACHEHADOOP 9

2.4 Apache Hadoop

Apache Hadoop [17] is an open-source software framework for large-scale data processing. It includes a distributed file system called the Hadoop Distributed File System (HDFS) and a framework for MapReduce. Many data analysis, data warehousing and machine learning solutions have been built on top of it.

The most commonly known extensions of Hadoop are Apache Pig [21], Apache Hive [22], Apache HBase [26], Apache Zookeeper [27] and Apache Mahout [23].

Recent version of Hadoop also include a resource negotiator called Yet-Another- Resource-Negotiator (YARN), often also referred to as NextGen MapReduce or short MRv2. YARN is, inter alia, used to execute MapReduce jobs. The design and concepts used by Hadoop are inspired by the Google papers about GFS and MapReduce [17, p. 9]. Similar to MapReduce on GFS, Hadoop is exploiting data locality for MapReduce jobs by trying to execute map jobs on a DataNode which hosts the data. If not possible, the framework will attempt to execute the job on a node close to the location of data, for instance on the same rack. This can greatly improve the overall performance [28] and reduces the network bandwidth requirements.

2.4.1 HDFS

HDFS [9] is the core of Hadoop and works similar to early versions of GFS.

It is an append-only distributed file system designed to work with commodity servers facing frequent failures. It has single master called the NameNode, which acts as a namespace and metadata server. Files are divided into chunks (blocks), sized 64 megabytes by default, and stored on so called DataNodes. The blocks are generally replicated three times on different nodes and placed on at least two different racks. Figure 2.3 illustrates this architecture. Clients need to obtain block locations from the NameNode before writing or reading. The NameNode maintains the placement of blocks, as well as their replication level, in order to compensate node failures.

HDFS is facing the same issues as early versions of GFS. The NameNode is a bottleneck and a single point of failure. If the NameNode becomes unreachable, the whole file system stops working. To increase resilience to failures, newer versions of Hadoop allow having additionally NameNodes in active standby, being able to take over in the event of failure [17, p. 49]. Additionally, HDFS Federation [17, p. 50] was introduced in Hadoop 0.23, which allows having multiple namespaces on the same set of DataNodes. This partially solves the problem of the NameNode being a congestion point. However, it has some limitations, as clients need to use multiple namespaces.

(28)

Figure 2.3: The Hadoop Distributed File System [3]

2.5 Hadoop Open Platform as a service

Hadoop Open Platform as a service (HOP) [10] is a Hadoop distribution based on Hadoop 2. It provides namespace scalability through the support of multiple NameNodes, platform as a service support for creating and managing clusters, and a dashboard for simplified administration. HOP is developed in cooperation of KTH and SICS [15].

2.5.1 HOP-HDFS

HOP-HDFS [11,12] is a fork of HDFS and part of HOP. It aims on providing high availability and scalability for HDFS. This is achieved by making the NameNode stateless and thereby adding support for the use of multiple NameNodes at the same time. Instead of storing any state in the NameNode, the state is stored in a distributed database offering high-availability and high-redundancy. Therefore, the current implementation uses MySQL Cluster [29], which utilizes NDB Cluster as an underlying storage engine.

HOP-HDFS is a promising approach that could make HDFS similar to Colos-

(29)

2.6. REDUNDANT ARRAYS OF INEXPENSIVE DISKS 11

sus, while overcoming the scalability and availability limitations of the current Hadoop implementation. Through its support for larger amounts of metadata, it could also make the use of block sizes smaller than 64 megabytes efficient, what might be useful for many applications.

2.6 Redundant Arrays of Inexpensive Disks

Redundant Arrays of Inexpensive Disks (RAID) were introduced by Patterson et al. [18] in the 1980s, when compute performance was increasing in a way that the performance of secondary storages could not keep up with it. As cheap commodity disks had become available, a possible solution of increasing throughput was building arrays out of such disks. Whereas this can greatly improve throughput, it significantly reduces the mean time to failure (MTTF) of such a system. As stated by Patterson et al., the MTTF of a disk array can be calculated by dividing the MTTF of one disk by the number of disks. This correlation is shown in equation 2.1. Considering a large disk array, the MTTF will decrease significantly and failures are hence more likely to happen.

MT T F_array= MT T F_disk

#disks (2.1)

As a solution to the decrease in MTTF, Petterson et al. introduced RAID concepts which use error-correcting codes (ECC) to compute redundancy information, which is then spread across disks, or stored on additional redundancy disks. It thereby increases reliability significantly and makes the usage of disk arrays feasible. The RAID levels 1, 5 and 6, as describe by Chen et al. [14]

are summarized in the following subsections, as they are related to approaches discussed in this thesis. Given that the data in HDFS is append-only, RAID level properties such as update costs are not considered, as it is not relevant in the context of this particular thesis.

2.6.1 RAID 1

RAID 1 is a simple mechanism, which does not make use of any error-correcting codes. Instead, whenever data is written, it is automatically replicated to a redundant disk. While this mechanism ensures that a single disk failure can be tolerated, as there is always a second copy of the data without requiring any repair, it also reduces the total available storage capacity by a factor of two. When data is accessed, it can be read from both copies, hence increasing read performance.

This is the same approach as used by GFS and HDFS.

(30)

2.6.2 RAID 5

RAID 5 is a mechanism that allows one out of n disks in an array to fail without losing any data. It divides data into blocks of the same size and groups them into stripes of n − 1 blocks. One parity block of the same size as the source blocks is computed for each stripe. Parity bocks are spread uniformly across disks.

Thereby, only a single block belonging to each stripe is allowed to be stored on a single disk.

According to Plank [30], RAID 5 uses an ECC-based so called “n+1-parity“.

Parity computation and block reconstruction can therefore be done using simple XOR operations. A toy example, considering a disk array consisting of five disks and a block size of one byte, is shown in figure2.4. First, the value of the parity block is calculated by XORing the values of block one to four. Then, when the disk on which block 3 was stored is failing, the original value of block 3 can be restored by XORing block 1, 2, 4 and the parity block. A block with a size greater than 1 bit can be handled analogous, by calculating the ith bit of the parity block from the ith bit of all source blocks.

Given that XOR operations are computationally cheap and the storage overhead per stripe is only one block, RAID 5 is a very efficient way of implementing redundancy.

Figure 2.4: RAID 5 example

2.6.3 RAID 6

RAID 6 is designed to tolerate double disk failures. This is required as failures become more likely in larger disk arrays. To achieve double failure tolerance, two parity blocks are computed [31] and two additional disks are required. The data

(31)

2.7. ERASURE CODING 13

is divided into blocks and spread over disks, in a way that is similar to RAID 5.

Reed-Solomon (RS) codes are usually applied to compute the parity blocks. RS codes are discussed in section2.7.1.

2.7 Erasure coding

Erasure coding is a type of Forward Error Correction (FEC) [32]. An erasure code transforms a set of input blocks into a larger set of encoded blocks, so that a subset of the encoded blocks is sufficient to reconstruct the original data. Thereby, the encoded blocks do not necessarily need to contain the original blocks.

An important characteristic of an erasure code is, whether it is maximum distance separable (MDS) or not. An MDS code generates n encoded blocks out of k source blocks, in a way that any k blocks are sufficient to reconstruct the original data. A code with such a property is denoted as (n, k) code. [32] MDS codes are proven to be optimal in terms of encoding overhead [4]. Consequently, it is not possible to construct a code which provides the same level of redundancy with a smaller overhead.

An example of such a code is shown in figure 2.5. An encoder creates n encoded blocks out of k source blocks, which are then transmitted to a decoder.

The decoder does not receive all of the encoded blocks but is still able to reconstruct the original k blocks.

Another property of erasure codes is the minimum code distance. The minimum code distance of a code of length n is the minimum number of block erasures required to prevent the reconstruction of the original k blocks [4]. A (14, 10)-MDS code, for instance, has a minimum code distance of 5.

Finally, the block locality r of an erasure code is defined as the maximum number of blocks required to reconstruct a single block [4].

Figure 2.5: Erasure coding example

(32)

2.7.1 Reed-Solomon codes

The most commonly applied class of erasure codes are Reed-Solomon (RS) codes.

A well-known example for the application of RS codes are digital audio disks, which use RS codes for error correction [33]. RS codes were already discovered in the 1950s by Irvan S. Reed and Gustave Solomon [34]. RS codes have the MDS property and are hence optimal in terms of storage overhead, while providing the maximum possible tolerance for block erasures. For instance, the application of a (14, 10)-RS code requires only 1.4 times the original storage, while tolerating up to four block erasures. RS codes do not modify the source blocks but do add additional parity blocks. The original blocks can hence be read individually from other blocks. RS codes are also very flexible and can be used in arbitrary combinations regarding the number of source and parity blocks. Configurations known to be applied for distributed storage systems are (6, 4), (9, 6), (12, 8), (13, 9) and (14, 10) [5,4].

The mathematical construction of RS codes requires a deep knowledge in algebra and coding theory, which is not necessary in order to understand this thesis, as it uses existing implementations. The details are therefore omitted in this work and the interested reader is referred to the original paper [34] and an implementation tutorial by Plank [30].

A study about the usage of RS codes in distributed storage systems, conducted by Google, [5] compared the MTTF of different levels of block replication to various configurations of RS codes. The results are shown in table2.1. The block replication factor is thereby denoted with R. Correlated failures include failures of multiple nodes produced by the same cause, for instance a power outage.

Policy (% overhead) MTTF(days) MTTF(days)

with correlated failures w/o correlated failures

R = 2 (100) 1.47E + 5 4.99E + 05

R = 3 (200) 6.82E + 6 1.35E + 09

R = 4 (300) 1.40E + 8 2.75E + 12

R = 5 (400) 2.41E + 9 8.98E + 15

(6, 4)-RS (50) 1.80E + 6 1.35E + 09

(9, 6)-RS (50) 1.03E + 7 4.95E + 12

(13, 9)-RS (44) 2.39E + 6 9.01E + 15

(12, 8)-RS (50) 5.11E + 7 1.80E + 16

Table 2.1: MTTF of replication compared to various RS codes [5]

The results of the study show that RS encoding is superior to replication in terms of a storage overhead and MTTF trade-off. Comparing triplication to an (9, 6)-RS code, it can be seen that RS has an higher MTTF when facing correlated

(33)

2.7. ERASURE CODING 15

failure, while introducing only one fourth of the storage overhead of triplication.

Without correlated failures, the results are even better. Hence, RS codes provide the same or even greater MTTF with only a fraction of the storage overhead of replication.

A drawback of RS encoding is the large storage overhead for files which are smaller than the input stripe length of the code [7]. Assuming a (14, 10)-RS code and a source file of only two blocks, the storage overhead of four parity blocks is 200%. However, this is still not worse than block triplication.

2.7.1.1 The repair problem

A problem with applying RS codes in distributed storage system, which has been frequently discussed recently [4, 16, 35], is the amount of bandwidth required for repairs and degraded reads. Given a (14, 10) RS encoded data stripe and assuming the erasure of only one of the source blocks, 10 blocks need to be read to reconstruct the erased block. Hence, RS codes have a block locality of r = k.

In a data centre with thousands of storage nodes, built of commodity hardware, facing constant failures, this can consume a large amount of the overall available bandwidth [4]. A possible solution to this problem named Locally Repairable Codes was recently developed and will be described in the next section.

2.7.2 Locally Repairable Codes

Locally Repairable Codes (LRC) are a recently developed extension of RS codes, addressing the repair problem by adding additional parity blocks, thereby decreas- ing the block locality significantly [4].

The notation of LRC is slightly different to the standard notation. A code is denoted as (k, n − k, r), meaning that it has k source blocks, an encoded stripe length of n, n − k parity blocks, and a block locality of r. Figure 2.6 shows an example based on a (14, 10) RS code. Three additional parity blocks are created in a way that, for each of the sets of five source blocks plus the local parity block, five blocks are sufficient to recover from a single block failure. Additionally, by carefully selecting the function for creating the additional parity, the third parity block S3 can be omitted and the blocks P1 to P4, S1 and S2 build another group of five tolerating a single block erasure. Thereby, the block locality is decreased to r = 5 for single block failures, while still tolerating block erasures up to four blocks by defaulting to the underlying RS code.

Given that 99.75% of recoveries are represented by single block failures [35], this encoding schema can significantly decrease the amount of repair traffic.

(34)

Figure 2.6: (10, 6, 5) Locally Repairable Code [4]

2.8 Failures in distributed storage systems

A study by Google [5] identified that distributed storage system used in a data centre environment suffer from various failure sources such as software, hardware, network, and power failures. These failures can be grouped into two event classes:

node availability and disk failures. Additionally, events can be correlated or not.

Most availability failures are transient as shown in figure 2.7. The figure shows the cumulative distribution function of unavailability events for multiple of Google’s storage systems consisting of 1000 to 7000 nodes. Thereby, it has to be noted that unavailability is not only caused by failures but also by planned reboots during maintenance. Restarts are defined as software restarts without rebooting the whole node. It can be seen that a majority of unavailability events is shorter than 15 minutes.

Looking at the correlation between failures, the study revealed that most failures affecting multiple nodes are bound to a small number of racks. This is confirmed by Ramshi et al. [16] stating that spreading file blocks correctly across racks leads to 98.08% of failures being single block failures.

Considering the frequency of failures, a one month long trace of failing nodes in a Facebook production cluster consisting of 3000 nodes, underlines that large clusters are constantly facing a large number of failures. See figure2.8.

(35)

2.8. FAILURES IN DISTRIBUTED STORAGE SYSTEMS 17

Figure 2.7: Duration of node unavailability [5]

Figure 2.8: Failed nodes per day at Facebook [4]

(36)

2.8.1 Failure models

The Mean Time To Failure (MTTF) is a characteristic of hard disks that manufac- turers state in their data sheets. It specifies the average time a hard disk is likely to operate without failures and can nowadays be as high as 1,000,000 hours, what is more than 100 years. It is retrieved from accelerated tests under laboratory conditions. [13]

The Mean Time To Data Loss (MTTDL) is a concept widely used to compare the reliability of storage systems consisting of multiple storages and predicts the likelihood of facing a data loss. It assumes a Markov model. [6]

Figure 2.9: Markov model for RAID 5 [6]

Figure 2.9 shows an example Markov model for a RAID 5 system, which tolerates a single disk failure. In this model, n is the number of disks, while λ denotes the failure rate of a single disk, and µ denotes the repair rate. State 0 stands for n healthy disks, whereas state 1 and 2 stand for one respective two broken disks. Hence, the MTTDL is the probability of reaching state 2. From the Markov model, the MTTDL can be derived as shown in equation2.2[6].

MT T DL= µ + (2n − 1)λ

n× (n − 1)λ² (2.2)

Although widely applied to predict the failure rate of storage systems, it has to be noted that the concept of MTTDL and MTTF is controversial. Schroeder and Gibson [13], as well as Greenan et al. [6], questioned the accuracy of these models and showed that they often model failure rates incorrectly.

(37)

Chapter 3 Related work

The idea of integrating erasure coding for distributed storage systems has already been implemented in several solutions. This section describes the solutions that are tightly related to this thesis and discusses their advantages and drawbacks.

3.1 HDFS-RAID

HDFS-RAID [36] is an open-source module for Hadoop available for version 0.22. It offers two classes of erasure codes, namely XOR and Reed-Solomon, which can be flexibly configured in terms of source stripe length and number of parity blocks. HDFS-RAID is also extensible to support other codes. The encoding process is designed to encode cold files after they have not been accessed for a predefined amount of time. After the encoding, the replication factor is decreased in order to reduce the storage requirements. Paths containing the files to be encoded can be specified during runtime, by adding them to a configuration file. HDFS-RAID is then constantly iterating over the predefined paths, checking for files that can be encoded. Similar, the NameNode is constantly queried for broken blocks using a file system check provided by HDFS. Corrupted blocks of encoded files are then attempted to be recovered. The scheduling of encodings and repairs is implemented as a so called RaidNode daemon, which can be executed on an arbitrary node in the cluster. It offers local and remote encoding, respectively repair, mechanisms. Thereby, encoding and repair jobs can either be executed locally on the RaidNode or forwarded to another node as a map job. Additionally, a so called DistributedRaidFileSystem is included which provides transparent file repairs to client applications. HDFS-RAID is built as a client application on top of HDFS.

HDFS-RAID applies a random block placement policy. Additionally the placement of blocks is monitored in a separate daemon. Co-located blocks are

19

(38)

20 CHAPTER3. RELATED WORK

thereby detected and it is attempted to copy them to other nodes.

While being a great tool for archiving cold files, HDFS-RAID is facing several limitations. First of all, because its separation from HDFS, the RaidNode has to constantly query the NameNode while iterating through all files over and over again. Additionally, although it does prioritise the repair of files which have both, corrupted source and corrupted parity blocks, it does not consider the number of corrupted blocks per file. Considering the number of corrupted blocks could decrease the repair time of critical files and hence increase reliability.

The separation from the NameNode also requires a garbage collection process, deleting parity blocks of deleted files, and a daemon that monitors source files for changes and updates parity files accordingly.

Secondly, the random block placement strategy is not optimal, especially on relatively small clusters where blocks frequently end up on the same nodes. On the one hand, this decreases reliability as multiple blocks of one stripe can be lost during a single node failure. On the other hand, the block mover has to frequently move blocks to other DataNodes which increases the bandwidth utilization.

Additionally, HDFS-RAID does not offer a flexible API, as it only allows specifying file path plus the access delay after which the encoding will be triggered.

Finally, HDFS-RAID is not available for recent versions of Hadoop.

Figure 3.1: HDFS-RAID (Xorbas) overview

(39)

3.2. DISKREDUCE 21

3.1.1 Xorbas

Xorbas [37] is the Facebook version of HDFS-RAID that includes Locally Re- pairable Codes. It was built on top of Facebook’s own Hadoop distribution, which is based on Hadoop 0.20. The included version of HDFS-RAID has been developed further than the original version, although the basic concept is the same.

Hence, it faces the same limitations as HDFS-RAID.

The prototype presented in this thesis is built out of the Facebook version of HDFS-RAID.

3.2 Diskreduce

Diskreduce [7,38] is an approach similar to HDFS-RAID and also implemented for HDFS. Diskreduce was implemented separately from HDFS as an administration tool. The administration tool allows specifying a directory which is then encoded in a MapReduce job. The encoding process is actually distributed and not implemented as a single mapper as in HDFS-RAID. The supported erasure codes are taken from RAID 5 respectively RAID 6 and based on the Jerasure library [39].

A DistributedRaidFileSystem, similar to the one in HDFS-RAID, is provided. One major difference to HDFS-RAID is that clients are actively triggering the repair process, when detecting a block failure.

The most interesting feature of Diskreduce is the ability to group files together, encoding them like a single file. Considering the large overhead for files smaller than the input stripe length of a code, this significantly decreases the storage requirements for small files. The drawback of this approach is that it is difficult to maintain the state of the parity information, as the combined parity need to be updated whenever files are deleted or changed.

Being implemented aside of HDFS, Diskreduce faces the same limitations as HDFS-RAID. Additionally, it does neither implement a block placement strategy nor file append or deletion of single files in an encoded directory. As its encoding is based on RAID 5 and 6, a maximum of 2 block erasures can be tolerated.

Triggering file repairs by clients might also be an issues, as cold files might not be frequently read and hence the risk of losing more than two blocks in a stripe increases over time.

3.3 Microsoft Azure Storage

Microsoft implemented erasure coding for their cloud storage service Windows Azure Storage (WAS) [40]. It applies Reed-Solomon codes, with a storage overhead of 1.3 or 1.5, in order to reduce storage requirements. Besides the usage

(40)

22 CHAPTER3. RELATED WORK

of RS codes, Microsoft is putting efforts into exploring codes providing a more effective repair [35]. WAS is a proprietary solution and hence not available for the public.

3.4 Google Colossus

A newer version of GFS, which is codenamed Colossus, [24] is implementing Reed-Solomon codes with a storage overhead of 1.5. It uses a rack-aware block placement strategy and prioritises repairs according to the number of lost blocks [5]. Unfortunately, it is neither publicly available nor a lot of information has been published about it.

3.5 Compression

An alternative approach to reduce the storage requirements for data stored on HDFS is file compression. Hadoop offers a variety of compression formats, which can easily be used. Possible options are gzip, bzip2, LZO and snappy.

Thereof, only bzip2, and LZO, after being indexed, are splittable, what means that individual blocks can be decompressed without reading the whole file. Splittable formats are usable in MapReduce jobs, which decreases the storage requirements of files in addition to the bandwidth during block transfers. [17, p. 85]

Compression should not be seen as an alternative to erasure coding, as both can be applied in combination to significantly decrease the storage requirements.

(41)

Chapter 4 Method

The overall goal of this thesis is to decrease the storage requirements of big data and thereby to approach the problem of its rapid growth. The hypothesis is that this goal can be achieved by adding erasure coding to one of the most commonly used tools for storing and analysing big data, namely Hadoop. It is assumed that this will enable data analysts to store and analyse more data and to reduce the operating costs of data centres. This will empower enterprises, as well as researchers, to gain new, valuable insights, thereby increasing public welfare.

In order to confirm the given hypothesis an experimental approach is chosen.

First, a prototype named HOP Erasure Coding (HOP-EC) is introduced, which is then analysed quantitatively and analytically. From an software engineering point of view, an iterative approach is taken, in order to constantly test and improve the prototype, while validating it against the specified goals.

To differentiate from existing solutions, existing work was analysed in chapter 3, mainly based on research papers and publicly available open-source approaches. Thereby, existing limitations were identified and will be used in the following section to specify goals, in order to overcome these limitations in the prototype developed for this thesis.

Statistics, published about the characteristics of large data centres, are used to make reasonable assumptions about the execution environment.

The success of this thesis project is measured by analysing the characteristics of the developed prototype.

23

(42)

24 CHAPTER4. METHOD

4.1 Goals

The main goals of this project, incorporated in the design goals of HOP-EC, are defined as follows.

• Add erasure coding to Hadoop to approach the storage problem of big data.

• Offer a flexible API allowing the encoding of individual files, at any time and with an arbitrary codec, as well as the revocation of the encoding.

• Allow the implementation of custom strategies for identifying the files to encode and triggering the encoding.

• Detect failures as fast as possible and prioritise the repair of critical files by considering the number of erased blocks.

• Enforce the block placement of encoded files to ensure maximum reliability also for smaller clusters.

• Provide a mechanism for transparent source block repair on the client-side.

• Automatically remove the parity information during source file deletion.

• Enable the addition of new erasure codes.

• Support a recent version of Hadoop.

4.2 Delimitations

This work does not focus on the design or implementation of new erasure codes, as this is coding theory and outside of the focus of this thesis. Instead, existing implementations will be used.

4.3 Solution

To facilitate the achievement of the previously defined goals, a major design decision, different to existing solutions, was taken. Although the prototype is based on the HDFS-RAID version of Xorbas, the scheduling of encoding and repair jobs was moved to the NameNode. This greatly simplifies the detection of failures, the prioritizing of repairs, the enforcement of block placement and the creation of a flexible API. Existing approaches avoided this approach, as it is putting more pressure on the already congested NameNode. However, as our

(43)

4.3. SOLUTION 25

solution was built on HOP, which has a multiple NameNode architecture, the cost of adding the management of erasure coding to the NameNode is relatively low and the benefits are predominating.

An overview of the whole systems is given in figure4.1. The individual parts will be explained in the following sections.

Figure 4.1: HOP-EC overview

4.3.1 System architecture

The HOP-EC architecture is divided into two parts, the ErasureCodingManager in the NameNode and an erasure coding library which is loaded during runtime.

A component diagram showing the major components for each of them is given in figure4.2respectively4.3.

The ErasureCodingManager forms the centre of HOP-EC. It is implemented as a single thread executed on the leading NameNode. The leading NameNode is determined by a leader election which is already included in HOP. The Erasure- CodingManager is responsible for scheduling encodings and repairs, monitoring the revocation process of encoded files and executing the garbage collection of parity files. Thereby, the state of each encoded file is represented by an EncodedFileStatuswhich is stored in a MySQL Cluster.

EncodingManager and RepairManager are abstract and specify the interface implemented by the encoding library. The interface is kept simple and includes operations such as encode or repair a file. Upon request, EncodingManager and

(44)

26 CHAPTER4. METHOD

Figure 4.2: ErasureCodingManger in the NameNode

RepairManager return their progress to the ErasureCodingManger in form of a Report.

ErasureCodeis an abstract implementation of an erasure code and has to be extended to provide actual codes. A Codec is a concrete configuration of an ErasureCode. For instance, having Reed-Solomon as an erasure code, a specific codec would be an (14, 10) RS code. New codecs can be simply specified in an HDFS configuration file. An EncodingPolicy represents the combination of a codec and the respective block replication level as requested for a specific file by a client. It is stored together with the EncodedFileStatus.

The BlockManager was extended in order to monitor the corruption and fixing of encoded files.

The erasure coding library is based on the Xorbas version of HDFS-RAID and quite complex. For the sake of simplicity, only the main components which

(45)

4.3. SOLUTION 27

Figure 4.3: Erasure coding library

are necessary to understand the following sections are presented here. The library includes implementations of the EncodingManager and RepairManager interfaces as well as three implementations of erasure codes, namely XOR, Reed- Solomon and Locally Repairable Codes. The concrete code implementations are used by the Encoder and Decoder in order to encode files respectively to reconstruct corrupted blocks.

The MapReduceEncodingManager starts a mapper for each encoding request it receives and keeps track of its progress. The MapReduceRepairManager acts similar to the encoding manager but in the case of repairs. Both components use the MapReduce framework of Hadoop to schedule the execution of jobs on nodes in the cluster.

The ErasureCodingFileSystem implements the FileSystem API and provides transparent repairs to client applications.

(46)

28 CHAPTER4. METHOD

4.3.2 State

Each file, for which the encoding was requested belongs to an EncodedFileStatus.

It represents all important information about the encoding such as the codec, the replication to be applied, the associated parity file, the status of the source and parity file. Additionally, information about the integrity of the source and parity file, such as the number of lost source and parity blocks, is being stored. Finally, it includes modification timestamps in order to prioritise older requests.

The complete state is stored in the MySQL Cluster instance used by the HOP- HDFS NameNode and is hence safely persisted. The necessary layers, as required by the HOP-HDFS data model, were implemented in order to do this. As MySQL Cluster does not support transactions, the existing transaction system of HOP- HDFS was used in order to guarantee the atomicity, consistency, isolation and durability of the persisting operations.

For the sake of completeness, a list of all EncodedFileStatus fields and possible values can be found in appendixA.

4.3.3 Overall process

The process of scheduling encodings and repairs as well as the monitoring of revocations and the garbage collection of parity files is implemented as a single thread of the ErasureCodingManager. The process is illustrated in figure4.4. The process is user-configurable by several parameters. The frequency of the process and a limit for the number of active encodings as well as the number of active repairs can be specified. The repair of files can be delayed by a predefined amount of time.

The process wakes up periodically and checks for succeeded or failed encodings or repairs and updates the state accordingly. New jobs are triggered if capacities are available. Additionally, the process looks for parity files marked as deleted and garbage collects them. Finally, the replication level of revoked encodings is checked. Details about the individual sub-processes are given in the following subsections.

4.3.4 Encoding process

Triggered by the overall process, the MapReduceEncodingManager starts a map job for the encoding of an individual file. In this process, the mapper, executed on a remote machine, reads through the source file stripe by stripe and computes the parity blocks, which are stored in a separate file in a special parity folder. The encoding manager stores a reference to each job so it can check and return its

(47)

4.3. SOLUTION 29

Figure 4.4: Overall process

status upon request. To prioritise earlier requests, each encoding request has a timestamp set to the time of its creation. Older requests are handled first.

The concept of storing all parity files in one directory is different to the approach of HDFS-RAID, which stores them in a parallel folder hierarchy so the parity file related to a source file can be easily identified. However, in the case of HOP-EC the identity of the parity file is stored in the EncodedFileStatus and hence known. This greatly simplifies operations such as renaming or moving of source files, as the parity file does not need to be modified.

4.3.5 Repair process

The repair process works similar to the encoding process but has a different mechanism of prioritizing requests. For each file, the number of lost source blocks, the number of lost parity blocks and the time when the first loss of a source respective parity block was detected, is stored. The repair of source files is now prioritised in the following order, leading to a prioritised repair of critical files.

1. Total number of lost blocks (descending) 2. Number of lost source blocks (descending)

3. The detection time of the first source block loss (ascending)

Parity file repairs are scheduled separately and are prioritised similarly. Re- paired blocks are sent individually to DataNodes, while considering block placement constraints.

(48)

30 CHAPTER4. METHOD

Rewriting of individual blocks of already completed files was enabled by modifying the HDFS class DFSOutputStream. A special mode for rewriting of a single block was added to the existing write pipeline. In this process, the NameNode is not asked to create a new block, but is only queried for new block locations. A write pipeline is then opened to the given DataNodes, as usual, and the block is being written to them. Additionally, the file finalization of DFSOutputStream, which is usually executed when the stream is closed, is skipped.

4.3.6 Deletion process

When an encoded file is deleted using the HDFS API, the EncodedFileStatus is marked as deleted. More precisely, when the remove function of an HDFS inode, representing an encoded file, is called, it will set its state to deleted. The main process detects this the next time it executes and deletes the parity file as well as the state in the database. Although this is not happening in the very moment of the source file deletion, it happens quickly after, mainly depending on the interval configured for the overall process.

4.3.7 Revocation process

The encoding of files can be revoked by using an API call which is also specifying the replication factor to be applied. To guarantee reliability when revoking the encoding of a file, the parity file is kept until the requested replication level was reached. Therefore, the replication factor of the file is increased immediately when the revocation is requested and the state of the encoded file is changed to revoked. The overall process then frequently checks the actual replication level and deletes the parity file and the database entry after the replication level was achieved.

4.3.8 Placement of blocks

In order to guarantee reliability for encoded files, the placement of source and parity blocks needs to be considered [5]. As discussed earlier, files to be encoded are divided into stripes and parity blocks are created for each of these stripes.

The blocks of each stripe and the related parity blocks now need to be placed on different nodes, so that at most one of them is lost if a single node fails. Having a large amount of racks, it is even encouraged to place only one of them on an individual rack to tolerate rack failures.

HOP-EC enforces the proper placement of blocks whenever the encoding of a file is requested. Additionally the correct placement of repaired blocks is

(49)

4.3. SOLUTION 31

enforced. The current version does not consider the placement on different racks.

However, it can be easily extended to do so. An example placement of an (6, 4) encoded file with three stripes is shown in figure4.5. Blocks denote source blocks whereas circles denote parity blocks. Stripes are distinguishable by colour.

A placement following this schema ensures that the maximal possible number of node failures can be tolerated.

The block placement is implemented as part of the DFSOutputStream on the client side. When writing a file, for which erasure coding was requested, the writing pipeline keeps track of all nodes, already used for the current stripe, and excludes them in subsequent block location requests to the NameNode. The same approach is used for writing parity files, but nodes already used in current source file stripe are gathered from the NameNode and also excluded. Writing a repaired block follows the same schema and excludes all nodes used for the respective source stripe and its parity blocks.

If the encoding is requested after the file has already been written, then the whole file is being rewritten by the requesting client. The existing placement mechanisms ensure that the necessary constraints are satisfied.

Figure 4.5: Block placement example for a (6, 4) encoded file

4.3.9 Monitoring of files

The state of each encoded file and its parity file is monitored in the BlockMan- ager, which is part of the NameNode. As its name suggests, the BlockMan- ager manages blocks and can be utilized to track lost, corrupted and recovered blocks. The BlockManager has three functions which are called when a block is added (addStoredBlock), removed (removeStoredBlock) or reported as corrupted (markAsCorrupt). The functions are called during block reports, when a client reports a block as corrupted, when a DataNode is suspected to be failed or when a DataNode recovers from being suspected. Full block reports are sent by DataNodes in predefined intervals and include all blocks currently stored on them.

Additionally, partial block reports are sent by DataNodes when a new block was stored. In all cases, the BlockManager updates its state according to the event, by calling the listed functions.

The discussed functions were utilized to track the state of encoded files and their parity files, while storing the state in the EncodedFileStatus. Amongst others,

(50)

32 CHAPTER4. METHOD

EncodedFileStatus is storing the number of lost source blocks, the number of lost parity blocks and has a status field which can be used to request repairs.

Every time when a block of an encoded file or its parity file is removed or marked as corrupt, it is checked whether more replicas are available. If not then a file repair is requested and the number of lost blocks is incremented. Inversely, if a block is added and no replica was previously present, the number of lost blocks is decremented and the file status is set to healthy, if no more blocks are missing.

The whole process is illustrated as a state machine in figure4.6.

It has to be noted that a block, which was repaired by the erasure coding mechanism, is reported to the BlockManager by the receiving DataNode and hence covered by this process.

Figure 4.6: States of an encoded file

4.3.10 Application programming interface

HOP-EC provides an easy to use client API, which is embedded in the HDFS file system API, as exposed by the DistributedFileSystem class, and was implemented in form of additional RPC calls to the NameNode. The method signatures of the API are shown in listing 2. Clients can request the encoding either during file creation, by specifying an EncodingPolicy, or call the encodeFile method. Either way, the encoding process will be executed asynchronously as soon as capacities are available. The replication factor to be applied after the file was successfully encoded can be specified in the policy. Additionally, a custom replication factor to be applied before the encoding can be given.

The encoding of files can be revoked at any time using the revoke method.

When doing so, the replication factor to be applied before the parity information is deleted can be specified.

(51)

4.3. SOLUTION 33

The deletion of encoded files is done using the regular delete method of the HDFS file system API.

public HdfsDataOutputStream create(Path f, boolean overwrite, short replication, EncodingPolicy policy);

public void encodeFile(String filePath, EncodingPolicy policy);

public void revokeEncoding(String filePath, short replication);

public boolean delete(Path f, boolean recursive);

Listing 2: HOP-EC API

The simple but powerful API of HOP-EC enables client applications to apply their own encoding strategies. For instance, an archiving approach, which triggers the encoding of cold files, as in HDFS-RAID, can be implemented. Alternatively, an administration terminal could be built on top of it, allowing administrators to select files or folders to be encoded.

4.3.11 Supported codes

The current implementation supports XOR (RAID 5), Reed-Solomon and Locally Repairable Codes. The codes can be freely configured by specifying the source stripe length and the amount of parity blocks as well as the replication factor to be applied after encoding. New codes can be supported by implementing the provided interfaces and are automatically loaded during runtime if specified in a codec configuration.

The configuration of codes is done in a single configuration file. The code class and the requested code length, such as (14, 10), in addition to the desired parity folder, need to be given and named. The specific code configuration can then be used with the given name, after restarting the NameNode. New codes simply need to be added to the class path in order to be used by this configuration.

4.3.12 Modified components

Most of the code for the erasure coding functionality was implemented separately from the original Hadoop code. However, multiple components needed to be modified in order to support the new functionality. This section list the major components that needed to be modified and states the kind and purpose of the modification.

On the client side, DistributedFileSystem, DFSClient and DFSOutputStream were modified. DistributedFileSystem was extended to include the new API calls in form of extended create, encode and revoke methods. The changes

(52)

34 CHAPTER4. METHOD

in DFSClient reflect these extensions and implements the appropriate Remote Procedure Calls (RPC). Additionally, DFSClient was extended for the repair process and provides an API call for the sending of single blocks as well as one for the requesting of block locations for reconstructed blocks. The latter two are not exposed by the public API of the DistributedFileSystem. The data sending pipeline of DFSOutputStream was adjusted to support the sending of single blocks and to enforce the placement of blocks. Therefore, nodes already storing blocks for a stripe are added to the exclude list of block location requests, whenever requesting new block locations for encoded files or their parity.

On the NameNode side, NameNodeRpcServer, FSNameSystem, BlockMan- ager and INode were modified. NameNodeRpcServer implements the server- side RPC calls for the new API, whereas FSNameSystem was changed to offer functionality to retrieve and update the state of encoded files. Additionally, FS- NameSystemstarts and stops the ErasureCodingManager during its initialization.

As described earlier, the BlockManager functions addStoredBlock, removeStored- Block and markAsCorrupt were altered in order to automatically monitor and update the integrity state of encoded files. The remove function of INode was modified to set the encoding state of the file to be removed to deleted, so that the parity information can be garbage collected.

No major changes were made in the HOP-HDFS code, but necessary classes were added, as required by the existing data persistence model. This includes a data context, a data access layer (DAL) interface, a DAL adapter, a concrete ClusterJ implementation of the database functionality and a class for transaction lock acquisition.

4.3.13 Extensibility

HOP-EC is easily extensible. If necessary, even the whole procedure of encoding files could be replaced by substituting the erasure coding library. This is easily possible as it communicates with the erasure coding manager using a small set of well-defined interfaces. So, if it is required to replace the mechanism of encoding files using MapReduce, then this can be established without the need to reimplement the scheduling capabilities of the ErasureCodingManager. The chosen design also simplifies maintenance as the erasure coding library can be updated independently from the Hadoop binaries. New erasure codes extending the given interfaces can simply be added by putting them in the class path and specifying them in a configuration file.

(53)

Chapter 5 Analysis

This section evaluates the HOP-EC prototype by experimentally and analytically analysing it. All experiments were executed on a HOP-HDFS cluster consisting of 18 DataNodes and 1 NameNode with the following configurations: Seven DataNodes had two AMD Opteron 2435 processors with 6 cores at 2.6GHz, 32GB of RAM and Oracle JRE version 1.6.0 26 installed on the 64-bit version of Ubuntu 11.04. The other 11 DataNodes had two Intel Xeon X5660 processors with 6 cores at 2.80GHz, Hyper-threading support, 40GB of RAM and Oracle JRE version 1.6.0 26 installed on the 64-bit version of Ubuntu 12.04.3 LTS. All DataNodes were connected via Gigabit Ethernet using the same switch. The NameNode had two Intel Xeon E5410 processors with 4 cores at 2.33GHz, 32GB of RAM and Oracle JRE version 1.6.0 26 installed on the 64-bit version of Linux Mint 16. The NameNode was connected via 100 Mbit/s Ethernet. The resource manager and the history server were executed on one of the nodes acting as DataNode. Each job was configured to use at most 8GB of RAM and 4 virtual cores. Experiments with HDFS-RAID use the Facebook version published with Xorbas.

5.1 Storage overhead

In order to validate that the actual storage overhead for encoded files matches the theoretic one, an experiment was conducted encoding a 10GB file with an (11, 10)-XOR, (14, 10)-RS respectively (10, 6, 5)-LRC code. The block size was 64MB. The resulting storage overhead for stripe-aligned files, in comparison to triplication (R3), is shown in figure 5.1. It can be seen that all erasure coding policies have a significantly smaller overhead than triplication. XOR requires by far the smallest amount of parity information, but can also tolerate not more than one failure per stripe, while RS and LRC tolerate up to four failures. Because LRC uses two parity blocks per stripe more than RS, its overhead is 20% higher

35

Making Big Data Smaller: Reducing the storage requirements for big data with erasure coding for Hadoop