Multi-version Storage: Code Design and Repair in Distributed Storage Systems

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

Multi-version Storage: Code

Design and Repair in Distributed

Storage Systems

YUANJIA GONG

(2)

(3)

Abstract

With the rapid growth of data volume, data storage has attracted more and more re-search interests in recent years. Distributed storage systems play important roles of meeting the demand for data storage in large amounts. That is, data are stored by multiple storage nodes which are connected together with various network topologies. The main merits of such distributed storage are faster response, higher reliability and better scalability. However, due to network failure, link outage or bu↵er overflow, the updated data might not be received by all storage nodes, resulting in the coexistence of multiple versions of the file in the system. Thus, the major challenge is consistency, which means that the latest version of the file is accessible to any read request. We aim to study multi-version storage and code design in distributed storage systems, where the latest version of the file or a version close to the latest version is recoverable. Moreover, compared to previous studies, higher availability can be achieved in our system model, namely, at least one version of the file can be obtained.

(4)

Sammanfattning

Med tanke p˚a den snabbt växande volymen av data, har intresset för forskning p˚a data lagring vuxit under de senaste ˚aren. Distribuerade lagringssystem spelar en viktig roll i att möta behovet av stora volymer av lagring. Distribuerade lagringssystem är allts˚a när data är sparad p˚a flera noder som är sammankopplade p˚a olika sätt i ett nätverk. Jämfört med traditionella lokala lagringar, har distribuerad lagring fördelen att den har kortare svarstider, högre tillförlitlighet och bättre skalbarhet. Men om nätverket g˚ar ner, det blir ett länkavbrott eller maxkapaciteten i en bu↵ert n˚as, kan det hända att all uppdatering inte n˚ar ut till alla noder, vilket resulterar i att flera olika versioner av en fil finns sparande samtidigt i systemet. Därför är en av utmaningarna är att vara konsekvent, att det alltid är den senaste versionen av en fil finns tillgänglig för alla vid varje given tidpunkt. M˚alet med detta arbete är att studera flerversionslagrade filer och hur programmeringsdesignen för distribuerade lagringssystem ser ut. Med andra ord, finns det flera versioner av en fil, ska alla kunna ˚aterskapas. Jämfört med tidigare studier, har vi med v˚art system uppn˚att högre tillgänglighet, nämligen att en klient har f˚att tag i alla fall en version av en fil.

(5)

Acknowledgements

(6)

List of Figures

1.1 The typical structure of data centers.. . . 2 1.2 A typical cloud storage system. . . 2 1.3 The simplified system model, where v ordered versions of the file are

forwarded to n storage nodes through the intermediate router. . . 4 4.1 A DR storage node is introduced to repair node 1. . . 18 4.2 A DR storage node is introduced to repair node 2. . . 18 4.3 The curves of both the repair bandwidth and additional storage space are

reduced when increasing the number of DR storage nodes, for M=10Mb, d=15, s=10. . . 20 4.4 A typical application of distributed storage systems with di↵erent link

costs.. . . 21 4.5 By traditional repair approach, the total communication cost is 2_{⇥(1 + 20 + 1) =}

44. . . 21 4.6 By linear combinations, the total communication cost is reduced to 2⇥

1 + 20 + 1 = 23.. . . 22 4.7 Link failure between the surviving node and the new node. . . 23 4.8 Repair node 1 by the cooperation of the surviving node and DR storage

node.. . . 24 A.1 The information flow graph with DR storage nodes, where S, DR and DC

(9)

List of Tables

3.1 Storage allocations for arbitrary v 2, where Ver p is the latest common version among k < n storage nodes. . . 12 3.2 Optimal storage allocations for v = 2, k = 3, µ = 1/2 and p = 1. . . 13 3.3 A scenario of F2 is recoverable, where at least two nodes receive both

version 1 and version 2, then only store 2 bits of F2. . . 13

3.4 A scenario of F1 is recoverable, where at least two nodes only receive

version 1 and each stores 2 bits of F1. . . 13

3.5 Storage allocations for v = 2, k = 2, µ = 3/4 and p = 1. . . 13 3.6 A scenario of F2 is recoverable, where two nodes receive both version 1

and version 2, then each stores 1 bit of F1 and 2 bits of F2. . . 14

3.7 A scenario of F1 is recoverable, where one node only receive version 1 and

stores 3 bits of F1, one node receives both version 1 and version 2, then

stores 1 bit of F1 and 2 bits of F2. . . 14

3.8 Another scenario of F1is recoverable, where two nodes only receive version

1 and store 3 bits of F1. . . 14

(10)

Abbreviations

P2P Peer-to-Peer

DEC Di↵erential Erasure Codes DR Dedicated-to-Repair

MDS Maximum Distance Separable DC Data Collector

(11)

Chapter 1

Introduction

1.1 Motivation

In the age of information, the volume of data is significantly large and still growing rapidly in recent years. For example, 2.3_{⇥ 10}6 _{petabytes data are generated every day}

in 2014. Moreover, every minute in 2015, around 200 million posts were sent by Email, nearly 220 thousand photos were uploaded in Instagram and about 2.5 million files were shared in Facebook [1]. In a word, data storage has become a big challenge of current communication applications.

Distributed storage systems play important roles of meeting the demand of data stor-age in large amounts, where data are stored by a set of connected storstor-age nodes. Data centers, cloud storage systems, peer-to-peer (P2P) storage systems, wireless networks and large scale sensor networks are popular applications of distributed storage systems. Data center is a complicated facility, which includes computer systems and ancillary equipment, connections of redundant data and devices of environmental monitoring and controls [2]. The typical structure of data centers is depicted in Figure 1.1. The data center, which is connected to the core, consists of multiple storage nodes. Required data is accessible to the client by connecting to the closest data center, which is feasibility, fast response and efficient.

(12)

Introduction

Figure 1.1: The typical structure of data centers.

hosting provider. By the approach of cloud computing as shown in Figure 1.2, the shared resources and information access to computer terminals and various devices over the Internet [3].

Figure 1.2: A typical cloud storage system.

As stated in the CAP theorem, a distributed storage system can simultaneously satisfy at most two out of the three properties, i.e., Consistency, Availability and Partition tolerance [4]. Consistency means that the latest version of the data is accessible to every read client. Availability means that the system always returns a non-error response to any read request. Partition tolerance indicates that the delay or failure in parts of the system has no e↵ect on normal operations of the whole system.

(13)

Introduction

ensure consistency at the cost of loosening availability, no other write or read request is allowed unless all the update have been completed in the system. As a result, the latest version of the data is accessible to any read client, which is termed as consistency. However, the system might return a time-out or error response if partial update is not yet finished, which indicates that the system is unavailable during the whole update period. On the contrary, if consistency is sacrificed for availability, the system will return any available version of the data without guaranteeing the most recent updates have been contained.

Consistency is one of the challenges in distributed storage systems. For example, any change from the write operation is supposed to be announced to all storage nodes, which is the necessary condition that the latest version of the data is accessible to any read client. However, due to network failure, link outage or bu↵er overflow, the updated data might not be received by all storage nodes. As a consequence, the read client will obtain di↵erent versions of the file when connecting to a set of storage nodes.

Techniques based on erasure codes have been proposed to ensure consistency in dis-tributed storage systems among current researches [5], [6], [7] and [8]. For instance, update efficient codes are designed to reduce the communication cost in the update process [8]. In addition, multi-version storage and code design in distributed storage systems are taken into account in [9] and [10]. In [9], Harshan and Datta worked on storing an archive with multiple versions by proposing an efficient technique, termed as compressed di↵erential erasure codes (DEC). In [10], multi-version codes with near-optimal storage space have been studied from an information theoretic view, which are designed to recover the latest common version or a later version of the file. However, one limitation of such code construction is that nothing can be recovered if no common version exists in the system.

(14)

Introduction

1.2 Problem Description

The system model of distributed storage is depicted in Figure 1.3, where F1, F2, ..., Fv

stand for v di↵erent versions of the file, and Fp contains more recent updates than Fq

for p > q. All these versions will be forwarded to n storage nodes in total through an intermediate router. However, due to network failure, link outage or bu↵er overflow, the updated version of the file might not be received by all storage nodes. Finally, each storage node only receive an arbitrary subset of versions. As a result, the read client will obtain di↵erent versions of the file when connecting to a set of storage nodes. The question is that how to obtain the latest version of the file, which contains the most recent updates.

On the other hand, node failure is a frequent occurrence and the lost data is supposed to be regenerated to meet the demand for reliability. That is, downloading sufficient number of bits from the surviving storage nodes. Furthermore, how to reconstruct the lost data with less downloaded number of bits as possible is well worth considering. In addition, another common scenario is that the link costs might be inconsistent due to di↵erent locations of the storage nodes, i.e., some costs may be lower, but others might be much higher. The problem of interest is how to reduce the communication cost in the repair process where the link cost is high. Moreover, if one or more links between the surviving node and the new node are failed, is there any approach to complete the repair process successfully?

(15)

Introduction

1.3 Outline

The scope and contributions of this thesis are organized as follows.

The background and related work are briefly introduced in Chapter 2, including dis-tributed storage systems, maximum distance separate (MDS) codes and network coding, which are basics and will be used in the following chapters.

In Chapter 3, the system model description is given at first. Details of storage allo-cation strategy and code design are elaborated, subsequently. Then, we show the main results of the project and corresponding proofs can be found in Appendix A.1. More-over, two case studies of v = 2 and v = 3 versions are discussed in the last part of this chapter.

In Chapter 4, we study multi-version repair process in consistent distributed storage and mainly focus on reducing the repair bandwidth and the communication cost of repair. First, additional storage nodes dedicated to repair (DR storage nodes) are introduced to reduce the repair bandwidth with minimal additional storage space. Next, linear coding is provided to reduce the communication cost of repair where the link cost is high. Finally, we show that the cooperation among surviving nodes and DR storage nodes suffices to complete the repair process successfully even with link failure.

(16)

Chapter 2

Background

The background and related work of this project are given in this chapter. Since we formulate the problem as a distributed storage problem, first distributed storage systems are introduced. Next, brief descriptions of maximum distance separate (MDS) codes and network coding are given in the following two sections, respectively.

2.1 Distributed Storage System

In recent years, with the rapid growth of data volume, data storage has attracted more and more research interests such as [11], [12] and [13]. Distributed storage systems play important roles of meeting the demand for data storage in large amounts. That is, data are stored in a distributed way by multiple storage nodes which are connected together with various network topologies. The main merits of distributed storage systems are faster response, higher reliability and better scalability.

For the sake of reliability demand, routine repair is a necessity in distributed storage systems due to the unreliability of current network, such as link outage and node failure. However, the repair cost can not be underestimated, such as the repair bandwidth (number of bits downloaded for repair) [14] and [15], repair locality (the required number of surviving storage nodes) [16], disk I/O reads [17] and the transmission cost [18].

(17)

Background

2.2 MDS Codes

To provide high reliability of the system, redundancy has been introduced in many practical applications. In distributed storage systems, hot data (frequently accessed in-formation) are stored by replication to ensure reliability, and cold data (less frequency accessed data) are suggested to be stored by erasure codes instead. Compared to repli-cation, erasure codes provide higher efficiency in terms of storage space for the same level of reliability. That is, erasure codes improve the fundamental trade-o↵ between reliability and redundancy. More detailed comparisons between replication and erasure codes are given in [19].

One of the most popular erasure codes is maximum distance separate (MDS) codes owing to its property of the maximum error tolerance, where the optimal trade-o↵ between reliability and redundancy can be achieved. For example, consider a file with size M units, divided into k fragments and each fragment contains M/k units, then encoded into n fragments and stored in n nodes. As a result, MDS codes can tolerate at most n k coded fragments failure or erase, which means that any set of k coded fragments suffice to reconstruct the source file correctly. More related studies of MDS codes can be found in [20] and [21], e.g., Reed-Solomon (RS) codes, which is the most widely used MDS codes in storage systems.

2.3 Network Coding

(18)

Background

(19)

Chapter 3

System Model Design

In this chapter, we start with the system model description in section 3.1. Next, multi-version storage allocations and code design are elaborated in section 3.2. Then, the main results and two case studies are given in section 3.3. The corresponding poof of main results can be found in Appendix A.1.

3.1 System Model Description

Notations: For simplicity, the integer set 1, 2, ..., n is denoted by [n] for n2 N+_{, and}

m, m + 1, ..., n is simplified to [m, n] for positive integers 1_{ m < n.}

The distributed storage system model is shown in Figure 1.3, where F1, F2, ..., Fv stand

for v di↵erent versions of the file, and Fpcontains more recent updates than Fqfor p > q.

All these versions will be forwarded to n storage nodes in total through the intermediate router. More specifically, each version follows a uniformly distribution over the set [2M_].

That is, for i2 [v],

Fi 2

⇥

2M⇤= 1, 2, 3, ..., 2M ,

where M is the size of each version. Notably, we suppose that each storage node can receive the original version of the file (i.e., F1) without error. However, due to system

delay, the updated version (Fi for i2 [2, v]) might not be received by all storage nodes

(20)

Multi-version Storage Allocation and Code Design

Furthermore, we use Sj to denote the set of versions received by storage node j, for

j_{2 [n], i.e.,}

Sj ✓ 1 [ 2, 3, ..., v .

Note that, control information transmission between any two storage nodes is allowed in our system model. Therefore, each storage node is aware of the versions received by other nodes.

3.2 Multi-version Storage Allocation and Code Design

The main idea of multi-version storage allocation strategy is to allocate a certain number of bits of each received version to the corresponding storage node. Note that, the number of bits for a given version allocated to each storage node is not necessary to be of equal size. Actually, the given version is recoverable so long as the sum of storage bits among k < n storage nodes is larger than or equal to the size of the version. Details of the allocation strategy and code design are given as follows.

Storage Allocation Strategy: Allocate µiM bits of version i to storage node j for

i_{2 [S}j] and j2 [n], where µi denotes the storage cost of version i at storage node j and

M is the file size of each version.

The total amount of storage bits at node j is the sum of the number of bits stored for each received version. The size of storage space is referred to the maximum total amount of storage bits among all storage nodes, that is,

µM = max_j_2[n] X

i2[Sj]

µiM,

where µ is denoted as the total storage cost of all received versions in each storage node. The Encoding Function:

E :⇥2M⇤_{! [2}µiM_],

which indicates that mapping M bits into µiM bits for version i (i.e., Fi) at storage

(21)

Main Results and Case Studies

The Decoding Function:

D : [2µM]k _{! [2}M]_{[ null} .

That is, Fi can be decoded from any set of k < n storage nodes if and only if the

following equation is satisfied:

k

X

m=1

µi,m 1,

otherwise, version i is not decodable.

In other words, Fi is recoverable so long as the sum of storage bits for version i among

any set of k < n storage nodes is at least M bits (the size of each version).

3.3 Main Results and Case Studies

In this section, first we show the main results for arbitrary v 2. Subsequently, two special cases for v = 2 and v = 3 are discussed, respectively.

3.3.1 Main Results

Definitions: We use FmaxSj to present the latest version of the file received by storage

node j. Additionally, the latest common version of the file among k < n storage nodes is denoted by Fp for p = max\j2[k]Sj, where Sj ✓ 1 [ 2, 3, ..., v and j2 [n].

The main idea of our storage allocation strategy is that each node stores the latest common version among k storage nodes and the latest version of the file it receives. That is, only Fp and FmaxSj are supposed to be stored by storage node j. The corresponding

(22)

Main Results and Case Studies Latest Version p p + 1 p + 2 ... v 1 v Ver p µ µ 1/d µ 1/d µ 1/d µ 1/d µ 1/d Ver p + 1 1/d Ver p + 2 1/d ... 1/d Ver v 1 1/d Ver v 1/d

Table 3.1: Storage allocations for arbitrary v 2, where Ver p is the latest common version among k < n storage nodes.

The value of storage cost µ is specified in the following four cases, where v 2 and

k d 2.

Case 1: For k = v(d 1) + 1, the storage cost µ = 1/d (optimal storage allocations). Case 2: For k_{2 [v(d 1)+2, (v 1)d] and d  v} 2, the storage cost µ = 1/d (optimal storage allocations).

Case 3: For k_{2 [(v 1)(d 1)+1, v(d 1)] and d  v} 1, the storage cost µ = v(d 1)+1_dk . Case 4: For k_{2 [v(d} 2) + 2, v(d 1)] and d v, the storage cost µ = v(d 1)+1_dk .

3.3.2 Case Studies for v = 2 Versions

For v = 2 versions, which corresponds to Case 1 and Case 4 , the storage cost can be summarized as below: µ = 8 < : 1 d, k = 2d 1, d 2 2d 1 dk , k = 2d 2, d 2

In the following parts, we discuss two examples of v = 2 versions, where the size of each version is 4 bits and the two versions are denoted as F1 = (a1, a2, a3, a4) and

F2 = (b1, b2, b3, b4), respectively. Additionally, suppose that the latest common version

is version 1, i.e., p = 1.

Example 2.1: v = 2, d = 2, k = 2d 1 = 3, µ = 1/d = 1/2 (corresponding to Case 1 ).

(23)

Latest Version 1 2

Ver 1 1/2

Ver 2 1/2

Table 3.2: Optimal storage allocations for v = 2, k = 3, µ = 1/2 and p = 1.

node that only receivex version 1 will store 2 bits of F1, node receives both version 1

and version 2 will only store 2 bits of F2. As a result, the read client connects to any

set of k = 3 storage nodes can recover either version 2 or version 1, since there are at least two nodes store either F2 or F1. Two possible scenarios of version 2 or version 1 is

recoverable are given in Table3.3 and Table3.4, respectively.

Node 1 2 3

Ver 1 a1, a2

Ver 2 b1, b2 b3, b4

Table 3.3: A scenario of F2 is recoverable, where at least two nodes receive both

version 1 and version 2, then only store 2 bits of F2.

Node 1 2 3

Ver 1 a1, a2 a3, a4

Ver 2 b1, b2

Table 3.4: A scenario of F1 is recoverable, where at least two nodes only receive

version 1 and each stores 2 bits of F1.

Example 2.2: v = 2, d = 2, k = 2d 2 = 2, µ = 2d 1_dk = 3₄ (corresponding to Case 4 ).

Latest Version 1 2

Ver 1 3/4 1/4

Ver 2 1/2

Table 3.5: Storage allocations for v = 2, k = 2, µ = 3/4 and p = 1.

In this case, both the latest common version and the latest version of the file are stored by each storage node as shown in Table 3.5. That is, node only receive version 1 will store 3 bits of F1, node receives both version 1 and version 2 will store 1 bit of F1 and

2 bits of F2, respectively. As a result, either version 2 or version 1 is recoverable from

(24)

Node 1 2

Ver 1 a1 a2

Ver 2 b1, b2 b3, b4

Table 3.6: A scenario of F2 is recoverable, where two nodes receive both version 1

and version 2, then each stores 1 bit of F1and 2 bits of F2.

Node 1 2

Ver 1 a1, a2, a3+ a4 a3

Ver 2 b1, b2

Table 3.7: A scenario of F1 is recoverable, where one node only receive version 1 and

stores 3 bits of F1, one node receives both version 1 and version 2, then stores 1 bit of

F1 and 2 bits of F2.

Node 1 2

Ver 1 a1, a2, a3 a2, a3, a4

Ver 2

Table 3.8: Another scenario of F1is recoverable, where two nodes only receive version

1 and store 3 bits of F1.

3.3.3 Case Studies for v = 3 Versions

For v = 3 versions, the corresponding storage cost are given as below:

Case 1: For k = 3d 2 and d 2, the storage cost µ = 1/d (optimal storage allocations). Case 3: For k = 3 and d = 2, the storage cost µ = 2/3.

Case 4: For k2 [3d 4, 3d 3] and d v = 3, the storage cost µ = 3d 2_dk .

Similarly, two examples are provided as follows, where the size of each version is 6 bits. Example 3.1: v = 3, k = 7, d = 3, µ = 1/d = 1/3 (corresponding to Case 1 ).

Latest Version 1 2 3

Ver 1 1/3

Ver 2 1/3

Ver 3 1/3

Table 3.9: Optimal storage allocations for k = 7, d = 3, µ = 1/d = 1/3 and p = 1.

(25)

For p = 2, the latest common version is F2, which is similar to the case of v = 2

versions as discussed in Example 2.1 .

Example 3.2: v = 3, k = 3, d = 2, µ = 2/3 (corresponding to Case 3 ).

Latest Version 1 2 3

Ver 1 2/3 1/6 1/6

Ver 2 1/2

Ver 3 1/2

Table 3.10: Storage allocations for v = 3, k = 3, d = 2, µ = 2/3 and p = 1.

In this case, both the latest common version and the latest version will be stored by each storage node as shown in Table3.10. That is, node only receive version 1 will store 4 bits of F1, node with version 2 as the latest version will store 1 bit of F1 and 3 bits

of F2, node with version 3 as the latest version (no matter that version 2 is received or

not) only store 1 bit of F1 and 3 bits of F3. Consequently, at least one version of the

file is recoverable from any set of k = 3 storage nodes. That is, F2 is recoverable if at

least two nodes store version 2, so does F3. If both version 2 and version 3 are stored

by at most one storage node, then at least one storage node that only receive and store version 1, thus F1 is recoverable since 2/3⇥ 1 + 1/6 ⇥ 1 + 1/6 ⇥ 1 = 1.

For p = 2, the latest common version is F2, which is similar to the case of v = 2

(26)

Chapter 4

Repair in Consistent Distributed

Storage

In this chapter, we study multi-version repair in consistent distributed storage. Our analysis is inspired by the recent research [27], where dedicated-to-repair (DR) storage nodes are first proposed over packet erasure channels in distributed storage systems. While, we mainly focus on reducing the repair bandwidth and the communication cost in multi-version repair process by DR storage nodes.

First of all, additional storage nodes designated for repair (DR storage nodes) are introduced to reduce the repair bandwidth in section 4.1. In the following two sections, we consider the communication cost of repair in di↵erent scenarios. In section 4.2, high link cost is taken into account, our goal is to reduce the communication cost of repair by linear combinations. In section 4.3, suppose that one or more links between the surviving node and the newcomer are failed, we show that the cooperation among surviving nodes and DR storage nodes suffices to complete the repair process successfully with less communication cost.

(27)

Optimal Repair Bandwidth with DR Storage Nodes

with any other k 1 storage nodes can recover the source file. Besides the existing storage nodes, additional storage nodes dedicated to repair (DR storage nodes) are introduced in our multi-version repair model. We show that increasing the number of DR storage nodes, the repair bandwidth is significantly reduced. Moreover, an amazing observation is that the necessary storage space of DR storage nodes will be decreased as well. Our goal is to achieve the optimal repair bandwidth with less additional storage cost.

The main results in section 3.3 show that, connecting to any set of k storage nodes, d _{ k nodes suffice to recover F}i for i 2 [p + 1, v]. Moreover, Fp is also recoverable

from d storage nodes with optimal storage allocations, where only the latest version of the file will be stored by each storage node. Each of the d storage nodes contains ↵ = M/d bits of version i, where M is the file size of each version. In our multi-version repair model, if one or more storage nodes are failed, besides s surviving storage nodes, r additional storage nodes dedicated to repair are introduced and the storage space of each DR storage node is denoted by ↵⇤ bits. Thus, the newcomer can download bits each from s + r storage nodes. Here we use to represent the total repair bandwidth, i.e., = (s + r) bits.

Recall Example 2.1 in the case study for v = 2 versions, where k = 3, d = 2 and µ = 1/d = 1/2. The file size of each version is 4 bits and each of the k = 3 storage nodes contains 2 bits of either version 1 or version 2. Thus F1 or F2 can be recovered

from d = 2 storage nodes by connecting to any set of k = 3 storage nodes. Suppose that F1= (a1, a2, a3, a4) is recoverable from node 1 and node 2 as shown in Table3.4. Now,

we study the repair process if one out of the two storage nodes is failed, unfortunately. Besides the surviving node, a DR storage node is introduced to reconstruct the lost data as depicted in Figure4.1-4.2.

Inspired by Theorem 1 in [13], we can state that, if the size of storage space ↵ is no less than the lower bound ↵t, i.e., ↵ ↵t, these parameters (n, k, d, s, r, = (s +

r)) are feasible via linear network coding over a sufficiently large finite field GF (q). The relationship between the storage space and repair bandwidth in distributed storage systems can be derived as below:

(28)

Figure 4.1: A DR storage node is introduced to repair node 1.

Figure 4.2: A DR storage node is introduced to repair node 2.

where

y(i) = 2M (s + r)

i(2d i 1) + 2d(s + r d + 1), (4.2)

h(i) = i(2s + 2r 2d + i + 1)

2(s + r) . (4.3)

Note that, i = 0, 1, 2, ..., d 1 and s + r d, which means that at least d storage nodes are needed to reconstruct a given version. The corresponding proof is given in Appendix A.2.

(29)

= y(0) = M (s + r)

d(s + r d + 1). (4.4)

As can be seen in equation (4.4), increasing the number of DR storage nodes r, the repair bandwidth will be reduced, since d 2.

Subsequently, the size of each downloaded packet can be derived from the equation = (s + r), i.e.,

=

s + r =

M

d(s + r d + 1). (4.5)

The same as the surviving node, DR storage node is also supposed to transmit bits to the newcomer. Thus, the storage space of each DR storage node ↵⇤ should be no less than bits, i.e.,

↵⇤ = M

d(s + r d + 1). (4.6)

We specify ↵⇤ to be equal to bits in terms of reducing the storage cost. Then, the total storage space of DR storage nodes, denoted by , can be obtained as below:

= ↵⇤r = r = M r

d(s + r d + 1). (4.7)

Observed from equation (4.7), the additional storage space will be reduced by increas-ing the number of DR storage nodes r. The reason is that the size of each downloaded packet is reduced faster than the increasing of the number of DR storage nodes r, resulting in the decreasing of the product r.

To sum up, both the repair bandwidth and additional storage space are decreasing functions of the number of DR storage nodes r as plotted in Figure 4.3, where M = 10M b, d = 15, s = 10 and r d s > 0.

(30)

Reducing Communication Cost with High Link Cost

Figure 4.3: The curves of both the repair bandwidth and additional storage space are reduced when increasing the number of DR storage nodes, for M=10Mb, d=15, s=10.

additional storage space can be achieved by introducing a certain number of DR storage nodes in distributed storage systems.

4.2 Reducing Communication Cost with High Link Cost

In this section, we discuss another aspect of the repair cost in distributed storage systems, i.e., the communication cost. Suppose a possible scenario where the link costs between the surviving storage node and the newcomer are di↵erent due to di↵erent locations of the storage nodes. Our goal is to reduce the communication cost of repair where the link cost is high.

Consider a typical application of distributed storage systems, i.e., data-centers, as depicted in Figure 4.4, where some storage nodes located in the same rack and others might belong to di↵erent racks. Additionally, the storage nodes in each rack are coordi-nated by the rack server. In general, compared to the same rack, the link cost between storage nodes belong to di↵erent racks will be much higher.

For example, consider a certain version of a file, i.e., F = (a1, a2), is stored by two

(31)

Reducing Communication Cost with High Link Cost

Figure 4.4: A typical application of distributed storage systems with di↵erent link costs.

between the rack server and each storage node in the same rack is 1 unit, and any two rack servers communicate with each other will cost 20 units.

Figure 4.5: By traditional repair approach, the total communication cost is 2⇥ (1 + 20 + 1) = 44.

Further, we analyse the repair process when one of the storage nodes (i.e., node 2) is failed as shown in Figure 4.5. By traditional repair approach, fragment a1 from node

1 and fragment a1+ a2 from DR storage node are transmitted to rack server A, then

(32)

Successful Repair with Link Failure

Figure 4.6: By linear combinations, the total communication cost is reduced to 2_⇥ 1 + 20 + 1 = 23.

fragment a1+ a2 are combined in rack server A. Finally, the combined fragment a2 will

be forwarded to the new node through rack server B ( suppose the finite filed is GF (2) for simplicity). As a consequence, the communication cost from server A to the new node is half the cost in the traditional method. Actually, the total communication cost will be reduced to 2⇥ 1 + 20 + 1 = 23.

In short, linear combinations of received fragments in the rack server enable to reduce the communication cost of repair where the link cost is high in distributed storage systems.

4.3 Successful Repair with Link Failure

In this section, another scenario is considered where one or more links between the surviving node and the newcomer are failed in the repair process. We show that the cooperation among the surviving nodes and DR storage nodes suffice to complete the repair process successfully with less communication cost.

(33)

and DR storage nodes. Finally, the combined fragments are forwarded to the newcomer by one DR storage node. Notably, we restrict our strategy to the scenario where at least one DR storage node can successfully communicate with the newcomer and any two storage nodes are connected together in distributed storage systems.

Figure 4.7: Link failure between the surviving node and the new node.

Again, recall the scenario in Table3.4of Example 2.1 , if one of the storage nodes is failed, we introduce two DR storage nodes to reconstruct the lost data. Unfortunately, the link between the surviving node (node 2) and the newcomer is failed as shown in Figure4.7. Therefore, we aim to reconstruct the lost data by the cooperation among the surviving node and DR storage nodes. As depicted in Figure 4.8, where the coefficient is selected from the finite field GF (5), the two fragments in node 2 are combined first, i.e., c1 = a3+ a4. Then, the combined fragment c1 will be transmitted to DR node 1

and combined with its fragment, i.e., c2 = 4c1+ (a1+ a2 + a3+ a4) = a1+ a2. Next,

both c1 and c2 are sent to DR node 2 and combined with its fragment. Finally, the lost

fragments a1 = 2c1+ 3c2+ (4a1+ 3a2+ 2a3+ 2a4) and a2 = c2+ 4a1 can be reconstructed

and then forwarded to the new node by DR node 2.

(34)

(35)

Chapter 5

Conclusions and Future Work

5.1 Conclusions

In this thesis, we study multi-version storage and code design in distributed storage systems, where the latest version of the file or a version close to the latest version is recoverable. The main idea of our storage allocations is that each storage node stores the latest version it receives and possibly the latest common version among k < n storage nodes. A certain version can be recovered if and only if the sum of the storage bits among k storage nodes is larger than or equal to the size of the version. In addition, compared to previous studies [10], higher availability is achievable in our system model, that is, at least one version of the file is recoverable by connecting to a set of k storage nodes.

(36)

Conclusions and Future Work

but others might be much higher. We show that linear combinations suffice to reduce the communication cost of repair where the link cost is high. Last but not the least, another scenario where one or more links between the surviving node and the new node are failed is considered, we show that the cooperation among the surviving nodes and DR storage nodes makes it possible to complete the repair process successfully.

5.2 Future Work

There are still some limitations and questions that remain open for future work. We restrict our analysis that all these versions are independent and each version is encoded separately by linear combinations. Actually, most of the versions are related in practical applications of distributed storage systems. In our future work, linear coding cross multiple versions will be studied.

(37)

Appendix A

Appendix

A.1 Proofs of Main Results

The corresponding proofs of main results are given as follows.

Case 1: For k = v(d 1) + 1 and d 2, the storage cost µ = 1/d (optimal storage allocations).

First, it is obvious that k is lager than d, i.e.,

k = v(d 1) + 1 = (v 1)(d 1) + d > d due to v 2, d 2.

Then, we can state that, among any set of k storage nodes, there are at least d nodes store a common version. That is, it is impossible for all versions to be stored by at most d 1 storage nodes, respectively, since

k = v(d 1) + 1 > v(d 1).

The extreme case is that only one version is stored by d nodes and each of the other v 1 versions is stored by d 1 nodes, that is,

(38)

Proofs of Main Results

Thus, the sole version can be recovered since the storage bits in the d storage nodes is equal to the file size of the version, i.e.,

µ_{⇥ d =} 1

d⇥ d = 1.

In conclusion, connecting to any set of k storage nodes, d < k nodes suffice to recover version i for i_{2 [p, v], where version p is the latest common version and version v is the} latest version of the file. In addition, at least one version of the file is recoverable in our system, which means that higher availability can be achieved, compared to previous studies [10].

Case 2: For k_{2 [v(d 1)+2, (v 1)d] and d  v} 2, the storage cost µ = 1/d (optimal storage allocations).

In this case, k > d is also satisfied, since

k v(d 1) + 2 = (v 1)(d 1) + d + 1 > d.

Similarly, as discussed in Case 1 , among any set of k storage nodes, there are at least d nodes store a common version, due to

k v(d 1) + 2 > v(d 1).

As a result, Fi can be recovered by connecting to any set of k storage nodes, so long as

there are at least d < k nodes store version i for i2 [p, v]. Moreover, we can guarantee that at least one of the version among Fp, Fp+1, ..., Fv is recoverable.

Case 3: For k2 [(v 1)(d 1)+1, v(d 1)] and d  v 1, the storage cost µ = v(d 1)+1_dk . First, we can prove that µ > 1/d, since

k v(d 1) ) µ = v(d 1) + 1 dk v(d 1) + 1 v(d 1) ⇥ 1 d > 1 d. Similarly, k > d is satisfied as well:

(39)

Proofs of Main Results

If there are at least d nodes store version i among k storage nodes for i2 [p + 1, v], Fi

can be recovered since d_⇥1_d = 1.

If there are at most d 1 nodes store version i among k storage nodes, then at least k (v p)(d 1) storage nodes only store version p, where

k (v p)(d 1) (v 1)(d 1) + 1 (v p)(d 1) = (p 1)(d 1) + 1 1.

Therefore, Fp can be recovered from k storage nodes, due to

µ⇥ [k (v p)(d 1)] + (µ 1 d)⇥ (v p)(d 1) = k⇥ µ (v p)(d 1)⇥ 1 d = v(d 1) + 1 d (v p)(d 1) d = p(d 1) + 1 d d 1 + 1 d 1.

To sum up, Fi for i 2 p + 1, p + 2, ..., v is recoverable if there are at least d < k

storage nodes store version i, otherwise Fp can be recovered from at most k storage

nodes. In other words, the latest version of the file or a version close to the latest version is recoverable by connecting to any set of k storage nodes.

Case 4: For k2 [v(d 2) + 2, v(d 1)] and d v, the storage cost µ = v(d 1)+1_dk . Similarly, as proved in Case 3 , µ is lager than 1/d.

In this case, k d can be proved:

k v(d 2) + 2 = (v 1)(d 2) + d d.

Similarly, connecting to any set of k storage nodes, Fi for i2 p + 1, p + 2, ..., v can be

recovered if there are at least d nodes store version i, and Fp is recoverable if there are

at most d 1 nodes store version i, then at least k (v p)(d 1) storage nodes only store version p, since

k v(d 2) + 2 = (v 1)(d 1) + d v + 1

(40)

Storage Bandwidth Trade-o↵

All in all, by connecting to any set of k storage nodes, Fi for i2 p, p + 1, ..., v can

be recovered in distributed storage systems.

A.2 Storage Bandwidth Trade-o↵

The relationship between the storage space and repair bandwidth in distributed storage systems is provided in this part.

First, the information flow graph with DR storage nodes is depicted in Figure A.1, the source file will be forwarded to n storage nodes by the source node in a distributed way. Each node among s surviving nodes and r DR storage nodes is supposed to transmit bits to the newcomer. Note that, in order to reconstruct the given version, at least d storage nodes are needed in data center as we proved before, thus, s + r d should be satisfied.

Figure A.1: The information flow graph with DR storage nodes, where S, DR and DC stand for source node, DR storage node and data collector, respectively.

According to the cut-set bound constraint [22], the necessary condition of reconstruct-ing the source file is that the min-cut value should be no less than the file size, that is

d 1

X

i=0

min (s + r i) , ↵ M,

(41)

Storage Bandwidth Trade-o↵

via linear network coding over a sufficiently large finite field GF (q) in distributed storage systems. In general, we have ↵t= M Pi 1_j=0mj d i for i = 0, 1, 2, ..., d 1, where M ₂⇣ i 1 X j=0 mj+ (d i)mi 1, i X j=0 mj+ (d i 1)mi i , and mi = (1 d 1 i s + r ) . We can derive that,

i 1 X j=0 mj = i 1 X j=0 (1 d 1 j s + r ) = i(2s + 2r 2d + i + 1) 2(s + r) , and i X j=0 mj+ (d i 1)mi = (i + 1)(2s + 2r 2d + i + 2) 2(s + r) + (d i 1)(1 d 1 i s + r ) = i(2d i 1) + 2d(s + r d + 1) 2(s + r) .

For simplicity, we denote that

h(i) = i(2s + 2r 2d + i + 1)

2(s + r) ,

and

y(i) = 2M (s + r)

i(2d i 1) + 2d(s + r d + 1).

Notably, y(i) is a decreasing function of i, the value of y(0) achieves its maximum when i = 0, and the corresponding storage space ↵t= M_d, since h(0) = 0.

(42)

(43)

Bibliography

[1] Majid Gerami. Coding, computing, and communication in distributed storage sys-tems. Doctoral Thesis in Electrical Engineering Stockholm, Sweden, 2016.

[2] Krishna Kant. Data center evolution: A tutorial on state of the art, issues, and challenges. Comput. Netw., 53(17):2939–2965, December 2009.

[3] https://www.techopedia.com/definition/26535/cloud-storage.

[4] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002. [5] J. S. Plank and C. Huang. Tutorial: Erasure coding for storage applications. In

Proceedings of the 11th Conference on USENIX Conference on File and Storage Technologies, FAST’13. USENIX Association, 2013.

[6] James Lee Hafner. Weaver codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies - Volume 4, FAST’05, pages 16–16. USENIX Association, 2005.

[7] A. G. Dimakis, V. Prabhakaran, and K. Ramchandran. Decentralized erasure codes for distributed networked storage. IEEE Transactions on Information Theory, 52 (6):2809–2816, June 2006.

[8] N. P. Anthapadmanabhan, E. Soljanin, and S. Vishwanath. Update-efficient codes for erasure correction. In 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 376–382, Sept 2010.

(44)

Bibliography

[10] Zhiying Wang and Viveck R. Cadambe. Multi-version coding in distributed storage. CoRR, abs/1506.00684, 2015.

[11] Sean Rhea, Chris Wells, Patrick Eaton, Dennis Geels, Ben Zhao, Hakim Weather-spoon, and John Kubiatowicz. Maintenance-free global data storage. IEEE Internet Computing, 5(5):40–49, September 2001.

[12] C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems. In Sixth IEEE International Symposium on Network Computing and Applications (NCA 2007), pages 79–86, July 2007.

[13] A. G. Dimakis, P. B. Godfrey, M. J. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. In IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, pages 2000–2008, May 2007.

[14] Soroush Akhlaghi, Abbas Kiani, and Mohammad Reza Ghanavati. Cost-bandwidth tradeo↵ in distributed storage systems. Computer Communications, 33(17):2105– 2115, November 2010.

[15] N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran. Explicit codes minimizing repair bandwidth for distributed storage. In 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo), pages 1–5, Jan 2010. [16] A. S. Rawat, D. S. Papailiopoulos, A. G. Dimakis, and S. Vishwanath. Locality and availability in distributed storage. IEEE Transactions on Information Theory, 62(8):4481–4493, Aug 2016.

[17] Osama Khan, Randal Burns, James Park, and Cheng Huang. In search of i/o-optimal recovery from disk failures. In Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems, HotStorage’11, pages 6–6, 2011. [18] M. Gerami, M. Xiao, C. Fischione, and M. Skoglund. Decentralized minimum-cost

repair for distributed storage systems. In Proc. IEEE International Conference on Communications (ICC), 2013.

(45)

Bibliography

IPTPS 2002 Cambridge, MA, USA, March 7–8, 2002 Revised Papers, pages 328– 337, 2002.

[20] R. Singleton. Maximum distance q-nary codes. IEEE Transactions on Information Theory, 10(2):116–118, April 1964.

[21] Lihao Xu and J. Bruck. X-code: Mds array codes with optimal encoding. IEEE Transactions on Information Theory, 45(1):272–276, Jan 1999.

[22] R. Ahlswede, N. Cai, S. Y. R. Li, and R. W. Yeung. Network information flow. IEEE Transactions on Information Theory, 46(4):1204–1216, July 2000.

[23] S. Y. R. Li, R. W. Yeung, and Ning Cai. Linear network coding. IEEE Transactions on Information Theory, 49(2):371–381, Feb 2003.

[24] Antony Rowstron and Peter Druschel. Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. SIGOPS Oper. Syst. Rev., 35 (5):188–201, October 2001.

[25] Kiran Tati and Geo↵rey M. Voelker. On object maintenance in peer-to-peer sys-tems. In In Proc. of the 5th International Workshop on Peer-to-Peer Systems, 2006.

[26] A. Phutathum, M. Gerami, M. Xiao, and D. Lin. A practical study of distributed storage systems with network coding in wireless networks. In Proc. IEEE Interna-tional Conference on Communication Systems (ICCS), 2014.

(46)

Multi-version Storage: Code Design and Repair in Distributed Storage Systems

Multi-version Storage: Code

Design and Repair in Distributed

Storage Systems

YUANJIA GONG

Abstract

Sammanfattning

Acknowledgements

Contents

List of Figures

List of Tables

Abbreviations

Chapter 1

Introduction

1.1

Motivation

1.2

Problem Description

1.3

Outline

Chapter 2

Background

2.1

Distributed Storage System

2.2

MDS Codes

2.3

Network Coding

Chapter 3

System Model Design

3.1

System Model Description

3.2

Multi-version Storage Allocation and Code Design

3.3

Main Results and Case Studies

Chapter 4

Repair in Consistent Distributed

Storage

4.2

Reducing Communication Cost with High Link Cost

4.3

Successful Repair with Link Failure

Chapter 5

Conclusions and Future Work

5.1

Conclusions

5.2

Future Work

Appendix A

Appendix

A.1

Proofs of Main Results

A.2

Storage Bandwidth Trade-o↵

Bibliography