An Analytical Study of Consistency and Performance of DHTs under Churn

(1)

An Analytical Study of Consistency and Performance

of DHTs under Churn

1

Sameh El-Ansary1, Supriya Krishnamurthy1, Erik Aurell1,2and Seif Haridi1,3

1_{Distributed Systems Laboratory}

SICS Swedish Institute of Computer Science P. O. Box 1263 SE-16429 Kista, Sweden

2_{Department of Physics, KTH-Royal Institute of Technology}

SE-106 91 Stockholm, Sweden

3_{IMIT, KTH-Royal Institute of Technology}

{supriya,sameh,eaurell,seif}@sics.se

SICS Technical Report (Draft) T2004:12 ISSN 1100-3154

ISRN:SICS-T–2004/12-SE

Abstract. In this paper, we present a complete analytical study of dynamic membership (aka churn) in structured peer-to-peer networks. We use a master-equation-based approach, which is used traditionally in non-equilibrium statistical mechanics to describe steady-state or transient phenomena. We demonstrate that this methodology is infact also well suited to describing structured overlay networks by an application to the Chord system. For any rate of churn and stabilization rates, and any system size, we accurately account for the functional form of: the distribution of inter-node distances, the probability of network disconnection, the fraction of failed or incorrect successor and finger pointers and show how we can use these quantities to predict both the performance and consistency of lookups under churn. Additionally, we also discuss how churn may actually be of different ’types’ and the implications this will have for structured overlays in general. All theoretical predictions match simulation results to a high extent. The analysis includes details that are applicable to a generic structured overlay deploying a ring as well as Chord-specific details that can act as guidelines for analyzing other systems.

Keywords: Peer-To-Peer, Structured Overlays, Distributed Hash Tables, Dynamic

Mem-bership in Large- scale Distributed Systems, Analytical Modeling, Master Equations.

(2)

An Analytical Study of Consistency and Performance

of DHTs under Churn

†

Sameh El-Ansary1, Supriya Krishnamurthy1, Erik Aurell1,2and Seif Haridi1,3

1 _{Swedish Institute of Computer Science (SICS), Sweden} 2 _{Department of Physics, KTH-Royal Institute of Technology, Sweden}

3 _{IMIT, KTH-Royal Institute of Technology, Sweden}

{supriya,sameh,eaurell,seif}@sics.se

Abstract. In this paper, we present a complete analytical study of dynamic membership (aka churn) in structured peer-to-peer networks. We use a master-equation-based approach, which is used traditionally in non-equilibrium statistical me-chanics to describe steady-state or transient phenomena. We demonstrate that this methodology is infact also well suited to describing structured overlay networks by an application to the Chord system. For any rate of churn and stabilization rates, and any system size, we accurately account for the functional form of: the distribution of inter-node distances, the probabil-ity of network disconnection, the fraction of failed or incorrect successor and finger pointers and show how we can use these quantities to predict both the performance and consistency of lookups under churn. Additionally, we also discuss how churn may actually be of different ’types’ and the implications this will have for structured overlays in general. All theoretical predictions match simulation results to a high extent. The anal-ysis includes details that are applicable to a generic structured overlay deploying a ring as well as Chord-specific details that can act as guidelines for analyzing other systems.

1 Introduction

An intrinsic property of Peer-to-Peer systems is the process of never-ceasing dynamic membership. Structured Peer-to-Peer Networks (aka Distributed Hash Tables (DHTs)) have the underlying principle of arranging nodes in an overlay graph of known topology and diameter. This knowledge results in the provision of performance guarantees. However, dy-namic membership continuously “corrupts/churns” the overlay graph and every DHT strives to provide a technique to “cor-rect/maintain” the graph in the face of this perturbation.

Both theoretical and empirical studies have been conducted to analyze the performance of DHTs undergoing “churn” and simultaneously performing “maintenance’. Liben-Nowell et. al [6] prove a lower bound on the maintenance rate required for a network to remain connected in the face of a given dy-namic membership rate. Aspnes et. al [2] give upper and lower bounds on the number of messages needed to locate a node/data item in a DHT in the presence of node or link

fail-†

This work is funded by the Swedish VINNOVA AMRAM and PPC projects, the European IST-FET PEPITO and 6th FP EVERGROW projects.

ures. The value of such theoretical studies is that they provide insights neutral to the details of any particular DHT. Empirical studies have also been conducted to complement these theo-retical studies by showing how within the asymptotic bounds, the performance of a DHT may vary substantially depending on different DHT designs and implementation decisions. Ex-amples include the work of: Li et. al [5], Rhea et.al [8] and Rowstron et.al [3].

In this paper, we present a new approach to studying churn, based on working with master equations, a widely used tool wherever the mathematical theory of stochastic processes is ap-plied to real-world phenomena [7]. We demonstrate the appli-cability of this approach to one specific DHT: Chord [9].

A master-equation description for a dynamically evolving system is achieved by first defining a state of the system. This is just a listing of the quantities one would need to know for the fullest description of the system. For Chord, the state could be defined as a listing of how many nodes there are in the sys-tem and what the state (whether correct, incorrect or failed) of each of the pointers of those nodes is. This information is not enough to draw a unique graph of network-connections (be-cause for example, if we know that a given node has an ’incor-rect’ successor pointer, this still does not tell us which node it is pointing to). However, as we will see, beginning at this level of description is sufficient to keep track of most of the details of the Chord protocols.

Having defined a state, the master-equation is simply an equation for the evolution of the probability of finding the sys-tem in this state, given the details of the dynamics. The spe-cific nature of the dynamics plays a role in evaluating all the terms leading to the gain or loss of this probability, i.e. keeping track of the contribution of all the events which can bring about changes in the probability in a micro-instant of time.

Using this formalism our results are accurate functional forms of the following: (i) The distribution of inter-node dis-tances when the system is in equilibrium or in general when a network is growing or shrinking. This distribution is inde-pendent of any details of Chord and are applicable to any DHT deploying a ring. (ii) Chord-specific inter-node distribution properties. (iii) For every outgoing pointer of a Chord node, we systematically compute the probability that it is in any one

(3)

of its possible states. This probability is different for each of the successor and finger pointers. We then use this informa-tion to predict other quantities such as(iv) the probability that the network gets disconnected,(v) lookup consistency (num-ber of failed lookups), and(vi) lookup performance (latency). All quantities are computed as a function of the parameters in-volved and all results are verified by simulations.

2 Related Work

Closest in spirit to our work is the informal derivation in the original Chord paper [9] of the average number of timeouts encountered by a lookup. This quantity was approximated there by the product of the average number of fingers used in a lookup times the probability that a given finger points to a departed node. Our methodology not only allows us to de-rive the latter quantity rigorously but also demonstrates how this probability depends on which finger (or successor) is in-volved. Further we are able to derive an exact relation relating this probability to lookup performance and consistency accu-rately at any value of the system parameters.

In the works of Aberer et.al [1] and Wang et.al [10], DHTs are analyzed under churn and the results are compared with simulations. However, the main parameter of the analysis is the probability that a random selected entry of a routing table is stale. In our analysis, we determine this quantity from system details and churn rates.

A brief announcement of the results presented in this paper, has appeared earlier in [4].

3 Assumptions & Definitions

Basic Notation. In what follows, we assume that the reader is

familiar with Chord. However we introduce the notation used below. We useK to mean the size of the Chord key space and N the number of nodes. Let M = log₂K be the number of fin-gers of a node andS the length of the immediate successor list, usually set to a value= O(log(N )). We refer to nodes by their keys, so a noden implies a node with key n ∈ 0 · · · K − 1. We usep to refer to the predecessor, s for referring to the successor list as a whole, andsifor theithsuccessor. Data structures of

different nodes are distinguished by prefixing them with a node key e.g.n′

.s1, etc. Letf ini.start denote the start of theith

fin-ger (Where for a noden, ∀i ∈ 1..M, n.f ini.start = n + 2i−1)

andf ini.node denote the node pointed to by that finger (which

is the closest successor ofn.f ini.start on the ring).

Steady State Assumption. λj is the rate of joins per node,

λf the rate of failures per node andλsthe rate of stabilizations

per node. We carry out our analysis for the general case when the rate of doing successor stabilizationsαλs, is not

necessar-ily the same as the rate at which finger stabilizations(1 − α)λs

are performed. In all that follows, we impose the steady state

conditionλj = λf unless otherwise stated. Further it is useful

to define r ≡ λs

λf which is the relevant ratio on which all the

quantities we are interested in will depend, e.g,r = 50 means that a join/fail event takes place every half an hour for a sta-bilization which takes place once every36 seconds. Through-out the paper we will use the termsλj∆t, λf∆t, αλs∆t and

(1 − α)λs∆t to denote the respective probabilities that a join,

failure, a successor stabilization, or a finger stabilization take place during a micro period of time of length∆t.

Parameters. The parameters of the problem are hence: K, N , α and r. All relevant measurable quantities should be en-tirely expressible in terms of these parameters.

Chord Algorithms & Simulation A detailed description of

the algorithms used is provided in Appendix A. Since we are collecting statistics like the probability of a particular finger pointer to be wrong, we need to repeat each experiment 100 times before obtaining well-averaged results. The total simu-lation sequential real time for obtaining the results of this pa-per was about 1800 hours that was parallelized on a cluster of 14 nodes where we had N = 1000, K = 220, S = 6, 200 ≤ r ≤ 2000 and 0.25 ≤ α ≤ 0.75.

4 The Analysis

4.1 Distributional Properties of Inter-Node Distances

During churn, the average inter-node distance is a fluctuating quantity whose distribution is used throughout our analysis. The derivation we present here of this distribution is indepen-dent of any details of the DHT implementation and depends solely on the dynamics of the join and leave process. It is hence applicable to any DHT that deploys a circular key space.

Definition 4.1 Given two keys u, v ∈ {0...K − 1}, the

“dis-tance” between them isu − v (with modulo-K arithmetic). We

interchangeably say thatu and v form an “interval” of length u − v. Hence the number of keys inside an interval of length ℓ

isℓ − 1 keys.

Definition 4.2 LetIntxbe the number of intervals of lengthx, i.e. the number of pairs of consecutive nodes which are sepa-rated by a distance ofx keys on the ring.

Theorem 4.1 For a process in which nodes join or leave with equal rates independently of each other and uniformly on the ring, and the number of nodesN in the network is almost

con-stant with N << K, the probability (P (x) ≡ Intx

N ) of find-ing an interval of length x is: P (x) = ρx−1(1 − ρ) where ρ = K−N

K .

Proof : By definitionP P (x) = 1 and P x P (x) = K/N . Further, for the mean number of peers, the join-leave process we consider, simply implies that dN_dt = λj−λf We will need to

(4)

Intx(t + ∆t) Rate of Change = Intx(t) − 1 c1.1 = (λf∆t)2P (x) = Intx(t) − 1 c1.2 = (λj∆t)N (x−1)P (x)K−_N = Intx(t) + 1 c1.3 = (λf∆t)Px−1x1=1P (x1)P (x − x1) = Intx(t) + 1 c1.4 = (λj∆t)K−2NN P x1>xP (x1) = Intx(t) 1 − (c1.1+ c1.2+ c1.3+ c1.4)

Table 1: Gain and loss terms forInt(x) the number of intervals of lengthx.

check that an equation forInt₍x) does indeed satisfy the above constraints.

We now write an equation forIntx by considering all the

processes which lead to its gain or loss. These are summarized in table 1

First, a failure of either of the boundary nodes of an interval of sizex leads to its loss at rate c1.1. That is, since the node

killed is randomly picked amongst all the nodes in the inter-val, the probability that it was participating on either side of an interval of lengthx is 2P (x).

Second, an interval of sizex can be lost at rate c1.2if a

join-ing node splits it. Only joinjoin-ing with keys that belong to one of theIntx intervals can lead to the loss of an interval of length

x and in each one of these, there are x − 1 ways (available keys) for splitting. Therefore(x − 1) × Intx positions out of

theK − N available keys can destroy an interval of length x. That is, the probability that one of the intervals of lengthx is destroyed is (x−1)Intx

K−N which can be rewritten as

N (x−1)P (x) K−N .

Third, the number of intervals of sizex can increase by 1 at ratec1.3if a failure of a boundary node results in the

aggre-gation of two adjacent intervals. To clarify that, we give the following examples. An interval of length1 cannot be formed by such a process. An interval of length2 can be formed by the failure of a node if the node that failed was shared between two adjacent intervals of length1. We are assuming here that the probability of picking two adjacent intervals of length 1 isP (1)P (1). This is in effect assuming that the probability of having two adjacent intervals of size1, factorises to P (1)2. However for this system, this is an accurate estimation. Thus, in general, the probability of forming an interval of lengthx is Px−1

x1=1P (x1)P (x − x1).

Fourth, an increase can happen at rate c1.4 if a join event

splits a larger interval into an interval of sizex. For a join to form an interval of length x, it must occur in an interval of length greater than x. In each interval of length x1 > x,

there are exactly two ways of forming an interval of lengthx. Therefore, the probability of forming an interval of lengthx is equal to2

P

x1>xIntx

K−N , which can be rewritten as 2NP

x1>xP (x)

K−N

Finally,Intxremains the same if none of the above happens.

Therefore the equation forIntxforx > 1 is:

dIntx dt = − P (x) · 2λf + N λj(x − 1) K − N ¸ + λf x−1 X x1=1 P (x1)P (x − x1) + 2λj N K − N X x1>x P (x1). (1)

The equation forInt1 is the same as the above except that

the second term is missing. We can check that :

d dt X Intx = dN dt = λj− λf (2) as required.

Further we can check that the constraint: d dt X xIntx= dK dt = 0

is also obeyed. Equation 1 can be readily solved leading to the solution:

P (x) = ρx−1(1 − ρ) (3)

whereρ = K−N

K−N (1−_λfλj). In the special case we are interested in

here whereλj = λf, we haveρ = K−KN. Note that ifλj 6= λf,

then N is actually an increasing/decreasing function of time.

Given the above term forρ we can state the following corol-lary that gives an intuitive meaning forρ in the case λj = λf. Corollary 4.1.1 Given a ring ofK keys populated by N nodes, ρ ≡ K−N

K is the ratio of the unpopulated keys to the total num-ber of keys, i.e. the probability of picking a key at random and finding it empty isρ.

The proof of the above theorem does assume that (in the case λj = λf) the number of nodesN is fairly constant. Indeed at

first sight this seems to be strictly true from Eq. 2. However, just as in a random walk, the variance in this case increases with time. We will comment more on the properties of the variance later. For the moment, we note that the above result can be generalised to also include the case whenN is a largely fluctuating quantity. In this case we only need to multiply the N dependent terms in Eq. 1 with P rob(N, t): the probability that there areN nodes in the system at time t, and average over N .

We now derive some properties of this distribution which will be used in the ensuing analysis.

(5)

Figure 1: (a) Case when n and p have the same value of f ink.node. (b) Case where a newly joined node p copies the

kthentry of its successor noden as the best approximation for its ownkth_{entry (by the join protocol). In this case, there could}

be a nodeo which is the ’correct’ entry for p.f ink.node.

How-ever, sincep is newly joined, the only information it has access to is the finger table ofn.

Property 4.1 For any two keys u and v, where v = u + x,

let bi be the probability that the first node encountered inbe-tween these two keys is atu + i (where 0 ≤ i < x). Then bi ≡ ρi(1 − ρ). The probability that there is definitely atleast one node betweenu and v is: a(x) ≡ 1 − ρx. Hence the condi-tional probability that the first node is at a distancei given that

there is atleast one node in the interval isbc(i, x) ≡ b(i)/a(x).

Explanation : Considerbifirst. For any keyu, the probability

that the first node encountered is atu itself (b0) is1 − ρ from

Corollary 4.1.1. Similarly the probability that the first node encountered is atu + 1 (b1) isρ(1 − ρ). In general, the

prob-ability that the first populated node starting fromu is at u + i isb(i) ≡ (ρ)i(1 − ρ). Given this, the probability that there is atleast one node betweenu and v = u + x (not including the case when the node is atv) isPx−1

i=0bi = 1 − ρx≡ a(x).

Property 4.2 The probability that a node and atleast one of its immediate predecessors share the same kth finger is

p1(k) ≡ _1+ρρ (1 − ρ2

k₋₂

). This is ∼ 1/2 for K >> 1 and N << K.Clearly p1 = 0 for k = 1. It is straightforward (though tedious) to derive similar expressions for p2(k) the probability that a node and atleast two of its immediate pre-decessors share the samekth_finger,_p

3(k) and so on.

Explanation : If the distance between node n and its pre-decessor p is x, the distance between n.f ink.start and

p.f ink.start is alsox (see Fig. 1(a)). If there is no node

in-betweenn.f ink.start andp.f ink.start then n.f ink.node and

p.f ink.node will share the same value. From Eq. 3, the

prob-ability that the distance betweenn and p is x is ρx−1(1 − ρ). However, x has to be less than 2k−1_{, otherwise} _{p.f in}

k.node

will be equal ton. The probability that no node exists between n.f ink.start andp.f ink.start is ρx (by Property 4.1).

There-fore the probability that the n.f ink.node and p.f ink.node

share the same value is: P2k−1−1

x=1 ρx−1(1 − ρ)ρx = ρ 1+ρ(1 −

ρ2k₋₂

)

Property 4.3 We can similarly assess the probability that the join protocol results in further replication of the kth _pointer. Let us define the probability pjoin(i, k) as the probability that a newly joined node, chooses the ith entry of its successor’s finger table for its ownkth_{entry. Note that this is unambiguous} even in the case that the successor’s ith entry is repeated. All we are asking is, when is the kth entry of the new joinee the same as theith_{entry of the successor? Clearly}_{i ≤ k. Infact} for the larger fingers, we need only considerpjoin(k, k), since

pjoin(i, k) ∼ 0 for i < k. Using the interval distribution we find, for largek, pjoin(k, k) ∼ ρ(1 − ρ2

k−2₋₂

) + (1 − ρ)(1 − ρ2k−2₋₂

) − (1 − ρ)ρ(2k−2_{− 2)ρ}2k−2₋₃

. This function goes to

1 for large k.

Explanation : A newly joined node p, tries to assign p.f ink.node to the best approximate value from the finger

ta-ble of its successorn. This approximate value might turn out to ben.f ink.node, especially for the larger fingers. If p chooses

thekthentry ofn as its own kthentry, it must be because the

k − 1th entry ofn (if distinct, as is always the case for large k) does not afford it a better choice. The condition for this is : p.f ink.start > n.f ink−1.node. If the distance between

n.f ink.start and p.f ink.start is x, and the distance between

n.f ink−1.start and n.f ink−1.node is y (see Fig. 1 (b)), then

the constraint on x and y is n + 2k−1 _{− x > n + 2}k−2_{+ y}

or x + y < 2k−2. We also have the added constraint that x < 2k−1, since otherwise p.f ink.node would simply be n.

Thus the probabilitypjoin(k, k) is: 2k−1₋₁ X x=1 2k−2₋_x X y=1 P (x)P (y) = 2k−2₋₁ X z=2 ρz−2(1 − ρ)2(z − 1) (4)

where we have put in the expressions for P (x) and P (y) from Eq. 3 and converted the double summation to a single one. This expression can be summed easily to obtain the result quoted above.

We can also analogously computepjoin(i, k) for any i. The

only trick here is to estimate the probability that starting fromi, the last distinct entry ofn’s finger table does not give p a better choice for its kth entry. This can again readily be computed

using property 4.1.

4.2 Successor Pointers

We now turn to estimating various quantities of interest for Chord. In all that follows we will evaluate various average quantities, as a function of the parameters. However this same

(6)

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 200 400 600 800 1000 1200 1400 1600 1800 2000 w1 (r, α ), d 1 (r, α )

Rate of Stabilisation /Rate of failure (r=λ_s/λ_f) w₁(r,0.25) Simulation w₁(r,0.5) Simulation w₁(r,0.75) Simulation w₁(r,0.25) Theory w₁(r,0.5) Theory w₁(r,0.75) Theory d₁(r,0.75) Simulation d₁(r, 0.75) Theory 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022 200 400 600 800 1000 1200 1400 1600 1800 2000 I(r, α )

Rate of Stabilisation of Successors/Rate of failure (αr=αλ_s/λ_f) I(r,0.25) Simulation I(r,0.5) Simulation I(r,0.75) Simulation I(r,0.25) theory I(r,0.5) theory I(r,0.75) theory

Figure 2: Theory and Simulation forw1(r, α), d1(r, α), I(r, α)

Figure 3: Changes inW1, the number of wrong (failed or

out-dated)s1pointers, due to joins, failures and stabilizations.

formalism can also be used for evaluating higher moments like the variance.

In the case of Chord, we need consider only one of three kinds of events happening at any micro-instant: a join, a failure or a stabilization. One assumption made in the following is that such a micro-instant of time exists, or in other words, that we can divide time till we have an interval small enough that in this interval, only any one of these three processes occur. Another (more serious) assumption is that the state of the system is a

product of the state of all the nodes. Nodes are hence assumed

to have, for the most part, states independent of each other ,

i.e. the probability of two adjacent nodes having a wrong successor pointer is taken to be the product of the individual nodes having wrong successor pointers ( though as we will see, in the case of finger pointers, we do also consider the case when adjacent nodes might have correlated fingers). However, this ansatz works very well.

Consider first the successor pointers. Letwk(r, α), dk(r, α)

Change inW1(r, α) Rate of Change

W1(t + ∆t) = W1(t) + 1 c2.1 = (λj∆t)(1 − w1)

W1(t + ∆t) = W1(t) + 1 c2.2 = λf(1 − w1)2∆t

W1(t + ∆t) = W1(t) − 1 c2.3 = λfw21∆t

W1(t + ∆t) = W1(t) − 1 c2.4 = αλsw1∆t

W1(t + ∆t) = W1(t) 1 − (c2.1+ c2.2+ c2.3+ c2.4)

Table 2: Gain and loss terms for W1(r, α): the number of

wrong first successors as a function ofr and α.

denote the fraction of nodes having a wrong kth _successor

pointer or a failed one respectively andWk(r, α), Dk(r, α) be

the respective numbers . A failed pointer is one which points to a departed node and a wrong pointer points either to an in-correct node (alive but not in-correct) or a dead one. As we will see, both these quantities play a role in predicting lookup con-sistency and lookup length.

By the protocol for stabilizing successors in Chord, a node periodically contacts its first successor, possibly correcting it and reconciling with its successor list. Therefore, the number of wrongkthsuccessor pointers are not independent quantities but depend on the number of wrong first successor pointers. We first consider s1 here, and then briefly discuss the other cases

towards the end of this section.

We write an equation forW1(r, α) by accounting for all the

events that can change it in a micro event of time∆t. An illus-tration of the different cases in which changes inW1take place

due to joins, failures and stabilizations is provided in Fig. 3. In some casesW1 increases/decreases while in others it stays

unchanged. For each increase/decrease, table 2 provides the corresponding probability.

By our implementation of the join protocol, a new nodeny,

joining between two nodesnxandnz, has itss1pointer always

correct after the join. However the state ofnx.s1before the join

(7)

Change inW1(r, α) Rate of Change

Nbu(t + ∆t) = Nbu(t) + 1 c3.1= (λf∆t)d1(r, α)

Nbu(t + ∆t) = Nbu(t) + 1 c3.2= λf∆t(1 − d1)d2

Nbu(t + ∆t) = Nbu(t) − 1 c3.3= αλs∆tPbu(2, r, α)

Nbu(t + ∆t) = Nbu(t) 1 − (c3.1+ c3.2+ c3.3)

Table 3: Gain and loss terms forNbu(2, r, α): the number of

nodes with dead first and second successors

the join, then after the join it will be wrong and thereforeW1

increases by1. If nx.s1was wrong before the join, then it will

remain wrong after the join and W1 is unaffected. Thus, we

need to account for the former case only. The probability that nx.s1 is correct is1 − w1and from that follows the termc2.1.

For failures, we have 4 cases. To illustrate them we use nodes nx, ny, nz and assume that ny is going to fail. First,

if both nx.s1 andny.s1 were correct, then the failure of ny

will make nx.s1 wrong and hence W1 increases by 1.

Sec-ond, ifnx.s1andny.s1were both wrong, then the failure ofny

will decreaseW1 by one, since one wrong pointer disappears.

Third, ifnx.s1 was wrong andny.s1 was correct, thenW1 is

unaffected. Fourth, ifnx.s1was correct andny.s1was wrong,

then the wrong pointer ofny disappeared and nx.s1 became

wrong, thereforeW1is unaffected. For the first case to happen,

we need to pick two nodes with correct pointers, the probabil-ity of this is(1 − w1)2. For the second case to happen, we need

to pick two nodes with wrong pointers, the probability of this isw2

1. From these probabilities follow the termsc2.2andc2.3.

Finally, a successor stabilization does not affectW1, unless

the stabilizing node had a wrong pointer. The probability of picking such a node isw1. From this follows the termc2.4.

Hence the equation forW1(r, α) is:

dW1

dt = λj(1 − w1) + λf(1 − w1)

2_{− λ}

fw21− αλsw1

Solving forw1in the steady state and puttingλj = λf, we get:

w1(r, α) =

2 3 + rα ≈

2

rα (5)

This expression matches well with the simulation results as shown in Fig. 2. d1(r, α) is then ≈ 1₂w1(r, α) since when

λj = λf, about half the number of wrong pointers are incorrect

and about half point to dead nodes. Thusd1(r, α) ≈ _rα1 which

also matches well the simulations as shown in Fig. 2. We can also use the above reasoning to iteratively getwk(r, α) for any

k.

4.3 Break-up (Network Disconnection) Probability

We demonstrate below, how calculatingdk(r, α): the fraction

of nodes with deadkth_{pointers, helps in estimating precisely}

0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012 200 400 600 800 1000 1200 1400 1600 1800 2000 d2 (r, α )

Rate of Stabilisation /Rate of failure (r=λs/λf)

d2(r,0.5) Simulation d₂(r,0.5) Theory d2(r,0.5) Simulation d2(r,0.25) Theory d₂(r,0.5) Simulation d2(r,0.75) Theory

Figure 4: Theory and Simulation ford2(r, α)

.

the probability that the network gets disconnected for any value ofr and α. Let Pbu(n, r, α) be the probability that n

consec-utive nodes fail. If n = S, the length of the successor list, then clearly the node gets disconnected from the network and the network breaks up. For the range of r considered in Fig. 2, Pbu(S, r, α) ∼ 0. However should we go lower, this starts

becoming finite. The master equation analysis introduced here can be used to estimatePbu(n, r, α) for any 1 ≤ n ≤ S. We

in-dicate how this might be done by considering the case n = 2. Let Nbu(2, r, α) be the number of configurations in which a

node has boths1 ands2 dead andPbu(2, r, α) be the fraction

of such configurations. Table 3 indicates how this is estimated within the present framework.

A join event does not affect this probability in any way. So we need only consider the effect of failures or stabilization events. The termc3.1 accounts for the situation when the first

successor of a node is dead (which happens with probability d1(r, α) as explained above). A failure event can then kill its

second successor as well and this happens with probabilityc3.1.

The second term is the situation that the first successor is alive (with probability1 − d1) but the second successor is dead (with

probabilityd2). This probability is∼ 2/αr. (the second

suc-cessor of a node being dead either implies that the first succes-sor of its first successucces-sor is dead with probability d1, or that it

has not stabilized recently, and hence has not corrected its sec-ond successor pointer.This happens with probability ∼ 1/αr. These two terms add up to2/αr). A stabilization event reduces the number of such configurations by one, if the node doing the stabilization had such a configuration to begin with.

Solving the equation forNbu(2, r, α), one hence obtains that

Pbu(2, r, α) ∼ 3/(αr)2. As Fig. 4 shows, this is a precise

(8)

Figure 5: Changes inFk, the number of failedf ink pointers,

due to joins, failures and stabilizations.

We can similarly estimate the probabilities for three consec-utive nodes failing, etc, and hence also the disconnection prob-abilityPbu(S, r, α). This formalism thus affords the

possibil-ity of making a precise prediction for when the system runs the danger of getting disconnected as a function of the parameters.

Lookup Consistency By the lookup protocol, a lookup is

inconsistent if the immediate predecessor of the sought key has a wrongs1 pointer. However, we need only consider the

case when thes1 pointer is pointing to an alive (but incorrect)

node since our implementation of the protocol always requires the lookup to return an alive node as an answer to the query. The probability that a lookup is inconsistentI(r, α) is hence w1(r, α) − d1(r, α). This prediction matches the simulation

results very well, as shown in Fig. 2.

4.4 Failure of Fingers

We now turn to estimating the fraction of finger pointers which point to failed nodes. As we will see this is an important quan-tity for predicting lookups, since failed fingers cause timeouts and increase the lookup length. We need however only con-sider fingers pointing to dead nodes. Unlike members of the successor list, alive fingers even if outdated, always bring a query closer to the destination and do not affect consistency or substantially even the lookup length. Therefore we consider fingers in only two states, alive or dead (failed). By our im-plementation of the stabilization protocol (see Appendix A), fingers and successors are stabilized entirely independently of each other. Thus even though the first finger is also always the first successor, this information is not used by the node in updating the finger.

Letfk(r, α) denote the fraction of nodes having their kth

fin-ger pointing to a failed node andFk(r, α) denote the respective

number. For notational simplicity, we write these as simplyFk

Fk(t + ∆t) Rate of Change

= Fk(t) + 1 c4.1= (λj∆t)Pki=1pjoin(i, k)fi

= Fk(t) − 1 c4.2= (1 − α)M1 fk(λs∆t)

= Fk(t) + 1 c4.3= (1 − fk)2[1 − p1(k)](λf∆t)

= Fk(t) + 2 c4.4= (1 − fk)2(p1(k) − p2(k))(λf∆t)

= Fk(t) + 3 c4.5= (1 − fk)2(p2(k) − p3(k))(λf∆t)

= Fk(t) 1 − (c4.1+ c4.2+ c4.3+ c4.4+ c4.5)

Table 4: Some of the relevant gain and loss terms for Fk, the

number of nodes whose kth fingers are pointing to a failed node fork > 1.

andfk. We can predict this function for any k by again

esti-mating the gain and loss terms for this quantity, caused by a join, failure or stabilization event, and keeping only the most relevant terms. These are listed in table 4 and illustrated in Fig. 5

A join event can play a role here by increasing the number of Fk pointers if the successor of the joinee had a failed ith

pointer (occurs with probability fi) and the joinee replicated

this from the successor as the joinee’skth pointer. (occurs with probability pjoin(i, k) from property 4.3). For large enough

k, this probability is one only for pjoin(k, k), that is the new

joinee mostly only replicates the successor’skth pointer as its ownkth pointer. This is what we consider here.

A stabilization evicts a failed pointer if there was one to be-gin with. The stabilization rate is divided byM, since a node stabilizes any one finger randomly, every time it decides to sta-bilize a finger at rate(1 − α)λs.

Given a noden with an alive kthfinger (occurs with prob-ability1 − fk), when the node pointed to by that finger fails,

the number of failed kth _{fingers (F}

k) increases. The amount

of this increase depends on the number of immediate predeces-sors of n that were pointing to the failed node with their kth finger. That number of predecessors could be0, 1, 2,.. etc. Us-ing property 4.2 the respective probabilities of those cases are: 1 − p1(k), p1(k) − p2(k), p2(k) − p3(k),... etc.

Solving forfkin the steady state, we get:

fk = h 2 ˜Prep(k) + 2 − pjoin(k) + r(1−α)M i 2(1 + ˜Prep(k)) − r h 2 ˜Prep(k) + 2 − pjoin(k) + r(1−α)M i2 − 4(1 + ˜Prep(k))2 2(1 + ˜Prep(k)) (6) where ˜Prep(k) = Σpi(k). In principle its enough to keep

even three terms in the sum. The above expressions match very well with the simulation results (Fig. 7).

(9)

Figure 6: Cases that a lookup can encounter with the respective probabilities and costs.

4.5 Cost of Finger Stabilizations and Lookups

In this section, we demonstrate how the information about the failed fingers and successors can be used to predict the cost of stabilizations, lookups or in general the cost for reaching any key in the id space. By cost we mean the number of hops needed to reach the destination including the number of time-outs encountered en-route. Timetime-outs occur every time a query is passed to a dead node. The node does not answer and the originator of the query has to use another finger instead. For this analysis, we consider timeouts and hops to add equally to the cost. We can easily generalize this analysis to investigate the case when a timeout costs some factorn times the cost of a hop.

DefineCt(r, α) (also denoted Ct) to be the expected cost for

a given node to reach some target key which ist keys away from it (which means reaching the first successor of this key). For example,C1would then be the cost of looking up the

adja-cent key (1 key away). Since the adjaadja-cent key is always stored at the first alive successor, therefore if the first successor is alive (which occurs with probability1 − d1), the cost will be1 hop.

If the first successor is dead but the second is alive (occurs with probabilityd1(1 − d2)), the cost will be 1 hop + 1 timeout = 2

and the expected cost is2 × d1(1 − d2) and so forth. Therefore,

we haveC1 = 1−d1+2×d1(1−d2)+3×d1d2(1−d3)+· · · ≈

1 + d1 = 1 + 1/(αr).

For finding the expected cost of reaching a general distance t we need to follow closely the Chord protocol, which would lookupt by first finding the closest preceding finger. For the purposes of the analysis, we will find it easier to think in terms of the closest preceding start. Let us hence defineξ to be the

start of the finger (say thekth_{) that most closely precedes} _t.

Henceξ = 2k−1+ n and t = ξ + m, i.e. there are m keys between the sought targett and the start of the most closely preceding finger. With that, we can write a recursion relation

forCξ+mas follows: Cξ+m = Cξ[1 − a(m)] + (1 − fk)a(m) " 1 + m−1 X i=0 bc(i, m)Cm−i # + fka(m) · 1 + k−1 X i=1 hk(i) ξ/2i₋₁ X l=0 bc(l, ξ/2i)(1 + (i − 1) + Cξi−l+m) + O(hk(k)) ¸ (7)

whereξi ≡ P_m=1,iξ/2m andhk(i) is the probability that

a node is forced to use its k − ith finger owing to the death of its kth finger. The probabilities a, b, bc have already been introduced in section 4, and we define the probability hk(i)

below.

The lookup equation though rather complicated at first sight merely accounts for all the possibilities that a Chord lookup will encounter, and deals with them exactly as the protocol dic-tates.

The first term (Fig. 6 (a)) accounts for the eventuality that there is no node intervening betweenξ and ξ + m (occurs with probability1−a(m)). In this case, the cost of looking for ξ+m is the same as the cost for looking forξ.

The second term (Fig. 6 (b)) accounts for the situation when a node does intervene inbetween (with probabilitya(m)), and this node is alive (with probability1 − fk). Then the query is

passed on to this node (with1 added to register the increase in the number of hops) and then the cost depends on the length of the distance between this node andt.

The third term (Fig. 6 (c)) accounts for the case when the intervening node is dead (with probability fk). Then the cost

increases by 1 (for a timeout) and the query needs to find an alternative lower finger that most closely precedes the target.

(10)

0 0.05 0.1 0.15 0.2 0.25 0.3 100 200 300 400 500 600 700 800 900 1000 fk (r, α )

Rate of Stabilisation of Fingers/Rate of failure ((1-α)r=(1-α)λ_s/λ_f) f₇(r,0.5) Simulation f₇(r,0.5) Theory f₉(r,0.5) Simulation f₉(r,0.5) Theory f₁₁(r,0.5) Simulation f₁₁(r,0.5) Theory f₁₄(r,0.5) Simulation f₁₄(r,0.5) Theory 6 6.5 7 7.5 8 8.5 9 9.5 10 0 100 200 300 400 500 600 700 800 900 1000

Lookup latency (hops+timeouts) L(r,

α

)

Rate of Stabilisation of Fingers/Rate of failure ((1-α)r=(1-α)λ_s/λ_f) L(r,0.5) Simulation

L(r,0.5) Theory

Figure 7: Theory and Simulation forfk(r, α), and L(r, α)

Let thek − ith finger (for somei, 1 ≤ i ≤ k − 1) be such a finger. This happens with probabilityhk(i), i.e., the

probabil-ity that the lookup is passed back to thek − ith finger either because the intervening fingers are dead or share the same fin-ger table entry as thekth _{finger is denoted by}_h

k(i). The start

of thek − ith finger is atξ/2i and the distance betweenξ/2i andξ is equal toP

m=1,iξ/2m which we denote byξi.

There-fore, the distance from the start of thek − ith to the target is equal to ξi + m. However, note that f ink−i.node could be

l keys away (with probability bc(l, ξ/2i)) from f ink−i.start

(for somel, 0 ≤ l < ξ/2i). Therefore, after making one hop to f ink−i.node, the remaining distance to the target is ξi+ m − l.

The increase in cost for this operation is1 + (i − 1); the 1 in-dicates the cost of taking up the query again byf ink−i.node,

and thei − 1 indicates the cost for trying and discarding each of thei − 1 intervening nodes. The probability hk(i) is easy

to compute given property 4.1 and the expression for thefk’s

computed in the previous section. hk(i) =a(ξ/2i)(1 − fk−i)

×Πs=1,i−1(1 − a(ξ/2s) + a(ξ/2s)fk−s), i < k

hk(k) =Πs=1,k−1(1 − a(ξ/2s) + a(ξ/2s)fk−s)

(8)

Eqn.8 accounts for all the reasons that a node may have to use itsk −ith_{finger instead of its}_kth_{finger. This could happen}

because the intervening fingers were either dead or not distinct. The probabilitieshk(i) satisfy the constraintPki=1hk(i) = 1

since clearly, either a node uses any one of its fingers or it doesn’t. This latter probability ishk(k), that is the

probabil-ity that a node cannot use any earlier entry in its finger table. In this case,n proceeds to its successor list. The query is now passed on to the first alive successor and the new cost is a func-tion of the distance of this node from the targett. We indicate this case by the last term in Eq. 7 which isO(hk(k)). This can

again be computed from the inter-node distribution and from

the functionsdk(r, α) computed earlier. However in practice,

the probability for this is extremely small except for targets very close to n. Hence this does not significantly affect the value of general lookups and we ignore it for the moment.

The cost for general lookups is hence

L(r, α) = Σ

K−1

i=1Ci(r, α)

K

The lookup equation is solved recursively, given the coeffi-cients andC1. We plot the result in Fig 7. The theoretical result

matches the simulation very well.

5 What is Churn?

We now discuss a broader issue, connected with churn, which arises naturally in the context of our analysis. As we men-tioned earlier, all our analysis is performed in the steady state where the rate of joins (λj) is equal to the rate of failuresλf.

However the ratesλjandλf can themselves each be chosen in

one of two different ways. They could either be “per-network” or “per-node”. In the former case, the number of joinees (or the number of failures) does not depend on the current num-ber of nodes in the network. This is the case when a poisson model is considered either for arrivals or departures. Put in an-other way, this is like saying that on average, there is always a fixed number of nodes joining or failing per time interval, ir-respective of the total number of nodes in the network. In the case when these rates are chosen to be per-node, the number of joinees or failures does depend on the current number of occu-pied nodes). We consider three possibilities here, whenλj is

per-network andλf is per-node; both are per-network or (as is

the case studied in this paper) both are per-node. In all three cases, since the system is always studied in the steady state where the total number of joinees per unit time is equal to the total number of failures per unit time, the equation for the mean is alwaysdN/dt = 0. We hence expect the mean behaviour to

(11)

be the same, atleast in the regime whenN is roughly constant. However the behaviour of fluctuations is very different in each of these three cases.

In the first case, the steady state condition isλj/No = λf,

whereNo is the initial number of nodes in the system. The

equation for the mean isdN/dt = λj/N − λf, which ensures

thatN cannot deviate too much from the steady state value. Similarly one can write an equation for the second momentN2_:

dN2/dt = (λj/N + λf) + 2(λj− N λf). While the first term

is a ’noise’ term which encourages fluctations, the second term becomes stronger the larger the deviation fromNo and hence

strongly damps out fluctuations. Thus the number of nodes in the system remains close to its initial value.

In the second case, where the join and failure rates are both per-network the equation for the mean isdN/dt = λj/N −

λf/N . Hence putting λj = λf ensures the steady state

condi-tion. However in this case, the equation for the second moment isdN2/dt = (λj/N + λf/N ). The joins-failures process thus

makes the system execute a “random-walk” in N , where the “steps” of the walk depend onN and are smaller if N is larger. For such a system, fluctuations are not bounded and a large deviation can and will take the system to theN = 0 state even-tually. The time for this to happen scales withN as N3_{for this}

process.

The third case (which is also the case considered in this paper) is when both rates are per-node. This is very sim-ilar to the second case. The equation for the mean is just dN/dt = λj− λfas mentioned earlier. Again settingλj = λf

ensures steady state. The equation for the second moment is nowdN2/dt = (λj + λf). There is thus again no “repair”

mechanism for large fluctuations, and the system will be even-tually driven to extinction. In this case the process onN is just an ordinary random walk and the time taken to hit theN = 0 state scales asN2.

Which of these ’types’ of churn is the most relevant? In the real world, the churn felt by a DHT, might possibly be some time-varying mixture of these three, and will also possibly de-pend on the application. It is hence probably of importance to study all these mechanisms and their implications in detail.

6 Discussion and Conclusion

To summarize, in this paper, we have presented a detailed the-oretical analysis of a DHT-based P2P system, Chord, using a Master-equation formalism. This analysis differs from existing theoretical work done on DHTs in that it aims not at establish-ing bounds, but on precise determination of the relevant quan-tities in this dynamically evolving system. From the match of our theory and the simulations, it can be seen that we can pre-dict with an accuracy of greater than1% in most cases.

Though this analysis is not exact (in the sense that there are approximations made to make the analysis simpler), yet it pro-vides a methodology to keep track of most of the relevant de-tails of the system. We expect that the same analysis can be done for most other DHT’s in a similar manner, thus helping to establish quantitative guidelines for their comparisn.

Apart from the usefulness of this approach for its own sake, we can also gain some new insights into the system from it. For example, we see that the fraction of dead finger pointers fk is an increasing function of the length of the finger. Infact

for large enough K, all the long fingers will be dead most of the time, making routing very inefficient. This implies that we need to consider a different stabilization scheme for the fingers (such as, perhaps, stabilizing the longer fingers more often than the smaller ones), in order that the DHT continues to function at high churn rates.

References

[1] Karl Aberer, Anwitaman Datta, and Manfred Hauswirth, Efficient, self-contained handling of identity in peer-to-peer systems, IEEE Transac-tions on Knowledge and Data Engineering 16 (2004), no. 7, 858–869. [2] James Aspnes, Zo¨e Diamadi, and Gauri Shah, Fault-tolerant routing

in peer-to-peer systems, Proceedings of the twenty-first annual sym-posium on Principles of distributed computing, ACM Press, 2002, pp. 223–232.

[3] Miguel Castro, Manuel Costa, and Antony Rowstron, Performance and dependability of structured peer-to-peer overlays, Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04), IEEE Computer Society, 2004.

[4] Supriya Krishnamurthy, Sameh El-Ansary, Erik Aurell, and Seif Haridi, A statistical theory of chord under churn, The 4th International Workshop on Peer-to-Peer Systems (IPTPS’05) (Ithaca, New York), February 2005.

[5] Jinyang Li, Jeremy Stribling, Thomer M. Gil, Robert Morris, and Frans Kaashoek, Comparing the performance of distributed hash tables un-der churn, The 3rd International Workshop on Peer-to-Peer Systems (IPTPS’02) (San Diego, CA), Feb 2004.

[6] David Liben-Nowell, Hari Balakrishnan, and David Karger, Analysis of the evolution of peer-to-peer systems, ACM Conf. on Principles of Distributed Computing (PODC) (Monterey, CA), July 2002.

[7] N.G. van Kampen, Stochastic Processes in Physics and Chemistry, North-Holland Publishing Company, 1981, ISBN-0-444-86200-5. [8] Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz,

Handling churn in a DHT, Proceedings of the 2004 USENIX Annual Technical Conference(USENIX ’04) (Boston, Massachusetts, USA), June 2004.

[9] Ion Stoica, Robert Morris, David Liben-Nowell, David Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, IEEE Transactions on Networking 11 (2003).

[10] Shengquan Wang, Dong Xuan, and Wei Zhao, On resilience of structured peer-to-peer systems, GLOBECOM 2003 - IEEE Global Telecommunications Conference, Dec 2003, pp. 3851–3856.

(12)

A Our Implementation of Chord A.1 Joins, Failures & Ring Stablization

Initialization. Initially, the predecessor p, successors (s1..S)

and fingers (f in1..M) are all assigned tonil.

Joins (Fig. 8). A new noden joins by acquiring its successor from an initial random contact nodec. It also starts its first stabilization of the successors and initializes its fingers.

Stablization of Sucessors (Fig. 8). The function fixSucces-sors is triggered periodically with rateαλs. A noden tells its

first alive successory that it believes itself to be y’s predecessor and expects as an answery’s predecessor y.p and successors y.s. The response of y can lead to three actions:

Case A. Some node exists betweenn and y (i.e. n’s belief is wrong), son prepends y.p it to its successors list as a first suc-cessor and retries fixSucsuc-cessors.

Case B.y confirms n’s belief and informs n of y’s old prede-cessory.p. Therefore n considers y.p as an alternative/initial predecessor forn. Finally, n reconciles its successors list with y.s.

Case C.y agrees that n is its predecessor and the only task of n is to update its successors list by reconciling it with y.s.

By calling iThinkIamYourPred (Fig. 8), some node x in-formsn that it believes itself to be n’s predecessor. If n’s pre-decessorp is not alive or nil, then n accepts x as a predecessor and informsx about this agreement by returning x. Alterna-tively, ifn’s predecessor p is alive (discovering that will be explained shortly in section A.3), then there are two possibili-ties: The first is thatx is in the region between n and its current predecessorp therefore n should accept x as a new predecessor and informx about its old predecessor. The second is that p is already pointing tox so the state is correct at both parties and n confirms that tox by informing it that x is the predecessor of n. In all cases the function returns a predecessor and a successors list.

The function firstAliveSuccessor (Fig. 8) iterates through the successors list. In each iteration, if the first successors1 is

alive, it is returned. Otherwise, the dead successor is dropped from the list and nil is appended to the end of the list. If the first successor is nil this means that all immediate successors are dead and that the ring is disconnected.

A.2 Lookups and Stablization of Fingers

Stablization of Fingers (Fig. 9). Stabilization of fingers

oc-curs at a rate(1 − α)λs. Each time the fixFingers function

is triggered, a random finger f ini is chosen and a lookup

for f ini.start is performed and the result is used to update

f ini.node.

Initialization of Fingers (Fig. 9). After having initialized

its first successors1, a noden sets all fingers with starts

be-n.join(c) s1=c.findSuccessor(n) fixSuccessors() initFingers(s1) n.fixSuccessors() y = firstAliveSuccessor()

{y.p, y.s} = y.iThinkIamYourPred(n) if (y.p ∈]me, y[) //Case A

prepend(y.p)

fixSuccessors()

elsif (y.p ∈]y, me[) //Case B

considerANewPred(y.p)

reconcilce(y.s)

else //Case C:y.p == me

reconcile(y.s) n.firstAliveSuccessor() while (true) if (s1== nil) //Broken Ring!! if (isAlive(s1)) return (s1) ∀i ∈ 1..(S − 1) si= si+1 sS= nil n.iThinkIAmYourPred(x) if ((isNotAlive(p) or (p == nil)) p = x return({s, x}) if (x ∈]p, me[) oldp = p p = x return({s, oldp}) else return({s, p}) n.considerANewPred(x) if (isNotAlive(p) or (p == nil) or (x ∈]p, n[)) p = x n.reconcile(s′ ) fori = 1..(S − 1) si+1= s′i n.prepend(y) fori = S..2 si= si−1 s1= y

Figure 8: Joins and Ring Stabilization Algorithms

n.initFingers(s1) f′_{= s} 1.f ∀i ∈ 1..M s.th. (f ini.start ∈]n, s1]), f ini.node = s1 ∀j ∈ 1..M s.th. (f inj.start /∈]n, s1]),

f inj.node =localSuccessor(f′, f inj.start)

n.localSuccessor(f ,k) fori = 1..M if (k ∈]n, f ini]) return(f ini) return(nil) n.fixFingers(k) 1 ≤ i = random() ≤ M f ini.node = findSuccessor(f ini.start)

Figure 9: Initialization and Stabilization of Fingers

tweenn and s1 tos1. The rest of the fingers are initialized by

taking a copy of the finger table ofs1 and finding an

approxi-mate successor to every finger from that finger table.

Lookups (Fig. 10). A lookup operation is a fundamental

operation that is used to find the successor of a key. It is used by many other routines and its performance and consistency are the main quantities of interest in the evaluation of any DHT. A node n looking up the successor of k runs the findSuccessor algorithm which can lead to the following cases:

(13)

n.findSuccessor(k)

//Case A:k is exactly equal to n

if (k == n)

return(n)

//Case B:k is between n and s1

if (k ∈]n, s1])

return(firstAliveSuccessorNoChange()); //Case C: Forward to the lookup to //the closest preceding alive finger cpf = closestAlivePrecedingFinger(k); if (cpf == nil) y = firstAliveSuccessorNoChange(); if (k ∈]n, y]) return(y); cpf = closestAlivePrecedingSucc(k); return(cpf .findSuccessor(k)) else return (cpf .findSuccessor(k)); n.firstAliveSuccessorNoChange() i = 1 while (true) if (si== nil) //Broken Ring!! if (isAlive(si)) return (si) i + + n.closestAlivePrecedingFinger(k) fori = M..1 if ((f ini∈]n, k[) and (f ini6= nil) and isAlive(f ini)) return(f ini) return(nil) n.closestAlivePrecedingSucc(k) fori = S..1 if ((si∈]n, k[) and (si6= nil) and isAlive(si)) return(si) return(cpf)

Figure 10: The Lookup Algorithm

k.

Case B. Ifk ∈]n, s1] then n has found the successor of k,

but it could be that s1 failed and n did not discover that as

yet. However, entries in the successors list can act as backups for the first successor. Therefore, the first alive successor ofn is the successor ofk. Note that, in this case, while we try to find the first alive successor, we do not change the entries in the successors list. This is mainly because, for the sake of the analysis, we want that the successor list is only changed at rate αλs by the fixSuccessors function and is not affected by any

other rate.

Case C. The lookup should be forwarded to a node closer

tok, namely the closest alive finger preceding k in n’s finger table. The call to the function closestAlivePrecedingFinger re-turns such a node if possible and the lookup is forwarded to it. However, it could be the case that all alive preceding fingers tok are dead. In that case, we need to use the successors list as a last resort for the lookup. Therefore, we locate the first alive successory and if k ∈]n, y] then y is the successor of k.

Otherwise, we locate the closest alive preceding successor tok and forward the lookup to it.

A.3 Failures

Throughout the code we use the callisAlive and isN otAlive. A simple interpretation of those routines would be to equate them to a performance of a ping. However, a correct implemen-tation for them is that they are discovered by performing the op-eration required. For instance, a call to f irstAliveSuccesor in Fig. 8 is performed to retrieve a node y and then call y.iT hinkIamY ourP red, so alternatively the first alive suc-cessor could be discovered by iterating on the sucsuc-cessor list and callingiT hinkIamY ourP red.