Self-adaptive and hierarchical membership management in distributed system

(1)

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018,

Self-adaptive and hierarchical membership management in distributed system

JIANGFENG DU

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

management in distributed system

Självadaptivt och hierarkiskt

medlemskapshantering i distribuerade system

JIANGFENG DU

Master’s Thesis at KTH Information and Communication Technology Examiner: Viktoria Fodor

Industrial supervisor: Xuejun Cai

TRITA-EECS-EX-2018:681

(4)

Abstract

Cloud computing is widely deployed in industry, enabling customers to save infrastructure acquisition and network deployment costs. To provide better performance of applications driven by cloud computing, management of the resource pool in the cloud needs to be effective. For efficient communication within the resource pool, an overlay of nodes is formed. Membership management maintains the membership lists and relationship for all nodes in the overlay, detects the member changes, and exposes the membership list to other management components or upper layer services. However, most existing solutions only maintain a relative static overlay which cannot reflect the real-time and dynamic network and system conditions in the systems, and thus the performance of those tasks or services depending on the membership management will be less optimized.

To deal with the problem, we proposed and implemented a self-adaptive and hierarchical membership management system. The structure of the overlay could be dynamically changed according to predefined and real-time cost values between each pair of nodes. A transfer approach is proposed in order to move one node from a cluster to another cluster and meanwhile replace a high-cost link with a low-cost link; a merge approach is proposed in order to decrease the amount of clusters with relatively small size and improve the connectivity of the whole overlay network.

Ideally, the resulting overlay network will possess a fully connected and tree-based hierarchical structure with minimum overall cost. The optimized structure could benefit the services running on top, such as resource scheduling and task placement.

The system is evaluated in an emulated environment.

Results from experiments show that the structure could adapt to cost changes and the overall cost can be reduced when the parameters are set properly. The communication overhead incurred by messages keeps low for non-leader nodes, grows with the level for leaders nodes and does not increase a lot with the number of nodes in the system. The failure of nodes can be detected with high accuracy and relatively low latency. Moreover, the whole structure could recover from the events like leader failure timely.

(5)

Molntjänster är brett utplacerade inom industrin, vilket gör att kunderna kan spara kostnader av infrastrukturförvärv och nätverksutplacering. För att ge bättre prestanda för applikationer som drivs av molntjänster, måste hanteringen av resurspoolen i molnet vara effektiv. Medlemskaps- hanteringen upprätthåller medlemslistan och förhållandena mellan noder i resurspoolen, upptäcker ändringarna i med- lemskap och exponerar medlemslistan till andra manage- mentkomponenter eller till tjänsterna i övre nätverkslager.

De flesta befintliga lösningar upprätthåller emellertid bara en relativ statisk overlay som inte kan spegla de realtid och dynamiska nätverks och systemförhållandena i systemen, och därmed blir prestandan för de uppgifter eller tjänster som är beroende av medlemskapshanteringen mindre opti- merad.

För att hantera problemet, föreslog och genomförde vi ett självadaptivt och hierarkiskt system för medlemskapshanteringen. Strukturen av overlayen kan ändras dynamiskt i enlighet med de fördefinierade och de realtids kostnadsvär- den mellan varje par av noder. En överföringsmetod föreslås för att flytta en nod från ett kluster till ett annat kluster och ersätta därmed en högkostnads länk med en lågkost- nads länk. En metod för klusterssammanslagning föreslås för att minska mängden av kluster med relativt liten stor- lek och för att förbättra anslutningen av hela overlaysnä- tet. Idealt sett kommer det resulterande overlaysnätet ha en helt sammanhängande och trädbaserad hierarkisk struk- tur med minsta totala kostnad. Den optimerade strukturen kan gynna de tjänster som går i översta nätverkslager, till exempel resursschemaläggning och jobsplacering.

Systemet utvärderades i en emulerad miljö. Resulta- tet från experimenten visar att strukturen kan anpassa sig till kostnadsförändringar och den totala kostnaden kan minskas när parametrarna är korrekt inställda. Kommuni- kationsutgifterna för medlemskapshanteringen håller låga för icke-ledande noder, växer med nivån för ledarnoder och ökar inte mycket med antalet noder i systemet. Nodsfel kan detekteras med hög noggrannhet och relativt låg fördröj- ning. Dessutom kan hela strukturen återhämta sig i god tid från händelser som ledarfel.

(6)

List of Tables

5.1 Statistic of measured RTT values in ms . . . 35 5.2 Detection and recovery time . . . 52

(9)

2.1 Proposed cloud architecture[1] . . . 8

2.2 Hierarchical structure . . . 8

2.3 ThreeTier Architecture . . . 15

2.4 FatTree Architecture . . . 16

3.1 Sequence diagram of join procedure . . . 20

3.2 Sequence diagram of transfer procedure . . . 22

3.3 Sequence diagram of merge procedure . . . 25

4.1 Structure of Agent system . . . 30

5.1 Example topology with 30 nodes in IMUNES . . . 34

5.2 Average cost with RADIUS values of 1.0, 0.6, 0.2 . . . . 37

5.3 Experiment of 30 nodes . . . 38

5.4 Cost changes for 30 nodes . . . 38

5.9 Box plots of the number of high-cost links before and after adaption in experiments with 30, 60, 90 nodes . . . 42

5.10 Convergence time in experiments with 30, 60, 90 nodes . . . 44

5.11 Messages sent, received, handled by nodes selected randomly with roles of from top to bottom level-3 leader, level-1 leader and non-leader in an experiment with 60 nodes . . . 45

5.12 Histograms of 30 nodes regarding number of messages . . . 46

5.15 Send and receive rate of 4 nodes with different roles in an experiment with 90 nodes . . . 49

5.16 Leader election after node failure . . . 50

5.17 Leader election after node failure (continue) . . . 51

5.18 Average cost and total cost with two definitions of cost . . . 54

(10)

Chapter 1

Introduction

1.1 Background

In recent years, cloud computing developed rapidly and promoted business model innovation by providing new service consumption and delivery models[2]. According to NIST[3], cloud computing is defined as a model for enabling network access to a shared repository of various computing resources which can be provisioned with low latency and released with low management cost. Provisioning means placement, deployment as well as management of the resources from the repository for each specific task. When a task is completed, the resources occupied by the task will be released and returned back to the repository. By widely deployed cloud computing, developers can focus on the implementation of their innovative and entrepreneurial ideas and deploy their services on cloud platforms without considering the way to design and purchase hardware.

The resource pool in the cloud, which is the foundation of cloud computing, consists of data centers located in places around the world. A data center provides a pool of computing resources inter-connected together by network infrastructure[4].

Computing resources provided by data centers include storage, servers, applications and services resources. Two major components of a data center are servers and network infrastructure. Conventional data centers usually accommodate thousands of physical commodity servers that comprise the resource pool, and data center networks enable internal and external communications for servers in data centers.

Management of data centers is essential and helps data centers provide high performance efficiently. Each physical machine in the data center corresponds to a node in the management layer. The management layer itself is responsible for membership of nodes, tasks placement and migration, monitoring, etc. Membership management includes mechanisms dealing with nodes joining, leaving and failing.

Placement involves finding a node that meets the requirement of a task and on which the task will be executed. Migration of a task means moving the running task from one node to another due to failure or maintenance of nodes. Monitoring provides information about status and utilization of resources in data centers for

(11)

administrators. To achieve these management goals, a large number of messages will be exchanged among nodes through the network infrastructure. Therefore, the system for membership management has to be highly effective both in communication latency and additional overhead on all nodes and underlying switches.

1.2 Problem statement

The data center has a physical topology consisting of physical machines and switches(or routers). Usually, membership management will maintain membership lists for each node in the data center. The list is denoted as the view of a node. A node will consider the nodes in its view as neighbor nodes. All the views from nodes together establish an overlay over the physical topology where vertices are the nodes themselves and edges are formed by views of each node.

The structure of an overlay impacts on the performance of services running on top of it. For example, if a service requires some nodes exchanging large amount of messages, the good structure of an overlay in this case should reflect the physical topology and place these nodes as neighbors. In this way, communications among these nodes would only affect the switch connecting them and not overwhelm the other nodes in the overlay or the other switches.

Nodes in one data center could also be organized in clusters which reflect proximity among them. A cluster represents a group of nodes in the thesis. The proximity among nodes in the same cluster could be explained as that a group of nodes hold similar resources or have low-cost links among them. Here the cost metric can be determined by data center topology or defined by services running on top of the overlay structure. For example, nodes connected to the same edge switch or with similar computing resources, such as large memory volume, could be organized in one cluster. An advantage of clustering is that each node, instead of keeping a list of nodes in random locations, only needs to maintain membership information about nodes within the cluster as well as additional information about other existing clusters. Scalability issues which arise from a large number of nodes in distributed systems can be resolved in this way.

The problem studied is to provide a hierarchical membership management system with the ability to adapt its structure according to some predefined and real- time parameters. Membership management maintains the membership information and relationship, detects changes of members, and exposes the membership list to other management systems and services on top. The hierarchical structure re- searched in the thesis is formed by nodes in a recursive and bottom-up way. Each cluster has one leader that takes charge of connecting with other leader nodes and updating the membership information within the cluster. The adjustment behaviors could be, for example, after a node joins a cluster, this node could leave the current cluster and join another one perhaps due to lower link cost in the new cluster; a few clusters with relatively small sizes could merge into one big cluster; isolated clusters that have no connections to the largest component in the overlay network

(12)

1.3. PURPOSE

could discover the case autonomously and establish a new link. We assume that each link in the overlay is associated with a link cost. The expected overlay formed by the designed system should have a low overall cost, which is summed up by the cost on all links.

1.3 Purpose

In this thesis, we design a membership management system where nodes form an efficient overlay network topology which can also adapt dynamically based on his- torical data, predefined policies, and real-time parameters. As for services running on top of the structure provided by the system, we consider the optimized structure could be beneficial to task placement and resource management in a cloud operating system running on data centers. Taking task placement as an example, a node that receives a new task could use the cluster to find the best node to execute the task.

If no suitable node is found, the leader then could utilize the hierarchical structure to send the request to other leaders in a higher-level cluster until the task is placed on a node successfully.

1.4 Goals

The high-level goal of the thesis work is to design core algorithms that allow for autonomous adaption of the overlay formed by membership management, implement the management system as a daemon including modules running the designed algorithms together with modules handling the joining and failing of nodes, and evaluate the system in an emulated environment.

The specific goals of the thesis are summarized as follows:

• Investigate the state-of-the-art methods of overlay adjustment and nodes membership management in data centers and summarize the pros and the cons.

• Design algorithms that enable self-adaptive features based on related works and the Ericsson Nefele Compute System.

• Discover an efficient way to handle nodes joining, failure detection, and leader election.

• Implement the membership management system as a daemon with a modular architecture.

• The system is expected to possess features:

– The overall cost is reduced during each round of adaption.

– Clusters with the size smaller than a predefined threshold merge into existing clusters.

– Overhead incurred does not grow significantly with the number of nodes.

(13)

– Detection of failed nodes and recovery of the overlay should be completed with low latency and high accuracy.

• Setup an emulated environment that allows for the evaluation in a realistic physical topology. and define two different ways to implement link cost.

• Evaluate the system in the environment under different link cost metrics, an- alyze the performance of the system in terms of dynamic adaption behaviors, communication overhead, failure detection and recovery.

1.5 Scope

The research of membership management in this thesis focus on forming a tree- based hierarchical structure and making the structure self-adaptive based on some link cost metrics. We assume the communication is with no loss, which means messages sent should arrive at destination nodes and keep intact. And the security issues are not taken into consideration when designing the system.

1.6 Methodology

This thesis will apply both quantitative and qualitative research methods. The thesis studies related works in the fields of membership management and overlay management. Qualitative methods will be used in the design of the core algorithms.

The algorithms adopt innovative ideas from multiple literatures and combine them with heuristic design based on reasoning and past experience. Quantitative methods will be used in the phase of evaluation in order to investigate how the designed system performs under different choices of parameters in the emulation environment.

1.7 Sustainability and ethics

According to [5], in 2010, the global electricity consumption by data centers was estimated to be between 1.1% and 1.5% of the worldwide electricity usage. Fur- thermore, every five years the energy costs of a typical data center doubles[6].

Researches on membership management systems would be beneficial to the services running on top of the resulting overlay in terms of reduced latency, reduced overhead and higher reliability. In the aspect of sustainability, hopefully the proposed system would help save energy expenditure of data centers. Implementing the system will not cause pollution to the environment nor produce harmful wastes.

In the aspect of ethics, the research study of the thesis will not violate the privacy of data holding by physical machines in data centers. The result of the thesis work is a daemon managing the membership information and the overlay.

(14)

1.8. OUTLINE

1.8 Outline

The rest of this report is organized as follows. Chapter 2 provides related background knowledge including the summary of prior arts and other information for better understanding the proposed solution and its implementation. Chapter 3 explains the adaption algorithms in detail and the motivation for the design. Chapter 4 explains the details in the implementation of the membership management system.

Then, experiment environment setup along with results and analysis is presented in Chapter 5. Finally, Chapter 6 concludes this report, reviews the thesis objectives, and summarizes the future work.

(15)

(16)

Chapter 2

Background

In this chapter, we will start with a brief introduction about Ericsson Nefele Com- pute Architecture as well as the relation between the thesis work and the architecture. Then we will introduce basic concepts about membership management and the intrinsic churn property in distributed systems. In the meantime, as membership management and overlay are closely correlated concepts, definition and main categories of overlay networks will be explained. Next we will introduce the main techniques proposed in the literature about membership management and overlay management. Finally we will end this chapter by an introduction of data center topology.

2.1 Ericsson Nefele Compute Architecture

The solution proposed in the thesis works as a component in Ericsson Nefele Com- pute Architecture[1]. The Nefele Compute service is a “Computation as a Service”

platform for hosting cloud-native development and deployment environments. It differs from the paradigm of establishing a platform on top of operating systems of isolated single servers. Single System Image (SSI) concepts are adopted to sim- plify the management of cloud execution contexts. Developers access to resources in the cloud through the SSI abstraction rather than through multiple levels of execution contexts. With Nefele compute service realizing fully decentralized resource management, developers will see the entire data center as a single compute resource instead of individual servers because Nefele makes certain operating system services span across the entire data center.

Figure 2.1 illustrates the proposed cloud architecture. Below Nefele Compute, at the base of the stack is the hardware of data center, consisting of servers intercon- nected with network switches. Nefele compute in the middle provides distributed

“Computation as a Service” across the data center with the help of Saranyu handling tenant management¹ and extra module handling data center policy management.

1Saranyu[7] is a cloud tenant management system based on smart contracts and blockchain.

(17)

Figure 2.1: Proposed cloud architecture[1]

At the top of the stack, the application runtime provides fast Remote Procedure Call(RPC) service for API calls in the data center, enabling responding to developers’ requests with low latency. The development environments for cloud native applications allow developers to access to services offered by others.

Figure 2.2: Hierarchical structure

In the Nefele compute, the distributed control plane is established upon Nefele Agents, each of them running on a single server in the data center. Membership

(18)

2.2. MEMBERSHIP MANAGEMENT

management is one of the services realized by Nefele Agents. All nodes form a hierarchical overlay together in a recursive and bottom-up way. These clusters of nodes will enable optimized communication paths. Figure 2.2 shows the resulting structure formed by membership management service with three levels. Nodes joining the system newly will start from level 1, represented by a small green circle, and try to join a cluster at level 1. Each cluster has only one leader, marked as an

"L" in the figure, which is responsible for managing membership information inside the cluster and connecting to other level-1 leaders in clusters at level 2. Similarly, leaders of level-2 clusters will upgrade to level 3 and try to find a cluster to join.

2.1.1 Background of research question

The thesis aims to add dynamic adaption function to a hierarchical membership management system according to predefined criteria. We will focus on one specific category of topology with tree-based hierarchical structure as shown in the Figure 2.2. The reason we choose a hierarchical structure with role of leaders (also referred as super nodes) is that comparing with flat structure, hierarchical structure could reduce link stress on core switch but at the expense of degraded reliability [8]. Be- sides, the hierarchical structure will facilitate the task placement service in Ericsson Nefele Compute. A node receiving a request for task placement could utilize the structure to firstly contact other nodes in the bottom cluster. If no appropriate node could execute the task, the request can then be forwarded by the cluster leader to other leaders at higher levels. Moreover, the cluster leader at the highest level could also provide membership information of the whole system to services on top.

2.2 Membership management

2.2.1 Churn property

Distributed large-scale systems are faced with frequent membership changes perhaps due to dynamic conditions of nodes and the underlying network. For instance, new nodes join the system; existing nodes leave the system with a notification or without one if due to failure; some nodes may leave temporarily and rejoin the system by reason of maintenance. These changes could happen concurrently, and thus it brings challenges to maintain and synchronize membership information in large-scale distributed systems. We will refer the continuous process of node arrival and leaving as churn, according to [9]. Churn is an intrinsic property of distributed systems for two reasons. First, nodes have the freedom to join or leave the system at any time; second, the underlying network in charge of communication is typically unreliable and thus can fail at any time [10].

(19)

2.2.2 Overlay network

An overlay network is defined as a logical or virtual network created on top of another network. The category of overlay network discussed in this report is formulated at the application level and on top of IP network. The overlay network is comprised of overlay nodes, which in this case correspond to underlying servers in the data center, and overlay links, which connect a pair of nodes and permit them to communicate with each other directly. Overlay links are also called logical links as they are often independent of the topology of the underlying network, which means that if a node contacts a neighbor in the overlay network, the message sent by the node may traverse an arbitrary number of routers and switches in the underlying network.

An overlay network reflects the relationship among nodes. Two nodes joined by an overlay link usually tend to communicate with each other directly, whereas nodes without direct links could sometimes only exchange messages with the help of intermediate nodes relaying the message. The main objective of an overlay network is to enable a huge amount of computing resources to be linked and accessed effectively[10].

Overlay and membership management are related concepts because membership management in distributed systems usually maintains an overlay network. A membership protocol provides each member participating in the system with a list of other non-faulty members. The list is usually maintained locally and available to the application either directly in the address space of the member, or through a callback interface or an API [11]. A membership protocol should ensure that 1) a newly joined node is not isolated and forms at least one overlay link with existing nodes in the system; 2) a failed node or a leaving node should be detected and removed from the overlay networks; 3) the whole network is still a fully connected component when some overlay links are destroyed.

Structured and unstructured overlay networks

There are two main categories of overlay networks: structured and unstructured overlays. Although a unified definition cannot be found, interpretations in different academic papers do have some key points in common[12, 13]. A structured overlay usually has a global coordination scheme as well as a topology that is known a priori. An unstructured overlay often has a simpler design and fewer constraints on the location of any specific node and the resulting overlay topology, therefore, it is more resilient to the churn phenomena and has a larger degree of freedom.

Structured approaches to build overlays are usually efficient for exact queries.

For example, Distributed Hash Tables (DHT) has been applied to location-independent routing to closest data or service in Tapestry[14], topology-aware routing service in[15], and database systems with distributed storage such as Dynamo[16] and Cassandra[17].

Unstructured approaches to build overlays usually depend on some sort of flood-

(20)

2.2. MEMBERSHIP MANAGEMENT

ing to implement search algorithms. Thus, some measures are designed to decrease the cost of query flooding, such as adding super-peers[18, 19].

Overlay network metrics

To evaluate an overlay network formulated by different systems, some metrics may be considered. The topology of an overlay network consisting of nodes and links could be considered as a graph with vertices and edges, and thus analyzed with graph theory[20]. Also, metrics such as the performance of services executed on top could be taken into account.

Connectivity The whole overlay network should be fully connected, which means at least one path exists between any pair of nodes. If this metric is not met, distributed services running on top of the overlay networks will not be able to reach all nodes or hold an integrated view of the system. Isolated nodes will then consider themselves forming a new overlay network.

Average Shortest Path A path between two distinct nodes in a graph is a se- quence of edges that connect them; the length of a path is defined as the number of edges that have to be traversed from one node to the other; the shortest path is the path with the shortest length among all paths. Thus the average shortest path of the overlay network is the average of length of the shortest paths between all pairs of nodes. This metric is closely relevant to the latency of the exchange of messages in the overlay network as it reflects the average number of hops that a message may cross.

Cost We assume that a cost is associated with each edge of the overlay network.

The overall cost could be considered as the sum of cost for all links that form the overlay network and should be kept as low as possible theoretically. However, in reality, it is quite difficult to evaluate the overall cost of a distributed system in real time as nodes, links, and cost of each link could change every now and then.

The link cost may correlate with the corresponding link latency of the underlying network, thus the structure of overlay network reflects the topology of underlay network. In the meantime, the link cost may also reflect the requirements of higher- level applications. For instance, in a resource location system, the cost could be correlated to the similarity of resources holding by a pair of nodes. As a result, nodes providing similar resources form overlay links with higher probability.

2.2.3 Failure detector

In a traditional heartbeat failure detector algorithm, each group member periodi- cally sends a heartbeat message usually along with an increasing counter to all its neighbors in the network. If no heartbeat message is sent out from one node after a preset time period, other non-faulty members will declare the failure of that node.

A heartbeat failure detector without timeouts is also feasible and was presented in[21]. An eventually non-increasing counter of one node marks its crash. The information of a failed node should then be disseminated in the network to ensure the

(21)

consistency of membership information. However, the simple heartbeat algorithm’s accuracy, efficiency, and scalability varies and depends on actual implementations.

We will introduce a protocol with better performance. SWIM[11] is a scalable, weakly-consistent, and infection-style process group membership protocol. The failure detection algorithm takes the strategy of choosing random members to send a ping message. If no acknowledgement message is received for a period of time, then a failure message is sent out. Since the underlying IP network is usually unreliable, a member that keeps losing messages is indistinguishable from the one that has crashed[22]. To improve the accuracy of failure detection, SWIM introduces an indirect ping strategy. Instead of sending failure message directly after a timeout, a node will choose k random members to send indirect ping request messages. Mem- bers that receive a ping request will ping the former target member and sends back an ack message to the requested node if an ack is received from the former target.

To reduce the detection time, SWIM imports a round-robin probe target selection mechanism. A member does not choose ping targets randomly each round from its local membership list but in a round-robin fashion. After completing a traversal of the whole list, the member then shuffles the list in a random manner.

2.3 Related literature in membership management

2.3.1 Scamp

Scamp [23] is a scalable peer-to-peer membership protocol which operates in a completely decentralized manner. Each node is provided with a sample of the entire system membership. The size of the partial views converges to (c + 1)log(n) on average, where n denotes the number of nodes in the system and c is a design parameter that determines the proportion of failures tolerated.

A new node joins the system by contacting an arbitrary pre-existing member and sending subscription request. The contact node forwards the subscription to all nodes in its view firstly and then forwards c additional copies to nodes randomly chosen in its partial view. A node receiving a forwarded subscription integrates it into its view with probability inversely proportional to the size of its view at the moment. If the receiving node doesn’t keep it, the subscription continues to be forwarded to a random node in the view of the receiving node.

An unsubscription mechanism is used to keep the size of partial views, where a leaving node informs some nodes which gossip to it to replace its node id with the one in its partial view, and simply lets others node gossiping to it to delete its node id. It is shown in the analysis that Scamp is almost as resilient to failures as a membership protocol relying on the global view of the membership at each node.

2.3.2 HiScamp

HiScamp [8] is a peer-to-peer hierarchical membership protocol, which depends strongly on some agreed measure of distance. For instance, the distance could be

(22)

2.3. RELATED LITERATURE IN MEMBERSHIP MANAGEMENT

the round-trip time or the number of hops between them. It is proposed to relieve stress on core links but at the expense of a negative impact on latency and reliability.

The protocol organizes nodes into clusters and forms different levels according to proximity measure of nodes. Each cluster represents an abstract single node at the upper level and implements Scamp protocol to maintain membership within the cluster. Links which connects distinct clusters are usually set up between nodes that have initiated each cluster. In this way, inter-cluster traffic becomes less but the probability of new isolated clusters is higher.

2.3.3 Oversenia

Oversenia[24] is a resilient overlay network with inter-connected replicated super- peers, constructed by the proposed protocol. The target size of each cluster is the most important parameter of this protocol. It builds an overlay network with three characteristics: each peer belongs to a cluster; in the cluster, each peer knows the whole membership information and maintains a link to each of the neighbors; each peer maintains additional links to other clusters.

In join procedure, the message from a newcomer is sent to the contact node and then forwarded using a limited length random walk until the walk ends or the message reaches a cluster whose size is small than the target size. Clusters with their sizes larger than a predefined maximum value will divide into smaller clusters;

clusters with their sizes smaller than a predefined minimum value will disband gracefully with a collapse procedure. When the size of a cluster is larger than a predefined maximum size, the cluster will trigger the division procedure and break into two smaller clusters.

The protocol provides an innovative way to build overlay networks according to a target cluster size. However, the underlying topology and the cost of each adjustment procedure are not considered. Moreover, in the use case of query routing that the protocol is designed for, super-peers can be easily overloaded with the work of routing and processing each query.

2.3.4 X-BOT

X-BOT[25] is a protocol that enables the unstructured overlay networks to adapt to a resilient and optimized state according to target criteria. The goal is to reduce the whole overlay cost which is summed up by the cost of all links that form the overlay networks. The cost of one link can be based on network metric and/or higher level metric.

Each node in the system maintains an Active View with small size, maintained with reactive strategy and sorted by link cost, and an unbiased Passive View with much larger size, maintained with cyclic strategy. TCP connections are established between a node and each of its neighbors and play a role of unreliable failure detector.

(23)

A parameter pi denotes the maximum number of optimization rounds in each optimization procedure. In one adaptation round, the initiated node attempts to replace an old link that connects one member in its active view with a new and lower- cost link that connects one member in its passive view. Only four nodes are involved in each round and the degree of nodes remains constant. As a result, the optimized unstructured overlay networks should keep relevant properties of the original overlay, such as randomness, low clustering coefficient, in-degree distribution.

A peer-to-peer architecture for efficient resource location service[26] uses X-BOT to adapt its unstructured overlay networks. Then a structured layer is constructed by a fraction of nodes which are regional contacts for a particular resource category c. A distributed hash table, which contains all information about the regional contacts, helps route a query to the desired region of the unstructured topology.

2.3.5 Serf

Serf[27] is a tool that provides solutions for three major problems: cluster membership, failure detection, and orchestration. Serf are decentralized, fault-tolerant and highly available and can run on major platforms, such as Windows, Linux, and Mac OS X. It generates extremely low overhead of 5 to 10 MB of resident memory. The communication is primarily based on infrequent UDP messages.

Serf uses a gossip protocol based on [11] to broadcast messages to all nodes.

Besides, a network tomography system to compute network coordinates for nodes is deployed in Serf. Coordinates are calculated by appending data to the probe messages that are part of Serf’s gossip protocol. In this way, the system incurs little overhead and could scale well. The RTT between a pair of nodes is calculated with a formula mainly based on the Euclidean distance. The formula which can be found in the documentation of Serf is out of the scope of the thesis.Serf also provides protection against security attacks by a symmetric key cryptosystem, which enables Serf to run over untrusted networks.

From the study of related works, some ideas will be adopted in the designed membership management system. According to HiScamp, a hierarchical structure is formed and the decision of behaviors in the system are based on some agreed measure of distance. Our design uses a hierarchical structure with one leader in each cluster. Each link in the overlay is assumed to own a link cost. Considering dynamic adaption, we proposed a merge procedure based on the idea of division and collapse procedures proposed in Oversenia, and a transfer procedure based on the idea of replacing a link in each round to reduce the whole overlay cost in X- BOT. These two procedures will be explained in detail in the next chapter. In our evaluation, we will use the RTT between a pair of nodes and the network coordinates of each node provided by Serf.

(24)

2.4. DATA CENTER TOPOLOGY

2.4 Data center topology

Our designed membership management system takes the underlying topology into consideration and reflects it in the resulted structure of the generated overlay. The topologies summarized here will also be used to evaluate the designed solution.

A good overlay network structure maintained by membership management should make the protocols or services running on top impose less load on switches, especially on switches that locate in critical positions. For instance, switches at critical positions could be those at the core layer of data center topology or those that a large number of shortest paths will traverse. The data center architectures can be classified in two major categories: switch-centric and server-centric networks[28];

the former rely on switches and routers to forward and route packets through the network, while the latter relies on physical servers to take the responsibility of routing. Here we focus on switch-centric networks and introduce ThreeTier in detail and FatTree briefly.

2.4.1 ThreeTier

The most commonly deployed data center topology currently is the ThreeTier architecture[29]. It consists of a three-tier hierarchy of routers and switches and a single layer of physical commodity servers[28]. Usually, physical layers are clus- tered in racks and connect to a switch. This kind of switch is called a top of the rack switch and is typically linked to around 40 servers[30]. All top of the rack switches comprise the first layer, which is called the access layer. A number of access layer switches then connects to the switches located at the second layer, aggregate layer.

The top layer, namely the core layer, consists of powerful switches that each of them is connected to all aggregate layer switches in this data center.

Figure 2.3: ThreeTier Architecture

(25)

2.4.2 FatTree

FatTree topology is a tree-based hierarchy of network switches. Servers at bottom layer along with switches at first and second layer are deployed in separated pods.

The number of pods is given by a parameter k which also controls the number of switches at each layer. The core layer accommodates (k/2)² switches; in each pod, the aggregate and access layers each accommodates k/2 switches, each of which connects to k/2 switches at its higher layer. As a result, one pod contains k switches and (k/2)² servers. Compared to ThreeTier, FatTree architecture performs better in the aspects of throughput, scalability, and energy efficiency[31]. Moreover, a low-cost custom addressing and two-level routing scheme is proposed especially for this architecture[32].

Figure 2.4: FatTree Architecture

(26)

Chapter 3

Design

In this chapter, we will firstly explain the idea of proposed solutions that deal with changes of membership and dynamic adaption. Then all messages used in communications are defined and explained. Definitions of important parameters and functions are also described. Then core algorithms are demonstrated in pseudo- code. Finally, we describe the way that nodes detect failure and how nodes elect new leaders after the failure of leader nodes.

3.1 Rationale

The main goals of the designed system is to manage the membership and establish an overlay network that possesses the following features: (1) each node builds at least one link to other nodes; (2) each node belongs to at least one cluster and knows the existence of other nodes in the same clusters; nodes tend to disseminate information within its cluster first; (3) nodes build a hierarchical overlay with tree topology from bottom to top as shown in Figure 2.2; (4) nodes can dynamically change the overlay based on real-time parameters and predefined requirements.

In fact, the highest level in the hierarchical structure could be unlimited. Since a level-3 leader in the meantime is also the leader of level-2 and level-1 clusters, the cost of recovery from the failure of a leader increases with the level the leader is on.

As a result, in the thesis, the highest level in the system is set by 3.

To handle the basic tasks of membership management, namely nodes joining and nodes leaving or failing, we propose a join procedure for newcomers to discover a cluster to join or become leader nodes. As for node leaving and failure, we use the indirect ping and round-robin target strategy from SWIM [11] within each cluster to detect node failure and a simple strategy to elect a new leader if a leader fails.

The structure of overlay network should reflect proximity among nodes in the same cluster based on cost metrics and adjust dynamically to the changes of cost.

For example, if real-time RTT is taken as the link cost, when conditions of the underlying network change, nodes that belong to a cluster before are perhaps un- suitable to stay in the same cluster as the cost increases. In the design, all nodes

(27)

with the role of leader in the distributed systems are responsible for reorganizing the structure. Therefore, we propose a transfer procedure and a merge procedure mainly based on cost metrics and cooperation of all leader nodes. Ordinary nodes will not participate in monitoring the conditions for dynamic adjustment, but follow leader nodes’ instructions to complete a movement from one cluster to another. The essence of a successful transfer is to reduce overall cost by replacing an existed high- cost link with a new low-cost link. The purpose of merge procedure is to reduce the number of clusters with relatively small sizes and to eliminate isolated components in the overlay network.

3.2 Control messages

Nodes communicate with each other through the exchange of messages. Message types and fields in each type of message are predefined for all nodes in the system.

A message can be sent to either a group of nodes by broadcast or a single node by unicast. Messages received by a node will be handled depending on the type of the message and the role of the node. Certain types of messages could only be handled by a leader of a cluster and should be discarded by non-leader nodes.

All messages defined in the system will be explained in the following:

• discover : A newly-joined node or a leader upgrading to a higher level will broadcast discovery message to the specific channel with its current level.

This type of message could only be handled by leaders which decide whether to respond to the sender with an offer message.

• offer : The message will be sent from a leader if the leader has the will to accept a new node into the cluster that the leader is responsible for. A node in discovery state could receive multiple offer messages and accept any one of them.

• accept: The message will be sent from a node if the node would like to accept any offer and join the cluster.

• update: The message will be broadcast to all the members in one cluster from the leader when any membership change happens, for instance, a member’s join, leave, or failure. Members that receive the message will retrieve information from the message fields and update the local states in order to keep consistency of membership information inside the cluster.

• transferreq, transferack, transfercmd: These three types are used in the pro- cedure of transfer one node from the original cluster to a new cluster. The names are short for transfer request, transfer acknowledgement, and transfer command.

• ping, pingack, pingreq: These three types are used in failure detection module.

The last two names are short for ping acknowledgement and ping request.

(28)

3.3. ALGORITHMS

• mergereq, mergeack, mergeaccept, mergecancel: Theses four types are used in the procedure of one small cluster merges into one of other relatively large clusters. The first two names are short for merge request and merge acknowledgement.

3.3 Algorithms

In the section, we will start with definitions of some parameters that play an important role in controlling the behavior of the membership management system. Then three core algorithms in the system are defined in this section.

3.3.1 Parameters and functions

The behavior of the designed system is mainly controlled by the following parameters:

• RADIUS : target cost threshold inside the cluster. If the cost between two nodes is larger than the value, some adjustment behaviors such as the transfer of one node from the original cluster to another cluster will be triggered.

• MAX_GROUP_SIZE : an array of maximum size of clusters on each layer. If the current number of nodes in one cluster equals or is larger than the value, the leader of the cluster will stop accepting newly-joined nodes.

• UPGRADE_GROUP_SIZE : an array of target upgrade size of clusters on each layer. If the current number of nodes in one cluster reaches the value, the leader of the cluster will upgrade to a higher level and start the discovery process.

• TRANSFER_GROUP_SIZE : minimum size to start transfer procedure on level 1. When the size of clusters on level 1 reaches the value, the leader will start checking the cost and trigger the following transfer procedure.

• MERGE_MAX_SIZE : maximum size to stop triggering merge procedure.

• MERGE_INTERVAL: time interval between every merge request message sent by the leader.

We will explain some functions that will be used in the algorithms:

• time(now): the function returns the current timestamp.

• get_cost(id1, id2): the function returns the cost value between two nodes with identifier id1 and id2.

(29)

• Send(message_type, parameter1, parameter2, ..., destination): the function sends out the exact type of message with parameters filled in the fields of the message. When the destination is an identifier of a node, the message will be sent to the node in unicast style. When the destination is an identifier of a group, the message will be broadcast to all nodes in that group.

3.3.2 Join procedure

Parameters and functions in the algorithm of join procedure will be explained as follows:

• Tlastof f er: The time when receiving the last offer message.

• T_{of f er}_timeout: A timeout value to control whether the saved offer will be accepted.

• state: It is initialized to 0. After receiving the first offer message, the value of state will be change to 1.

Figure 3.1: Sequence diagram of join procedure

When a new node joins the system, it creates a socket and bind the socket to a TIPC address with a common type for the purpose of discovery and a port number that is calculated from its own IP address. The newcomer starts discovery process by broadcast a discovery message to the common type, and owns initial state value 0. State value 0 means it has not received any offer message since it joins the system.

(30)

3.3. ALGORITHMS

Algorithm 1 Join Procedure

1: every ∆T do

2: if time(now) - Tlastof f er > T_{of f er}_timeout and state == 1 then

3: trigger Send(Accept, leader, group)

4: return

5: if time(now) - discovery_start > T_leader_timeout then

6: trigger become_leader()

7: return

8: trigger Send(discover, id, level)

9: upon event Receive(offer, leader, group) do

10: if state == 0 then

11: state ← 1

12: cost ← get_cost(id, leader)

13: Tlastof f er ← time(now)

14: save_offer(leader, group, cost)

15: if cost <= RADIUS then

16: trigger Send(accept, leader, group)

17: if state == 1 then

18: cost ← get_cost(id, leader)

19: if cost <= RADIUS then

20: trigger Send(accept, leader, group)

21: else if cost < previous_cost then

22: save_offer(leader, group, cost)

23: else

24: pass

State value 1 means that it has already got at least one offer message and saved the best offer at the moment.

When it receives the first offer message from any existing leader node, it extracts and keeps the information in the message fields at first, and changes its state value to 1. Then it calculates the cost value between the leader that provides the offer and itself. If the value is smaller than the predefined parameter RADIU S, it will immediately send an accept message to the leader using the information kept before.

If the cost is larger than RADIU S, it chooses to wait for other offers and pick the best one according to the criteria.

Every T seconds, the node will check if the time passed exceeds two thresholds.

If it is in state 1 and T_{of f er}_timeout times out, it will accept the best offer which it has saved. If T_leader_timeout times out, it will elect itself as the leader of a new cluster.

(31)

3.3.3 Transfer procedure

Parameters in the algorithm of transfer procedure will be explained as follows:

• S_trans: The TRANSFER_GROUP_SIZE parameter.

• in_tranfer_state: It turns true when the leader initiates a transfer procedure via broadcasting transferreq message. It turns to false when the leader sends a transfercmd message. It ensures that the leader will only accept first transferack message and ignore other messages that arrive later.

• group_n: The parameter represents the identifier of the cluster on level n that the node belongs to.

• node.leader: It is a flag reflects whether the node has a leader role in any cluster.

Figure 3.2: Sequence diagram of transfer procedure

This procedure is run only on the nodes with a role of leader which has already joined a group at level 2. In each round, the leader calculates the cost between itself and each of the group members and saves the maximum cost value and the corresponding id of that member. Then the leader sends a transferreq message to the level 2 group which it belongs to and changes its state to in_tranfer_state.

The fields in the message include cost, which is the maximum cost value, tid, which

(32)

3.3. ALGORITHMS

Algorithm 2 Transfer Procedure

2: if len(memberlist)>S_trans && len(group_list)>2 then

3: cost, tid ← get_max_cost()

4: if cost > RADIUS and not in_tranfer_state then

5: in_tranfer_state ← True

6: trigger Send(transferreq, cost, tid, group_2)

7: if group_3 ! = None then

8: trigger Send(transferreq, cost, tid, group_3)

9: upon event Receive(transferreq, cost, tid, group) do

10: newcost ← get_cost(id, tid)

11: if newcost < cost && newcost < RADIUS then

12: trigger Send(transferack, mycost, tid, leader_id)

13: upon event Receive(transferack, mycost, tid, leader_id) do

14: if in_tranfer_state then

15: trigger Send(transfercmd, tid, j_group, leader_id)

16: in_tranfer_state ← False

17: upon event Receive(transfercmd, tid, j_group, leader_id) do

18: current_leader_id = group_1.leader

19: trigger Send(leave, tid, current_leader_id)

20: group_list ← group_list \ group

21: trigger Send(join, tid, j_group, leader_id)

22: group_list ← group_list ∪ j_group

23: upon event Receive(leave, tid, current_leader_id) do

24: member_list ← member_list \ tid

25: upon event Receive(join, tid, j_group, leader_id) do

26: member_list ← member_list ∪ tid

(33)

is the id of the member being transferred, and group, which is the id the current group.

Other leaders in the level 2 group will receive the message and then decode and handle it. Firstly, the cost between itself and the node with identifier tid is calculated and compared with the cost in the message as well as the parameter RADIUS. If the leader has a lower cost value that satisfies the two conditions, it will send back a transferack message that expresses its willingness to accept the transferred node.

As for handling the transferack message, if the leader is still in in_tranfer_state state, it will accept the offer that comes first for the migration of the target node.

To start the migration, the leader sends a transfercmd message to the target node and sets its in_tranfer_state state to false. Then the node that receives the message will realize that it will be transferred from the current group to another group. A leave message will be sent to the current leader and a join message will be sent to the new leader. Both messages will trigger update of membership information inside the two groups. The transferred node should discard all the information about the previous group and receive membership information about the new group in an update message from the new leader. Until this stage, the procedure of one single node transfer is done.

This procedure is designed to make nodes that belong to one common cluster have lower link cost to their leader. In the resulting overlay with stable state, the cost of each link between a leader and each of the members in level 1 group is smaller than the predefined parameter RADIUS. Thus, the cost of link between any two members in the group could also be kept relatively small.

3.3.4 Merge procedure

Parameters and functions in the algorithm of merge procedure will be explained as follows:

• T_lastcheck: It gives the timestamp when the last check for the conditions of triggering merge behaviors happened.

• T_merge_timeout: A timeout value to control whether the leader checks the conditions.

• level: The largest level of all clusters the node is in.

• leader_level: The largest level of all clusters the node leads.

• mlist: A list saves the identifiers of nodes that the leader have sent mergerack to and meanwhile received no response from.

• mstate: The default value is 0. The value is 1 when the mlist is not empty, indicating it cannot initiate a merge procedure. The value is 2 when a mergeaccept from a leader arrived and a join from this leader was on the way. Leaders

(34)

3.3. ALGORITHMS

under this state will not send mergeack messages in order to prevent too many nodes merging into the cluster simultaneously.

Figure 3.3: Sequence diagram of merge procedure

This procedure is also run only on the nodes with the role of leader. The purpose is to decrease the amount of small clusters and increase the connectivity of the whole overlay network. Two situations are handled by the algorithm: 1) Small clusters stay on the bottom level, cannot upgrade to a higher level due to lack of nodes, and thus keep isolated from the main component. They should merge into other clusters on the same level and the leader of the original cluster will become ordinary node.

2) Small clusters on level 2 cannot upgrade to level 3 due to small sizes and are isolated from the largest components. The leader should upgrade and join a cluster at level 3.

During each period of T_merge_timeout seconds, the leader will check conditions for the two situations. If the size of level-1 cluster is smaller than Merge_Max_Size and mstate is 0, the leader broadcast mergereq to all nodes in the system. Or if the leader is on level 2 and leads a level-2 cluster, it will upgrade and trigger discovery at level 3.

For leaders that receive mergereq messages, it will respond a mergeack if the size of level-1 cluster is larger than the size of the cluster in the merge request and the mstate is 0. In the meantime, mstate changes to 1 and the sender of mergereq will be added into mlist.

For leaders that receive mergeack messages(lines 16-26), if they have always

(35)

Algorithm 3 Merge Procedure

2: size ← len(group_1.member_list)

3: if time(now) - Tlastcheck > T_merge_timeout then

4: if size<Merge_Max_Size and mstate==0 then

5: trigger Send(mergereq, id, group, size)

6: if level == 2 and leader_level == 2 then

7: trigger Discover at level 3

8: Tlastcheck ← time(now)

9: upon event Receive(mergereq, id, group, size) do

10: mysize ← len(group_1.member_list)

11: if mysize<size or mstate == 2 then

12: return

13: mstate ← 1

14: trigger Send(mergeack, id, gid, mysize)

15: mlist ← mlist ∪ id

16: upon event Receive(mergeack, id, gid, size) do

17: j_group ← gid

18: if not leader or len(group_list) == 0 then

19: trigger Send(mergecancel, id)

20: return

21: trigger Send(mergeaccept, id, group)

22: for member in group_1.memberlist do

23: trigger Send(transfercmd, member.id, j_group, group.leader)

24: group_list ← group_list \ group_1

25: group_list ← group_list ∪ j_group

26: trigger Send(join, id, j_group)

27: upon event Receive(mergeaccept, id, group, mysize) do

28: mstate ← 2

29: upon event Receive(mergecancel, id) do

30: mlist ← mlist \ id

31: if len(mlist) == 0 then

32: mstate ← 0

33: upon event Receive(join, id, j_group) do

34: if id in mlist then

35: mlist ← mlist \ id

36: if len(mlist) == 0 then

37: mstate = 0

38: else

(36)

3.4. FAILURE DETECTION AND LEADER ELECTION

completed a merge procedure and are not leaders anymore, they respond with a mergecancel message. Or they will just accept the first arriving mergeack message.

First, they send a mergeaccept to the leader of target cluster. Second, they will send transfercmd messages to every member in the cluster. From the view of a member, it cannot tell the difference between merge and transfer, because the purpose is the same, obeying the leader’s order and joining the target cluster. Third, they change the local membership information by removing the original cluster identifier and append the target cluster identifier instead. Finally, they send a join message to finish the merge procedure.

Lines 27-39 are about changing mlist and mstate according to the type of mes- sages received. The main purpose of having mstate is to avoid the situation that when a cluster wants to merge into a target cluster but finds the target cluster has disappeared and merged into another cluster.

3.4 Failure detection and leader election

To detect the failure of nodes, we adopt the indirect ping strategy and round-robin probe target selection mechanism from SWIM [11], as an example to introduce how the failed node is handled in the designed system. During each detection round, N1 sends a ping message to another node(denoted as Nt here) in the same cluster. If no response from that target node, Nm sends a pingreq message to another node (denoted as Nr) in the cluster, and then waits for any response from either Nt or Nr. Upon receiving the pingreq message Nr will send a ping message to Nt. If Nr gets a response from Nt, it will forward the response to N1. N1 will track the time when the first ping was sent and if the time exceeds the predefined timeout, N1 will declare the failure of Nt.

During each detection round, N1 will choose a target to send ping in round- robin style instead of choosing randomly. In this way, the detection time of a failure decreases and the situation that a failed node is never found will not happen. In our design, we introduce an index which controls the probe target selection. When membership changes, the leader will broadcast to the cluster an update message with a list of identifiers of all members. Each member gets a copy of the list, finds the position of its identifier in the list and set the position as the index. During detection round, each node increases the index by 1 (if the index reaches the length of the list, then the index will be 0) and chooses the target in the list according to the index. If the target is the node itself, the node just repeats the operation and chooses target in the next position of the list. With the help of the index, every member in the cluster will receive a ping message in each round.

When a failed node is found, the role of node who found a dead node along with the role of failed node together decides the actions which will be taken. We will discuss three different scenarios in the following. The failed node will be denoted as NF; the leader of the failed node will be denoted as NL; the node that declared the dead node will be denoted as ND here.

(37)

1. A node with leader role declared one of its group member failed, which means NL=ND. If the dead node is still in the list of group members, the leader removes it and sends update message to all group members. If NF is not in the list, the leader then does nothing because the dead node may have already been dealt with a leave message.

2. A non-leader node ND found one of its group member NF failed. ND sends a leave message to the group leader NL notifying the failure of NF. ND will handle the leave message and send update message to the group.

3. The leader failed. The node ND will broadcast an election message to the group and let the group enter election state. Now the election criteria is that the node with smallest value of last 8 bits in its IP address will be elected as the new leader. Then the new leader broadcasts update message to the group.

(38)

Chapter 4

Implementation

4.1 Software architecture

The proposed solution is implemented as a daemon, namely Agent. Agent is written in Python and composed by the following building blocks:

• The main module is responsible for receiving new messages and taking actions depending on the type of messages and the role of the node.

• Data structures hold useful information about the node and the clusters that the node belongs to.

• The discovery module is responsible for a node’s discovery and join process.

A separate thread is started when a node enters into discovery state and is stopped when a node successfully joins a cluster or elects itself as a new leader.

• The adjustment module makes changes on the current overlay topology according to cost values and other parameters.

• The failure detection and leader election module monitors the state of nodes and take actions when a dead node is found.

• The cost map module provides the cost between any pair of nodes to adjustment module when requested.

• The logging module keeps the system log, which will be used to debug and collect numerical results after an experiment.

4.1.1 Data structure

The Agent program maintains a Node data structure that holds basic information of the node and membership information of the clusters it belongs to. The data structure is implemented as a Python class and an object of the class is created when

(39)

Figure 4.1: Structure of Agent system

booting and passed in functions throughout the program. The main attributes are as follows:

• ipaddr: the IP address of the node.

• uuid: the same as ipaddr here.

• sock: a socket object bind to TIPC_ADDR_NAME when created.

• port: the port number used in TIPC.

• level: the highest level of all clusters that the node belongs to.

• leader_level: the highest level of all clusters that the node leads

• leader: a flag indicating whether the node plays a role of leader.

• group_list: a list of objects of class Group saving the membership information of all clusters (also referred as groups in the implementation) the node joins.

Self-adaptive and hierarchical membership management in distributed system