Scalable Streaming Graph Partitioning

(1)

INOM

EXAMENSARBETE DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP

STOCKHOLM SVERIGE 2017,

Scalable Streaming Graph Partitioning

REZA KHAMOUSHI

(2)

KTH Royal Institute of Technology

Department of Software and Computer Systems

Degree project in Software Engineering of Distributed Systems

Scalable Streaming Graph Partitioning

Author: Reza Khamoushi Supervisor: Hooman Peiro Sajjad Examiner: Vladimir Vlassov

(3)

(4)

Abstract

Large-scale graph-structured datasets are growing at an increasing rate. Social network graphs are an example of these datasets. Processing large-scale graph- structured datasets are central to many applications ranging from telecommunication to biology and has led to the development of many parallel graph algorithms. Perfor- mance of parallel graph algorithms largely depends on how the underlying graph is partitioned.

In this work, we focus on studying streaming vertex-cut graph partitioning algorithms where partitioners receive a graph as a stream of vertices and edges and assign partitions to them on their arrival once and for all. Some of these algorithms maintain a state during partitioning. In some cases, the size of the state is so huge that it cannot be kept in a single machine memory. In many real world scenarios, several instances of a streaming graph partitioning algorithm are run simultaneously to improve the system throughput. However, running several instances of a partitioner drops the partitioning quality considerably due to the incomplete information of partitioners.

Even frequently sharing states and its combination with buffering mechanisms does not completely solves the problem because of the heavy communication overhead produced by partitioners.

In this thesis, we propose an algorithm which tackles the problem of low scalability and performance of existing streaming graph partitioning algorithms by providing an efficient way of sharing states and its combination with windowing mechanism. We compare state-of-the-art streaming graph partitioning algorithms with our proposed solution concerning performance and efficiency.

Our solution combines a batch processing method with a shared-state mechanism to achieve both an outstanding performance and a high partitioning quality. Shared state mechanism is used for sharing states of partitioners. We provide a robust implementation of our method in a PowerGraph framework. Furthermore, we empirically evaluate the impact of partitioning quality on how graph algorithms perform in a real cloud environment.

The results show that our proposed method outperforms other algorithms in terms of partitioning quality and resource consumption and improves partitioning time considerably. On average our method improves partitioning time by 23%, decreases communication load by 15% and increase memory consumption by only 5% compared to the state-of-the-art streaming graph partitioning.

(5)

(6)

Acknowledgment

I am deeply thankful to my supervisor, Hooman Peiro Sajjad, and my examiner, Vladimir Vlassov, for helping me to define the project, set the goals and be on schedule. I would like to thank Rong Chen for answering my questions about the PowerGraph framework. Lastly I wish to thank my family and friends for supporting me during this project.

Stockholm, 06 April 2017 Reza Khamoushi

(7)

(8)

List of Figures

4.1 Sample graph streaming . . . 21

4.2 Sample graph partitioning . . . 23

4.3 Sample graph partitioning - Replications . . . 23

7.1 Number of Partitions impact on Replication Factor . . . 40

7.2 Number of Partitioners impact on Partitioning Time . . . 41

7.3 Window Size impact on Replication Factor . . . 42

7.4 Window Size impact on Partitioning Time . . . 43

7.5 Partitioning Algorithm impact on Memory Consumption . . . 44

7.6 Partitioning Algorithm impact on Network Calls . . . 45

7.7 Partitiong Algorithm impact on Network Traffic . . . 46

7.8 Window Size impact on Network Calls . . . 46

7.9 Partitioning Algorithm impact on Execution Time - K-Core . . . 47

7.10 Partitioning Algorithm impact on Execution Time - Pagerank . . . . 48

(11)

(12)

List of Tables

6.1 Datasets Properties . . . 31

6.2 Amazon EC2 instances . . . 34

6.3 Google Cloud Platform instances . . . 34

7.1 Replication Factor of the Partitioned Graph - 24 Partitions . . . 38

7.2 Cloud Environments - Experiments . . . 39

(13)

(14)

1 Introduction

^{Chapter 1}

1.1 Motivation

The past few years have witnessed a deluge of data and a continuous development of information production on the Internet. From the year 2012, an approximate of 2.5 exabytes of data is created every day [46], and globally stored information grows roughly at an annual rate of 23 percent [17]. Social networks are good examples of these colossal datasets. Facebook has more than one billion active users with the average friend count of 190 [41]. This is not a mere growth in size but also a progression of data connectivity. Connectivity refers to the relationship between varying data elements in a dataset. For instance, in a Facebook dataset, friendship bonds are connections between the data elements that are individuals in this case.

Generally, higher connectivity of data leads to a greater complexity in processing and analysing data. Modelling complex datasets as graphs is a popular approach in which data elements are modeled as vertices, and edges represent data connectivity.

The graph representation of data comes with many advantages. Primarily, it fits the statistical nature of many natural datasets. Social networks, communication networks, citation networks, web graphs, peer-to-peer networks, geographical datasets and on-line communities are examples of graph like datasets. Furthermore, a wide range of data analysis algorithms can be modeled as graph algorithms, which is a well-researched area. Consequently, there is an increasing demand for platforms that can store and process graph datasets.

Traditional databases are capable of storing graph like data in a centralised fashion.

Nonetheless, storing enormous datasets in a centralised database, even if possible, in most cases is not convenient. Besides, gigantic complex datasets cannot be processed by a single computational unit within a reasonable elapsed time. The demand for scalability has caused a paradigm shift toward decentralised data processing in fields of graph processing and analysis. In the decentralised approach, graph datasets must be partitioned into smaller pieces in order to be distributed and processed

(15)

1 Introduction

simultaneously using several processing units. Using this approach, scalability is achieved by dynamically changing the capacity of the system.

A good graph partitioning is very useful because edges in real world graphs display a great deal of locality. Generally, the issue of locality, i.e. accessing local data, and communication limitations are the main obstacles in designing distributed algorithms.

Because accessing parts of data which are not stored locally is a time-consuming operation and drops the performance of the algorithm dramatically. Moreover, in some environments, accessing remote data is either too limited or totally unfeasible.

Distributed graph algorithms are designed to process graph data in such environments with aforementioned limitations. But, the performance of these algorithms highly depends on the quality of the underlying partitioning. Distributed graph coloring [24]

and calculating local clustering coefficient [42] are examples of such algorithms. In the following algorithm, each node need only know about its surroundings, i.e. direct adjacent nodes, and their neighbours. Having access to this data improves the performance significantly.

Typically, graph partitioning problems fall under NP-Hard category [12]. Even the partitioning of grid graphs with an approximation for edges between components and component sizes is not possible in a fully polynomial time [11] [3]. Therefore, graph partitioning algorithms usually use heuristic and approximation solutions. Typically, a good graph partitioning is defined as a one in which the number of connections between partitions is minimum while partitions are balanced, i.e. approximately equal in size.

It is, therefore, not surprising that the best graph partitioning algorithms are centralised, where access to global graph structure is acquired. Hierarchical clustering, reviewed in [9] [16], and K-means [27] [25] are examples of centralised graph clustering algorithms. However, partitioning large graphs in a centralised fashion is unscalable and impractical in real environments. Primarily it is so because large graphs do not fit in the memory of a single machine, and secondly because of the dynamic structure of real-world graphs where new nodes are added and removed frequently, and any changes require repartitioning of the graph.

Many distributed graph partitioning algorithms have been introduced in the past few years, namely JA-BE-JA [33] [32]. But even distributed algorithms require high computation and communication cost and in the case of big data, they fall short of creating timely partitions. One of the reasons is that the whole dataset needs to be loaded into a cluster before running partitioning algorithms, which is a time-consuming operation.

An alternative high-performance solution for partitioning large graph datasets is the streaming graph partitioning. In streaming graph partitioning data comes as a stream and partitioning is performed at the same time as the graph is loaded into the Cloud.

In these algorithms, the whole dataset does not need to be loaded into the memory at once. Therefore, stream graph partitioning algorithms have better scalability and

(16)

1.2 Objective

are faster compared to non-stream graph partitioning algorithms. To the best of our knowledge, HDRF [31], which is a greedy online stream graph partitioning algorithm, outperforms all existing stream graph partitioning algorithms. Nevertheless, when several partitioners work at the same time to boost performance, partition quality drops due to incomplete information of each partitioner and how other partitioners segregated their part of the graph. Although the same performance as the single instance can be achieved by coordinating local state of instances, coordination at each step adds communication overhead and consequently increases partitioning time.

Peiro et al. in their recent work introduced a framework, HoVerCut [35], which provides the existing stream graph partitioning algorithms with a windowing method for efficiently sharing states between partitioners. The method decreases communication load between partitioners while achieving similar performance of the coordinated version of running several partitioners. However, HoVerCut has not been implemented in any of the popular graph computation frameworks or tested in real Cloud environments. Also, the shared state used in the algorithm has not been studied in detail, and the impact of sharing partial information on performance and efficiency has not been considered.

We show that by sharing only a small portion of information and keeping the other pieces of the states in local storage, the algorithm reaches higher performance while preserving the superior partitioning quality. We also expect the algorithm to achieve a lower communication overhead compared to the original algorithm by sharing less information between partitioners.

1.2 Objective

In particular, the objectives of this thesis are:

• To conduct a study of existing partitioning algorithms with a focus on stream graph partitioning algorithms for power-law graphs.

• To provide a robust and scalable implementation of our method in PowerGraph framework using a distributed hash table as the shared state.

• To empirically evaluate and compare the performance and efficiency of stream graph partitioning algorithms, in real Cloud environment against natural graphs, using the new scalable and robust implementation.

• To study the impact of storing partial information in the shared state on the performance and efficiency of stream graph partitioning algorithms.

3

(17)

1 Introduction

1.3 Approach & Solution

Using batch processing and shared states improves replication factor while conserving the low execution time. We expect that sharing part of the information in the shared state improves the algorithm’s execution time with a negligible drop in partitioning quality. The idea behind this method is to add a new windowing mechanism for de- creasing communication load. It shows that by sharing a small amount of information between partitioners every few steps, high partitioning quality close to the centralised version can be achieved without any performance loss.

In this work, we provide a new vigorous and scalable implementation of the shared state mechanism using a distributed hash table. For assessment, we compare our methods (SharedState⁺) with two state-of-the-art vertex-cut stream graph parti- tioning algorithms, namely P owergraph, HDRF . We implemented our methods in PowerGraph [13] framework. We evaluate all partitioning algorithms against four natural graphs in real Cloud environments, Google Cloud Platform, Amazon EC2 Cloud and KTH Cloud. We show that our solution ensures scalability and is robust while preserves the performance of centralised shared memory.

1.4 Structure of the Thesis

Rest of this thesis is structured as follows: Chapter 2 describes graph partitioning and provides a formal definition of some graph properties. Chapter 3 describes existing stream graph partitioning algorithms. Chapter 4 introduces our novel method, for running streaming graph partitioning algorithms. Chapter 5 details the implementation and motivates our design choices. Chapter 6 details experiments and specifies evaluation methods. Chapter 7 provides results and reasoning. Chapter 8 summarizes our work and provides conclusions.

(18)

2 ^Background

^{Chapter 2}

In this chapter, graph basics are covered, basic definitions and notations are defined.

Graph types with a focus on power-law graphs are studied. Also, graph partitioning methods and types are detailed and a few sample vertex-cut graph partitioning algorithms are explained. Moreover, popular graph computation frameworks are introduced.

2.1 Graphs Basics

A simple undirected graph is represented as G(V, E) where V is the set of vertices and E is the set of edges. The size of V is demonstrated by |V | = n, and is the number of vertices. It is also called the graph size in this text. |E| = m is the number of edges in the graph. An edge e_vu connects vertices v and u. D_v is the degree of vertex v. N_v is the collection of all immediate neighbours of vertex v and is defined formally by Equation 2.1. N_v is referred to as the neighbourhood of vertex v.

Nv = {u : e_uv∈ E ∨ e_vu ∈ E}. (2.1) The Graph diameter is represented by D and is the longest shortest path between any two vertices. For more details about graph basics refer to [34].

Local clustering coefficient of a vertex v is presented by C_v and quantifies the connectivity of N_v. C_v is calculated according to Equation 2.2.

Cv = |{e_uw : u, w ∈ N_v, euw ∈ E}|

D_v(D_v− 1) . (2.2)

Global clustering coefficient, C, is the number of closed triangles over the total number of possible triplets.

(19)

2 Background

2.2 Power-Law Graphs & Small-World Networks

Power-law graphs are named after their highly skewed power-law degree distribution.

This indicates that most vertices have a low degree, and few vertices have a very high degree.

Another property of power-law graphs is their high local clustering coefficient which implies that the neighbours of a vertex are also typically connected to each other.

This property has been exploited in designing many algorithms for partitioning power-law graphs.

The impact of the few high degree vertices can be witnessed in low diameter of power-law graphs. In these graphs, there is a path between every pair of vertices, where the length of it is small compared to random graphs, and the average length of short paths between vertices increases slowly by increasing the graph size. For more details about small world graphs refer to [10].

2.3 Graph Partitioning

In graph partitioning, P represents the set of partitions and |P | = k is the number of the partitions. p represents a specific partition in P . A partition function is a function that maps a vertex to a partition π : V → P .

Graph partitioning algorithms are typically categorised into offline and online categories. This categorization is based on the assumptions they make about graph availability. Online algorithms receive a graph as a stream of vertices or edges and partitions it at the same time. In contrast with offline partitioning algorithms where the whole graph is available at any point of time.

Online graph partitioning algorithms, also known as streaming partitioning algorithms, are not dependent on the graph global information such as graph structure, actual node degrees, and degree distribution. This is because the graph is received as a stream of vertices, edges and global information is not available. In contrast with the offline graph partitioning algorithms which assume that the whole graph is available at any point in time and they use global graph information during partitioning.

Another important reason for the popularity of the online partitioning algorithms is their good adaptation to dynamic graphs where the graph structure is constantly changing. For instance in Skype service, each time user signs in, all her/his friends are notified. In these environments, offline algorithms need to repartition the whole graph and are computationally expensive. Other examples of dynamic graphs are Facebook and Twitter graphs, where new accounts are created and deleted

(20)

2.4 Streaming Vertex-Cut Graph Partitioning

periodically, and new relations are added and removed from the graph at any given time.

From another angle, graph partitioning algorithms are categorised into vertex-cut or edge-cut categories. Vertex-cut algorithms replicate vertices across a subset of partitions and assign each edge only to one partition. Edge-cut algorithms work in an opposite way, i.e. they replicate edges in partitions and assign each vertex to one partition only. Four categories are made up from the four possible combinations of the two aforementioned properties. The four combinations are Online Edge-Cut Algorithms, Offline Edge-Cut Algorithms, Online Vertex-Cut Algorithms and Offline Vertex-Cut Algorithms.

In Online Edge-Cut Algorithms, target graph is fed to the algorithm as a stream of vertices and edges. Each vertex is permanently assigned to a partition on arrival, however, edges can be replicated on several partitions. Similarly, in online vertex- cut algorithms, the graph is received from a source as a stream. However, the algorithm assigns edges, instead of vertices, to the partitions on their arrival. It has been shown that for power-law graphs better clustering is achieved using vertex-cut algorithms [13]. Online vertex-cut algorithms are categorised into hash-based and greedy.

Hash-based online vertex-cut algorithms use hashing methods to assign edges to partitions. This approach results in balanced partition sizes if a uniform hash function is selected. However, the replication factor of partitioning is high, since the assignment of the edges is random. In contrast, greedy online vertex-cut algorithms use a grasping function which considers the history of assignments in allocating a new edge. This approach decreases replication factor while also ensuring the load balance. As our work is based on streaming vertex-cut graph partitioning, we detail state-of-the-art algorithms in the next section.

Offline graph partitioning problem, including both edge-cut and vertex-cut, is NP- hard. Several centralised algorithms have been introduced to find a solution to this problem [15]. Nonetheless, because of resource limitations, these algorithms are impractical for large scale graphs. It has been observed that even centralised heuristics are not capable of partitioning large scale graphs [6]. Therefore, large scale graph must be divided to fit into memory and get partitioned.

2.4 Streaming Vertex-Cut Graph Partitioning

In this section, we detail three streaming vertex-cut graph partitioning algorithms, namely Degree − BasedHashing [43], P owergraph [13] and HDRF [31] which have achieved the best results.

7

(21)

2 Background

2.4.1 Degree-Based Hashing (DBH)

DBH algorithm introduce two hashing functions h(v) and h(e) for hashing vertices and edges respectively. h(v) is a simple uniform hash function which assigns vertices randomly and uniformly to different partitions. The main idea behind the algorithm is to to define h(e) in such a way that it provides a low replication factor while ensuring a good load balance. To satisfy this constraint, considering the structure of power-law graphs, the h(e) is defined as follow:

h(e) =

(h(v) if d(v) > d(u).

h(u) otherwise. (2.3)

where v and u are the two vertices connected to the edge e and d(u), d(v) are the partial degree of vertices u and v respectively.

The intuition behind this definition is that in power-law graphs, low degree nodes have high local clustering coefficient, whereas replicating a small portion of vertices with high degrees end up in a better replication factor. Xie et al. showed that their algorithm outperforms Gris solution of GraphBuilder [18] and randomised version of P owergraph [13].

2.4.2 Powergraph

Many edge-cut graph partitioning algorithms have been introduced to efficiently partition and provide distributed representation of large scale graphs. However, these algorithms have poor performance in partitioning natural graphs. The poor performance is because of the highly skewed degree distribution of natural graphs which challenges the assumptions made by these algorithms. In general, power-law graphs with highly skewed degree distribution are hard to represent in a distributed fashion [1] [23] and partitioning them with edge-cut algorithms results in partitions with an unbalanced workload, poor locality, high communication load, unbalanced memory consumption and skewed computation load. These behaviours are all justified by the presence of a small portion of vertices with a very high degree and the fact that high degree vertices need to store and process a big portion of data because each node stores and process its neighbourhood data locally.

Gonzalez et al. [13] introduce a novel algorithm, P owergraph, which factors com- putations over edges instead of vertices. At each step, a computation function is invoked over edges and the result is sent to the neighbour vertices of that edge. Each vertex stores receiving data in a local accumulator. After each step, an algorithm specific function is applied over the sum and the results are sent to all neighbours and connecting edges. Using this approach the heavy computation load over high degree vertices is distributed over clusters.

(22)

However, since these algorithms are supposed to run in a distributed environment, the partitioning of the graph plays a key role in lowering communication and balancing workloads. It has been shown that edge-cut algorithms have poor performance over power-law graphs and result in a high number of edge-cuts and replication factor.

The expected number of edge-cuts per vertex for a power-law graph is calculated based on the following Equation 2.4

Enumber of edge-cuts per vertex=

1 −1

p

λ(α − 1)

λ(α) (2.4)

where λ is the normalization constant of the Zipf distribution, α is the clustering exponent and p is the number of clusters.

On the other hand balanced randomise vertex-cut algorithm results in lower replication factor compared to edge-cut algorithms. It has been proved [13] that any edge-cut partitioning can be converted to a better vertex-cut partitioning. Expected replication factor of a partitioned power-law graph by a randomised vertex-cut algorithm can be calculated based on the equation 2.5

ERF=

p − p

λα X

d

p − 1 p

^d

d^−α (2.5)

where λ is the normalization constant of the Zipf distribution, α is the clustering exponent and p is the number of clusters.

However, these figures can still be improved by replacing the randomised assignment of edges with a greedy approach. Gonzalez et al. introduced a novel algorithm which focuses on efficiently partitioning power-law graphs and a greedy vertex-cut algorithm. The novel algorithm considers assignments of previous edges when a new edge is streamed, and assigns the new edge to a cluster based on the following rules.

• If P (v) ∩ P (u) 6= ∅ then assign the edge to a partition in P (v) ∩ P (u).

• If P (v) 6= ∅ and P (u) 6= ∅ and P (v) ∩ P (u) = ∅ then assign the edge to the partition in P (v) ∪ P (u) with the most unassigned edges.

• If (P (v) = ∅ or P (u) = ∅) and P (v) ∪ P (u) 6= ∅ then assign the edge to a partition in P (v) ∪ P (u).

• If P (v) = ∅ and P (u) = ∅ then assign the edge to the partition with smallest size.

P owergraph significantly improves partitioning quality compared to edge-cut meth- ods using the greedy partitioning algorithm.

9

(23)

2 Background

2.4.3 HDRF

HDRF effectively exploits the structure of the input power-law graph and tries to replicate high degree vertices more often than low-degree vertices. The intuition behind the algorithm is that in natural graphs low-degree vertices have higher local clustering coefficient and by replicating high degree vertices first, the result of partitioning has low replication factor compared to a randomised vertex-cut. The name of the algorithm is also based on this principle: High-Degree (are) Replicated First (HDRF ).

More formally, HDRF calculates a score C(p) for each partition p given an edge e and its vertices u and v. Thenceforth, it assigns the edge to the partition with the highest score. Ties are broken with random choices. The score of a partition p is calculated using the equation 2.6

C^HDRF(p, v, u, e) = C_balance^HDRF(p, u, v, e) + Creplication^HDRF (p, v, u, e) (2.6) where C_balance^HDRF(p, u, v, e) and Creplication^HDRF (p, v, u, e) are the partial scores calculated for ensuring good load balance and minimizing replication factors. These partial scores are calculated using equations 2.7 and 2.8 correspondingly.

C_balance^HDRF(p, v, u, e) = λ(Max_a∈P(|a|) − |p|)

 + Max_a∈P(|a|) − Min_a∈P(|a|) (2.7) where is a small positive value and λ > 0 is the parameter that controls the importance of load balancing. The bigger λ get, the more importance is the load balancing. λ → ∞ makes the algorithm a random heuristic where the previous assignments are ignored and the result of partitioning has the highest load balance, while with λ = 0 the algorithm is agnostic to load balance.

Creplication^HDRF (p, v, u, e) = g(p, u, v) + g(p, u, v) (2.8)

g(p, u, v) =







1 +_d(u)+d(v)^d(u) , if p ∈ P (u).

0, otherwise. (2.9)

where d(v) is the partial degree of v.

Counting degree of each vertex before running the algorithm is a time-consuming and memory-intensive task. To overcome this problem, Petroni et al. introduced a new parameter called the partial degree. Partial degree of a vertex is an estimation of its actual degree and can be calculated and maintained during the execution of the algorithm, based on the assigned vertices and edges. Actual degrees can be replaced

(24)

by partial degrees in calculations for partitioning. Replacing actual degrees with their estimations have an impact on the algorithm’s performance. This impact is even higher when the algorithm is run on a cluster with several partitioners. In such a setup, each partitioner maintains its own copy of degree estimations. In order to minimise the impact of degree estimations, partitioners share their states at each step. Although sharing states improves the performance of the algorithm, it also adds an overhead and increases the partitioning time.

2.4.4 HoVerCut

HoVerCut tackles the problem of performance drop when running several partitioners in parallel. In this thesis, we refer to this method by SharedState().

HoVerCut takes an input streaming vertex-cut partitioning algorithm and executes it in a newly introduced context. We call this algorithm ’the input algorithm’ and refer to it using the first letter of the algorithm name. For instance, if HDRF is used as input algorithm we refer to it with SharedState(H) and if the input algorithm is Greedy algorithm we refer to it with SharedState(G).

HoVerCut framework uses a shared state in which partitioners store and share their states. For instance, when input algorithm is HDRF , partitioners store partial degrees, partition sizes and replication lists in the shared state.

Sharing information in the shared state needs network communication which is a time- consuming operation and drops the performance of the input algorithm significantly.

To tackle this issue, HoVerCut uses a windowing mechanism.

Using windowing mechanism, each partitioner uses a buffer/window to buffer edges locally. When buffer gets full partitioner executes the partitioning algorithm to partition the edges in the buffer. Each partitioner contacts the shared state only once for partitioning a full window.

11

(25)

(26)

3 Related Work

^{Chapter 3}

In this section, we study a wide range of different graph partitioning algorithms. For ease of understanding, we divide graph partitioning algorithms into four categories, namely, Online Edge-Cut Algorithms, Offline Edge-Cut Algorithms, Online Vertex- Cut Algorithms and Offline Vertex-Cut Algorithms. Moreover, we explain restreaming and hybrid methods which are combinations of these four categories. In the last section of this chapter, we go through different streaming graph processing frameworks and discuss their pros and cons.

3.1 Online Edge-Cut Algorithms

Stanton et al. introduced online graph partitioning in 2012 [38]. In this chapter we cover F ennel [40] and the work of Stanton et al. [37], [38] as the two state of the art of online edge-cut algorithms.

3.1.1 Fennel

Tsourakakis et al. in their work F ennel [40] fill the gap between mathematical graph partitioning algorithms and the heuristics that use in practice. They define a general framework where the global objective function is defined as 3.1.

f (P ) = C_{OU T}(|e(S₁, V \S₁)|, ..., |e(S_k, V \S_k)|) + C_IN(σ(S₁), ..., σ(S_k)). (3.1)

The first term corresponds to the total edge-cuts and the second term corresponds to the load balance. In order to minimize the objective function in a streaming graph setting, a greedy assignment is used to place a newly streamed vertex v in a cluster. The score of each cluster for the new vertex is calculated based on the equation 3.2.

(27)

3 Related Work

δg(v, S_i) = C_{OU T}(v, S_i) + C_IN(v, S_i) (3.2)

Where C_IN and C_{OU T} are defined as shown in equation 3.3 and are corresponding with the edge-cut and the load balance respectively.

C_{OU T}(v, S_i) = |N (v) ∩ S_i|

C_IN(v, S_i) = −(c(|S_i∪ u|) − c(|S_i|)) (3.3)

where c(x) = αx^γ for α > 0 and γ ≥ 1.

This greedy assignment, in many cases, outperforms fully offline algorithms, namely M ET IS [19] and the restreaming approach introduced in [29]. However, there are also issues with F ennel algorithm. For instance, this algorithm only considers the balance between the number of vertices in each cluster and it usually ends up in an imbalanced number of edges between clusters when it is run on regular graphs.

Chen et al. [8] tackled this issue by adding an extra term to the cost function which forms partitions with a balanced number of edges. The new terms are µ|S_i|^E where µ is the ratio of vertices to the edges and |Si|^E is the number of edges in a cluster S_i.

3.2 Offline Edge-Cut Algorithms

For the reasons explained in Section 2.3 in this chapter, we only consider distributed offline edge-cut algorithms.

3.2.1 Ja-be-Ja

Ja-be-Ja [33], is a fully distributed partitioning algorithm where pairs of nodes are periodically switching their places if it improves the partitioning quality. Formally, the algorithm is categorised under distributed local search algorithms. However, since the initialization of the partitions is random, if the initialization is not balanced the partitioning result will have the same skewness. Another drawback with Ja-be-Ja is that the biggest dataset used in the paper has less than 70,000 vertices.

(28)

3.3 Online Vertex-Cut Algorithms

3.2.2 Sheep

Sheep [28] is another distributed graph partitioner which is similar to M ET IS [20]

in essence. They both reduce the target graph to simpler data structures, namely elimination tree and smaller graphs, then they partition the newly created structures and, in final phase, reconstruct the original graph from the result of partitioning. In particular, Sheep transforms the input graph to an elimination tree using a distributed map-reduce operation.

The strength of Sheep algorithm is its partitioning time where it outperforms both M ET IS and streaming algorithm. When it comes to partitioning quality, Sheep is competitive with M ET IS when the number of partitions is low, and it can compete with online algorithms when the number of partitions is high.

3.3 Online Vertex-Cut Algorithms

Degree−BasedHashing [43], P owergraph [13] and HDRF [31] are the most efficient algorithms in the field of streaming vertex-cut graph partitioning to the best of our knowledge.

Degree based hashing is a hash based online vertex-cut algorithm with a focus on partitioning power-law graphs. It outperforms other hash-based algorithms by exploiting power-law degree distribution of natural graphs. Moreover, it keeps a good load balance between partitions. Gonzalez et al. introduced P owergraph algorithm which achieves better results compared to DBH by considering previously assigned vertices.

Petroni et al. introduced a novel online vertex-cut graph partitioning algorithm, called HDRF [31] which outperforms the other two aforementioned algorithms.

Similar to other online vertex cut partitioning algorithms, it reads the graph from a source as a stream of vertices and edges. The algorithm assigns edges to partitions on their arrival. Similar to Degree-Based Hashing (DBH) [43], HDRF focuses on partitioning natural graphs with power-law degree distribution. However, in contrast to DBH, it employs a greedy-based approach for partitioning. Using the greedy approach, it considers both vertex degree and previous assignments. The algorithm calculates a score for each partition and assigns the edge to the partition with the highest score.

Peiro et al. [35] introduce HoVerCut which is a parallel and distributed framework for boosting existing streaming vertex-cut partitioning algorithms by introducing a windowing mechanism and using shared states. Other heuristic methods can be combined with this framework to achieve higher performance when several partitioners are executed simultaneously.

15

(29)

3 Related Work

3.4 Offline Vertex-Cut Algorithms

Contrary to edge-cut partitioning, there are few offline vertex-cut partitioning algorithms.

Florian et al. in [37] analyzed the balanced vertex-cut as an alternative to balanced edge-cut partitioning by providing an explicit characterization of the expected communication cost for different variants of the graph partition problems.

Ja-be-Ja-vs [32] is a distributed vertex-cut algorithm. It is based on the same idea as Ja-be-Ja, which is for edge-cut partitioning. In Ja-be-Ja-vc pairs of nodes iteratively find each other and take actions toward improving partitioning quality.

This algorithm is a form of search locality.

DFEP [14] is an offline vertex-cut partitioning algorithm, it works on the basis of a market model where partitions compete to buy edges with a limited budget.

Each edge is assigned to a partition with the highest offer. There is an additional coordinator which sends more budget to the partitions inversely proportional to their size.

3.5 Other Graph Partitioning Algorithms

In this section, we study partitioning algorithms which do not fall under any of the four previously studied categories in this chapter.

3.5.1 Restreaming algorithms

Restreaming is a newly introduced method that can be combined with online edge-cut algorithms to achieve better results. Nishimura et al. [29] improved the results of F ennel and Linear Deterministic Greedy (LDG) [38] algorithms by combining them with restreaming method. In this approach, the partitioner has access to the previous partitioning result.

This approach is also used for graph reloading when minimal changes have occurred.

Restreaming is used whenever an update is requested. However, this approach is not scalable when separate workers are operating. This is because restreaming the whole graph while messages are passing between workers is usually a very time-consuming process.

(30)

3.6 Streaming graph processing frameworks

3.5.2 Hybrid methods

Hybrid methods combine different partitioning methods to achieve better results.

P owerLyra [8] is a modification of P owergraph. P owerLyra is a hybrid method that combines a streaming vertex cut algorithm (P owergraph algorithm) with a streaming edge cut algorithm (F ennel algorithm). It simply uses P owergraph algorithm for high degree vertices (vertices with a degree above a user defined value) and F ennel algorithm for low-degree vertices.

3.6 Streaming graph processing frameworks

Most common stream graph processing frameworks are Flink, Spark and Power- Graph. In this section, we investigate each of these frameworks and go through their similarities and differences. At the end of this section, we select PowerGraph framework as our choice for implementing vertex-cut streaming graph partitioning algorithms and performing the experiments.

3.6.1 Spark

Spark [45] is an efficient, general purpose platform for large-scale data processing, hosted by the Apache Software Foundation (ASF). Spark has a stack of libraries, namely Spark SQL, MLlib, GraphX and Spark Streaming. In this section, we focus on Spark streaming which is a library for building scalable streaming applications.

Spark streaming provides high-level operators that help developers to write streaming jobs the same way as they write batch jobs. The input data stream can be injected from many different resources, namely Kafka, Flume and TCP Sockets. And processed data can be used as input to other algorithms like machine learning or graph processing algorithms.

Spark streaming wraps data streams in small batches. This means the framework waits for a specified period of time and collects data during this waiting time. After that, it runs batch programs on the collected data while concurrently collecting data from the stream for the next batch and repeats this process. Using this approach, data stream is converted into small batches of data. In Spark framework, the small batches are called RDDs (Resilient Distributed Dataset). An RDD is the lowest abstraction in Spark streaming platform and represents an immutable collection of elements that can be processed simultaneously. There are many operations in Spark that can be performed on RDDs, namely map, filter and persist. For more information regarding Spark refer to [45]. We believe Micro-batch principle of Spark is the major factor in Spark streaming framework for generating true streaming solutions.

17

(31)

3 Related Work

Although some claim that more than 95% of all use cases in stream processing can be handled with micro-batching, but there are use-cases where micro-batch processing is an inappropriate approach. Particularly, in use-cases where low-latency is a key factor, using Spark streaming engine has a negative impact on algorithms performance. For example, in anomaly detection problem, quick detection and low latency are the key factors to prevent damage. However, using micro-batch approach fails to provide the stream data with a low latency. This failure is because of the extra delay the framework adds, for batching incoming stream, before it delivers it to the application.

3.6.2 Flink

Flink [4] is an open-source and popular platform for distributed big data processing.

Flink is integrated with many other open-source projects, namely Kafka, YARN, HDFS and Hadoop. Flink also has several APIs and libraries. Gelly is a bundled library for graph processing. Flink provides a true low-latency stream processing platform, which means data elements are immediately pipelined as soon as they are delivered to the platform. Consequently, Flink can perform flexible window operations on streams. Windows can be customized to triggering conditions. It also provides iterative computations for machine learning and graph analysis.

3.6.3 PowerGraph

PowerGraph[26] is a high performance distributed graph processing framework, written in C++. PowerGraph has built-in implementation of the most state-of-the- art streaming graph partitioning algorithms and distributed graph algorithms. Among the aforementioned frameworks, PowerGraph is focused on vertex-cut streaming graph partitioning of power-law graphs, whereas Flink and Spark are general frameworks in the area of stream graph processing.

In summary, PowerGraph is a domain specific framework for vertex-cut partitioning of power-law graphs and defines state-of-the-art of graph computation frameworks.

Moreover, it provides a low latency streaming platform that can be used in use cases where low latency is a key factor.

(32)

4 ^Solution

^{Chapter 4}

In this chapter, we detail our solution to the problem of running several parallel partitioners while maintaining high performance. We also justify our design choices and provide a sample graph partitioning.

4.1 State Management

As explained in Chapter 3, heuristics maintain a state during partitioning. This state includes partition sizes, vertex copies and partial degrees. In some cases, the size of the state is so huge that it cannot be kept in a single machine memory. Furthermore, using one partitioner provides a low throughput.

In many real world scenarios, several instances of a streaming graph partitioning algorithm are run simultaneously to improve the system throughput. In current practices, each partitioner is a stand-alone process which works independently from all other partitioners. In this work, we refer to this approach as ’Oblivious’. In Oblivious approach, partitioners do not share any state.

Although running several instances of a partitioner in an Oblivious fashion improves the throughput and consequently decreases partitioning time, but the partitioning quality drops drastically. The drop is due to the incomplete information of partitioners.

For instance, two partitioners might replicate a vertex in two disjoint subsets of partitions. This could be avoided if they were aware of current replaces created by other partitioners.

To tackle this problem and improve the partitioning quality, partitioners can share state frequently. Frequent reads and writes to a shared state add a lot of network and communication overhead. Therefore, a windowing mechanism, as explained in Chapter 3, is used in combination with the shared state. In this approach, partitioners buffer edges and vertices in a window and update the shared state in batches. Although windowing mechanism improves the performance, it is still much slower than the Oblivious way of running partitioners.

(33)

4 Solution

In this work, we introduce a new algorithm where only the list of replicas is stored in shared state, we show that this approach improves the performance of the partitioning algorithm, decreases communication load, and preserves partitioning quality.

4.2 SharedState

⁺

SharedState⁺is our modification of HoVerCut (SharedState) framework where only the list of replications are stored in the shared state. We will show in the experiment section that this modification decreases partitioning time and communication load while preserving partitioning quality.

The idea behind this modification is that in most vertex-cut graph partitioning algorithms, current replicas are the key factor in the decision-making process of assigning vertices to the partitions. For instance in Greedy algorithm, if there is a partition in which both vertices of an edge have been replicated, the edge will be assigned to it independent of the partition sizes. Similarly in HDRF under the same conditions, the edge will be assigned to the partition independent of both partition sizes and vertices degrees.

Partitioners in SharedState⁺ store partial degrees and partition sizes locally and when the score for two partitions is the same, they use local estimations to break the tie.

The only drawback of SharedState⁺ is higher memory consumption, compared to SharedState. It is worth mentioning that, despite using local storage, memory con- sumption of SharedState⁺ is remarkably less than the input partitioning algorithm, where all information is stored locally. More details on SharedState⁺implementation is provided in Chapter 5.2

4.3 Sample partitioning

To clarify the process of partitioning and to explain each step in detail, we go through the process of partitioning a sample graph. The graph demonstrated in Figure 4.1a with size six is supposed to be partitioned into two partitions. For simplicity and without loss of generality, we assume that there are only two partitioners, where each maintains one partition. For making it easy to understand, we refer to the partitioners and their corresponding partitions with colours of green and red.

Moreover, partitioners form a Distributed Hash Table (DHT) for storing states of the vertices. Each partitioner is responsible for storing states of half of the vertices.

In this example, red vertices {a, c, e} are stored by the red partitioner and green vertices {b, d, f } are stored by the green partitioner. This means that the hash

(34)

4.3 Sample partitioning

function used for the DHT returns even numbers for {a, c, e} and odd numbers for {b, d, f }. Hash value modulo two is equal to the id of the responsible partitioner in DHT 4.1.

hash(v)%2 =

(0 if v ∈ {a, c, e}.

1 if v ∈ {b, d, f }. (4.1)

Edges of the original graph are streamed into partitioners concurrently and randomly.

But each edge is streamed only once and only to one partitioner. In this example we assume that dashed edges are streamed into the red partitioner, Figure 4.1c, and dotted edges are streamed into the green partitioner, Figure 4.1d. The overall division of the edges is displayed in Figure 4.1b.

a b

d c

e f

(a)original graph

a b

d c

e f

(b)streamed graph

a b

d c

e

(c)stream of edges to the red partitioner

b d c

e f

(d)stream of edges to the green partitioner

Figure 4.1: Sample graph streaming

Each partitioner checks if its window is full each time it receives a new edge. When the window becomes full or when there are no more edges available (Algorithm 1), the partitioner invokes partitioning method.

In the partitioning method (Algorithm 2), first, partial degrees are updated based on the edges in the window. Depending on where partial degrees are stored, the shared state or local states are updated.

21

(35)

4 Solution

After the update, all necessary information for partitioning are read from the DHT (Algorithm 2 line 4) and the partitioning process is initiated. In this particular example, red partitioner needs to read the state of vertices a, b, c, d and e from the DHT. Information of vertices a, c and e is locally available, as this partitioner stores information of these nodes in DHT. The partitioner then sends a message to the green partitioner for reading the state of vertices b and d. The second partitioner sends a similar message to the first partitioner for reading the state of vertices c and d. After gathering all the necessary information for partitioning, each partitioner assigns partitions to its buffered edges (Algorithm 2 lines 5-7).

Figure 4.2a and Figure 4.2b show edge assignments of red and green partitioners respectively. Red edges were assigned to the red partition and green edges were assigned the green partition. At this point partitioners transfer assigned edges to their destinations (Algorithm 2 line 8-10). In this example, red partitioner sends edge (b, c) to the green partitioner, and green partitioner sends edge (e, f ) to the red partitioner. The result after communication is demonstrated in 4.2c and 4.2d.

At this point, partitions are ready and partitioning phase has been finished. The last step of the algorithm is to update replicas of vertices in the DHT (Algorithm 2 line 11 which triggers Algorithm 4). In this example, red partitioner updates the list of replicas of vertices a, c and e locally and sends a message to the green partitioner for updating the list of replicas of vertices b and d. Similarly, green partitioner updates the list of replicas of green nodes b, d and f locally and sends a message to the red partitioner for updating the list of replicas of vertices c and e.

Figure 4.3 shows the replicas of each vertex after partitioning. Where dotted replicas are the replicas outside their responsible node in DHT. In this sample partitioning, replication factor is calculated as is shown in Equation 4.2.

RF = P

v∈V Replicas(v)

|V | = 1 + 1 + 2 + 2 + 1 + 2

6 = 1.5 (4.2)

(36)

4.3 Sample partitioning

a b

d c

e

(a)edge assignment in p1

b d c

e f

(b)edge assignment in p2

a

d

e f

(c)red partition

b

d c

(d)green partition Figure 4.2: Sample graph partitioning

a

d c

e f

(a) Replicas in red partition

b d c

f

(b)Replicas in green partition Figure 4.3: Sample graph partitioning - Replications

23

(37)

(38)

5 Implementation

^{Chapter 5}

In this chapter, we go through the design and implementation of shared state algorithms in the PowerGraph framework. Moreover, we support our design choices and also discuss alternative solutions.

5.1 Shared State

In this section, we review alternative design solutions for implementing a shared state while also supporting our choice of distributed hash table. The main purpose of shared state is to share information between partitioners and have a better global view of the ongoing partitioning. There are several methods to implement a shared state.

However, in distributed environments, when there are many processes communicating with the shared state, the most important property that needs to be considered is the scalability of the state server.

The simplest solution is to have a separate server that keeps the state information of all vertices. Each partitioner is responsible for updating the state server when it is necessary. Due to the high communication load and a massive number of read/write requests, this approach does not scale well. Moreover, it breaks the ’no single point of failure’ design principle. Furthermore, the high amount of requests requires lots of synchronisation to avoid race conditions.

An alternative choice to centralised approach is to consider distributed shared states.

Distributed shared states are managed by several computers. Among many available distributed shared states, we chose distributed hash tables (DHT) which fits the environment and is highly scalable. In our environments, where partitioning time is relatively short, information does not need to be persisted, i.e., it can be kept in memory during partitioning time. These requirements led us to the design of an in-memory distributed hash table. Another reason for using distributed hash table is that it can be implemented as a component in the software stack which each partitioner executes.

(39)

5 Implementation

Due to the short life cycle of shared state and compromised performance, only one replica of information is stored. Each partitioner is responsible for maintaining information of some vertices. For assigning vertices to partitioners, a uniform hash function is used. To find responsible partitioner for a specific vertex, the hash value of the vertex id is calculated. The remainder of the hash value divided by the number of partitioners is then the id of responsible partitioner. Uniformity of the hash function ensures equal loads on partitioners. As detailed in Section 2.4.4 only vertex information is stored in the shared state.

5.2 The implementation

In this section, we detail the design and implementation of SharedState(H) and SharedState⁺(H) (shared state algorithms). In this part of the chapter, pseudocodes of main parts of algorithms are provided. In the rest of this work, we use buffer and window terms interchangeably.

As explained in the previous chapters, shared state algorithms use windowing method to buffer edges and process them in batch. Partitioners receive edges as stream and put them in a local buffer, when the buffer is full or no more edges are available, the partitioning process is triggered. Algorithm 1 shows the steps of buffering edges and triggering the partitioning algorithm.

Algorithm 1 Buffering algorithm.

1: procedure Load

2: while reader.hasnext() do

3: e ← reader.next() . read the next edge

4: window.add(e)

5: if window.isfull() or reader.end() then

6: partition(window)

7: window.clear()

8: end if

9: end while 10: end procedure

Algorithm 2 shows the partitioning part of SharedState(H). The first partitioner calculates changes in vertex degrees based on edges in the window. After that, it updates shared state with this data. The update is made by sending an update message. The update message is an array of pairs, where each pair consists of a vertex id and the corresponding change in its degree.

In the next step, the partitioner reads all necessary information from the shared state in order to partition edges in the buffer. Response from the shared state contains replications of corresponding vertices, partial degrees, and partition sizes. Each edge is then assigned to a partition based on this information using actual partitioning algorithm (in this case HDRF ). After partitioning, edges are sent to the partitions

(40)

5.2 The implementation

they were assigned to. Finally, the partitioner updates shared state by sending a message to the partitioners responsible for the vertices connected to newly partitioned edges.

Algorithm 2 SharedState(H) partitioning algorithm.

1: procedure partition(Window window)

2: deltas ← calculateDeltas(window) . calculate degree changes

3: SharedState.update(window.vertices, deltas) . update shared state 4: state ← SharedState.read(window.vertices) . read updated state from shared state 5: for e ∈ window.edges do

6: e.p ← assignPartition(e, state) . any partitioning algorithm, for instance HDRF 7: end for

8: for e ∈ window.edges do

9: send(e.p, e) . send edges to their assigned partitions

10: end for

11: SharedState.update(window) . update shared state with new replicas and partition sizes 12: end procedure

Sending and receiving messages are performed asynchronously and in parallel with the processing of edges. Each partitioner has a thread for sending and receiving messages. After the edges are sent to their assigned partitions, the target partitioner receives them and runs the algorithm 3. In this algorithm, target partitioner updates its local graphs with received edges. Local graphs are used for running the actual target algorithms after partitioning phase. The combination of all local graphs forms the original graph.

Algorithm 3 Update local graph

1: procedure Receive(Set edges)

2: LocalGraph.Update(edges) . updates local graph and partition size 3: end procedure

Algorithm4 is executed by the responsible partitioner each time the shared state is updated. The algorithm updates the list of replicas of the newly added vertex. The steps of this algorithm are shown in Algorithm4.

Algorithm 4 Update list of replicas

1: procedure Update(assignedPartition p, Set vertices) 2: for v ∈ vertices do

3: if !v.replicas.contains(p) then

4: v.replicas.add(p)

5: end if

6: end for 7: end procedure

Implementation of SharedState⁺(H) is slightly different than SharedState(H).

SharedState⁺(H) sends fewer requests and updates to the shared state, instead partitioners need to maintain their local storage. The buffering algorithm is the

27

(41)

5 Implementation

same in both SharedState(H) and SharedState⁺(H). In partitioning procedure, the first update only exists in the implementation of SharedState(H), because in SharedState⁺(H) partial degrees are maintained locally and instead of this step, every partitioner updates its local state.

Algorithm 5 shows the steps of partitioning as part of SharedState⁺(H). Partition- ers in SharedState⁺(H) store partial degrees and partition sizes locally, therefore no write operation is performed on the shared state before actual partitioning. In- stead, they update their local state. Thereafter, the partitioners read replication information of vertices in the window. The rest of the algorithm is the same as SharedState(H), besides the update operation after partitioning which sends only replication information to the shared state in SharedState⁺(H).

Algorithm 5 SharedState⁺(H) partitioning algorithm.

1: procedure partition(Window window)

2: deltas ← calculateDeltas(window) . calculate degree changes

3: LocalState.update(window.vertices, deltas) . update local state

4: state ← SharedState.read(window.vertices) . read replication information from shared state 5: for e ∈ window do

6: e.p ← assignPartition(e, state, localState) . any partitioning algorithm, for instance HDRF

7: end for

8: for e ∈ window do

9: send(e.p, e) . send edges to their assigned partitions

10: end for

11: SharedState.update(window) . update shared state with new replicas 12: LocalState.update(window) . update partial degrees and partition sizes 13: end procedure

(42)

6 Evaluation Methodology

^{Chapter 6}

To study the state-of-the-art streaming vertex-cut graph partitioning algorithms in real-world environments, we implemented several stream graph partitioning algorithms in a distributed framework. We also used a Distributed Hash Table (DHT) as the shared state to implement our modified version of shared state algorithm. To compare these algorithms, we ran each partitioning algorithm to roughly four graph datasets in different cloud environments. Several performance metrics were used to compare the results obtained from running the algorithms. Moreover, we studied the impact of partitioning quality over execution time of various graph algorithms (target algorithms), and the impact of only maintaining replication information in the shared state compared to the existing approaches where all vital information, i.e.

replication information, partial degrees and partition sizes are stored in the shared state.

In this chapter, we explain the algorithms that were used and the metrics that were measured during experiments, discuss the datasets along with some of their characteristics. We elaborate on evaluation environment, experiment setups and the measured metrics for evaluation, and also briefly support each set of experiments we performed and explain their purpose.

6.1 Overview

We used a broad range of scenarios to compare the partitioning algorithms. We selected four different real world power-law datasets with different characteristics. For details about the datasets, refer to Section 6.2. We partitioned each dataset with five different partitioning algorithms, namely Random, Greedy, HDRF , SharedState(H) and SharedState⁺(H). These algorithms are briefly explained in Section 3. We used partitioned graphs to run different graph algorithms with different communication load. These target algorithms along with some of their characteristics are explained in Section 6.4. Both partitioning algorithm and target algorithms were

Scalable Streaming Graph Partitioning

Scalable Streaming Graph Partitioning

REZA KHAMOUSHI

KTH Royal Institute of Technology

Department of Software and Computer Systems

Scalable Streaming Graph Partitioning

Acknowledgment

Contents

List of Figures

List of Tables

1 Introduction

1.1 Motivation

1.2 Objective

1.3 Approach & Solution

1.4 Structure of the Thesis

2 Background

2.1 Graphs Basics

2.2 Power-Law Graphs & Small-World Networks

2.3 Graph Partitioning

2.4 Streaming Vertex-Cut Graph Partitioning

3 Related Work

3.1 Online Edge-Cut Algorithms

3.2 Offline Edge-Cut Algorithms

3.3 Online Vertex-Cut Algorithms

3.4 Offline Vertex-Cut Algorithms

3.5 Other Graph Partitioning Algorithms

3.6 Streaming graph processing frameworks

4 Solution

4.1 State Management

4.2 SharedState

4.3 Sample partitioning

a b

d c

e f

a b

d c

e f

a b

d c

e

b d c

e f

a b

d c

e

b d c

e f

a

d

e f

b

d c

a

d c

e f

b d c

f

5 Implementation

5.1 Shared State

5.2 The implementation

6 Evaluation Methodology

6.1 Overview

2 ^Background

4 ^Solution