Evaluation and Optimization of Execution Plans for Fixpoint Iterative Algorithms in Large-Scale Graph Processing

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016,

Evaluation and Optimization of Execution Plans for Fixpoint

Iterative Algorithms in Large-Scale Graph Processing

RICCARDO DIOMEDI

(2)

(3)

Abstract

In large-scale graph processing, a fixpoint iterative algorithm is a set of operations where iterative computation is the core. The aim, in fact, is to perform repetitive operations refining a set of parameter values, until a fixed point is reached. To describe fixpoint iterative algorithms, tem- plate execution plans have been developed. In an iterative algorithm an execution plan is a set of dataflow operators describing the way in which parameters have to be processed in order to implement such algorithms.

In the Bulk iterative execution plan all the parameters are recomputed for each iteration. Dependency plan calculates dependencies among vertices of a graph in order to iteratively update fewer parameters during each step. To do that it performs an extra pre-processing phase. This phase, however, is a demanding task especially in the first iterations where the amount of data is considerable.

We describe two methods in order to address the pre- processing step of the Dependency plan. The first one exploits an optimizer which allows switching the plan during runtime, based on a cost model. We develop three cost models taking into account various features characterising the plan cost. The second method introduces optimizations that bypass the pre-processing phase. All the implementations are based on caching parameters values and so they are memory greedy.

The experiments show that, while alternative implementation of Dependency plan does not give expected results in terms of per-iteration time, cost models are able to refine the existing basic cost model increasing accuracy.

Furthermore, we demonstrate that switching plan during runtime is a successful strategy to decrease the whole execution time and improve performance.

(4)

Referat

Fixpunkt iterativa algoritmer är ett typiskt exempel på storskalig bearbetning av grafer, där iterativa beräkning- ar är kärnan. Dess mål är att utföra upprepade operationer och förbättra en uppsättning parametrars värde tills det att en fast punkt nåtts. För att modellera fixpunkt iterativa algoritmer har exekveringsplansmallar utvecklats. En exekveringsplan är en uppsättning av dataflödesoperatorer som beskriver sättet på vilket arametrarna

I den iterativa exekveringsplanen Bulk räknas alla parametrar om för varje iteration. Exekveringsplanen Depen- dency beräknar beroenden mellan hörn i en graf för att iterativt uppdatera färre parametrar under varje steg. För att göra detta genomför den en extra förbehandlingsfas.

Denna fas är dock krävande, i synnerhet under de första iterationerna där mängden data är betydande.

Vi beskriver två metoder för att adressera förbehand- lingssteget i exekveringsplanen Dependency. Den första ut- nyttjar en optimerare som tillåter exekveringsplansbyte under körning, baserat på en kostnadsmodell. Vi utvecklar tre kostnadsmodeller som tar hänsyn till olika funktioner som kännetecknar planens kostnad. Den andra metoden intro- ducerar optimeringar som kringgår förbehandlingssteget.

Alla implementationer baseras på cachning av parametrars värde samt minnesgirighet.

Experimenten visar att även om den alternativa imple- mentationen av Dependency-planen inte ger förväntade re- sultat sett till per-iteration tid, kan kostnadsmodeller förfi- na existerande grundkostnadsmodellens känslighet. Vidare visar vi att ett planbyte under körning är en framgångs- rik strategi för att minska hela exekveringstiden samt öka restandan.

(5)

List of Figures

1.1 Distributed Execution of Apache Flink . . . . 4

1.2 Dependency Plan emphasizing the pre-processing phase . . . . 5

1.3 Connected Components algorithm applied to the LiveJournal dataset with three different plans emphasizing the different execution time among Bulk and Dependency . . . . 6

2.1 Flink’s Stack . . . 12

2.2 Bulk Plan . . . 14

2.3 Dependency Plan . . . 15

2.4 An example to explain the difference between Bulk plan and Incremental plan. Node B is in red because it is the only node that has changed the value in the previous iteration . . . 15

2.5 Incremental/Delta Plan . . . 16

2.6 Connected Components example . . . 17

3.1 Cost Models Switching . . . 22

3.2 Bulk plan with shipping strategies highlighted . . . 22

3.3 Plan-switching never occurs if D Ø 2S . . . 27

3.4 Node with cached values, only the changed in-neighbors send again their values . . . 28

3.5 Comparison between the first implementation with the join and the second with HashSet . . . 30

3.6 Join - Reduce Dataflow and per-partition HashMap . . . 31

3.7 Example of Join - Reduce Dataflow with per-partition HashMap . . . . 32

4.1 Connected Components algorithm, LiveJournal . . . 39

4.2 Community Detection algorithm, Orkut . . . 40

4.3 Per-iteration time of updating hashmaps implementation on LiveJournal dataset in Connected Components algorithm . . . 41

4.4 PageRank algorithm, LiveJournal . . . 42

4.5 Label Propagation algorithm, LiveJournal . . . 43

4.6 Community Detection algorithm, LiveJournal . . . 44

4.7 Connected Components algorithm, Orkut . . . 45

4.8 PageRank algorithm, Orkut . . . 46

(8)

4.9 Label Propagation algorithm, Orkut . . . 47

4.11 Cost Models on LiveJournal dataset . . . 48

4.10 Per-iteration time of PageRank algorithm . . . 49

4.12 Cost Models on Orkut dataset . . . 50

4.13 CC algorithm, Orkut dataset . . . 51

(9)

List of Tables

4.1 Total iterations and crossing-point for LiveJournal dataset . . . 49 4.2 Total iterations and crossing-point for Orkut dataset . . . 50

(10)

(11)

Glossary

CC Connected Components. 2, 6, 7, 13, 16, 17, 35, 36, 38, 41, 48, 50, 51 CD Community Detection. 35, 37, 38, 48, 50

DAG Directed Acyclic Graph. 3, 8, 33

HDFS Hadoop Distributed File System. 12, 13 JVM Java Virtual Machine. 3, 11

LP Label Propagation. 13, 35, 36, 48, 50 ML Machine Learning. 2

PR PageRank. 7, 13, 35, 36, 48, 51

UDF User-Defined Function. 2–4, 7, 8, 14, 16, 23, 24, 29, 32

(12)

(13)

Chapter 1

Introduction

In graph theory, graphs are mathematical data structures used to model the rela- tions among objects. A graph is composed of vertices or nodes connected by edges or arcs. Finding the shortest path to get from one city to another, ranking web pages, or even analyzing friendship links among users of social networks, are just some use-cases easily modeled with graphs.

For decades, graph processing and analysis have been a useful tool not only in computer science but also in other domains[1, 3]. Nowadays, with the growth of graphs’ size formed by billion or even trillion of nodes and edges, processing graph- structured data has become challenging. Furthermore, processing a large graph presents some drawbacks such as poor locality of memory access, high data access to computation ratio and irregular degree of parallelism during the execution[15].

Large-scale graph processing is a term used to refer to processing a large amount of graph-structured data, too large to be processed in a single machine.

For this reason, there has been a rise in the use of large-scale graph systems and general-purpose distributed systems recently, as both of them are able to handle enormous graphs, deploying state-of-the-art distributed graph process- ing models[10, 9, 14].

In fact, it is common to partition and then distribute graph-structured data among multiple machines, instead of processing in a single one[15]. There are many reasons for adopting such an approach. First of all, in some cases, if the amount of data is huge, they don’t fit within a single machine so it is preferable to split them across multiple machines. Furthermore, processing data in a distributed environment allows to exploit better process parallelism and, whether the degree of parallelism of the program changes over the execution time, this environment turns out to be more flexible[15]. Moreover, having multiple small commodity machines is often cheaper than having just one with high memory capacity and a fast processor[15].

(14)

CHAPTER 1. INTRODUCTION

1.1 Background

In the past years, high-level data-parallel frameworks have been largely exploited in order to process and analyze data-intensive applications. One of the most famous and used for this purpose is MapReduce[4].

A type of data-intensive applications is the iterative applications which repeatedly execute a set of operations. Typical examples are value propagation algorithms like Connected Components (CC) or Machine Learning (ML) algorithms like Random Forest or Alternating Least Square. However, MapReduce framework performs poorly with iterative applications because of the inefficiency to perform iterative operations[14]. Hence, programming abstractions like Grapx and Pregel have been developed, which are able to handle these applications with large data size[5, 16, 14].

An iteration-based algorithm is composed of several operations performed until a final condition is met. Well-known iterative-based algorithms are fixpoint iter- ative algorithms where input parameter values are refined in each iteration, and the whole execution terminates when a convergence condition is met. The convergence condition in fixpoint iterative algorithms is often defined by the number of converged parameters. In particular, when all the parameters converge to a fixpoint, the execution terminates.

The parameters of a fixpoint iterative algorithm applied to a graph problem are the vertices of the graph. In detail, in a fixpoint iterative graph algorithm, we define two input sets: the Solution set and Dependency set. The Solution set contains all the nodes of a graph with related values, while the Dependency set contains all the edges of the graph. At the end of the execution, the Solution set indicates the final solution of the fixpoint algorithm, i.e. all the vertices with converged values. We often name the Solution set as Parameter set because it contains the parameters of the problem. In a fixpoint iterative graph algorithm, refining the parameter values means refining the value of each vertex until a fixpoint is reached.

In order to modify the value of the vertices and reach the fixpoint, an iterative execution plan has to be defined. An iterative execution plan is a set of dataflow operators describing the way in which vertices have to be processed in order to implement a fixpoint graph algorithm. Each plan is composed of two parts: a constant part that is the same for each fixpoint algorithm, and a variable part defined by the User-Defined Function (UDF), i.e. a function defined by the user that implements the purpose of a fixpoint algorithm.

The execution plans differ according to the operations involved into the dataflow and consequently, to the implemented logic. In this work, we always refer to four specific plans, deeply analyzed in Section 2.2. The four plans are:

• Bulk plan: the logic of this plan is to update all the nodes’ values in each iteration.

• Dependency plan: it updates only those vertices that need to be recomputed, by filtering the Dependency set. Vertices that need to be recomputed are all

2

(15)

1.1. BACKGROUND

those that have not yet reached the fixed value.

• Incremental and Delta plans: these two plans have the same logic of Depen- dency plan but they can be applied with some constraints[11].

Incremental and Delta plans usually outperform the other two but they cannot be applied with all the fixpoint graph algorithms because of the constraints. In that case, only Bulk and Dependency plan can be used. Since more than one plan can be used to solve a fixpoint iterative algorithm, choosing a suitable one is an essential task for both computational time and computational resources of the system.

The platform that has been used is Apache Flink[6], an open source platform for distributed stream and batch data processing. It gives the possibility to manipulate and process data applying map, reduce, join and many other operators. The most important feature that makes this platform suitable for fixpoint iterative algorithms is that it natively supports iterative applications.

In general, input datasets are used to feed the platform and then transformed by operators: in particular, the output of an operator is the input of another one. The set of all operators is commonly represented as a Directed Acyclic Graph (DAG), that indicates the execution plan of the system.

The Flink runtime is composed of two processes: the master process and worker process. A master process, also called JobManager, schedules and coordinates a job. There is always at least one master process. A worker process, also called TaskManager, executes a task of the dataflow operators. There is always at least one worker process. A worker is a Java Virtual Machine (JVM) process, composed of one or more taskslots in which it executes a subtask of the dataflow operators. Before the execution of the job, both the workload of the dataflow, and the input datasets are partitioned and distributed among the available TaskManagers, and consequently, among the available task slots. Even during the execution, TaskManagers send and receive subsets of the input datasets from other TaskManagers, in order to perform the assigned subtask. This step is achieved using the network. Figure 1.1 shows the interaction among the components of Apache Flink for executing a job. Distributing both dataflow operators and input datasets allows to perform and terminate the job execution.

In a Apache Flink environment, given the Solution set and Dependency set in input, Bulk plan exchanges the vertices values, runs the UDF and finally, it entirely updates the Solution set. The Dependency plan performs an extra phase where evaluates which vertices are likely to be recomputed in the next iteration by refining the Dependency set. It does not updates all the Solution set but only a subset. Even if in general, the Dependency plan performs better than Bulk in term of total execution time because it recomputes fewer nodes, it spends a lot of time by retrieving those vertices that need recomputation. This thesis tries to address the problem of the Dependency plan pre-processing.

Both Apache Flink and execution plans concerning this thesis are going to be deeply explained in chapter 2.

(16)

Figure 1.1: Distributed Execution of Apache Flink

1.2 Notation

A fixpoint iterative graph problem can be defined by the following constructs:

• V ertex/Node set © Parameter set. Set of the nodes of a graph where each node is identified by a unique ID and a value related to the node itself. A record in this set is represented as following: (vertexID, vertexV alue). We also use another notation for the Parameter set that is, Solution set.

• Edge set © Dependency set. Set of all edges of a graph highlighting the relationships among nodes. Each entry is composed of a source ID, a target ID and possibly a value related to the edge. A record of this set is represented by: (sourceID, targetID, edgeV alue)

• UDF © step function. A function defined by the user and tailored for the specific algorithm. It describes how a vertex value is updated according to its neighbors’ values.

1.3 Problem Statement

While Bulk and Dependency plans can always be used to implement a fixpoint graph algorithm, Incremental, and Delta plans cannot because of their constraints on the UDF. In detail, in order to apply either the Incremental or the Delta plan, the UDF of the fixpoint algorithm must be idempotent and weakly monotonic (see Section 2.2).

4

(17)

1.3. PROBLEM STATEMENT

Figure 1.2: Dependency Plan emphasizing the pre-processing phase

In order to update fewer nodes in each iteration, the Dependency plan performs an extra phase called pre-processing in which it calculates nodes that need to be updated and filters the Dependency set. Since not all the nodes of a graph change value in an iteration, it is possible to consider in the computation only a smaller number of nodes that need recomputation. The value of a vertex might not change since the last iteration for two reasons:

• no one of its neighbor vertices has modified their value;

• the new value is equal to the old one.

While Bulk plan considers all the nodes and all the edges in the computation, De- pendency plan is able to calculate those nodes that need to be recomputed, and thus excludes converged nodes from computation. However calculating those nodes needs additional operations. For this reason, Dependency dataflow is composed of more operations than the Bulk plan. Even if in general, Dependency plan outperforms Bulk because it recomputes fewer parameters, sometimes the number of parameters is so high that this plan spends a lot of time to perform these extra tasks. To sum up, in order to calculate fewer nodes at each iteration, the Dependency plan must perform more operations than Bulk.

So, we can state that the pre-processing phase is a drawback for the Dependency plan especially when the size of the datasets involved in the computation are high.

Figure 1.2, taken from [11], highlights the demanding pre-processing phase that represents the weak point of the Dependency plan.

(18)

Figure 1.3: Connected Components algorithm applied to the LiveJournal dataset with three different plans emphasizing the different execution time among Bulk and Dependency

Other remarkable problems that we have to deal with even during the implementation phase are both network and memory overhead. Exchanging data among the machines of the system, especially when data needs to be partitioned, requires a lot of effort because network communication is such a bottleneck for the system.

We face memory overhead instead when the amount of input data is really high. In that case, the main risk is that the system has no more available memory and so it has to store some data on the disk decreasing the system performance.

Overall, the question is: considering all the constraints that we have to face in this environment, is it possible to reduce the total execution time, especially of the Dependency plan, and so increase the performance of the system?

1.4 Goal

Figure 1.3, also taken from [11], shows the iteration time of three different plans, applying CC algorithm on the LiveJournal dataset[22]: Bulk, Dependency, and Incremental plan. Without considering the Incremental plan, what stands out is the clear difference in terms of time between Bulk and Dependency in the first few iterations.

The goal of the thesis project is to reduce the total execution time of a fixpoint algorithm when Incremental and Delta plans cannot be used. In other words, re-

6

(19)

1.5. METHODOLOGY

ducing the execution time by using either Bulk or Dependency plans. We want to accomplish the goal by either avoiding to execute Dependency plan in the first iterations using instead the Bulk plan, or trying to bypass the pre-processing phase of the Dependency exploiting some strategies. Overall, we want to push forward the performance of the system decreasing the whole execution time and consequently save resources.

1.5 Methodology

Regarding the methods that we want to apply in order to reach our goal, a possible strategy would be to quantitatively evaluate the cost of both Bulk and Dependency plans taking into account the main factors affecting the iteration execution time and decide then which one to perform in the next iteration. Basically, the intention is to study the cost factors for each plan and then develop a cost-based optimizer which is able to automatically select and switch plans. This method should lead to a more efficient total execution time by avoiding redundant computations.

Furthermore, as the Dependency plan outperforms the Bulk from a certain iteration onward, another approach might be to modify the Dependency plan bypassing the pre-processing phase. Successively, results coming both from alternative Depen- dency plan and the standard one are then formally and experimentally compared in order to assess if a reduction of the iteration execution time can be achieved.

Using both methods and mixing the two strategies could head to the goal.

1.6 Related Works

Excluding the literature[11], which is the basis of this work, there are no works directly linked to an in-depth study of execution plan cost for fixpoint iterative graph algorithms in a distributed batch data processing environment.

Starting from literature[11], four execution plans for large-scale graphs processing are shown: Bulk plan, Dependency plan, Incremental plan and Delta plan.

While the first two can be adopted in every case, the last two are applied under specific constraints, explained in Section 2.2. The execution model and the main differences between plans have been highlighted implementing some of the most well-known fixpoint iterative graph algorithms, such as PageRank (PR) and CC, and showing that the Incremental and Delta plan, whether they can be applied, outperform the other two.

Pregel is one of the most popular high-level frameworks for large-scale graph processing[16]. This framework consists of a sequence of iterations called superstep where, in parallel, vertices receive messages sent from in-neighbors in the previous step, run a UDF and forward messages to the out-neighbors. In this literature a vertex-centric approach is used, i.e. the focus is on the local computation of the vertex and is independent of other vertex computation. If a vertex does not receive a message in a superstep, it is automatically disabled. A deactivated node

(20)

CHAPTER 1. INTRODUCTION can be reactivated only if it receives a message from another node. The iteration terminates when all nodes are deactivated and there are no more messages in transit.

Pregel implements the Incremental plan by default because only the nodes that have received at least one message from an in-neighbor are considered active.

The model of computation used in this literature and the one adopted in the thesis are similar. Abstractly, the message passing model of Pregel is a notable reference for our model of computation in which the nodes of the graph firstly exchange values among them and then execute the UDF function.

Message passing among different machines constitutes a large overhead. If for instance, the sum of the values of the received messages is the important information for computation, then we can combine messages directed to a target node in a single one before sending, in order to decrease the communication overhead. This task is performed by combiners. Aggregators are another feature of this framework and may be a valid reference for our work because they are able to perform statistics on the graph and evaluate the distribution. Aggregators are also implemented in Apache Flink[6].

Several systems have been developed using Delta plan execution, a noteworthy one is REX[18]. REX is a good reference because it develops a system for recursive delta-based data-centric computation. In this literature, the Delta plan is implemented through the user-defined annotations where delta, i.e. difference between the last change and the old one, is seen as a tuple with annotation. An annotation tells the operator which operations such as inserting, deleting, replacement or UDF, have to be performed on that tuple. Operators keep unchanged nodes state and modify records only according to the delta.

REX optimizer takes into account network, disk and CPU costs as well as different interleaving of joins and UDF where predicates that are cheaper or discard more records should perform first. Furthermore, in deterministic functions, the constant parts should be cached in order to reuse data with lower overhead. Regarding the recursive computation, since it is composed of a base case and a recursive one, the optimizer first estimates the base case cost and obtains the optimal plan for it, then considers it as an input of the recursive case and retrieves again the optimal plan.

This is repeated at each iteration considering the previous plan as an input and costs are calculated accordingly. In the end, the optimal plan is chosen.

Literature [5] explains a method to integrate Bulk and Incremental iterations with parallel dataflow. Parallel dataflow is a programming paradigm which models a program with a DAG of operators, a source, and a sink. This implementation improves the exploitation of sparse computational dependencies present in many iterative algorithms. Sparse Computational Dependencies means that, taking a graph as input, the change of a vertex affects its neighbors only.

While Bulk plan is equally implemented to the related work[11], the Incremental one is developed differently, despite being conceptually similar. Further details of all execution plans are showed in Section 2.2.

To implement an alternative Dependency execution plan, we were inspired by the way this paper implements the Incremental execution plan.

8

(21)

1.7. OUTLINE

The optimizer estimates the cost of the iterative part and non-iterative part in order to pick a plan. Since it is not possible to know a priori how many times we need to compute the iterative part, the optimizer chooses the iteration plan depending on the cost of the first iteration. Moreover, a second optimization is cashing the constant iteration parts. In other words, all the parts that remain constant can be stored in a hash table or B⁺ tree in such a way to avoid every time the overhead to redo it.

1.7 Outline

The thesis starts off with an overview of the theoretical background related to the topic in Section 2. The theoretical background is then followed by Section 3 where an in-depth analysis of different implementation methods is shown with particular attention to the studied cost models. In Section 4 the experimental results of the implemented methods are presented, followed by a discussion. Lastly, in Section 5 a conclusion is drawn and future work are described.

(22)

(23)

Chapter 2

Theoretical Background

2.1 Apache Flink and Iterate operators

"Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.[7]"

Born as an academic project named Stratopshere[2], Apache Flink platform is now widely exploited for distributed big data analysis and processing. One of the most important features that makes it suitable for iterative application is its native support for iterations. Similar platforms are Apache Hadoop[24], a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models such as MapReduce[4]; and Apache Spark[17], another engine for large-scale data processing.

Figure 2.1 is an overview of Flink’s stack[6]. The core of Flink is a Distributed Streaming Dataflow, on top of which there are two APIs both for batch and for streaming processing, respectively: DataSet API and DataStream API. Once a Flink program is written employing all the operations that the platform provides, it is then parsed and a first rough plan is drawn. The first plan does not contain any information regarding the strategy adopted by the system for partitioning the datasets and the specific type of needed operations. This naive plan is then opti- mized by the Optimizer of the system. The purpose of the Optimizer is to evaluate all the transformations that need to be done on the input datasets and consequently, to choose the most appropriate type of operator for each operation, and the most appropriate partitioning strategy for each operator.

Flink programs can be written either in Java or in Scala and then deployed in different environments: in a Local environment, running a JVM, otherwise either in a Cluster environment like YARN[23] or in Cloud exploiting, for instance, OpenStack platform[20].

(24)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.1: Flink’s Stack

Deepening in the implementation of a Flink program, an input dataset can be created by loading a file from different sources: either from a local file system or from a distributed file system like Hadoop Distributed File System (HDFS). The input datasets can be even created within a Flink program. According to the result that we want to retrieve and the algorithm that we implement, the datasets are then manipulated and transformed by operators provided by Flink. Listing some of them, we have[8]:

• Map and FlatMap operator: it allows to map a dataset into another one, even with different fields. An example might be to take a record composed by two integer values, and return a different record with just a single value that is the sum of the previous two.

• Group By –> Aggregate operator: it allows to group a dataset on a given key and then perform an aggregation operation such as: selecting the Max/Min and Sum.

• Group By –> Reduce operator: it allows to group a dataset on a given key and then perform on each group a reduce function. The constraint of the Reduce is that if the input type is I also the output type must be I.

• Group By –> Combine operator: it allows to group a dataset on a given key and then perform on each group a combine function. Here the type constraint of the Reduce is not considered.

12

(25)

2.2. ITERATIVE EXECUTION PLANS

• Distinct operator: it allows to eliminate duplicates within a single dataset.

• Project operator: it allows to eliminate one or more fields of a dataset.

• Join operator: it allows to literally join two or more datasets together and potentially apply an operation on the intermediate result such as considering only some fields as the final result.

• coGroup operator: it jointly processes two or more datasets. The utility of this function over the join is that the user can separately iterate over the elements of both joined datasets.

Once the dataset is transformed, the relative result is returned via a sink that may be a raw file, a text file, a standard output like error stream or output stream, or a distributed file system like HDFS.

Apache Flink provides a useful tool in order to perform iterative algorithms:

the iterative operator. In the iterative operator the user can define a step function repeatedly performed in each iteration until a convergence condition is reached.

Two alternative iterative operators are provided: Iterate and Delta Iterate. In the Iterate operator, the Solution set is entirely updated with the calculated partial solution at the end of each iteration. In the Delta operator, not only the Solution set but also a Working set is defined and filled with those parameters that have changed their value according to the step function. Iteratively, while the Working set is entirely updated with the new updated parameters, in the Solution set, only those parameters that changed their value are substituted.

The execution terminates when a convergence condition is fulfilled. Usually, the convergence condition is the maximum number of iterations. Flink also gives the possibility to define a tailored convergence condition that it is verified before starting a new iteration. However, this is possible only in the Iterate operator.

The final result is represented by the last partial solution in the case of the Iterate operator and by the solution set state after the last iteration in Delta Iterate operator.

2.2 Iterative Execution Plans

There are four iterative execution plans analyzed in the literature [11]: Bulk Plan, Dependency Plan, Incremental Plan and Delta Plan. An execution plan is a set of dataflow operators describing the way in which parameters have to be processed in order to implement fixpoint iterative algorithms such as CC, PR, or Label Propa- gation (LP). In this section, we describe both the dataflow shaping each plan and the main operating principles.

Starting from the Bulk, we can define this plan as the simplest one. Figure 2.2 shows its mode of operation. The plan joins two input datasets: the Solution set S and the Dependency set D, in order to produce neighbors’ values and exchange those values among the vertices. After joining, the intermediate result is then used

(26)

Figure 2.2: Bulk Plan

to feed the UDF or step function that performs a specific task according to the purpose of the algorithm. The last step is to compare the new partial solution with the old one updating the Solution set with the new values. In fact, the most relevant feature of this plan is that the Solution set is entirely updated at the end of each iteration: so even those vertices that have not been modified are renewed.

The second plan under focus is the Dependency execution plan. The Dependency plan can be seen as an optimization of the Bulk. In the Bulk plan, although the values of some vertices are not changed, all of them are updated anyway. In fact, all the vertices are updated even if their values are not changed. The value of a vertex might not change since the last iteration for two reasons. Either no one of its neighbor vertices has modified their value or the new value is equal to the old one. Therefore, the Dependency plan avoids recomputing those vertices.

What stands out from Figure 2.3 is that the Dependency plan is a Bulk plan with a pre-processing phase at the bottom. The purpose of the pre-processing is to find candidates for recomputation, depending on the Working set: a subset of the Solution set containing updated vertices in the previous iteration. Doing that the Dependency set is shrunk by removing all those edges connected to already computed vertices. As a result, fewer vertices will be involved in the computation.

Both Bulk and Dependency are the two plans that we have mainly studied in this thesis work.

Regarding the Incremental plan and Delta plan, only the first one is shown and then we introduce the minor differences with the second. The Incremental plan is really similar to the Bulk but, instead of calculate neighbors’ values using the Solution set in the first join, it involves the Working set in order to perform this operation.

We can explain the result of substituting the Solution set with the Working set by using Figure 2.4. If we join the Solution set with the Dependency set, the result

14

(27)

2.2. ITERATIVE EXECUTION PLANS

Figure 2.3: Dependency Plan

Figure 2.4: An example to explain the difference between Bulk plan and Incremental plan. Node B is in red because it is the only node that has changed the value in the previous iteration

(28)

Figure 2.5: Incremental/Delta Plan

is that, in the case in Figure 2.4, all the in-neighbors of node Z send their values to Z, even if only node B has changed the value. This is what happens in the Bulk plan. Note that node B is colored with red to indicate that it is the only node that has changed the value in the previous iteration. In the Incremental plan instead, only B sends its value to Z. In this way, node Z will receive only a value from B in the current iteration.

However, this plan has some constraints: the UDF that we apply must be idem- potent[26] and weakly monotonic[27]. Figure 2.5 illustrates the dataflow of this plan.

The difference between Incremental and Delta plan is that while in the Incremental, the updated vertices send the total value to their neighbors, in Delta plan only the delta, i.e. the difference between the new and the old value, has to be sent. As a consequence, the amount of exchanged data is usually smaller.

2.3 Connected Components in Apache Flink

To give an idea of how Flink can implement a fixpoint iterative graph algorithm, we show the implementation of the CC algorithm. The objective of this algorithm is to find the number of Connected Components of a graph. A Connected Component is a subgraph composed by all those nodes connected by a path. This goal can be accomplished by scattering the initial value of each node in the graph until all have converged. The initial value of a node might be a random value or the ID number of the node itself. In many cases, the minimum value is propagated so each node has to decide on it. Collecting the minimum flooded values, the connected components number can be finally obtained. Figure 2.6 shows the execution of this algorithm composed by two phases:

1. Active nodes of the graph send their values to their neighbors;

16

(29)

2.3. CONNECTED COMPONENTS IN APACHE FLINK

Figure 2.6: Connected Components example

2. Each vertex decides on a value according to the step function (in the figure, the minimum value).

With regard to the code, two dataset structures are firstly created: one for vertices and one for edges. DataSet is a Flink structure that models the set of items of the same type. The Tuple is a type of Flink Java API that specifies the item fields. For the CC algorithm, the vertices Dataset is defined as follows:

DataSet<Tuple2<K, V>> v e r t i c e s ;

Each vertex item is composed by a key K and a value V that symbolizes respectively:

the ID and the Connected Components value of the node.

The edges Dataset instead is defined like:

DataSet<Tuple3<K, K, E>> edges ;

In this set, the first two fields are respectively the ID of source node and target node and this pair represents an edge, while the third one is the value associated with the edge. In the specific case of Connected Components, the third value is null because the value of an edge is not involved in the process.

If we want to implement the algorithm using Bulk plan, we should define the Iterate operator as follows:

IterativeDataSet <Tuple2<K, V>> i t e r a t i o n = v e r t i c e s . i t e r a t e ( maxIterations ) ;

// . . .

// d e f i n e s t e p f u n c t i o n // . . .

i t e r a t i o n . closeWith ( parametersWithNewValues ) ;

The closeWith function of the Iterate operator represents the last operation where the Solution set is updated with a new partial solution previously calculated. Ev- erything defined within the operator is repeatedly performed until the convergence

(30)

CHAPTER 2. THEORETICAL BACKGROUND condition is fulfilled. By default, the convergence condition is the maximum number of iterations defined in the iterate function.

The first task that needs to be done is the exchange of the value among graph nodes. Vertices and edges set are then joined together to perform this task.

DataSet<Tuple4<K, K, V, E>> parametersWithNeighborValues = i t e r a t i o n . j o i n ( edges ) . where ( 0) . equalTo (0) . with (new ProjectStepFunctionInput ( ) ) ;

The result of the joined, called parametersWithNeighborValues, is then projected in order to take into account only some fields. It can be seen as the set of values received by each single node of the graph. Next, step function is called and fed with the intermediate result:

DataSet<Tuple2<K, V>> parametersWithNewValues =

parametersWithNeighborValues . groupBy (0) . aggregate ( Aggregations .MIN, 2) . p r o j e c t (0 , 2) ;

The result of the step function is the vertices set with updated values. CloseWith function performs the last step to rebuilding the Solution set with the new computed value.

18

(31)

Chapter 3

Methods and Implementations

In order to solve the stated problem in Section 1.3 and accomplish the goal, we have developed two methods. The first one determines the factors that affect the cost of a plan in terms of time and builds a cost model that allows an optimizer to select a plan with lower cost. This method requires an in-depth study of cost factors for each iterative execution plan and the development of a cost-based optimizer that automatically decides the suitable plan. Actually, the plan-switching operation is performed by imposing a convergence condition for the Bulk plan (see Section 2.2).

In fact, since the Bulk always outperforms the Dependency in early iterations, the idea is to start to execute a job using Bulk and define a convergence condition which allows the comparison between the two plans cost models. When the cost of the Dependency becomes smaller than Bulk, then switch the plan.

The second method tries to optimize the Dependency plan bypassing the pre- processing phase. This phase is the heaviest part of the plan, especially when the amount of data is really high. A possible solution is to cache neighbors’ values without the need to exchange them every time when it is not necessary. Even- tually, we formally and experimentally evaluate the alternative implementation of Dependency execution plan with the standard Dependency.

3.1 Cost Model

A cost model gathers the main features of a plan by defining a formula. Calculating the cost of a plan means then applying a formula that gives as output a number indicating the value of the execution plan. This retrieved plan value is strictly related to the total execution time of a single iteration of the plan itself. Thus, the idea is to calculate the cost of both Bulk and Dependency plan at the beginning of each iteration, and then choose at runtime which one has a lower cost.

We build the model in a modular way, i.e. we add as many terms in the formula as the number of involved operations in the dataflow of the execution plan. To evaluate the cost of an operation, we consider both the cost of the operator and the size of the datasets used to perform that operation. Only the Dependency plan must

(32)

CHAPTER 3. METHODS AND IMPLEMENTATIONS be recalculated because the cost of the Bulk plan never changes during iterations.

The reason is that Bulk plan works with the same datasets and operations in each iteration, so its cost model does not need to be recomputed every time. Moreover, what is important is not the real cost of a plan but rather understanding which of the two is the cheaper.

Starting from the basic cost model discussed in [11], we have studied factors that may affect the cost of a plan. The system gives different performances depending on the plan that we are running. Hence, changing the plan-switching point modifies the total execution time of the algorithm.

In the following part, we are going to highlight all the models, item by item, mainly focusing on advantages and drawbacks.

3.1.1 Legend

Shown below, a list of all the symbols that we use in the formulas of the cost models.

• Cj: Join Cost

• Cr: Reduce Cost

• cp: Shipping Cost

• cb: Rebuilding Cost

• S: Solution set

• D: Dependency set

• W : Working set, #updated nodes

• Z: Candidates set, #candidate nodes

• ⁄k: #updatedNodes

Solutionset , 0 Æ ⁄k Æ 1

• µk: #candidateNodes

Solutionset , 0 Æ µkÆ 1 3.1.2 Basic Cost Model

The Basic Cost Model is described in the draft of [11]. This model approximates the cost of the two plans with the size of the inputs datasets. In fact, it is assumed that the cost of an operation is independent of its type, and it only depends on the size of the involved datasets. In addition, the computation cost of each operator can be considered irrelevant if compared to the network and memory overhead.

Considering only the size of the datasets, the cost of the Bulk plan is always the same over iterations, while, the cost of the Dependency changes because of the Working set and the Candidates set, i.e. the set containing re-computation candidates. In fact, the Working set usually becomes smaller as iterations progress and so the cost of the Dependency plan accordingly decreases.

20

(33)

3.1. COST MODEL

We assume that at the k ≠ th iteration, the Working set contains ⁄kú S nodes, while Candidates set contains µkú S nodes. Since both the Working set and the Candidate set are subsets of the Solution set, ⁄kand µkfactors are between 0 and 1.

The Candidates set is strongly related to the Working set because re-computation candidates, i.e. all those nodes that are recomputed during the current iteration, are retrieved from the elements of the Working set.

Following, the cost models of both plans:

BulkCost: Cj(S + D) + CrD+ Cj2S = 3CjS+ (Cj+ Cr)D

DepCost: Cj(⁄kS+ D) + Cr⁄_kD+ Cj(D + µkS) + Cj(S + µkD) + Crµ_kD+ Cj(S + µ_kS) = Cj(⁄k+ 2µk+ 2)S + [Cj(µk+ 2) + Cr(µk+ ⁄k)]D

The aim is to satisfy this formula:

DepCostÆ BulkCost which means:

C_j(⁄k+ 2µk+ 2)S + [Cj(µk+ 2) + Cr(µk+ ⁄k)]D Æ 3CjS+ (Cj + Cr)D (3.1) As already discussed, we assume Cj = Cr, and considering only the size of the datasets, Cj = Cr= 1.

The formula becomes:

(⁄k+ 2µk+ 2)(S + D) Æ 3S + 2D ∆ (⁄k+ 2µk)(S + D) Æ S

Another assumption made in this model is to consider ⁄kequal to µk, which means to consider the size of the Working set equal to the size of the Candidates set.

Applying ⁄k = µk, the formula becomes:

3⁄k(S + D) Æ S ∆ 3⁄kÆ S

S+ D (3.2)

Plot in Figure 3.1 displays the trend of the Bulk and Dependency costs respect to

⁄_k. The same plot shows the switching point between the two costs.

3.1.3 Cost Model with Shipping Strategies

The next step is to consider the shipping strategies adopted by each plan. A shipping strategy is how the platform decides to distribute one or more datasets across all the parallel instances of the system in order to perform a distributed operation. In a scenario where the size of the datasets is extremely large, distributing the workload and shipping data become key tasks. Regarding the cost of shipping, both the effort to partition the data and the overhead to distribute them need to be considered.

Figure 3.2 shows an example of how the system, specifically Apache Flink, selects the most suitable shipping strategies for the Bulk execution plan. Even if many

(34)

CHAPTER 3. METHODS AND IMPLEMENTATIONS

Figure 3.1: Cost Models Switching

Figure 3.2: Bulk plan with shipping strategies highlighted

shipping strategies can be used to partition data, in our specific case, only the forward strategy and hash partition have been used. The forward strategy indicates that a dataset does not need to be distributed because it is already partitioned across the nodes, while hash partition exploits a hash function based on a key to perform this task.

According to the performed operation, the shipping strategy is then picked. In our case, the two operations that influence the choice of the shipping strategies are the join and the reduce operators.

In a distributed environment, the way we perform the join operation strongly depends on the considered datasets and, consequently, this affects the total cost. From literature [21], the most appropriate types of join for parallelization are natural-join

22

(35)

3.1. COST MODEL and equi-join.

In order to perform a distributed join, the two datasets have to be partitioned on the join attribute, i.e. the attribute on which datasets are going to be merged, thus, parallelizing the entire operation. Thanks to the partition result indeed, a single node is able to perform the local join in a subset of the input datasets, without the inconvenience of missing data.

Regarding the equi-join, there are three types of such operations, depending on how the involved datasets are partitioned.

• Co-located join: both datasets are partitioned on the join key using a hash function. The cost only depends on the locally performed join.

• Directed join: only one of the two datasets is partitioned on the join at- tribute and therefore even the second dataset must be partitioned on the same attribute. The cost of this operation is clearly higher than the previous one and is equal to the cost of partitioning the not-partitioned dataset, plus shipping the data over the network and finally the local cost of such operation.

The cost of partitioning and shipping can be approximated to the number of blocks forming the dataset.

• Repartition join: neither of the two datasets is partitioned on the join attribute, so both must be partitioned. The cost is twice the partitioning and shipping since two datasets are involved, plus the local execution of the join.

Another possible solution to merge two datasets is the Broadcast join. One of the two datasets is already partitioned on the join attribute, but instead of running the partitioning of the other one, it is entirely sent to all nodes. In brief, the first dataset is already partitioned on different nodes, the second is sent to all. This technique is often used when the size of one of the two datasets is much smaller than the other.

Regarding the shipping strategy adopted to execute a reduce function, this depends both on how the dataset is distributed among nodes and on the specific type of reduce operation. The reduce function involves just a dataset that could be either an input dataset or the intermediate result of another operation. Often, if the dataset needs to be partitioned, a first instance of the reduce function is locally applied. Then, the dataset is distributed and finally a second instance is applied.

An example is eliminating the duplicates in a dataset. In this case, the removal function is first applied before partitioning. The dataset is then partitioned and distributed by the shipping strategy on a key likely to be the same of the reduce function, and eventually the reduce function is applied again. Note that, performing the reduce function before distributing data, decreases data traffic over the network.

For instance, the first reduce function of the Dependency plan operates in this way, removing duplicates of the intermediate result given by the join between Working set and Dependency set.

The UDF is a reduce function different from the first reduce of the Dependency, briefly explained in the sentence above. It is impossible to know a priori the shipping