Data Flow Optimization - ACTA UNIVERSITATIS UPSALIENSIS Uppsala Dissertations from the Faculty

The purpose of data flow optimization is to create an optimized data flow graph for a given continuous query. The optimization provides higher degree of transparency than the specification of queries through templates where the degree of parallelism and the partitioning strategy are specified explicitly. The optimizer automatically generates distributed execution plans and selects an optimized plan using some optimality metric. Traditionally database query optimizer functionality consists of plan enumeration, generating plans in the space of possible plans, and a cost estimation model on which the selection of an optimized plan is based.

We have developed a CQ optimizer for PCC with limited functionality that optimizes parallel execution plans for a single expensive SQF. Next, we shortly describe the components of the current CQ optimizer. The optimiza-tion of the individual SQFs relies on tradioptimiza-tional query optimizaoptimiza-tion.

5.6.1 Estimating Plan Costs

Traditional cost models rely on relatively accurate estimates of the costs of individual operators that are used to estimate the cost of the entire plan. How-ever, in an extensible system such as GSDM executing user-defined functions over user-defined data, the cost model of individual functions might be hard to define or obtain from the author of the code. Furthermore, to allow utilization of processing resources allocated on-demand among computers with differ-ent performance parameters, the system would need to support a separate cost model for each architecture.

Therefore, the CQ optimizer selects an optimized plan based on trial runs that collect execution statistics rather than based on a cost estimate model. The optimizer collects statistics about the utilization time of working nodes, which is a sum of the SQF’s processing times, the communication time, and the time spent in system tasks. The working node with maximum utilization time limits the throughput achievable by the data flow graph and hence we define an optimality metric (cost) of a plan as the maximum utilization time among the utilization times of working nodes is the plan. We select an optimized plan by selecting the plan with the lowest utilization time. In order to compare the statistics of several plans, the trial runs use the same cluster and work on a stream segment with equal size.

5.6.2 Plan Enumeration

A naive plan enumeration for PCC has been implemented as follows:

1. The maximum degree of parallelism is specified as a system parameter.

Plan Enumeration

Compilation

Run

CQ Specification

Deactivation

Data Flow Graph

Execution Plan

Running CQ Trial Run

Plan Selection

Optimized Execution Plan

Statistics

CQ Optimizer

Figure 5.3: Life cycle of an optimized CQ

2. Plans for partitioned parallelism are generated using the PCC template con-structor.

3. A function registry contains meta-data about the valid parallel strategies and their parameters for a given SQF. Both the valid stream partitioning strategies and the valid degrees of parallelism are specified in the registry.

4. The enumerator generates different plans by using the PCC template and varying the strategy (i.e. window split or window distribute) and the valid degrees of parallelism. The enumeration of plans for a given strategy stops when a plan has been generated such that either the maximum degree of parallelism or the resource limit is reached.

Each of the graphs is compiled, run in a trial mode, and statistics about the execution is stored in the coordinator’s metadata by the statistics collector.

The CQ optimizer then chooses an optimized data flow graph using the sta-tistics collected and the above model for optimality. The life cycle of the CQ optimized in this way is shown in Figure 5.3.

The optimizer is implemented as a function with the following signature:

opt(Charstring templ, Charstring fun,

Vector params, Vector inpstr) -> Dataflow d;

It takes as parameters the function fun to be executed, its parameters, and input streams inpstr. The first parameter is a template to be used for plan enumer-ation and currently only PCC is supported. The optimizer assigns automati-cally an output stream to collect statistics from trial runs. The result of the opt function is an optimized data flow graph for the provided parameter function fun. Using the CQ optimizer functionality, the continuous query on page 63 is specified by an alternative template constructor as follows:

set q = cq(opt("PCC","fft3",{},{s1}), {s1}, {s2});

The optimized plan is set by the cq constructor to the plan property of the query and used when the run(q) procedure starts the execution.

In the example above, plans with RR and user-defined stream partitioning are generated with different degrees of parallelism and the plan with best exe-cution time from the trial runs is selected. In this way transparency is provided to the user, so that only the SQF fft3 needs to be specified in the query, rather than all explicit parameters of the PCC template as in the example on page 63.

The current CQ optimizer provides automatic optimization of paralleleliz-able expensive SQFs using the PCC pattern. The experiments show that the optimization framework with trial runs and statistics collection is feasible, but it needs to be generalized in several important directions:

• Sophisticated plan enumeration. Since the naive enumeration of plans might create very big space of possible data flow graphs, in order to make the op-timization efficient it is important to develop heuristics about which plans to generate and what order to follow during the generation. For example, enumeration strategies such as random walk of search space, binary search, or greedy can be investigated. The importance of such heuristics is even bigger in our setting than for cost-based optimization, because rather than computing the cost, the CQ optimizer runs a plan in a trial mode.

• Optimality model. Alternatively to the maximum throughput metric, other metrics such as latency and precision can be used. Furthermore, multi-criteria optimality model combining several metrics might show to fit better some applications.

• The optimization framework needs to be generalized to use different distri-bution templates besides the PCC template. For example, if a user specifies a pipeline of two SQFs, the CQ optimizer has to enumerate plans where each stage of the pipe is parallelized independently of the other by, possi-bly, different degrees of parallelism and partitioning strategies, and plans where the stages in the pipeline are executed together by defining a meta-SQF that encapsulates them, which on its turn is parallelized by some data partitioning.

6. Execution of Continuous Queries

This chapter presents the execution of continuous queries at working nodes.

First, we describe the implementation of operators executing SQFs and inter-GSDM communication. Next, we present the scheduling policies used by the scheduler. Finally, we describe important observations concerning the system performance.

In document ACTA UNIVERSITATIS UPSALIENSIS Uppsala Dissertations from the Faculty of Science and Technology 66 (Page 86-89)