• No results found

Decomposition tree distribution

In document Vanja Josifovski (Page 124-129)

5.3 Performance measurements

6.1.5 Decomposition tree distribution

112 Query Decomposition and Execution

We conclude the section with an observation that the described strategy is more general given OO data sources than the strategies used in some other multidatabase systems (e.g. [56, 20, 48]) where the joins are performed in the mediator system. Such strategies do not allow for mediation of OO sources that provide functions that are not stored, but rather performed by programs executed in the data source (e.g. image analysis, matrix operations). In this case, it is necessary to ship intermediate results to the source in order to execute the programs using the result tuples as an input. From this aspect, the strategy presented above generalizes and improves the bind-join strategy in [36].

6.1 Query decomposition 113

an empty PPL must have a non-empty SAEDS. Therefore, the lower node describes an operation where data is shipped from the mediator to another AMOSII server, some computation is performed there, and the result is then shipped back to the mediator. In the next step, described by the upper node, the result of the previous step, stored now in the mediator, is once again shipped to the site described in SAEDS of the upper node, this time as input to the SF in the upper node SAEDS. For example, the left tree in Figure 6.7 describes a plan where

SFdb

2 is executed at

DB

2 and the result is shipped to the mediator. The upper node then ships the same result from the mediator further to

DB

1, where it is used in an equi-join.

In the case when the lower node has a non-empty PPL list, the result of the lower node is processed locally before it is used in the processing step described by the upper node. The transformations described in this section are, therefore, in such a case not applicable.

The node merge operation is shown in gure 6.9. Two consecutive nodes with the required properties are identi ed (Figure 6.9a) and substituted with a single node. The new node has the PPL from the upper node and a SAEDS assembled from the SAEDSs of the merged nodes. In order for the new tree to represent a correct query schedule, the SF in the SAEDS of the new node should perform the same operations as both SAEDSs of the original nodes.

Therefore, the SF in the new node's SAEDS is a combination of the SFs of the original nodes. Nevertheless, to avoid the unnecessary bypass of the data throughout the mediator, the combined SF is compiled and executed in the participating servers instead of locally in the mediator. This is done by de ning an envelope SF that calls the two original SFs. The envelope SF is compiled at both of the participating sites and the cheaper alternative is accepted. This, in turn, is compared with the cost of the original tree and if it has a lower cost, the modi ed tree is accepted instead of the original.

Figure 6.10 illustrates the data ow between the three AMOSII servers in the example from Figure 6.7a. This example is a general case of an ap-plication of a node merge. An exception is the case where the merged nodes have SAE SFs that are both at the same site. In such cases there would be only two sites involved.

In Figure 6.10a the data ow of the execution of the original schedule is presented. The query execution starts by the mediator contacting

DB

2 to execute

SFdb

2 in step 1, and shipping across the results in step 2. Next, from the result of the previous step, the mediator sends the values of the argument

va

to

DB

1 where

SFdb

1 is executed and the result is joined with

114 Query Decomposition and Execution

Upper node Sae: SFupp PPL: PPLupp

Lower node Sae: SFlow PPL: nil

Lower rest of the tree

Upper rest of the tree

Merged node Sae: Wrap(SFupp+SFlow) PPL: PPLupp

Upper rest of the tree

Lower rest of the tree a)

b)

Figure 6.9: Node merge: a) the original tree b) the result of the merger operation

the incoming set of

va

values. For each joined value of

va

a temporary boolean value is returned indicating which of the incoming

va

values joined with the result of the execution of

SFdb

2. Finally, after joining the result shipped in step 4 with the result of step 2, the mediator emits the values of

r

for which the temporary iteration variable

tmp

is

TRUE

.

This strategy would be very inecient in cases when the set of

va

values is very large and the net links connecting the mediator with

DB

1 and

DB

2 are very slow (e.g. due to geographical dislocation). Also, note that with this strategy the

va

values are shipped twice.

The strategy illustrated in Figure 6.10b is obtained by merging the nodes of the DcT in Figure 6.7a and placing the envelope SF at

DB

2. Here, the values of

va

are sent directly from

DB

2 to

DB

1, shipping them therefore only once. Figure 6.10c represents the execution strategy of the transformed

6.1 Query decomposition 115

Mediator Mediator

DB1 DB2

DB1 DB2

1 4

1 2

1 4

3

2

Mediator

DB1 DB2

3 4

2

3 a)

b) c)

{}

{r}

{} {va}

{va, r}

{}

{_tmp_}

{va}

{va}

{r}

{_tmp_}

{r}

Figure 6.10: Execution diagrams of the decomposition tree of the example query before node merge and after

DcT in Figure 6.7a where the envelope SF function is placed at

DB

1. This strategy is favorable when

SFdb

1 has large selectivity or the network link between the mediator and

DB

2 is slow.

A series of node merge operations can produce longer data ow patterns that do not necessarily pass through the query issuing mediator. One fea-ture of the trees produced by node mergers is that the SFs in the SAEDSs are themselves multidatabase functions over data in multiple data sources.

These SFs are also compiled and described by a DcT. Since by repeated application of the merging process their SAEDSs can also have SFs over multiple data sources, we can assume that a query is represented by a set of DcTsdistributed over more than one AMOSII server. Hence, the process can be viewed as DcT distribution. Compared with the traditional query tree balancing [17] the node merge exhibits the following di erences:

116 Query Decomposition and Execution

 Distributed compilation: node merging is a distributed process where the envelope SFs are compiled at nodes other than the mediator. This distributed compilation process is decentralized and does not need a centralized catalogue of optimization information that is a potential bottleneck when the number of mediators increases.

 Distributed tree: The resulting tree is not stored in one AMOSII server, but rather is spread over the participating servers that expose only an already compiled function for the subquery sent by the coordinating mediator.

In a tree produced by the cost-based scheduling there might be more than one spot that quali es for a merger operation. An important issue in apply-ing node mergapply-ing is where in the DcT to apply the the operator. Di erent sequences of merge operations can produce di erent results. The simplest solution to this problem is to perform an exhaustive application of all pos-sible sequences of merger operations by backtracking. However, it is clear that this will require a large number of SF compilations and is therefore not suitable. An alternative is to use hill-climbing from a few randomly chosen positions and perform the process until no transformation can be made such that a cheaper tree is produced. The process can be guided by heuristics that prioritize DcT nodes where the transformation can be especially useful, and avoid merging nodes that are unlikely to produce a merged node with lower cost. An example of such heuristic rules are:

 Merge only nodes where the SF in the SAEDS of the lower node has at least one result variable used in the input of the SF in the SAEDS of the upper node. This rule avoids producing merged nodes where the result shipped to the mediator is a cross product of the results of the two SFs.

 Merge nodes where the weight of the network connection between the SAE SFs in the upper and the lower nodes is considerably lower than the weights of their network connections with the mediator (e.g. the data sources are close to each other, but geographically far away from the mediator)

In document Vanja Josifovski (Page 124-129)