Query processing and data integration

One of the reasons for the success of the database technology is the capability of the DBMSs to accept declarative query requests from the user. As noted earlier, the user only needs to specify what is to be retrieved, rather than how it is retrieved. In other words, queries are not programs stating precisely how the data is retrieved. The burden of making a query execution plan from a query is taken by the DBMS. In an multidatabase environment consisting of heterogeneous and autonomous data sources, this task becomes even more demanding.

Resolving heterogeneity usually requires advanced queries containing op-erators that are more complex than in the traditional select-project-join queries. An example of such an operator, used in this work to integrate overlapping data from dierent sources, is the outer-join operator. This op-erator returns not only the matching tuples of the operands, but also the non-matching tuples, padded by NULL values. This operator does not have the associativity and commutativity properties used heavily in optimization of regular join-based queries.

Another issue is the dierence in the capabilities of the participating data sources. While in the distributed database framework all nodes have the same functionality, here some nodes might not even be databases (e.g.

an e-mail system). This makes the query compilation and the division of the tasks among the nodes harder than in distributed databases.

The autonomy of the data sources also greatly in uences the query pro-cessing in an MDBMS. As the MDBMS interacts with the data sources only via an external interface, the internal statistical information needed for the query optimization is not available. Obtaining this type of information is typ-ically very hard in an MDBMS operating over autonomous sources. In this thesis we do not elaborate on this problem. A few solutions to the problem have been proposed in the literature: query sampling in [88], query probing and piggyback in the same reference, and calibration and regression in [31].

A survey of these techniques is presented in [5].

2.5 Query processing and data integration 21

The MDBMS environment is also much more dynamic in comparison with the classical distributed database environment. Here, the participating data sources are free to withdraw from the system or refuse certain requests.

22 Data Integration by Multidatabase Systems

An Overview of the AMOS

^II

System

The AMOSII system was developed from the AMOS system which has its roots in the workstation version of the Iris system, WS-Iris [52]. The core of AMOSII is an open, lightweight, and extensible database management system (DBMS). The aim of the AMOSII architecture is to provide for e-cient integration of data stored in dierent repositories by both active and passive techniques. To achieve better performance, and because most of the data resides in the data repositories, AMOSII is designed as a main-memory DBMS. Nevertheless, it contains all the traditional database facilities, such as a recovery manager, a transaction manager, active rules, and an OO query language. A running instance of AMOSII, named an AMOSII server (or sim-ply server), provides services to applications, as well as to other AMOSII servers.

Figure 3.1 illustrates the dierent roles that an AMOSII server can as-sume. In this example, several applications access data stored in several data sources through a collection of interconnected AMOSII servers. AMOSII servers can run on separate workstations and provide dierent types of data integration services. One server is designated to be a name server and pro-vide information about the locations of the servers on the net. Dierent in-terconnecting topologies can be used to connect the servers depending on the integration requirements of the environment. Also, a single AMOSII server can perform more than one task described in the gure and serve more than one application simultaneously. Each AMOSII is a fully edged DBMS and

24 An Overview of the AMOS

System

Pricing Data Feed

Purchasing Prod. Estimates Design / Analysis

Manufact.

System

Materials Database

Name Server Mediator

Translator Translator

Mediator

Translator Local

Data

Local Data

Figure 3.1: Interconnected AMOS

^II

servers

can store data locally. Imported and local data is described in each AMOSII by an OO type hierarchy.

In [23], an approach to wrapping relational data sources with AMOSII is described. Here, the sources are not only wrapped, but also some query optimization techniques are used to simplify the queries on both local and relational data. Therefore, to distinguish between the wrapper subsytem in AMOSII, and an AMOSII server having the role of wrapping a data source with this extended functionality, the second is named translator. The term wrapper will be used to represent the wrapper subsystem.

This thesis describes the design and implementation of the mediation services in AMOSII.

3.1 Data model

The data model in AMOSII is an OO extension of the DAPLEX [71] func-tional data model. It has three basic constructs: objects, types (i.e. classes),

3.1 Data model 25

and functions. Objects model entities in the domain of interest. An object can be classied into one or more types which make the object instances of those types. The set of all instances of a type is called the extent of the type.

Object properties and their relationships are modeled by functions.

The types in AMOSII are divided into literal and surrogate types. The literal types, e.g. int, real and string, have a xed (possibly innite) extent and self-identifying instances. Each instance of a surrogate type is identi-ed by a unique, system-generated object identier (OID). The types are organized in a multiple inheritance, supertype/subtype hierarchy that sets constraints on the classication of the objects. One example of such a con-straint is: If an object is an instance of a type, then it is also an instance of all the supertypes of that type; conversely, the extent of a type is a subset of the extents of its supertypes (extent-subset semantics). The AMOSII data model supports multiple inheritance, but requires an object to have a single most specic type.

The surrogate types are divided into stored, derived, proxy, and integra-tion union types:

The instances of stored types are explicitly stored locally in AMOSII and created by the user.

The extent of a derived type (DT) is a subset of an intersection of the extents of the constituent supertypes. The instances of the supertypes are selected and matched using a declarative query. DTs are described in chapter 4.

The proxy types represent objects stored in other AMOSII servers or in some of the supported types of data sources. The proxies are also described in chapter 4.

The integration union types (IUTs) are dened as supertypes of other types. An IUT extent contains one instance for each real-world entity represented by the (possibly overlapping) extents of the subtypes. The integration union types are the subject of chapter 5.

The functions are divided by their implementations into three groups.

The extent of a stored function is physically stored in the database. Derived functions are implemented in a declarative OO query language AMOSQL.

Foreignfunctions are implemented in some other programming language, e.g.

Lisp, Java or C++. Each foreign function can have several associated access

26 An Overview of the AMOS

System

paths having dierent implementations and, to help the query processor, each access path has an associated cost and selectivity ¹ function [52]. This mechanism is called a multi-directional foreign function.

In document Vanja Josifovski (Page 32-38)