The Customized Database Fragmentation Technique in Distributed Database Systems : A case Study

(1)

The Customized Database Fragmentation Technique in

Distributed Database Systems

Mohammed Ibrahim Shareef Aus Wail Al-Rawi

MASTER THES IS 2011 INFORMATICS

(2)

Postadress: Besöksadress: Telefon:

Box 1026 Gjuterigatan 5 036-10 10 00 (vx)

551 11 Jönköping

The Customized Database Fragmentation Technique in

Distributed Database Systems

Mohammed Ibrahim Shareef Aus Wail Al-Rawi

Detta examensarbete är utfört vid Tekniska Högskolan i Jönköping inom ämnesområdet informatik. Arbetet är ett led i masterutbildningen med inriktning informationsteknik och management. Författarna svarar själva för framförda åsikter, slutsatser och resultat.

Supervisor: Anders Cartensen Examinator: Vladimir Tarasov Omfattning: 30 hp (D-nivå) Datum:

(3)

Abstract

iii

Abstract

In current age, various companies are using a centralized database system for daily business transactions in different domains. Some critical issues have been observed related to the complexity, maintenance, performance and communication cost of data in centralized data repository for query processing, according to the demand of end-users from different locations. So, different enterprises are striving to implement efficient distributed database systems in their business environments for scalability. The distributed database architecture covers different factors such as transparent management system, replication, fragmentation and allocation etc. This dissertation focuses on database fragmentation and techniques which are useful for performing database fragmentation.

The objective of this research is to investigate efficient algorithm and technique for database fragmentation in distributed environment. We proposed a customized ISUD (Insert, Select, Update, Delete) technique after comparative study of the best suitable techniques, which is selected for implementation purpose. The functionality of the customized ISUD technique helps to get the precedence of the attribute of a relation horizontally in database from various sites or location.

The practical objective of this dissertation is to design the architecture and develop, implement customized ISUD (Insert, Select, Update, Delete) user interface, and to test the selected algorithm or technique by using the interface. We used C#.Net as a development tool. This user interface accepts ISUD frequency as an input and produces ALP (attribute location precedence) values as output. We have incorporated design science research (DSR) method for customized ISUD technique development. This customized ISUD technique can be considered as a foundation to implement horizontal database fragmentation in distributed environment, so that the database administrator can take a proper decision for allocating the fragmented data to various sites at initial state of distributed database design.

(4)

iv

Abstrakt

I dag använder olika företag ett centraliserat databassystem för dagliga affärstransaktioner i olika domäner. Vissa kritiska frågor har observerats i samband med komplexiteten, underhåll, prestanda och kommunikations kostnader av data i centraliserad data arkiv för behandling av förfrågningar, enligt efterfrågan på slutanvändarna från olika platser. Så, olika företag strävar efter att implementera effektiva distribuerade databassystem i sina affärsverksamheters miljöer för skalbarheten. Den distribuerade databas arkitekturen omfattar olika faktorer såsom transparent ledningssystem, replikering, fragmentering och allokering etcetera. Denna avhandling fokuserar på databas fragmentering och tekniker som är användbara för att utföra databas fragmentering.

Syftet med denna forskning är att undersöka effektiv algoritm och teknik för databas fragmentering i en distribuerad miljö. Vi föreslog en skräddarsydd ISUD (Insert, Select, Update, Delete) teknik efter en jämförande studie av de bästa lämpliga teknikerna som har valts för genomförandets ändamål. Funktionaliteten hos den anpassade ISUD tekniken hjälper till att få företräde till attribut för en relation horisontellt i databasen från olikaplatser.

Den praktiska Syftet med denna avhandling är att utforma arkitektur och utveckla, genomföra anpassade ISUD (Infoga, Välj, uppdatera, ta bort) användargränssnitt,och att testa den valda algoritmen eller teknik med hjälp av gränssnittet.Vi använde C#. Net somett utvecklingsverktyg. Dettaanvändargränssnitt accepterar ISU D frekvens som indata och producerar ALP (attribute location precedence) värden som utdata. Vi har integrerat design forskning (DSR) metoden för kundanpassad ISUD teknik utveckling. Denna skräddarsydda ISUD tekniken kan betraktas som en grund för att implementera horisontell databas fragmentering i distribuerad miljö, så att databas administratören kan ta ett riktigt beslut för att allokera fragmenterade data till olika platser vid första läget i distribuerad databas design.

(5)

Acknowledgements

v

Acknowledgements

With the immense pleasure we take this opportunity to thank one and all who have helped in making this project possible.

First of all, we like to thank almighty God, the Most Beneficent, and the most Merciful. We like to thank Jönköping University for giving us the opportunity to work on a thesis as a part of our curriculum. We also like to thank our supervisor Anders Carstensen for his advices, support and facilitator role throughout this final project. We would also like to thank our examiner and professor Dr.Vladmir Tarasaov for his valuable suggestion and guidance throughout our thesis. And we also like to thank Mr Markus Milerup, representing for jordbruksverket(Swedish Department of Agriculture) Sweden, for providing the information of the company problems in the scope of this thesis project. At last we would like to thank our family and friends who gave us social and moral support in order to achieve this thesis.

(6)

vi

Key words

Distributed database, Database Fragmentation, Attribute Locality precedence, Customized ISUD.

(7)

Contents

vii

List of Figures

FIGURE 1: DATABASE MANAGEMENT SYSTEM IMPLEMENTATION

ALTERNATIVES[1] ... 21

FIGURE 2: STAGES OF THE TOP-DOWN APPROACH IN DISTRIBUTED DATABASES [3] [5] ... 25

FIGURE 3. BLOCK DIAGRAM OF THE SYSTEM[24] ... 31

FIGURE 4: ALGORITHM FOR FRAGMENTATION[24] ... 32

FIGURE 5.ALP-TABLE-CONSTRUCTION PSEUDO-CODE [24] ... 33

FIGURE 6:RESEARCH DESIGN METHOD [7] ... 38

FIGURE 7: CONSTRUCTIVE RESEARCH METHODOLOGY FOR RESEARCH DESIGN ... 41

FIGURE 8: THE GENERAL METHODOLOGY OF DESIGN SCIENCE RESEARCH [22] ... 42

FIGURE 9: 5-LAYER ARCHITECTURE FOR PROPOSED FRAGMENTATION TECHNIQUE ... 48

FIGURE 10: APPLICATION OF A CASE STUDY ... 49

FIGURE 11: DATABASE OF CASE STUDY APPLICATION ... 50

FIGURE 12: CISUD MATRIX TABLE ... 51

FIGURE 13: USER INTERFACE FOR CISUD APPLICATION. ... 52

FIGURE 14: INTERFACE FOR SETTING AND GETTING THE PREDICATE SET FOR INDIVIDUAL HIGHEST ATTRIBUTE. ... 53

(10)

x

FIGURE 15: PREDICATE SET FOR HIGHEST ATTRIBUTE PRECEDENCE AT

INDIVIDUAL SITE ... 54

FIGURE 16: ALLOCATION OF FRAGMENTS ... 54

FIGURE 17: ISUD USER INTERFACE FOR TOTAL COST OF ATTRIBUTE FROM ALL SITES ... 55

FIGURE 18: RESULTS RETRIEVE FOR TOTAL ALP(ATTRIBUTE LOCALITY PRECEDENCE) VALUE FROM THREE SITES. ... 56

FIGURE 19: ISUD USER INTERFACE FOR INDIVIDUAL COST OF ATTRIBUTE FROM INDIVIDUAL SITES ... 57

FIGURE 20: INDIVIDUAL ALP RESULTS FROM INDIVIDUAL SITES ... 57

FIGURE 21: ALLOCATION OF DATA TO DIFFERENT SITES ... 58

FIGURE 22: ISUD INPUT VALUES (1) ... 58

FIGURE 23: INTERPRETATION OF RESULT 1 ... 59

FIGURE 24: GRAPHICAL INTERPRETATION OF RESULT 1 ... 59

FIGURE 25: ISUD INPUT VALUES (2) ... 60

FIGURE 26: INTERPRETATION OF RESULT 2 ... 60

FIGURE 27: GRAPHICAL INTERPRETATION OF RESULT 2 ... 61

(11)

List of Tables

xi

List of Tables

TABLE 1: PROJECT S1 ... 29

TABLE 2: PROJECT S2 ... 29

TABLE 3: COMPARISON FRAMEWORK OF DIFFERENT TECHNIQUES WITH RESPECT TO KEY CHARACTERISTICS ... 46

(12)

xii

List of Abbreviations DDBS: Distributed Database Systems

DDBMS: Distributed Database Management Systems ALP: Attribute Locality Precedence

CISUD: Customized (Insert, Select, Update, Delete) DSR: Design Science Research

(13)

Introduction

13

1 Introduction

The introductory section focuses on the selection of research domain and defines the importance of the research and also mentions the objective and limitation of the research work in this dissertation. This section also includes the background of the problem domain and mention what are the potential problems in the area of research under literature investigation.

1.1 Background

Distributed database systems are becoming more and more important for sharing and managing information within large corporate and companies or organizations. The emergence of distributed database management systems (DDBMS) is based on maturing of database management systems (DBMS) with significant development in computer networks and distributed computing technologies [1]. The concept of distributed database (DDB) is defined as a collection of multiple, logically interrelated databases distributed over a computer network [1]. The control of distributed database activities are governed by distributed database management systems (DDBMS). “A

distributed database management system (DDBMS) is the software system that permits the management of the distributed database and makes the distribution transparent to the users” [1, p.3].

In our discussion, it is important to take a brief overview about various distributed database systems. These distributed database systems are categorized into different forms such as homogenous distributed database systems (Home-DBS) and heterogeneous distributed database systems (Hetro-DBS) [2]. The homogenous distributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts different characteristics like schema integration, distributed query processing, distributed transaction management, administrative functions and coping with different types of heterogeneity [2]. The heterogeneity factor can also be involved with respect to computer hardware, operating systems, communication links, data models, protocols and different database management systems [2].

The importance of distributed and parallel processing in database management systems (DBMS) is taken as an efficient way of improving performance of applications that manipulate large volumes of the data in organization [8]. This design of distributed database is used to achieve various tasks such as removing irrelevant data accesses during the execution of queries from the various locations and reducing the communication cost of data shared among various sites. The distribution design also involves making decision in data fragmentation and placement across different sites in distributed environment [8].

Distributed database helps to allocate data as fragmented, replicated and distributed [9] over the intranet or internet within organization and across the organization. The client/server architecture provides a platform where a number of client’s machines can access to a single database server and help to distribute, allocate the data across multiple sites that have to communicate with each other when responding to the user’s queries and executing remotely transactions [1].

(14)

14

Distributed database design involves some issues [5] and these issues complicate distributed database design architecture. In distributed database system, it is often required to allocate data as fragmented, replicated and decentralized [9]. The fragmentation phenomenon highlights that how relation is divided into several parts and stored at several sites. Relation can be fragmented in different form as horizontal, vertical, or mixed fragmentation [9]. The term replication means the copies of the same data are stored at several sites. These copies may be considered as fragment of the relation or whole relation. For replication of the data, many data update operation problems have been observed [9]. The term decentralized database is referred to distribution of the data over the LAN/WAN environment where the relation is distributed or stored at different sites [9].

Various approaches [10] [11] have been proposed for database partitioning and fragment allocations in distributed databases. The design of distributed database is used to enhance the performance of applications by minimizing the irrelevant data accessible from different applications and by minimizing the cost of transferring the data when processing the applications at different sites [12].

This dissertation focuses on different strategies and ways [12] for propagating data over the network, between the sites within an organization or several organizations. These strategies are based on fragmentation [1]. The fragmentation is basically applied to relational database schema in the form of horizontal fragmentation and vertical fragmentation [1]. The main advantage is to introduce fragmentation concept in the distributed database system architecture and to enable the placement of data in close proximity to its place of use, which helps to reduce transmission cost and also the size of the relations that are involved in user queries [1].

1.1.1 Contribution of the thesis

The contribution of this thesis is to investigate the algorithms for database fragmentation, by using comparative study framework of different techniques which have been proposed by different researchers which explained in detail in section 5.1. The other contribution of this thesis is, to design the architecture and implement the customized ISUD technique which is taken from [24], which is explained in detail in section 4. The main contribution of our thesis is the proposed 5-layered architecture which enhances the features; the creation of individual ALP table from various individual sites, because in [24] it only emphasis on summarized total cost of attribute locality precedence (ALP) from all the sites, the detail explanation can be seen in section 5.2.1 and 5.2.2.

(15)

Introduction

15

1.2 Case Study

1.2.1 Swedish Board of Agriculture

This research work is related to Swedish Department of Agriculture Organization which has acentralized database system, providing the infrastructure to the end-users in order for them to access data remotely all over Sweden. End-users of the organization can easily access the information from the centralized database system over the limited internet (extranet) by using internet authentication service (IAS). This centralized database which is developed in oracle at different sites for a homogeneous environment. In the organization, many resources are engaged for the maintenance of centralized data for their dissemination within company and for accessing different services according to end-user queries, so cost factor is high for quality assurance is concerned. The organization’s centralized database system contains the data which come from different relational databases such as customer’s database, administrative/employee database etc.

1.2.2 Case Study for Testing Purpose or for Evaluation of Proposed Technique

In order to test the technique, developed in this thesis, a separate case study has been initiated. In the case study an information system previously developed for Bharat transport service is used. Bharat transport service is Indian logistic company situated in Hyderabad, India. This software is offered with different applications such as vehicle’s billing information, daily loading reports, vehicles payment details and the generation different reports. For the purpose of testing our technique, only the billing information application has been used. This application has many functionalities such as retrieving the data according to selected bill numbers and name of the employee , it save the information in the database, it even helps to update and delete the information. The application also utilizes the DML (Dynamic Manipulation Language) operations such create, update, delete, select etc. Due to availability of DML operation, we have selected this application to test our technique.

1.3 Purpose/Objectives

1.3.1 Research Question

After analytical assessment from the literature review [5], it is realized that there are some issues in the distributed database development that are subjected to database fragmentation. In distributed database design architecture, we have tried to discuss the following issues in our thesis work.

Q.1. What algorithms do exist in order to uniformly fragment the relations in a distributed database?

Q.2. How to design the architecture of designated algorithm from Q1?

(16)

16

1.3.2 Theoretical Purpose

This dissertation contributes to the field of distributed database and provides one of the solutions that, how traditional centralized database system is transformed into distributed database system. So the theoretical purpose of this research is to address data fragmentation problems and investigate some efficient algorithms with efficient techniques for horizontal database fragmentation in distributed environment.

1.3.3 Practical Purpose

The practical purpose of the research work was to design the architecture of a carefully selected algorithm (describe in [24]) in real time scenario (using the case study of Bharat Transport Service), and to implement and test the proposed algorithmic approach. The practical purpose of this study helps the database administrator or end-users to take a proper fragmentation decisions at initial stage of distributed database system by using ISUD (Insert, Select, Update, Delete) matrix table which is shown in detail in section4.

1.3.4 Assumption

Assumption is based on those things which are already developed. According to the our research work

 The databases of the case study which is used in this research work has already made before developing distributed database systems architecture for testing is concerned.

 Different techniques which are already discussed in this research work are taken from the literature review for fragmenting the database, which support for creating distributed data in distributed environments.

1.4 Limitations

The limitation is the way to limit the scope of the study. The limitations also identify certain set of boundaries and functionalities which are being used in this research work.

1. Our research work is focused to implement the algorithm [24] for distribute

database using horizontal fragmentation technique.

2. We are not concerned about the vertical fragmentation and mixed fragmentation.

3. We are also not concerned about the allocation of the data in distributed environment at different sites.

(17)

Introduction

17

1.5 Thesis outline

The first chapter highlights the introduction of the research work, influence and purpose of the research work and identify the problems, assumptions, limitation of the research work. The second chapter express the previous approaches, techniques and strategies how to develop fragmentation of distributed database. The third chapter describes the methodologies, how we can conduct the research work and implement the fragmentation in distributed database architecture. The fourth chapter defines the design and implementation of algorithm using horizontal fragmentation technique. The fifth chapter is about the results and analysis part of the research work. The sixth chapter is about conclusion and future work related to the discussion.

(18)

18

2 Theoretical Background

The distributed database is based on different architecture layers which describe a logical collection of data from inter-lined databases [2]. Before going into the detailed discussion, we have to know the basic fundamentals of distributed databases theory. Below are some of the basic definitions related to database management systems. A database “is a collection of data, typically containing the information about one or

more related organizations” [33.p.11].

A database management system (DBMS) “is a software package designed to store

and manage databases” [33.p.11].

A data model “is a collection of concepts for describing data. Data model in

database vs. type system in programming language” [33.p.12].

A schema “is a description of a particular collection of data, using the given data

model. Schemas in database vs. types in programming language” [33.p.12].

There are different types of data model for each database [33]. The models shown with example. Data models: The relational data model, most commonly used Relational database systems, e.g. SQL server, Oracle, Sybase. Object-oriented data

model: Object Store, O2. Object-relational model: UniSQL, Informix Universal

Server, Semi-structured data model, XML [33].

2.1 General Description of Distributed Database

In the real world scenario people have the need to access different company databases, whether it may be employees, customers, potential customers, vendors or suppliers of any kind. Until now the companies have been able to have their databases concentrated at a single server sites to be accessed worldwide by means of telecommunication networks and internet [5]. Although using a centralized database systems the companies have been able to disseminate the data within organization in a very structured manner. But due to the incorporation of new business needs and demands and the adoption of new database architectures for scalability, they need to adopt new ways to propagate the data over distributed locations. There are many benefits of using a distributed database system as explained in the following section. However there are also associated complexities, some of them described in section 2.3 and 2.4.

2.1.1 What is a Distributed Database System?

There exists several different definitions of DDS (Distributed database systems) defined by different authors. A basic and generic definition of a DDS is: A distributed database systems is a “collection of multiple, logically interrelated database

distributed over a computer network” [1.p.3]. A DDBMS (distributed database

management systems) is also defined as the “software system that permits the

(19)

Theoritical Background

19

2.1.2 Application of Distributed database technology

Many advantages, of different perspectives have been listed for DDBSs. In the following sections some fundamentals promises of DDBSs are described by the Tamer Ozu [1].

2.1.2.1 Transparent Management of Distributed system:-

Distributed database technology is planned to extend the concept of data independence to environments in which data is distributed and replicated over a number of machines connected by a network [13]. Data independence is provided by several forms of transparency network and, therefore distribution transparency, replication transparency, and fragmentation transparency. Transparent access to data separates a system’s higher level semantics from lower level implementation issues [13].

Transparent system hides the information code of implementation from the users. The actual benefit of transparent DBMS is that it handles the complex applications development. This could be explaining more by an example as given by Tamer ozu [1].

Let suppose an example of Jonkoping University which has different schools like Engineering School(JTH), Jonkoping International business school(JIBS), and Health science school, This university run the projects at each office sites and maintain a database of their employees, Program information and related data etc. As per the assumption the database used is relational so it can store the information in two relations i.e. EMP(ENO,ENAME,TITLE) and PROG(PNO,PNAME,PROGDETAIL), thus we add a third relation to store salary information of employee as SAL(TITLE,AMT) and a fourth relation as ASG which is used as employees assigned to which program for what duration and with what responsibility, ASG(ENO,PNO,RESP,DUR), if this data is stored in a centralized DBMS and if we want to find the names and the employees who worked on a project for more than 6 months, we would retrieve by the following SQL query[1].

Example

SELECT Ename,Amt FROM Emp,Asg,Sal WHERE

Asg.Dur > 6 AND Emp.Eno = Asg.Eno AND Sal.Title = Emp.Title

From the above example we depict that the query get the results from centralize database systems, as per the tables(relation) mentioned in where condition which is transparent to the user. However if we make the centralize nature of the university database to the distributed nature , it can be done through this circumstances that is to localize data such that data of the employees of JTH school is stored at JTH office, data of JIBS school are stored at JIBS office and so on. And the same can be applied to other relations program and salary information. Therefore what we are intended to do here is partitioning the relations and storing each partition to different sites, which is known as Fragmentation. Thus the fully transparent access means that the user can

(20)

20

use the same query as used in the above example without any concern about the fragmentation, location of data , as it rely on the system to resolve this issues[1]. There are different types of transparencies in distributed environment are explained by the Tamer Ozu in [1] , they are Fragmentation Transparency, Network Transparency, Replication Transparency etc. As our research work is concern with fragmentation so we are going to explain about the Fragmentation Transparency.

2.1.2.2 Fragmentation Transparency

The actual form of transparency that needs to be talked about is fragmentation transparency in distributed database system. In our proposed technique in chapter 4 we have justify the fact that fragmenting the relation horizontally into smaller fragments is possible and treating each fragment as separate database or relation. The motive of the fragmentation is to increase the performance, availability, and reliability [1]. Generally fragmentation is of two types i.e. Horizontal fragmentation (HF) and Vertical fragmentation (VF), In HF each relation is divided into sub relation and each sub relation will have the subset of rows (tuples), whereas in VF the relations are divided into sub relations and each sub relation is defined on a subset of the columns (attribute) of the original relation.

When the relations of database is fragmented the user queries should be handle according to the sub relations of database ,this issue can be handle by finding a query processing strategy based on fragments rather than the relations [1]. Thus we can say that these queries are converted from global queries to several fragment queries. Therefore one of the fragmentation transparency issues is dealing with the one of query processing [1].

2.1.2.3 Availability and Reliability

Availability can be defined as the probability that the system can be up continuously until the time period given [12]. Whereas Reliability is defined as the probability that the system will be up at a specified time [12], this improves with the DDBS. In the centralized DBS, if one of the sites goes down then the entire system goes down whereas in the DDBS it effects only with the site which is down and the other sites or the system will not be affected. And even with the replicated data at different sites, it effects is minimized [12].

2.1.2.4 Improved Performance

If there are very large database which is then distributed into different number of sites, then the local subset of the DB will be lot smaller which tends to improve the size of the transaction and the processing time. It even improve the performance of response time for the transaction which access more than one site thus the processing can be performed parallel [12].

2.2 Distributed Database Architecture

A distributed database system allows applications to access data from local and remote databases. In a homogenous distributed system, each site has same databases. In a heterogeneous distributed system, at least one of the databases is a non-related database.

(21)

21

2.2.1 Architectural Models for Distributed database system

There are some ways by which DDBMS can be constructed by using the following classification which organizes the system; they are differentiated with respect to (1) The Autonomy, (2) Distribution, (3) Heterogeneity in figure-1 [1].

Figure 1: Database Management System Implementation Alternatives[1]

(1) Autonomy. It refers to the distribution of control and not exactly the data and

it ensures single DBMSs which can operate independently [1]. It is a function of certain factors such as the systems that exchange information, which can independently execute transactions, or are allowed to modify the system. It demands some requirements that have to be fulfilled they are as follows [1]. According to Gligor and Popsescu-Zeletin [1]

(i) Local Operations are not affected by participation in global multi distributed database system.

(ii) Optimization and Query Processing also not affected by global query access.

(iii) System consistency is not well cooperated when there is any change in the database i.e. adding or removing DBs from global database.

According to Du and Elmagarmid [1][13]

(i) Design autonomy: All the databases use data models and transaction management they need.

(ii) Communication autonomy: Every Databases are responsible and decide which database to provide to other Dbs.

(iii) Execution autonomy: Each DBMS can be executed according to the way it wants.

There are some classifications of autonomy which can be specified as follows. -Tight integration: - It has single image of DB for all users who want to share the information.

(22)

22

-Semiautonomous systems: - They consist of DBMSs which determines which part of database should be shared and they modified the information for communicating with each other.

(2) Distribution. It refers to the physical Distribution of data and different

software components over multiple sites, whereas the user can see the data as transparent and as logical pool [1]. The distribution of data can be distributed into two classes Client/server distribution and peer-to-peer distribution [1].

Client/server distribution: - It provides data management service at the

server side, the data is stored primarily, while the clients focus on getting the data whenever needed and it also generate requests [1].

Peer-to-Peer distribution: - In this distribution the data is fully distributed

and there is no connection between client and servers, every machine has functionality of DBMS and can communicate with other machines to execute queries and transactions. Each server, client and each DBS at a site maintains a portion of the database [1].

(3) Heterogeneity:-It occurs in various forms in distributed systems, like

hardware heterogeneity, Communications, and Operating system. In relation to database it has data model, data format, query language, transaction management algorithms. If accessing with other remote DBSs than there is need of conversions [1].

2.3 Unsolved

problems in DDBS

2.3.1 Distribution design

Distributed database design methodology varies depending upon the system architecture. For tightly integrated distributed databases, the design process will be the top-down from requirements analysis and logical design of the global database to physical design of each local database [13]. For distributed multi-database systems, the design process is bottom-up and involves the integration of existing databases [13].

The step of interest in the top down process is distribution design describe by [13], which involves designing local conceptual schemas by distributing global entities over the sites of the distributed system. The global entities are then specified within the global conceptual schema. By taking consideration of relational model, both the global and the local entities are relations, therefore distribution design will maps global relations to local ones [13]. One of the most important research issues that require attention is the development of a practical distribution design methodology and its integration into the general data-modeling process [13].

The two main aspects of distribution design are fragmentation and allocation. In Fragmentation each global relation is partition into the set of fragment relations [13]. Whereas Allocation focuses on the (possibly replicated) distribution of these local relations across the distributed system’s sites [13]. Therefore the research on fragmentation has focuses on horizontal (or selecting) and vertical (or projecting) fragmentation of global relations [13]. There are so many algorithms proposed for

(23)

23

allocation based on mathematical optimization formulations [13]. There is no underlying design methodology which combines the fragmentation and allocation techniques, they are typically treated independently.

2.3.2 Network scaling problems:-

The database does not have overall understanding of the entire distributed DBMS design alternative in the form of performance implications [13]. Therefore, there are some questions have been raised about the scalability of some protocols and algorithms when the systems become geographically distributed or as the number of system components increases [13]. There is one concern which is the suitable for the distributed transaction-processing mechanisms i.e. (the 2PL and, particularly, the 2PC protocols) in distributed database systems which is based on wide area networks [13]. There is an overhead is associated with these protocols, and implementing them over a slow wide area network may pose difficulties [13].

2.4 Distribution Design Problems

For distributed databases, fragmentation and allocation are the major problems of database distribution design. In the current research arena which often involves design methods such as mathematical programing, in order to minimize the storing cost of database, processing transactions against it, and communication cost [28]. Practically it is very difficult to study database distribution design together with other problems because every problem has their own difficulty to be studied.

2.4.1 The Complexity of the Problems

The problem of fragmentation and allocation together is proven hard [28].Fragmentation and allocation are distribution design techniques which are used to improve system performance. Each of them has massive search space for the best solution of the problems.

Due to the complexity of fragmentation and allocation problems, the allocation is treated independently from fragmentation [28]. From the previous literatures we find that most of the allocation methods which accept fragmentation, in which fragmentation has been done already, the fragmentation output will become the input to allocation. To separate fragmentation from allocation is to simplify the formulation of the problem by reducing the decision space, though the separation which contributes to the complexity of allocation models [28]. Both steps take user applications as input information and aim to improve system performance; they vary only in that, where fragmentation works on global database schema while allocation works on fragments. Thus, the application information and relationship between fragments need to be specified again while doing allocation [28]. It would be worth to develop a methodology which produces the interdependence of fragmentation and allocation [28].

(24)

24

2.4.2 Interdependencies with Query Optimization

Designing distributed database systems is a complex task as many other issues are also involved, like query processing and optimization, data replication, concurrency control, directory management, reliability, and recovery [28]. From the aforementioned problems, query processing and optimization is a closely interrelated problem with fragmentation and allocation. Query optimization in distributed systems depends on how data are fragmented and allocated, since query processing defines the sequence of operations of queries, and the allocations of the operations as according to the allocation of fragments [28].

2.4.3 Improvised Solution for the problems mentioned

In the literature, to minimize the complexity of the problem and to increase the problem controllability the researchers have mentioned the following methods.

 The fragmentation and allocation are mostly treated separately as two different steps. First the fragmentation is performed without considering how resulting fragments will be allocated, while allocation is performed with the assumption that fragmentation has been decided already [28]. Thus, allocation is considered with the assumption that a fixed query optimization method is used to generate processing schedule [28], while the study of query optimization is conducted with an assumption of fixed data allocation [28].

 Both simple query environment and query site strategy is assumed while studying allocation. As per the first assumption, network information is not considered [28]. While with the second assumption is, queries are not considered, which need to be processed in a distributed way. Therefore, query trees are not activated and allocation of intermediate nodes is not considered [28].

 During studying allocation query optimization is disregarded. A real fragment allocation can only be achieved when distributed query optimization is performed after fragmentation [28].

There are some other ad hoc solutions proposed in the literature which leads to the effective solutions for the overall system design, by avoiding the interdependencies between individual problems, which makes this approaches inefficient in the sense of obtaining optimal database distribution design [28].

2.5 Initial Design Approach for Distributed database

Design

One of the prime tasks of this work research is to investigate and develop the fragmentation technique in distributed database environment which is used to manage the data from various locations. We chose the top-down design process approach in our research work for database fragmentation in the initial state of the design. A framework for this process is shown in figure-5 [3] [5].

(25)

25

The top-down approach is used frequently in different areas of computer sciences. This top-down design process has required stages for designing the distributed database. These stages share various level of information in incremental style for the construction of homogenous distributed database system from scratch [5].

Requirement analysis Distribution project Logical project Conceptual Project Physical project User input User input Integration Correction Correction

Figure 2: Stages of the top-down approach in distributed databases [3] [5]

Following are the stages of the top-down approach in distributed databases described.

2.5.1 Requirements analysis

In this stage the collection of information about the data, restrictions and relationships within the organization is taken. The requirements analysis is understood through meetings with the users where it can be observed that how the organization can operates. After analyzing the requirement specification a document is created.

2.5.2 Conceptual project

In this level the data modeling and its relationships are formed independently as of the structure representing the distributed database system (conceptual modeling). This conceptual project can be recognized with analysis of the requirement specification.

(26)

26

After completing conceptual project a conceptual schema with the data integrity restriction is obtained.

2.5.3 Logical project

In this level the conversion of the conceptual project which represents the schema of a Distributed database system i.e. logical schema. This project is understood by the application of conversion rules, translation to the relational model of the distributed database. At the end of the logical project a logical schema with tables, stored

procedures, views, access authorizations, etc. is obtained [5].

2.5.4 Distribution project

In this level the decision of how the data and programs must be allocated and fragmented through the nodes of the computer network is taken. In few cases the network itself is designed and built to satisfy the necessities of the distributed database project. This level is said to be the most critical and important in the project of a distributed database. To support this phase in top-down approach, we tried to connect with generic five steps for data distribution with respect to fragmentation and

allocation in distributed environment which is explained in detail in section 2.8.

2.5.5 Physical project

In this level the logical schema is defined in a DDS which is suitable to the data model. The physical project is recognized by means of SQL instructions. The result is a physical schema with establishing in the distribution project. After finishing the physical project of each node of the computer network the distributed database is ready for the use. To find errors a process which monitors is prepared to discover. Such errors are the system feedback and are sent to the people responsible for the construction of the distributed database [5].

2.6 Fragmentation in Distributed Database Design

Fragmentation:- “Fragmentation is a design technique to divide a single relation or class of a database into two or more partitions such that the combination of the partitions provides the original database without any loss of information” [28,p.3].

“A fragment i.e. horizontal or vertical of a database object in an object-oriented database system contains subsets of its instance objects (or class extents) reflecting the way applications access the database objects” [34.p.1].

Distributed processing on DBMS is an effective way of improving the performance of applications which operates huge data [2]. The major goals of distributed database design are to remove the irrelevant data accessed while executing the queries and reducing the data exchange among sites. The primary goal of distributed database design is to fragment the relation in case of RDBMS (Relational DBMS) or fragment the classes in case of object-oriented-databases, to allocate and to replicate the fragment in different sites of the distributed system with local optimization on each site.

(27)

27

Fragmentation is a promising design technique which is used to divide a single relation or class in database schema into two or more partitions such that the combination of the partitions provides the original database without loss of information[28][4]. Horizontal fragmentation (HF) allows a relation or class in database schema to be partitioned into disjoint tuples or instances [2]. Vertical fragmentation (VF) also allows a relation or class to be partitioned into disjoint sets of columns or attributes except the primary key [2].

Previous techniques of HF, VF or MF that are used have the following problems in common:

 Most of them uses frequency of queries, minterm predicates’ affinity or attribute affinity matrix (AAM) as a basis of fragmentation. These require sufficient empirical data that are not available in most cases at the initial stage [24][28].

 Most of them concentrate only fragmentation problem and overlooked allocation

problem to reduce complexity [24].

 Minimizing distributed joins is a fundamental fragmentation issue[3].

 The second problem is related to semantic data control, specifically to integrity

checking[3].

2.6.1 Horizontal Fragmentation

Horizontal fragmentation is divided into two types they are primary and derived. Whereas primary horizontal fragmentation of a relation or a class is implemented using predicates of queries which are accessed by the relation or class, while derived horizontal fragmentation of a relation or a class is implemented based on horizontal fragmentation of another relation or class [28].

2.6.1.1 Primary Horizontal Fragmentation for Relational Databases

The primary horizontal fragmentation can be constructed with the context of the relational data model and with the existing approaches for horizontal fragmentation was first proposed by Ceri et al in 1982 [29] using minterm predicates.

Minterm-predicate-based approaches: “minterm-predicate-based approaches: which

perform primary horizontal fragmentation using a set of minterm predicates, e.g., [28.p.11][29].

Then after [24][30] proposed a technique based on attribute usage matrix (AUM) for vertical fragmentation.

Affinity-based approaches: “which first group predicates according to predicate

affinities and then perform primary horizontal fragmentation using conjunctions of the grouped predicates, e.g., [28] [30]. The way of grouping predicates is either graph-based or using an objective function [28] [30]”.

From some of the literatures [28] we have taken few definitions related to minterm predicates which are as follows.

(28)

28

Definition 1: “For a given relation R = {A1: D1, ……, An: Dn}, a simple predicate is

in the form of Pk: Ai Ɵ(Teta) Value With Ai as an attribute defined over Di , Ɵ(Teta)

{=,<,} Ɵ”[28].

Definition 2. “Minterm predicates M = {m1,m2, . . . ,mz} over a set Pr of simple

predicates are the conjunctions of simple predicates and their negations: M = {mj |mj =

^

pk2Pr p_ k}, k = 1, . . . ,m, j = 1, . . . , z. where p_ k = pk or p_ k = ¬pk. Note that all simple predicates in Pr appear (positively or negatively) in each minterm

predicate”[28.p.12].

Definition 3. “ A set of simple predicates Pr is said to be complete if and only if there

is an equal probability of access by every application to any tuple belonging to any fragment that is defined according to Pr [28.p.12]”.

By using minterm predicates to implement horizontal fragmentation was first proposed Ceri and Pelagatti in the year 1982 [28] by which files are fragmented horizontally to optimize frequency of access performed at different sites of data by the application programs. In the proposed literature it states that this minterm fragments have records which are accessed homogeneously by all the transactions performed and this used as the proper units of allocation.

Several researchers have adopted affinity-based vertical fragmentation algorithms to horizontal fragmentation. Due to the complexity of checking completeness of the set of simple predicates used for horizontal fragmentation, Zhang [28] adopted an affinity-based vertical fragmentation approach to horizontal fragmentation. This approach takes predicate usage and predicate affinity matrix as input and employs the bond energy algorithm to cluster predicates. However, the fragments in the resulting fragmentation schema may overlap each other and therefore cannot satisfy the correctness criteria of fragmentation.

2.6.1.2 Derived Horizontal Fragmentation

Derived fragmentation in the rational data model is referred to horizontal fragmentation. Derived horizontal fragmentation is used to splitting up a relation in dependence on another relation by applying semi-join operations [28].

The dependence among the relations is the depiction of binary relationship between relations. The direct link is based on equi-join operations and also shown one-to-many relationships [28]. The two criteria suggested by [28] for choosing the fragmentation with better join characteristics or choosing the fragmentation used in more applications [28]. Here, derived horizontal fragmentation is explained by example. There are different relations such as employee, assignment, projects and salary. Every relation has own primary key for selection of the records according to the predicate constraints value.

Relations:

Employee : Employee ID, Employee Name, Title Assignment: Employee Number, Project No ,Duration Projects: Project No, Project Name, Budget, Location

(29)

29

Salary: Title, Salary

The above underline attributes are referred to primary key attributes of the relations. Similarly, assessed the foreign key relationships

Employee.Employee_ID Assignment.Employee_No Projects. Project_No  Assignment.Project_No

Salary.Title  Employee.Title

Horizontal fragmentation of relation S based on the fragmentation of another relation

R where R is already fragmented into R1, R2, R3,----Rn. Using the semi-join operator

Si = S ∞ Ri = S ∞ σpi (R) = π S.*(S ∞ σpi(R))

fragmentation expression only refers to R. The following example has been shown the mechanism of derived horizontal fragmentation. The relations have been distributed into the more relations who are depended on each primary horizontal fragmentation relation.

Project S1

Project_No Project_Name Budget Location

P1 Database Development 150.000 Jönköping

P2 Ontology based Portal 200.000 Stockholm

Table 1: Project S1

Project S2

Project_No Project_Name Budget Location

P3 Web Development 250.000 Göteborg

P4 Maintenance 100.00 Vaxjö

Table 2: Project S2

Similarly, distribute the relation R into to S1 and S2 for Assignment relation. Assingment1 = Assignment ∞ Project S1

Assingment2 = Assignment ∞ Project S2

Assignment

Employee_No Project_No. Duration

E1 P1 5 E2 P4 4 E2 P1 3 E3 P4 5 E4 P1 4 E4 P3 5 E5 P2 7 Assignment S1

Employee_No Project_No. Duration

(30)

30

E2 P1 6

E4 P1 4

E5 P2 7

Assignment S2

Employee_No Project_No. Duration

E2 P4 4

E3 P4 3

E4 P3 5

According to the above mechanism of derived horizontal fragmentation, we achieved the desire fragmentation with join characteristics. The benefits of derived fragmentation using join operations in distributed database to retrieve desire tuples or records according to the predicate or minterm efficiently. Here, we tried to pick real time scenario to express the mechanism of fragmentation and allocation in distributed database system in section 2.8.

2.7 Previous works on Fragmentation in DDBS

The two main design techniques of distributed database design are fragmentation and allocation. Since 1970s database distribution problem has been studied, in the first stage the problem of file distribution was found, then the problem of distributing relations or relation fragments. Then after emergence of the object-oriented data model, there are some existing approaches of fragmentation and allocation have been adapted to the object-oriented data model. To get overall picture database distribution design, we have presented an overview of previous work in database distribution design with respect to horizontal fragmentation and allocation.

In the year (1999) Ozsu and Valduriez proposed an iterative algorithm called COMMIN algorithm in which it generate a complete and a min-term set of predicates from a given set of simple predicates [1],after getting min-term predicates the access frequency is defined in his algorithm, by using access frequency table data is fragmented as explain by Ozu.

Using predicate matrix as input, in the year (2002) Cheng et al. [28] [31] proposed a genetic algorithm-based clustering approach, which treats horizontal fragmentation as a traveling salesman problem (TSP). Horizontal fragmentation is achieved by performing selection operation using the set of the grouped predicates, which are grouped according to the distances. The distance of each pair of attributes actually measure the access frequencies of transactions that do not access the pair attributes together. Additional analysis is needed to simplify the clusters of predicates. None of the affinity-based horizontal fragmentation approaches takes into consideration of data locality while clustering predicates.

In the year (2004) Baioo et al. proposed a technique in which it gives input as a predicate affinity matrix which builds a predicate affinity graph which than define horizontal class fragments [24].

In the year (2006) H. Ma, K. D. Schewe proposed a technique in which he uses input as an attribute uses frequency matrix (AUFM) based on this matrix and a cost model a

(31)

31

vertical fragmentation is done [24].Than again in the year (2007) M.Alfares et-al. proposed a technique by extending H.Ma technique in which he used input as AAM

(Attribute Affinity Matrix) to generate groups based on affinity values [24].

In the year (2008) Marwa et al. extended the M.Alfares er al. technique in which it uses the instance request matrix which fragments the data horizontally for object oriented database [24] [32].. In this paper he introduces a new algorithm for horizontal fragmentation for an Object Oriented Distributed Database System OODDBS [32]. In the year (2009) Mahboubi H. and Darmont J. proposed a technique in which they have used predicate affinity for HF in data warehouse [24] [33]. In this paper, they work on XML warehouse fragmentation. In this paper there focus was on the initial horizontal fragmentation of dimensions’ XML documents and exploits two alternative algorithms [33].

In context to our studies there are some solution discuss in the research paper by Shahidul Islam Khan and Dr. A. S. M. Latiful Hoque [24] which is publish in the year (2010) has provided a fragmentation technique which can be applied at the initial stage of database design of distributed database system. They have proposed a single algorithm for both fragmentation and allocation which can be done simultaneously. They have said that this technique can be used for initial fragmentation problem of relational database for any distributed database systems. As from the literature review we have found that this technique is most suitable to implement as per our characteristic which we were searching for as shown in table-3 in chapter 5.

2.7.1 Database Fragmentation Technique by Shahidul

Islam Khan and Dr. A. S. M. Latiful Hoque

This technique is used to fragment a relation horizontally with the help of locality of precedence of its attributes. “Attribute locality precedence (ALP) can be defined as

the value of importance of an attribute with respect to sites of distributed database”

[24, p.2]. Following is the block diagram of their system which depicts the development of a fragmentation technique.

Relation

Allocation _Sub-RelationFragmented MCRUD Frequency Matrix Predicate Set ALP Table

Figure 3. Block diagram of the system[24]

The block diagram provides a systematic working pattern of their technique in sequential form. Firstly, a relation is taken from the database which needs to be

(32)

32

fragmented, then a modified CRUD (Insert, Select, Update, Delete) Frequency matrix table is created according to predicates (queries) of the selected relation. “A

data-to-location MCRUD matrix is a table of which rows indicate attributes of the entities of a relation and column indicate different locations of the applications” [24, p.2]. It is

used by the database designers and system analyst in the requirement analysis for making decision to map to different locations [24]. We customized the existing Modified Create, Read, Update, and Delete (CRUD) according to our requirement and name it as Customized Insert, Select, Update, Delete (CISUD) matrix. The reason behind customizing the MCRUD matrix into CISUD matrix is to implement this technique practically in real time scenario. The MACRUD technique provide an algorithm and pseudo code to calculate the total ALP value from all three sites, thus by using this technique we customized and improvised it, like calculating the ALP value from individual sites and provide an architecture to implement this technique practically.

2.7.1.1 Fragmentation Allocation algorithm

The algorithm is used to generate the ALP (Attribute Locality Precedence) Table i.e. to calculate the importance of the attribute at a particular location. The overview of the fragmentation allocation algorithm is explained in the following figure-3. The input of the algorithm is total number of sites, Relation of the database which need to fragmented and the CISUD matrix of relation, the output of the algorithm will be cost of ALP which fragmented as F1,F2,F3…etc. In step1 ALP table is constructed from ISUD matrix based on cost functions, in step2 for the highest value of the ALP table a predicate set is created, which is then rearranged to fragment the relation to different sites.

Figure 4: Algorithm for Fragmentation[24]

In figure-4, they expressed the pseudo code of the algorithm for the construction of ALP (Attribute Locality Precedence) table which is explained in [24]. We have customized and contributed in our research work by using MCRUD (Create, Read, Update, and Delete) technique.

“ Input: Total number of sites: S = {S1, S2,… ,Sn}

Relation to be fragmented: R ISUD matrix: ISUD[R]

Output: Fragments F = {F1, F2, F3,…, Fn}

Step 1: Construct ALP[R] from ISUD[R] based on

Cost functions

Step 2: For the highest valued attribute of ALP table

a. Generate predicate set P={ P1, P2, … ,Pm } b. Rearrange P so that #P = #S

c. Fragment R using P as selection predicate (R) p p

(33)

33

The pseudo code algorithm contains two parts. Firstly, CISUD (Insert, Select, Update, Delete) matrix of a relation which needs to be fragmented is taken as input. Secondly, ALP table is achieved as output of that relation. The pseudo code algorithm consist of five nested-for loops for calculating the cost of each attribute i.e. ALP of the relation [24].

Figure 5.ALP-table-construction Pseudo-code [24]

Input: ISUD of a relation that to be fragmented Output: ALP table for that relation

for ( i =1; i <= TotalAttributes; i++) { for ( j =1; j <= TotalPredicates[i]; j++) { MAX[i][j] = 0; for ( k =1; k <= TotalSites; k++) {

for ( r =1; r <= TotalApplications[k]; r++) /* Calculating sum of

all applications’ cost of predicate j of attribute i at site k */

{

C[i][j][k][r] = fc*C + fr*R + fu*U + fd*D S[i][j][k] + = C[i][j][k][r]

}// end of forth loop

If S[i][j][k] > MAX[i][j] /*Find out at which site cost of

predicate j is maximum*/ { MAX[i][j] = S[i][j][k] POS[i][j] = k } SumOther = 0 for ( r =1; r <= A[i][j][k][r]; r++) { If (r!=k) SumOther + = S[i][j][r] }

}// end of third loop

ALPsingle[i][j] = S[i][j][POS[i][j]] – SumOther /* actual

cost for predicate j of attribute i */

}// end of second loop ALP[i] = 0

for ( j =1; j <= TotalPredicates[i]; j++) /*calculating total

cost for attribute i (locality precedence)*/

{

ALP[i] + = ALPsingle[i][j] }

}// end of first loop

(34)

34

Above figure is a pseudo code of the algorithm of fragmentation allocation which is shown by [24]. By using this algorithm we choose to test with our requirements and fragment the database as accordingly.

2.7.1.2 Mathematical Measurement of the algorithm

To execute the algorithm there are some mathematical formulas and equations are used in [24], which are often used to calculate the cost of ALP and also used to test the algorithm with different operational changes. By considering these linear combination equations we try to test the algorithm with different frequencies retrieve from the customized ISUD matrix table. Therefore cost is treated as the effort of access and modification of certain attribute of a relation by an application from a particular site [24]. To calculate precedence of an attribute of a relation we can take the CISUD matrix of the relation as an input with the following cost functions. The equation (1) is used to calculate the cost of sum of frequencies, the equation (2) is used to calculate the total cost of frequencies at particular site, equation (3) is used to get the maximum cost among the sites for predicate j of attribute i. And the equation

(4) is used to calculate the total cost of attribute (i.e. locality precedence) [24]. All

following equations are executed in user interface application code development. The customized ISUD frequencies can be retrieved automatically from CISUD matrix table with the help of user interface.

C

i, j, k, r = fiI + fsS + fuU + fdD (1) Ai j k

S

i, j, k =

∑

C

i, j ,k, r (2) r =1

S

i, j, m = Max (

S

i, j, k) (3) Ai j k

ALP

i j = Si, j, m -

∑

S

i, j ,k (4) k≠m l

ALP

i =

∑

ALP

i j (5) j= 1

Here fi = frequency of Insert operation fs = frequency of Select operation fu = frequency of update operation fd = frequency of delete operation I= weight of Insert operation S = weight of select operation U = weight of update operation D = weight of delete operation

Ci, j, k, r = cost of predicate j of attribute i accessed by Application r at site k

Si, j, k = sum of all applications’ cost of predicate j of attribute i at site k

Si, j, m = maximum cost among the sites for predicate j of attribute i

The Customized Database Fragmentation Technique in Distributed Database Systems : A case Study

The Customized Database Fragmentation Technique in

Distributed Database Systems

The Customized Database Fragmentation Technique in

Distributed Database Systems

Abstract

Abstrakt

Acknowledgements

Key words

Contents

1

Introduction ... 13

2

Theoretical Background ... 18

3

Research Method ... 38

4

Results... 45

5

Discussion ... 62

6

Conclusion and Future Work ... 64

7

References ... 66

8

Appendix: ... 69

List of Figures

List of Tables

1 Introduction

1.1 Background

1.2 Case Study

1.3 Purpose/Objectives

1.4 Limitations

1.5 Thesis outline

2 Theoretical Background

2.1 General Description of Distributed Database

2.2 Distributed Database Architecture

2.3

Unsolved

problems in DDBS

2.4 Distribution Design Problems

2.5 Initial Design Approach for Distributed database

Design

2.6 Fragmentation in Distributed Database Design

^

2.7 Previous works on Fragmentation in DDBS

2.7.1 Database Fragmentation Technique by Shahidul

Islam Khan and Dr. A. S. M. Latiful Hoque

C

S

∑

C

S

S

ALP

∑

S

ALP

∑

ALP