Object Migration in a Distributed, Heterogeneous SQL Database Network

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Engineering (Datateknik)

2018 | LIU-IDA/LITH-EX-A--18/008--SE

Object Migration in a

Distributed, Heterogeneous

SQL Database Network

Datamigrering i ett heterogent nätverk av SQL-databaser

Joakim Ericsson

Supervisor : Tomas Szabo Examiner : Olaf Hartig

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

There are many different database management systems (DBMSs) on the market today. They all have different strengths and weaknesses. What if all of these different DBMSs could be used together in a heterogeneous network? The purpose of this thesis is to explore ways of connecting the many different DBMSs together. This thesis will explore suitable architectures, features, and performance of such a network. This is all done in the context of Ericsson’s wireless communication network. This has not been done in this context before, and a big part of the thesis is exploring if it is even possible. The result of this thesis shows that it is not possible to find a solution that can fulfill the requirements of such a network in this context.

(4)

Acknowledgments

Thanks to my family that has encouraged and supported me through all my years of education. It has not always been easy but entirely worth it looking back.

This thesis marks a cornerstone of my formal education, but it is only the start of a lifelong learning.

(5)

3.7.2 BigDAWG . . . 15 3.8 Data Warehousing . . . 16 4 Distributed Database 17 4.1 Database Schema . . . 17 4.2 Distribution Schema . . . 17 5 System Architecture 21 5.1 Centralized Approach . . . 21 5.2 Distributed Approach . . . 22 5.3 Possible Architectures . . . 23 5.3.1 CellApp Architectures . . . 23 5.3.1.1 Architecture A . . . 23 5.3.1.2 Architecture B . . . 24 5.3.1.3 Architecture C . . . 25

5.3.2 Machine Learning Application . . . 25

5.3.2.1 Architecture D . . . 26 5.3.2.2 Architecture E . . . 26 6 Results 28 6.1 Performance Measurements . . . 28 6.1.1 Test Setup . . . 28 6.1.1.1 Mock Application . . . 29

6.1.1.2 Database Management Systems (DBMSs) Tested . . . 30

6.1.1.3 Test Parameters . . . 31

6.1.2 Result using ODBC . . . 31

6.1.2.1 ODBC Using a High-performance Machine . . . 34

6.1.3 Using JDBC . . . 35

6.1.4 Introducing a Network Delay . . . 35

6.1.5 Introducing a Distributed Query Engine . . . 37

6.1.6 Comparing Relational Database Management Systems (RDBMSs) to a non SQL (NoSQL) Alternative . . . 38

6.2 The Machine Learning Application . . . 39

6.3 Data Migration . . . 40

6.4 Final System . . . 41

7 Discussion 43 7.1 Literature Study . . . 43

7.2 Performance Measurements . . . 44

7.3 Uniform Structured Query Language (SQL) Syntax . . . 44

7.4 Data Migration . . . 45

7.5 Method . . . 45

7.6 Final System . . . 46

7.7 Future Work . . . 46

7.8 Societal and Ethical Considerations . . . 46

8 Conclusion 48

Bibliography 50

(7)

List of Figures

1.1 Data format . . . 2

1.2 Initial sketch of the proposed system architecture . . . 3

4.1 SQL database schema . . . 18

4.2 States table distribution schema . . . 19

4.3 Neighbors table distribution schema . . . 20

5.1 Architecture A - Simple direct connection . . . 23

5.2 Architecture B - Simple middleware . . . 24

5.3 Architecture C - Simple . . . 25

5.4 Architecture D - Middleware . . . 26

5.5 Architecture E - Database access component . . . 27

6.1 Test setup 1 . . . 29

6.2 MySQL Memory engine read and write operations . . . 33

6.3 SQLite read and write operations . . . 33

6.4 SQLite spike read and write operations . . . 34

6.5 SQLite main memory read and write operations . . . 34

6.6 Test setup 2 . . . 36

6.7 Test setup 3 . . . 37

6.8 Redis read and write operations . . . 39

(8)

List of Tables

6.1 Test parameters . . . 31

6.2 Test machine . . . 31

6.3 ODBC - Initial measurements of query read operations . . . 32

6.4 Test machine 2 . . . 35

6.5 ODBC high-performance machine - Measurements of query read operations . . . . 35

6.6 JDBC - Initial measurements of query read operations . . . 35

6.7 ODBC - Over network . . . 36

6.8 Distributed query engines - Over network . . . 38

(9)

List of Listings

3.1 PrestoDB distributed query example . . . 14

3.2 Drill distributed query example . . . 15

6.1 Mock CellApp pseudo code . . . 30

6.2 Read request . . . 30 6.3 Write request . . . 30 6.4 GET operation . . . 38 6.5 SET operation . . . 39 6.6 SQL data migration . . . 40 6.7 Delete functionality . . . 41

(10)

List of Acronyms

5G 5th generation mobile network.

ACID atomicity, consistency, isolation, and durability.

ANSI American National Standards Institute. API application programming interface. DBMS database management system.

DDBMS distributed database management system. ISTC Intel Science and Technology Center for Big

Data.

JDBC java database connectivity. NoSQL non SQL.

ODBC open database connectivity. OLAP online analytical processing.

OLE DB object linking and embedding, database. OLTP online transaction processing.

RDBMS relational database management system. RTT round-trip time.

(11)

1 Introduction

The field of distributed systems is growing rapidly today. Nowadays an increasing number of systems are in some way distributed to enhance performance and/or stability but at the same time, the distribution often increases the complexity of the system.

1.1 Motivation

The 5th generation mobile network (5G) architecture is a highly distributed system with a lot of different requirements such as the need to route huge amounts of data in real-time. To keep this system up and running there is a need to store and share a lot of configuration data between the applications in the system. This configuration data is used to calculate handovers between cells. It is therefore important to maintain a common configuration throughout the entire network.

Distributed databases are also a growing field, with new techniques developing at a rapid pace. Having a large number of databases distributed over large geographical areas leads to interesting problems that need to be solved. There exist two distinctly different architectural approaches, homogeneous and heterogeneous.

In a heterogeneous distributed database system, each site may run a different database software. Even the operating system or hardware may differ. This has the advantage of making the entire system easier to expand when it is compatible with a variety of configurations. This flexibility does, however, come with some disadvantages; communication between the different types of databases is not as straightforward as it would have been if the system was homogeneous. Transfer of data must be translated when the exchange of data occurs between databases of different types. This increases the complexity of the system and can prove to be both a technical and economical challenge. [1]

A homogeneous distributed database system consists of a network of similar machines running the same hardware and software. This is often simpler and less costly to implement, but at the same time, it puts more requirements on the system. [2]

(12)

1.2. Old System

The customer, in this case Ericsson, have a set of requirements that the system needs to conform to in order to be a useful replacement for their current system. To measure if the system reaches these requirements some metrics must be used. These metrics represent an important part of the requirements. More specific information about the requirements and metrics that will be studied in order to measure the system can be found below in the method and theory sections.

1.2 Old System

Ericsson has thousands of base stations that currently hold a lot of configuration data using in-memory storage. This data is used to keep track of neighboring cell towers, optimize handovers, and store application state. For simplicity these functions will be combined into a factitious application called "CellApp" in this thesis. The data that should be stored for each CellApp is using the structure illustrated in Figure 1.1.

Figure 1.1: Data format

The attributes in Figure 1.1 are key-value pairs where the keys are strings and the values are of simple primitive types like integers, booleans, and short strings. Each neighbor entry in the list consists of a small number (usually ten to fifty) of key-value attributes representing the state of one of the neighboring base stations. This data about neighbors is used for handover calculations.

1.3 Aim

This master’s thesis will implement and evaluate ways of communicating with distributed heterogeneous structured query language (SQL) database systems. The solutions will be compared, using some selected metrics, to similar implementations in a heterogeneous designed network. The metrics used to compare the systems will be selected by examining

(13)

1.4. Initial System Description

system using a heterogeneous architecture. In that case, different heterogeneous solutions will be evaluated against each other.

1.4 Initial System Description

The problem with the current way of data storage, using the in-memory storage, is that it is hard to change if you want to use another database or move the data to a remote database. This would require rewrites of the code and would also require specially written code for each of the different ways of storing data. This thesis aims to find solutions to this problem by moving towards a more abstracted approach of data storage for this application. This will enable data to more easily be stored using different database systems and both be stored locally and remotely, which is not possible at the moment, all using a uniform syntax. The thought, by Ericsson, is that SQL could be a suitable abstraction to solve this. Therefore, SQL DBMSs will be an important focus. An initial sketch of the architecture wanted for this system can be seen in Figure 1.2.

SQL MIDDLEWARE Local SQLite Remote MySQL Cloud SQL

CellApp CellApp CellApp

Machine Learning Application

SQL MIDDLEWARE

Distributed Database

Figure 1.2: Initial sketch of the proposed system architecture

Another reason why this abstraction is wanted is the possibility for some machine learning application to be able to read all the data in the entire network of databases and analyze it (see Figure 1.2). This application will then do some calculation to optimize handovers between CellApps and make changes to configuration data in the heterogeneous database network, based on these optimizations. This change could for example be adding or removing neighbors for a specific CellApp. This machine learning program has significantly fewer performance requirements than the CellApp. This is because the machine learning algorithm will only be run during some maintenance period, typically once a day. The CellApp however, needs to continuously run in real-time.

(14)

1.4. Initial System Description

It is possible that the data of a CellApp need to be migrated to another database. One case for this could be that too many CellApps are running on the same piece of hardware and some of them need to be moved to another machine due to performance reasons. In this case, the data will also need to be moved into a remote or cloud database. This migration should preferably not interfere with the availability of the data. Data also needs to be consistent between databases after a migration so that it is not the case that data is corrupted or only partially migrated or, in the worst case, removed completely. The decision to start a migration of the data is made by human intervention.

Another aspect of the research is to check if it is possible to move this data to a remote or cloud database or if this will add too much of a performance penalty that it will be unusable. It will often be the case that multiple CellApps use the same DBMS.

1.4.1 Requirements

In this section, the requirements of the system will be listed and explained. The requirements are tightly coupled to the research questions. The requirements come from the customer, Ericsson, and are what this thesis will focus on. It is not certain that all the requirements below are feasible to implement together, but the goal of the project is to find and evaluate such a system. This section will also try to explain the reason why these requirements exist and why the requirements are important in the context of this project and the system in question. The source of the information in this chapter is the customer, Ericsson, if nothing else is specified. The information was extracted during meetings with the technical supervisor of the thesis project and from that the requirements were formed.

The following requirements on the system exist. The requirements are tightly coupled with the research questions that are listed in Section 1.5.

1.4.1.1 Heterogeneous Databases

The network is heterogeneous and distributed because the system will consist of many different machines, that handle different load, are located at different locations and, most importantly, run different DBMSs. This becomes even more complex when taking into account that not all of the DBMSs will have the same features or even run on the same system architecture. Some of the DBMSs could be running in the cloud and have access to thousands of gigabytes of data storage and a lot of processing power and RAM. These computers may even be able to scale up and down depending on the demand of the service. Some of the other parts of the system may not have such resources and instead contain the processing power and storage of an average desktop PC.

All of this increases the complexity of connecting and using the network of databases using a uniform syntax.

1.4.1.2 SQL Database

One of the requirements of the system is that it should be an SQL database system. SQL will be used because SQL is considered, by Ericsson, to be a good abstraction for communication in a heterogeneous database system. From this requirement it follows that relational databases are a good choice because they almost always use SQL. The data that is being stored in the system is relatively simple and will not use all the features that a relational SQL database provides. The data, as it looks today, is almost entirely key-value storage without

(15)

1.5. Research Questions

1.4.1.3 Data Migration

Data in this system should be able to be moved from one DBMS to another, this will be called a migration of data. Whenever such a migration takes place, the data should at the same time be available to the applications. Hence, data migration should impose as little loss of availability as possible. The data must also remain consistent between the migrations, so it is not the case that data is corrupted or only partially migrated.

1.4.1.4 Performance Requirements

These are the performance requirements for the system. There are two types of applications that are part of the system: the CellApps and a machine learning application. These two applications have different data access patterns and different performance requirements.

• CellApp typical usage: query data from 100 different neighbors per second, and only a few write operations every minute. Each of the neighbors that are being queried contains 20 attributes. One read query will be issued every 10ms and should not overrun that time.

• Machine learning application: Reads all the data periodically (daily) and makes updates to selected attributes. The time frame for this is hours.

To benchmark this system, mock applications with the same access patterns as the real system will be developed. Based on these mock applications, I will measure the time it takes for typical queries in the system to run. More specifically the time is measured from the when the query is executed until the data queried is returned to the application.

1.5 Research Questions

The following questions will be considered in this master‘s thesis:

1. What are the important properties for a distributed network of heterogeneous SQL DBMSs in the given setting?

2. What frameworks and techniques can be used to communicate with a heterogeneous network of DBMSs using a uniform SQL syntax?

a) How do these frameworks and techniques compare to each other?

b) Is it feasible to build such a system, using the frameworks and techniques, that fulfills the functional and performance requirements of this project?

3. Will the system built, using the frameworks and techniques, enable data to migrate from one DBMS to another while keeping consistency and availability of the data?

1.6 Delimitations

The focus of the project will be to find already existing solutions, frameworks, and techniques to fulfill the requirements and not to implement and design such a system from scratch. Ericsson acknowledges that a system that achieves all the requirements could be built but it would take a lot of time and resources to implement from scratch. They also believe that there are similar, already existing, solutions and techniques that could be used to solve the problem. Finding these existing solutions and techniques will be the focus of this master‘s thesis.

(16)

1.6. Delimitations

For the machine learning application, only solutions, frameworks, and techniques that can query data in place, at the different DBMSs, will be considered. This is opposed to copying data from the different DBMSs to a single location and then running queries on the collection of data. This delimitation was made to limit the scope of the thesis.

One delimitation is that the frameworks and techniques must be able to run on Linux since this is the intended operating system that it will run on in production.

Another delimitation that has been made in this project is that only open source and other "free to use projects" will be considered. This is because it is in, most cases, a hassle to get trial subscriptions of paid products while the open source projects can just be downloaded and run instantly. While these paid alternatives will not be tested in this thesis, some of them will be evaluated in the theory section and used to compare, conceptually, with the open source alternatives found.

(17)

2 Method

This chapter aims to provide a method of how to build a system consisting of a distributed database, to store the data in the system, and a middleware to handle communication with the many DBMSs that the distributed database consists of.

To develop this system and find answers to the research questions the first thing were to extract the requirements for the system from the customer. The requirements are by design coupled with the research questions. After the requirements were collected, a preliminary design of the system architecture could be established. These requirements were then used to perform a pre-study of techniques believed to solve the problem. These techniques were then tested and benchmarked, using the requirements, in order to find the most suitable solution for this project. To benchmark the different techniques, they were first integrated to fit into the system. Following this, tests were made with synthetic data to measure the different metrics. Synthetic data was used because of the fact that the system in question is not in production yet and there is no real data available at this moment. The synthetic test data is based on data from a legacy system. Therefore, results and measurements made in this thesis can be expected to be similar to the result in production with real data.

The method chapter is divided into three parts. The first part is the pre-study were a literature study was performed to find suitable techniques for the problem. Secondly, the system design and implementation phase. Here the techniques found in the pre-study were developed into possible architectures for the system. A test environment was then set up in order to test these techniques with different architectures. In the last part, an evaluation of the different techniques and architectures was done with data measurements collected.

2.1 Pre-study

The pre-study assessed different alternative techniques and architectures to build the system during a literature study. The focus was on finding systems that satisfy the requirements stated in Section 1.4.1. To do this evaluation, techniques brought up in the theory section of this master’s thesis were studied with the help of the theory as a foundation. The techniques that according to the theory are believed to be able to fulfill all of the requirements were

(18)

2.2. System Design and Implementation

researched further and implemented in the design phase of the project. In the case that no system is found that can satisfy all the requirements, a system that accomplishes most requirements will be used.

The result of the pre-study is a theory chapter with techniques that could be used to build the distributed database and the system around it. Based on these techniques alternative architectures of the system could be developed. These architectures and techniques will be implemented and tested in the next phase of the method.

2.2 System Design and Implementation

In the design and implementation phase, the different techniques and architectures were implemented in a realistic environment. Because of the massively distributed system that the final result will be, a less complex version of the environment was used to run the tests and benchmarks in this project. The test environment will try to capture the most important features of the real environment, but it is possible that the result of the tests in the test environment could differ from the real environment.

In this step the database schema for the databases was developed. All the databases in the system combine into a distributed database. A distribution schema for the entire distributed database was also developed.

2.3 Evaluation

Here the data collected in the performance measurements, and the data collected in the literature study were evaluated. The performance measurements were collected by creating a testbench for the CellApps, to emulate real-world conditions. This test was then run in a realistic environment. The measurements from the test were collected and evaluated against each other.

The candidate for the final system architecture was chosen based on the measured performance and how well the architecture fit in accordance with the other requirements.

(19)

3 Theory

3.1 Relational Database Management Systems and SQL

A database is a collection of data. In an relational database management system (RDBMS) the data is stored as rows in a table. Data in an RDBMS can be connected through relations and constraints. It is up to the RDBMS to ensure that all the data adheres to the constraints. One example of such a constraint could be that all rows in the table should have a unique primary key. [3]

There are a lot of different features and requirements that are needed in a traditional RDBMS. A lot of features are supported in almost all DBMSs, both non-relational and relational. This chapter will highlight some techniques that are used in RDBMSs.

3.1.1 SQL

Structured Query Language, SQL for short, is a language used in an RDBMSs. The language supports to describe the structure of the database and the data stored. SQL can also be used to query the database for data. A query is a request to read from the database. An SQL statement is a request to either read or write to the database. [4]

SQL is a standard published by the American National Standards Institute (ANSI). They progressively publish new additions to the SQL standard every couple of years. [5]

Even if many RDBMSs use SQL they do not always follow the standard completely. Different RDBMSs often support different data types and other features. This means that a query that is supported in one RDBMS not necessarily works in another RDBMS. Although, they both should be able to support roughly the same functionality. [4]

3.1.2 ACID

ACID is an acronym that stands for atomicity, consistency, isolation, and durability. These are properties of a transaction in an RDBMS. A transaction in SQL is one or multiple SQL statements that are always run in full. Either all statements are executed completely or none

(20)

3.2. OLAP versus OLTP

of the statements should have any effect on the database. An RDBMS ensures that every transaction is run in accordance with the ACID properties. [6]

Most of the ACID concepts do not only concern database theory but are general concepts used in computer science. [7]

• Atomicity: This property guarantees that a transaction always is executed in full. It can never happen that a transaction partially updates the database. This includes unexpected situations where for example the DBMS crashes due to loss of power or crashes due to other failures or bugs.

• Consistency: This property makes sure that every transaction that is run in the database will result in a valid state of the database. To be in a valid state means that all data that is written follows the rules and constraints in the database.

• Isolation: Isolation is used to ensure database integrity when running transactions concurrently in the database. What this means is that the result of concurrent transactions should be able to be reproduced by the same transactions running in some sequential order.

• Durability: Durability means that a committed transaction will continue to be committed indefinitely. Even if the program or the computer running the database crashes. To implement this, the DBMS must make sure to always save committed transactions to a permanent memory. Durability will therefore not be achieved if just storing the data in regular RAM.

3.2 OLAP versus OLTP

There are two main classes of use cases when it comes to database systems: online analytical processing (OLAP) and online transaction processing (OLTP).

OLTP:An OLTP system is characterized as a system that performs small and fast transactions against a database. This means that queries need to be fast. Often, in an OLTP system, the data read or modified is directly facing a user or application. [8]

A typical example of this kind of OLTP system is a database storing user login credentials for a website. The queries will, in this case, be exceedingly small and need to be fast. If the user wants to log in, a simple query to read a data for one user will be executed. If the user wants to change password, a simple update statement will execute. These queries are simple and need to be fast.

OLAP: An OLAP system, on the other hand, is as the name suggests, more suitable for analytics. Queries in an OLAP system could be complex and require data from many sources, with different database schemas, to resolve. These queries could take a long time to execute compared to queries in an OLTP system. These systems often query a lot of historical data compared to OLTP which often only query current data. Because queries here is used for analytics, these systems are often read-only by design. These systems often do not face an end user the same way that OLTP does. [8]

An example of this kind of OLAP system could be a big collection of databases. Data from these databases will be aggregated and queried using complex and long-running functions. Based on this analysis some changes or decisions will then be made to optimize the system.

(21)

3.3. Distributed Database Management Systems

Most systems today support either OLTP or OLAP, not both. However, there has been some research of combining them to a system that can handle them both, but this increases the complexity of the system. [9]

3.3 Distributed Database Management Systems

A distributed database management system (DDBMS) is a management system for a collection of distributed databases. Similar to a DBMS that manages a local database, the DDBMS manage a distributed collection of databases. It hides the complexity of the distributed system to the user. [2]

Although the system developed in this thesis will not directly be a DDBMS it will be similar in many ways, therefore some conceptual background of these systems will be required. Before a decision is made to implement a distributed database system for storage of data, there are several factors to consider. A distributed database system has both advantages and disadvantages compared to a regular centralized system. A distributed system will probably be more expensive in terms of hardware required, but runtime costs may benefit from the possibility to fine-tune each machine separately from the rest to achieve maximum performance. When the data is distributed between several locations it has both security and integrity related benefits. The protection of data is improved when the data is not located at the same site. [10]

A distributed system may be heterogeneous in a few different ways. The different components of a heterogeneous distributed system may differ in hardware, software, or communication protocols. Two different systems can have different data models which in turn have differences in data structures, constraints, and query language. The difference in structure is easier to deal with if the two representations have the same data content. If not, the difference in content may require significant work to make the systems compatible with each other. [11]

Data is fetched from a database by executing so-called queries. In a conventional, non-distributed database system, all data asked for in the query can be provided by the single database. This may not be the case in a distributed system. To retrieve all data a user is interested in, several queries may have to be executed, by the user, on different databases. The results of the queries then have to be combined, by the user, into a resulting dataset. To make the distributed system convenient to use for the end-user, a distributed query manager can be used. The query manager makes the distributed system behave like its non-distributed counterpart by taking a single query provided by the user and combining data from its respective sources to form a single resulting set of data. Creating an efficient distributed query manager might be a more or less difficult task, depending on the differences between the databases in the system. [12]

When a transaction is executed in a distributed environment it could be the case that it writes data located in multiple databases at once. In this case, synchronization problems arise. This is because in a distributed system there are no native synchronization or atomic operations. This could lead to unwanted situations where data in the database becomes corrupt because of race conditions, or deadlocks could occur because of some partially updated data. There are some cases where this kind of synchronization is not needed but in the cases in which it is, there are two properties that need to be taken into consideration. The first is local synchronization that makes sure that concurrent running queries on the same database are synchronized and run in order. The other is global synchronization that makes sure that the

(22)

3.4. ODBC, JDBC, and OLE DB

entire network of databases keeps a consistent state. The latter is harder to achieve and can add additional overhead. [12]

For any database system, there has to be a way of performing various administrative tasks. These include authorization of users and management of semantic integrity rules. In a heterogeneous and distributed database system, the method chosen to perform these tasks depends on the degree of centralization. When authorizing users, there can be an advantage of providing permissions from a centralized system. Thomas et al. [12] bring up the example of giving a user access to the average salary for employees at a company while denying access to the salary of individual employees, where the salaries and employees are stored in different databases in a distributed system. If the authorization of users is centralized, it is trivial to create a database view that captures the necessary queries and operations to obtain the average salary. The user can then be granted access to this view by the centralized authorization system. If one chooses to use a decentralized authorization system, no method exists to grant the user access to the average salary without them being able to also access information about the salary of individual employees as well. If the database management system of the distributed database system supports having semantic integrity rules for the stored data, this can be handled either centrally, through a global schema, or locally at every database. The global approach does, in this case, have a significant advantage due to the possibility to add constraints that depend on data stored in different databases. [12]

3.4 ODBC, JDBC, and OLE DB

There are many different existing DBMSs that often are using different interfaces. Open database connectivity (ODBC) aims to provide a uniform interface to the different DBMSs using SQL acting as a kind of middleware.

There are three ways that ODBC tries to standardize this:

• Provide a uniform communication middleware: This means that it is possible to handle connections to different DBMSs in the same way.

• Datatypes standard: Provide a standard of what datatypes that can be used and how they are mapped to the target DBMS.

• Provide an application programming interface (API): Used to execute queries with the same SQL syntax no matter of the underlying target DBMS.

To be able to use ODBC with a specific DBMS, a driver for that DBMS is required. This driver is often provided by the vendor of the DBMS. [13]

For the Java environment, there is the java database connectivity (JDBC) developed by Oracle. It aims to solve the same problem as ODBC and includes the same features. There are connections that bridge ODBC to JDBC and vice versa. This means that one can almost always be used instead of the other. [13]

Object linking and embedding, database (OLE DB) is like ODBC developed by Microsoft. The main difference is that while, ODBC works with most DBMS, OLE DB has support for connecting to many other different sources. This includes sources like files on disk and other non-DBMSs. OLE DB also contains the functionality to connect to regular DBMSs by using ODBC as a middleware. [14]

(23)

3.5. Microsoft Linked Servers

3.5 Microsoft Linked Servers

Linked servers is a technology developed by Microsoft and is shipped with Microsoft’s SQL Server. Linked servers makes it possible to connect and access data from other data sources. Linked servers use ODBC and OLE DB as a middleware for this connection and therefore it can access data from a variety of sources. [15]

Using linked servers, it is possible to perform distributed queries. Performing a distributed query means that it is possible to query data from many heterogeneous sources in one query. As mentioned in Section 3.3, there are multiple problems when dealing with a distributed query compared to a local one. Microsoft solves this by, what they call, distributed transactions. These distributed transactions try to emulate the transparency of a query on a local database. These distributed queries do enforce the ACID properties. This is implemented using the two-phase commit principle. This will enable the transaction to enforce ACID while adding some overhead. [16]

This technology seems to be a good fit for the work in this thesis but unfortunately, it is a Windows-exclusive software and therefore can not be used in a Linux environment, which is a requirement for this thesis. [15]

3.6 Distributed Query Engines

A query engine is a component that can query data from a DBMS or some other data store. It could also be the case that the query engine supports making distributed queries and joining data over multiple heterogeneous sources. The query engine can itself be distributed, a distributed query engine, meaning that it can scale and run on multiple machines at the same time to enhance performance. [17]

3.6.1 PrestoDB

PrestoDB brands itself as a distributed query engine [18]. It is mainly designed for big data analytics and is able to query multiple heterogeneous databases using SQL. It does this by providing connectors to many of the most common DBMSs. [19] Among the supported connectors relevant to this project are:

• MySQL • PostgreSQL • Redis

and many more. [18] These connectors provide the PrestoDB core with specific information needed for the DBMS.

PrestoDB is built in a distributed manner to enhance performance when analyzing large quantities of data. It is designed with one coordinator server that is the central component of PrestoDB. When executing a query in PrestoDB it first reaches the coordinator. The coordinator takes the queries and divide them into tasks that it then schedules to the workers. The workers then handle the reading or writing of data to the databases using the connectors. [18]

PrestoDB has originally been developed by Facebook to query the internal data of the company from many different sources at once. Today PrestoDB is an open source project licensed under the Apache license. [18] PrestoDB is backed and used by a lot of big

(24)

3.6. Distributed Query Engines

companies. An example is Airbnb that develop a web user interface for PrestoDB named Airpal. [20]

PrestoDB is designed for analytics and can be considered an OLAP system. This means that when it comes to modifying data it has some limitations of which SQL statements that are supported. The supported SQL statements depend on the connector used but are mainly the same for all RDBMSs. For the MySQL connector the INSERT statement is supported but neither the DELETE nor the UPDATE statement. This limitation could cause problems if the system should be used to update any data. [18]

Illustrated in Listing 3.1 is a simple distributed analytics query. In this example, there are two databases that are queried using the same query. These databases are in this example called mysql2 and postgresql. As the name suggests mysql2 is a MySQL DBMS while postgresql is a PostgreSQL DBMS. This means that this is not only an example of a distributed query but also an example of querying different, heterogeneous, DBMSs. The test and public identifiers are referencing databases and neighbors is referencing a table. The query in Listing 3.1 will compare and find rows where field4 in different databases does not match each other.

Listing 3.1: PrestoDB distributed query example

SELECT mysql . c e l l a p p i d , p o s t . c e l l a p p i d , mysql . f i e l d 4 , p o s t . f i e l d 4

FROM p o s t g r e s q l . p u b l i c . ne ighbo rs as post , mysql2 . t e s t . ne ighbo rs as mysql

WHERE mysql . f i e l d 4 ! = p o s t . f i e l d 4 AND mysql . c e l l a p p i d =p o s t . c e l l a p p i d AND mysql . n e i g h b o r i d=p o s t . n e i g h b o r i d ;

3.6.2 Apache Drill

Apache Drill is also, like PrestoDB, a distributed query engine. Drill started as the Google project Dremel in 2010. The techniques from Dremel then became an Apache project under the name Drill. [17]

Drill supports a variety of data sources including both RDBMSs, and NoSQL. It supports full ANSI SQL.[17] It is built using storage plugins that are used to handle connections to specific data stores. Drill is written in Java and uses JDBC to connect to RDBMSs. By using JDBC, Drill should be able to connect to most sources that support the interface. Actively supported and tested data sources relevant to this project include:

• MySQL • PostgreSQL

Like PrestoDB, Drill is mainly designed for data analytics and, therefore, does not contain all the functionality for modifying and updating data that is available in ANSI SQL. Operations like INSERT, UPDATE or even DELETE are not yet supported. [21]

Illustrated in Listing 3.2 is a distributed query using Drill. The query does the same kind of analytics as the example for PrestoDB in Listing 3.1. The query for PrestoDB and Drill looks

(25)

3.7. Multistore and Polystore Systems

needed since it is implicit. Drill also did not support the != operator like PrestoDB did. Other than this both queries gave the same result.

Listing 3.2: Drill distributed query example

SELECT mysql . c e l l a p p i d , p o s t . c e l l a p p i d , mysql . f i e l d 4 , p o s t . f i e l d 4

FROM mysql2 . t e s t . ne ighbo rs as mysql

JOIN p o s t g r e s q l . p u b l i c . ne ighb ors as p o s t ON mysql . c e l l a p p i d =p o s t . c e l l a p p i d AND mysql . n e i g h b o r i d=p o s t . n e i g h b o r i d AND

NOT ( mysql . f i e l d 4 =p o s t . f i e l d 4 ) ;

3.7 Multistore and Polystore Systems

According to the book Data Management and Analytics for Medicine and Healthcare written by BEGOLI, Edmon; WANG, Fusheng; LUO, Gang [22] both multistore and polystore are defined as systems that combine heterogeneous DBMSs by using a single uniform interface or language, such as SQL. [22]

Multistore and Polystore Systems are, similar to the distributed query engines, a way to query data from multiple heterogeneous data sources in a single query. [23]

3.7.1 CloudMdsQL and CoherentPaaS

CloudMdsQL is a data query language similar to SQL. The syntax of CloudMdsQL is comparable to SQL but contains some additional features. It is developed inside the project of CoherentPaaS, a project funded by the EU. The project aims to solve the problem of querying heterogeneous data sources. [24] It achieves this by taking another approach than the systems previously described. Instead of providing a uniform syntax to all data stores, it enables the user to write native queries for each different database. It also provides a way of combining these native queries into bigger distributed queries, using SQL-like syntax, stretching over many data sources. [23]

By using this technique, CloudMdsQL can query a wide range of data sources including relational database systems, NoSQL systems, and even graph-based databases. Out of the box, CloudMdsQL supports one of each to prove the concept: Derby as a relational database, MongoDB as NoSQL database, and Sparksee as a graph database. With some work, other databases can be added and used with the system [23]

CloudMdsQL should at this time be seen as an interesting research project on the subject but is not yet a system that should be used in production according to V. Giannakouris et al. [19]. Even if it is well documented how the language works, it is not well documented how to run the project, or what parts are needed to extend in order to get it to work with other databases than those provided with the project.

3.7.2 BigDAWG

Much like CloudMdsQL, BigDAWG is mainly a research proof of concept and not production ready yet. [19] BigDAWG stands for Big Data Analytics Working Group, and is originally developed by Intel Science and Technology Center for Big Data (ISTC). BigDAWG is developed around an idea expressed by M. Storebraker and U. Cetintemel [25] as:

(26)

3.8. Data Warehousing

"No one size fits all"

This idea claims that there is no single DBMS that can be used in all circumstances. Instead, DBMSs need to be chosen based on the domain of the data.

BigDAWG is able to query data from multiple sources and different types of DBMS. Like CloudMdsQL, BigDAWG comes with support for a number of different DBMSs to be able to prove the concept. It also contains a sample dataset, consisting of public health records from Israel, that can be used to try all the different functionalities. [26]

3.8 Data Warehousing

Data warehousing is a little different approach than the ones mentioned previously. Data warehousing means that all the data from different sources is collected and unified into a single data storage. This is mainly used in big corporations with big collections of data. This data may be spread out at different locations and even stored in heterogeneous databases. The idea with a data warehouse is to collect all this information to be able to run analytics on data. [27]

According to "The data warehouse toolkit: The complete guide to dimensional modeling" a book written by R. Kimball and M. Ross [27], there are several components relevant to a data warehouse system:

• Source systems: These are the original data sources. It is this part of the system that is facing the user or applications. These systems are often of OLTP type and need to be able to resolve simple queries fast. These systems often only contain the current state of the system and little to no historical data.

• Staging area: This is the stage where data extracted from the source systems is transformed to fit into the database schema of the data warehouse.

• Presentation Area: In this step, the data is structured in a suitable way and stored in the data warehouse system. From here it can be analyzed using different queries and other tools.

Data warehousing is something that will not fit directly into the work done in this thesis since it will require data to first be copied into the data warehouse before being queried. This project is looking for solutions where the data can be queried in place, without being copied first.

(27)

4 Distributed Database

In this chapter, the developed schema for each database is illustrated. Also illustrated is the distribution schema for the entire distributed database. The distributed database consists of all of the databases in the system.

4.1 Database Schema

Illustrated in Figure 4.1 is the database schema used in all the databases in the system. The schema is created from the initial data class structure illustrated in Figure 1.1. The state table contains data about the application state of each CellApp. The neighbors table contain all the neighbors of the CellApps. A neighbor contains configuration and relation data about a neighboring CellApp. One CellApp can have many neighbors. In the illustration Figure 4.1 the bold fields, with a key symbol next to them, are primary keys. The line in between the tables stands for foreign key constraints. This line shows that the CellAppId in the neighbors table is a foreign key connected to a single CellAppId in the states table. The line also indicates that there are many rows in the neighbors table that is connected to a single CellAppId. One thing to consider when designing this schema is that not all of the DBMSs, in this heterogeneous system, supports the same data types. An example of this is that MySQL with the Memory engine does not support the STRING data type. This data type was used in a initial design of the schema but had to be changed when this was discovered. Another similar example is that SQLite does not support the BOOL data type fully. It will accept a BOOL data type in the schema but it will just convert it to a NUMERIC. This means that when inserting data into the SQLite DBMS it does not recognize the symbols TRUE and FALSE. Instead, 0 and 1 is used. This is also something that must be taken into account when designing an SQL statement that should work on many different DBMSs.

4.2 Distribution Schema

The many databases in the system together make up a distributed database. The structure of this distributed database is illustrated in Figure 4.2 and Figure 4.3. The data in the two tables in Figure 4.2 and Figure 4.3 is horizontally partitioned.These illustrations show all the data of

(28)

4.2. Distribution Schema

Figure 4.1: SQL database schema

the entire system. The illustrations also show how the data is connected to the different base stations and CellApps.

(29)

4.2. Distribution Schema Table: states CellAppId 0 1 ... 16 17 18 19 ... 34 35 36 37 ... 52 53 Field1 data_field data_field ... data_field data_field data_field data_field ... data_field data_field data_field data_field ... data_field data_field Field2 data_field data_field ... data_field data_field data_field data_field ... data_field data_field data_field data_field ... data_field data_field Field20 data_field data_field ... data_field data_field data_field data_field ... data_field data_field data_field data_field ... data_field data_field CellApp 0 Base station 1 Base station 2 Base station 3

(30)

4.2. Distribution Schema Table: neighbors CellAppId 0 0 ... 0 0 ... 17 17 .. 17 17 18 18 18 ... NeighborId 0 1 ... 298 299 ... 0 1 ... 298 299 0 1 2 ... Field1 data_field data_field ... data_field data_field ... data_field data_field ... data_field data_field data_field data_field data_field ... Field20 data_field data_field ... data_field data_field ... data_field data_field ... data_field data_field data_field data_field data_field ... CellApp 0 Base station 1 Base station 2 CellApp 17 CellApp 18

(31)

5 System Architecture

The problem that will be studied in this chapter is finding an architecture for the network of heterogeneous databases, that makes up the distributed database. First, different approaches for designing the architecture will be discussed. After that, more concrete architectures for this specific problem will be presented. The architectures presented is connected to the techniques found in the theory. This connection is because of the fact that the different techniques often promote the use of a certain architecture.

There are two distinct possible implementation approaches to build this architecture on. One is a centralized approach where there is a middleware that consists of a single component that is connected to every database in the system. The other approach is that every CellApp in the system is connected directly to all the other databases it needs to communicate to. Each of these approaches has different pros and cons.

The problem consists of two specific parts: the CellApps, and the single machine learning application. These two applications have different access patterns and will be used for different purposes. One approach could be to use a single middleware for communication with both applications. This requires a more complex middleware since it demands something that is both good for real-time queries (OLTP) and something that is good for complex long-running queries for analytics (OLAP). Because of the different characteristics of these two applications, it could be simpler to use two middlewares, one for each. These two approaches will be discussed more in depth in the following two sections. After those two sections, more concrete architectures will be illustrated.

5.1 Centralized Approach

A centralized approach means that there is a single component in the system that can connect and execute queries to all databases in the system. In reality, this single component could itself consist of a cluster of components to improve performance. This is the case with both PrestoDB and Drill mentioned in Chapter 3.

(32)

5.2. Distributed Approach

By having this centralized component, it could be able to execute distributed queries over multiple data sources at once. This is possible using both PrestoDB and Drill mentioned in Chapter 3. The central component will have a lot of knowledge about the system of databases and can often optimize queries to get good performance, especially for complex analytics queries that these systems are designed for.

A positive thing, related to the configuration needed to connect to the DBMSs in this centralized approach, is that only one component needs to be configured to connect to all databases in the system. This way it will be simple to add new databases to an already large system, all that is needed is to change the configuration of one single centralized component. The downside of using a central component is that it will add extra overhead for each query compared to a direct connection to a DBMS. This overhead is because the query needs to go through a central component instead of directly to the source. This overhead could, for instance, come from added network data transfer or the additional query processing inside this central component. It could be the case that this overhead will add too much of a performance penalty to work with the real-time part of the problem (the CellApp).

Another downside, created by using a central component, is that all queries need to go through single component. Hence, a single point of failure is introduced. If this single component for some reason does not work, the entire system will not be able to access data. Another, more concrete, negative aspect of this approach is that most of these central systems found in Section 3.6.2 do not fully support update and write operations on the data. The systems are mainly designed for analytics and, while they offer some crude ways of doing bulk updates on data, they are not ideal for frequent small updates.

5.2 Distributed Approach

In the distributed approach there is no centralized component that constitutes the middleware. Here the middleware is a part of each of the CellApps in the system. Each CellApp will have its own functionality to connect to the DBMSs that it needs to communicate with.

What this means in practice is that each CellApp will need to be configured to connect to one or multiple DBMSs. If two different CellApps need to connect to the same DBMS they both require configuration data on how to connect to that specific source. This may not be a big problem in a small system only containing a few DBMSs and a few CellApps, but this problem will grow as the size of the system increases. If, for example, there exists a high-performance DBMS that thousands of CellApps connect to, all the CellApps will need to be re-configured in order to connect them to another DBMS. Doing the same thing in the centralized approach only required one component to change configuration and it will apply to all CellApps using that central component.

For small and quick transactions this distributed approach might be a good approach since each application will have a direct connection to the DBMS in question. Having a direct connection might be good for fast transactions that only need to read or update a few values. Also, since there is a direct connection to the DBMS, it should possible both read and write data fast without the same limitations as seen in the centralized approach. Both ODBC and JDBC could be used for this direct connection to the databases. They both support almost all the actions that can be performed by an RDBMS as mentioned in Chapter 3.

(33)

5.3. Possible Architectures

So, in the case of this system with the machine learning algorithm, there is no way to query data from the entire distributed database, consisting of all the databases, with a single query. This means that each database must be queried separately. One approach here could be to separately copy data from each database to a single database inside some data warehouse solution and then do the analytics queries there. In this case, after the analytics and updating of the data inside the data warehouse, the data needs to be distributed again to update all the databases. However, this approach also breaks the delimitation of querying data in-place.

5.3 Possible Architectures

As illustrated in Figure 1.2 there are basically two parts of the problem of connecting this system of heterogeneous databases. One part is to find a suitable middleware for the CellApps and another part is finding a middleware for the machine learning application. However, it could be the case that both the CellApp and the machine learning application could use the same middleware. Therefore, the list of different architectures will be divided into one section for the CellApp and one section of possible architectures for the machine learning application. This split is done for simplicity, but as mentioned previously it could be the case that the same middleware is used in both cases.

5.3.1 CellApp Architectures

In this section, different architectures for the CellApp will be presented.

5.3.1.1 Architecture A

Database Base station

Database Database Access Component

CellApp CellApp CellApp Base station Database CellApp CellApp CellApp

Database Access Component

Remote location

(34)

Illustrated in Figure 5.1 is Architecture A, one of the simplest of architectures. In this architecture, each of the CellApp contains a database access component that handles the communication between the DBMS and the CellApp. This architecture could be implemented using some library that handles communication with the DBMS.

A portability problem with this architectural approach can be seen in Figure 5.1. The problem is that all CellApps need to contain a configuration on how to connect to the DBMSs. The problem reveals itself when many CellApps connect to the same database, one case is that the database is moved to another address or changed completely. In this case, all the CellApps will need to be re-configured to be able to connect to this new location or database. A step towards a solution for this problem can be seen in Architecture B ( Section 5.3.1.2)

5.3.1.2 Architecture B Base station Database Middleware CellApp CellApp CellApp Database

Figure 5.2: Architecture B - Simple middleware

Illustrated in Figure 5.2 is Architecture B. This architecture would still be classified as a distributed approach since every base station still contains the middleware. But in this case, the configuration of the database connection is shared between all CellApps on the same machine that connect to the same database. This means that the middleware is still local to each machine but can be used by all the CellApps on the machine. This architecture could be realized using ODBC or JDBC.

As seen in Figure 5.2 the portability problem, described in Architecture A, has been mitigated to an extent. If the database is moved to a new location, each machine still has to be

(35)

a problem in a big system but it is still better than having to change the configuration of each CellApp like it was required in Architecture A.

5.3.1.3 Architecture C Base station Database Middleware CellApp CellApp CellApp Database

Figure 5.3: Architecture C - Simple

Illustrated in Figure 5.3 is Architecture C, a more centralized approach. Here the middleware is abstracted outside the machine that the CellApps are running on. When looking at the more advanced setup of this approach in Figure 5.3 it is clear that this is better from the configuration perspective. There is only one component that is required to know the location of the databases. So, in the case of a database move, only that one component would need to be re-configured. The problem with this architecture is the overhead that this will add as discussed in Section 5.1 . This unnecessary overhead is especially clear in Figure 5.3 where the query is sent to the middleware outside the machine and then inside the machine to the database. This will add some network overhead. Also compared to the previous two architectures, this architecture introduces a single point of failure. If the middleware component crashes, all CellApp that rely on that component will stop to work.

This kind of architecture could be built using PrestoDB or Drill.

5.3.2 Machine Learning Application

For the machine learning application, there are two different architectures that could be implemented. Either a database access component is integrated inside the application, and it has one connection to every database. This is shown in Figure 5.5. The other approach is abstracting the middleware to its own component, seen in Figure 5.4. Since there only is one machine learning application these two solutions are similar to each other.

(36)

5.3.2.1 Architecture D

Illustrated in Figure 5.4 is connecting the machine learning application to a middleware. Because of the fact that there only exists one machine learning application, it does not really matter if this middleware component is run on the same machine as the machine learning application or a different. This kind of architecture could be implemented either using ODBC or a distributed query engine as middleware.

Database Middleware Machine Learning Application

Database Database

Figure 5.4: Architecture D - Middleware

5.3.2.2 Architecture E

Illustrated in Figure 5.4 is using a database access component to connect directly to the DBMSs.

(37)

Database

Database Access Component Machine Learning Application

Database Database

(38)

6 Results

6.1 Performance Measurements

This test was devised to measure the performance of ODBC and JDBC with different DBMSs. This is done to assert if it is possible to fulfill the performance measurements seen in Section 1.4.1.4. The test will show which DBMSs can be used and if ODBC or JDBC should be used. In this test, the CellApp was the focus since it has the toughest performance requirements by far, see Section 1.4.1.4.

Since the CellApp is a real-time application that needs data quickly to make calculations, it is important to note that the average query time is not the only important metric. Instead, a metric as important as the average query time is the worst-case (max) time for a query. This is because the CellApp must be able to guarantee execution time.

Because of the fact that the read queries have much stricter performance requirements than the write operations, the read queries will be the focus of the tests. It is still important though to perform write operations as well, since write operations also affect the query time for read operations when executed concurrently.

6.1.1 Test Setup

Illustrated in Figure 6.1 is the setup used for this test. This architecture of the test setup is similar to that of Architecture B described in Section 5.3.1.2. The test was performed using ODBC and JDBC to act as a middleware between the CellApp and the DBMS. The CellApp will have a direct connection to the DBMS through ODBC or JDBC. To minimize network lag as a factor, both the DBMS and CellApp were run on the same machine.

(39)

6.1. Performance Measurements Virtual Machine Database ODBC/JDBC CellApp 1 CellApp 2 CellApp 18

Figure 6.1: Test setup 1

6.1.1.1 Mock Application

To run the test, a mock application of the CellApp was created by me. This mock application does not do any "real work" but emulates the data access pattern for the CellApp. From the performance requirements in Section 1.4.1.4 the access patterns can be derived.

• The CellApp should be able to query data from 100 neighbors per second. Each query can take a maximum of 10 milliseconds. This also means a throughput of 6000 queries per minute for every CellApp. Every neighbor query is equal to 20 attributes.

• The CellApp will update an attribute two times per minute. This means that a write statement has 30 seconds to execute at most which is a more relaxed condition than for the read queries. The write statement is done in parallel to the read queries.

The pseudo code for the mock CellApp is seen in Listing 6.1. The function write_request will execute the SQL statement seen in Listing 6.3 and the read_request function will execute the SQL query seen in Listing 6.2. The row selected to read or write to the database is chosen by introducing a random component, this is so that the measured time is closer to that of a random access query and not just a cached response. The variable currentId is the id of the CellApp and randomId is an integer that is drawn uniformly by random from an

(40)

6.1. Performance Measurements

interval between zero and the maximum number of neighbors, which in this case equals 300 neighbors.

Listing 6.1: Mock CellApp pseudo code # Run i n p a r a l l e l e v e r y 30 s e c o n d s s e t _ i n t e r v a l ( w r i t e _ r e q u e s t , 3 0 ) i = 0 ; # Number o f r e a d r e q u e s t t o e x e c u t e i n t e s t TIME = 3∗60∗100 while i < TIME : i +=1 s t a r t = time . time ( ) r e a d _ r e q u e s t ( ) end = time . time ( )

r e s = ( 1 0 ´ ( end ´ s t a r t ) ∗ 1 0 0 0 . 0 ) / 1 0 0 0 . 0

i f r e s > 0 :

# Wait s o t h a t maximum 1 r e q u e s t i s e x e c u t e d p e r 10 ms time . s l e e p ( r e s )

Listing 6.2: Read request

SELECT ∗ FROM nei ghbor s

WHERE CellAppId= c u r r e n t I d AND NeighborId=randomId ;

Listing 6.3: Write request

UPDATE nei ghbor s SET F i e l d 4 = ’ t e s t ’

WHERE CellAppId= c u r r e n t I d AND NeighborId=randomId ;

6.1.1.2 Database Management Systems (DBMSs) Tested

The following DBMSs were benchmarked in this test:

• MySQL InnoDB - Version 5.7.22 • MySQL Memory - Version 5.7.22 • MySQL MyISAM - Version 5.7.22 • PostgreSQL - Version 9.5.12 • SQLite - Version 3.11.0

• SQLite on RAM disk - Version 3.11.0 • H2 - Version 1.4.197

All of them were tested with ODBC to measure the different performance of the different DBMSs. Some of the DBMSs were also tested with JDBC to compare the performance of ODBC and JDBC.

(41)

6.1. Performance Measurements

6.1.1.3 Test Parameters

The following test parameters were used during the test. The parameters were chosen to emulate real-life conditions Each test was run until 18000 read queries had been resolved for each CellApp. The number 18000 equals 3 minutes if the read queries take, on average, 10 ms to resolve. To emulate a real system, 18 CellApps were run simultaneously. This means that a total of 324000 read queries will be done during the course of the test. In Table 6.1 a summary of the parameters used in this test can be seen. Each CellApp has 300 neighbors, to emulate real-world conditions.

Indexes for the different DBMSs were created implicitly for the primary keys. Since the primary keys are the only thing used to query the data no other indexes should be needed.

Table 6.1: Test parameters

Number of CellApps running concurrently 18

Total number of read queries executed 324000

Database table "states" number of rows 18

Database table "neighbors" number of rows 5400

Each of the DBMSs were initialized with mock data for 18 CellApps before the test started. As seen in Table 6.1, the state table consists of 18 rows of data and the neighbor table consists of 5400 rows. The number of rows in the neighbor table comes from that each of the 18 CellApps have 300 neighbors. In this test, the neighbors table was used to run queries and update statements on. The schema for the tables is illustrated in Figure 4.1.

The test was done on a virtual machine running on a server provided by Ericsson. Specifications for this machine can be seen in Table 6.2. This machine is about as powerful as an average home PC, which is realistic for this application. Both the processes of the DBMSs and the CellApps were scheduled automatically by Ubuntu.

Table 6.2: Test machine

CPU Intel Xeon E3-12xx

Cores 8

RAM 16GB

Operating system Ubuntu 16.04.1 LTS

6.1.2 Result using ODBC

The result of the tests using ODBC can be seen in Table 6.3. The first third of the read queries executed was used as a warm-up period for the DBMS. Hence, the first 6000 queries were not used when calculating the result.

Since the CellApp mock application only issues 1 query each 10ms this means that the real throughput for each CellApp will always be constant at 6000 queries per minute as long as no query exceeds the maximum allowed query time of 10ms. The throughput, shown in the tables below, are therefore calculated based on the average query time. This means that this is an approximation of the throughput if the CellApps were to issue read queries as fast as possible. The calculation used for the throughput can be seen in Equation 6.1, t is the throughput and a is the average query time.

t=18 ˚ 60 ˚1000

Object Migration in a Distributed, Heterogeneous SQL Database Network

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Engineering (Datateknik)

2018 | LIU-IDA/LITH-EX-A--18/008--SE

Object Migration in a

Distributed, Heterogeneous

SQL Database Network

Datamigrering i ett heterogent nätverk av SQL-databaser

Joakim Ericsson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

List of Listings

List of Acronyms

1

Introduction

1.1

Motivation

1.2

Old System

1.3

Aim

1.4

Initial System Description

1.4.1

Requirements

1.5

Research Questions

1.6

Delimitations

2

Method

2.1

Pre-study

2.2

System Design and Implementation

2.3

Evaluation

3

Theory

3.1

Relational Database Management Systems and SQL

3.1.1

SQL

3.1.2

ACID

3.2

OLAP versus OLTP

3.3

Distributed Database Management Systems

3.4

ODBC, JDBC, and OLE DB

3.5

Microsoft Linked Servers

3.6

Distributed Query Engines

3.6.1

PrestoDB

3.6.2

Apache Drill

3.7

Multistore and Polystore Systems

3.7.1

CloudMdsQL and CoherentPaaS

3.7.2

BigDAWG

3.8

Data Warehousing

4

Distributed Database

4.1

Database Schema

4.2

Distribution Schema

5

System Architecture

5.1