Execution Time Analysis of Electrical Network Tracing in Relational and Graph Databases

(1)

Execution Time Analysis of Electrical Network Tracing in Relational and Graph Databases

FELIX DE SILVA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Electrical Network Tracing in Relational and Graph

Databases

FELIX DE SILVA

Master in Computer Science Date: March 16, 2019

Supervisor: Mika Cohen Examiner: Mads Dam

School of Electrical Engineering and Computer Science

(4)

(5)

Abstract

In today’s society, we handle a lot of connected data. Examples are companies like Facebook and Amazon, that handle connected data in different ways. Geographic Information Systems and Network Informa- tion Systems handle connected data in the form of networks or graphs that can represent anything from an electrical network to a product network.

When it comes to connected data, the most commonly used database technology is relational databases. However, with a lot of new databases emerging, there may be better alternatives for connected data that can provide higher performance.

In this study we look at the Oracle relational database and the Neo4j

graph database and study how both databases traverse an electrical

network. The findings indicate that the Neo4j graph database outper-

forms the Oracle relational database regarding execution time of search

queries.

(6)

Sammanfattning

I dagens samhälle hanterar vi mycket kopplad data. Exempel är företag som Facebook och Amazon, som hanterar kopplad data på olika sätt.

Geografiska informationssystem och nätverksinformationssystem han- terar kopplad data i form av nätverk eller grafer som kan representera allt från elnät till ett produktnätverk.

När det gäller kopplad data är den mest använda tekniken relationsda- tabaser. Men med många nya databaser som kommer fram kan det nu finnas bättre alternativ för kopplad data som kan ge högre prestanda.

I denna undersökning tittar vi på relationsdatabasen Oracle och graf-

databasen Neo4j och undersöker hur båda databaserna traverserar ett

elnät. De presenterade resultaten visar att grafdatabasen Neo4j utför

graftraversering snabbare än relationsdatabas Oracle, där fokus ligger

på körningstid.

(7)

1 Introduction 1

1.1 Problem Background . . . . 1

1.2 Research Question . . . . 3

1.3 Objective . . . . 3

1.4 Purpose . . . . 3

1.5 Scope . . . . 3

1.6 Terminology . . . . 4

2 Theoretical Background 5 2.1 Relational Databases . . . . 5

2.1.1 Tables and Keys . . . . 6

2.1.2 Index . . . . 8

2.1.3 Stored Procedure . . . . 9

2.1.4 Query Processing . . . . 9

2.1.5 The Oracle Relational Database . . . 11

2.2 Graph Databases . . . 12

2.2.1 Graphs . . . 13

2.2.2 Index-Free Adjacency . . . 14

2.2.3 Query Processing . . . 15

2.2.4 The Neo4j Graph Database . . . 17

2.3 Database Benchmarking . . . 18

2.4 Database Modeling . . . 19

2.4.1 Relational Modeling . . . 20

2.4.2 Graph Modeling . . . 21

2.5 Query Execution Time Estimation . . . 21

2.5.1 Access Time . . . 22

2.5.2 Storage Time . . . 22

2.5.3 Computation Time . . . 23

2.5.4 Communication Time . . . 23

v

(8)

2.6 Database Storage . . . 23

3 Related Research 25 3.1 Benchmarking Database Systems for Social Network Ap- plications . . . 25

3.2 The Shortest Path Algorithm Performance Comparison in Graph and Relational Database on a Transportation Network . . . 26

3.3 Relational Database and Graph Database: A Compara- tive Analysis . . . 27

3.4 Comparative Analysis of Relational and Graph Databases 28 4 Methodology 30 4.1 Datasets . . . 30

4.2 Modeling . . . 31

4.3 Benchmark Framework . . . 33

4.3.1 Query . . . 34

5 Results 36 5.1 Execution Time . . . 36

5.2 Throughput . . . 38

5.3 Standard Deviation . . . 39

6 Discussion 40 6.1 Benchmark Comparison . . . 40

6.2 Complexity Analysis . . . 41

6.3 Execution Plan Analysis . . . 41

6.4 Practical Applications . . . 44

7 Conclusion 46 7.1 Summary . . . 46

7.2 Future Work . . . 47

Bibliography 48

A Neo4j Graph Creation Algorithm 51

B Cypher BFS Complete Search 52

C Cypher BFS Stop-Label Search 53

(9)

D Result Data - Complete Search 54

E Result Data - Stop-Label Search 55

(10)

(11)

Introduction

In this chapter we present the research question along with the objective of this thesis. The limitation of this thesis is presented in the scope and purpose of this project are thereafter clarified.

1.1 Problem Background

We live in a world where information exists everywhere and informa- tion storage is an essential part of society. Information such as e-mails and personal information needs to be stored in easily accessible and efficient ways. Today, storage of information can be either in physical or digital form; for computerized devices, digital formats like file systems and databases are the most efficient ways.

The development of traditional SQL databases has been ongoing since the 1980’s [19] and has laid the foundation of modern databases. NoSQL and NewSQL are prominent database types that are today widely used.

There are many aspects to consider in selecting database type for stor- age in a system or application, such as the property of connectivity.

Everything that can be abstracted into a graph or network has the property of connectivity. The Internet with its hyperlink network, Face- book with its social network and Amazon with its product network are examples of large information networks built from data with high connectivity. When data are connected in a database in such a manner, it can be described as relationships between data points.

1

(12)

In relational databases, the data structures used for data storage are grid-structured tables. Connectivity among data here means that a data cell from a table refers to a data row in the same or another table.

Accessing the data points in such relations can be done through the use of SQL JOIN operations, merging the tables to allow access of these data points. When datasets become more interrelated, carrying out queries can be more complex because of the possible need for more JOIN operations.

Native graph databases use first-class citizen nodes for storing data and first-class citizen relationships are used for connecting the nodes [10], creating a graph structure. Contained within each node is a list of rela- tionship records that represent the node’s relationships to other nodes.

When carrying out queries similar to JOIN operations, the database uses these lists and has direct access to the connected nodes, eliminating the need for a time consuming computation like in relational databases.

There are many reasons as to why connectivity and JOIN operations have such a complicated relationship. One reason is the underlying architectures of the database types, in how the inner mechanics of handling and enabling connectivity works. According to Harrison [9], native graph databases are built with primary focus on connectivity, which is not the case for relational databases. The choice of database type therefore varies depending on application and what aspects are prioritized.

Digpro is a company that deals with Geographic Information Technolo- gies. They develop and provide software in the form of Geographic Information Systems (GIS) and Network Information Systems (NIS).

dpSpatial is a platform developed by Digpro that lays the foundation of all of Digpro’s product applications. It utilizes an Oracle relational database for a range of functionalities, mostly for storing and retrieving data and executing search queries.

As mentioned before, Digpro are using a relational database where connectivity is not the main focus which may affect the execution time for search queries. By using a alternative database which has connectivity as its main focus, such as native graph databases, may enhance the performance of dpSpatial.

The Neo4j graph database, [20] is of specific interest for Digpro, because

(13)

it is widely used and optimized for connected data [19][10]. Digpro handles highly connected infrastructural electrical networks with dif- ferent layers representing various levels of structures, cables and cable housing. Due to the complexity of the structured data with respect to connectivity, it is possible that the Neo4j graph database can be an effective solution in handling search queries. This possibility is further explored in this project.

1.2 Research Question

How does a native graph database affect the execution time of search queries in an interconnected electrical network, compared to a relational database?

1.3 Objective

This thesis aims to investigate whether a native graph database is a better alternative than a relational database, concerning the execution time of search queries requiring retrieval of connected data. In this study, the concept of execution time is interpreted as the time from when a query is sent until a response is received from the database.

1.4 Purpose

This study explores the improvement possibilities of search queries in terms of execution time by replacing relational databases with native graph databases. If this study shows improvements this may be rele- vant for companies handling connected data and may aid in further improving their software by exploring native graph databases.

1.5 Scope

The databases that are studied in this thesis are the Oracle relational

database and the Neo4j graph database. Specifically, it is investigated

whether Neo4j has a shorter execution time when traversing an inter-

connected electrical network than Oracle. The underlying models of

(14)

the electrical networks are in this study identical for both databases.

In this study, we compare Neo4j and Oracle in terms of the database internals, the underlying architecture and the algorithms. External factors such as system latency are not within the scope of this study, due to both databases operating on machines with similar hardware (e.g. SSD, CPU and RAM).

1.6 Terminology

Expression Abbreviation Definition

Oracle Oracle A relational database

Relational Database developed by Oracle Co.

Neo4j Neo4j A native graph database

Graph Database developed by Neo4j Inc.

Relational Database RDBMS An application that allows

Management System a user to manage

a relational database.

Structural SQL A programming language

Query Language to accesses data in a

relational database

Database DBB An application that

Benchmark measures the performance of

database

Table 1.1: Short definitions of relevant expressions.

(15)

Theoretical Background

This chapter covers in-depth background into relational databases and graph databases as well as database benchmarking, database modeling, time estimation and database storage.

2.1 Relational Databases

The relational model was first introduced by Ted Codd of IBM in the 1970’s [14][8]. This model has its theoretical foundation in set theory and first-order predicate logic. It has given birth to the relational alge- bra, which is the fundamental logic behind relational databases.

A relational database is filled with collections of tables. Each collec- tion is represented as a separate relational database unit. A Relational Database Management System (RDBMS) is an application that allows a user to manage access to the relational database[13].

Accessing the collection of data that resides within an RDBMS can be done in different ways. The standard way of accessing data is through a non-procedural declarative domain-specific programming language called Structural Query Language (SQL). The SQL functionality is di- vided into three main parts:

• SQL schema statements - The creation of data structures.

• SQL data statements - The manipulation of data.

• SQL transaction statements - The management of transactions.

5

(16)

All of these parts make SQL an efficient language for the relational database model [5].

JOIN operations are part of the SQL language handles data manipu- lation. These operations combine rows from different tables based on related column names to form a new table. There are several types of JOIN operations, but the only concerned type in this project is called inner-JOIN. This JOIN operation can be used to combine two different tables, but when it is used to combine a table with itself it is called a self-JOIN.

The following subsections cover relational tables along with techniques to improve execution time (i.e. index and stored-procedures) and the process behind query processing. Lastly, the Oracle relational database is introduced.

2.1.1 Tables and Keys

Most relational databases provide a database management system that contains a variety of data structures and fundamental properties. These structures and properties allow for the RDBMS to create database ab- stractions for many types of systems.

Tables

Tables are the primary data storage structure of relational databases,

where all data about a certain kind of entity are kept. A table consists of

rows and columns. The table and column names are used to interpret

the meaning of the data in each row. A simple example is a table called

PERSON, where a PERSON entity (i.e. a row in the table) has a name,

age and occupation (see Figure 2.1). These attributes are the column

names of the PERSON table. When a PERSON entity is added to the

table the data values are stored under the corresponding columns to

represent that specific PERSON entity.

(17)

Figure 2.1: A PERSON table with the attributes name, age and occupa- tion.

Keys

A key consists of one or more attributes that serve as a unique identifier of a row in a table [8]. Keys can come in different forms, but the most common ones are primary keys and foreign keys.

A primary key is defined as a set of columns of a table that uniquely identifies a row in a table. With primary keys, the database ensures that the rows have unique, non-duplicated and non-NULL values in those columns. A foreign key is defined as a column in a table that references a primary key in another table. Foreign keys, in contrary to primary keys, accept NULL values and duplicate values. For example, see Figure 2.2 where we have a table PERSON and a table OCCUPATION.

In the PERSON table we have a person called Johan and his occupation

is the number 2. Here, the number 2 is a foreign key referencing the ID

value 2, which is a primary key in the OCCUPATION table.

(18)

Figure 2.2: A representation of a primary-foreign key relationship.

2.1.2 Index

Indexes are used to speed up searching for data in a table, known as a look-up, in a database. They can be used to efficiently find all matching rows given some attribute in a search query and then search through only a subset of the table to find exact matches. If a table does not have an index on any column the database has to search through the entire table and check every row to find matches, which is a slow operation on larger tables [8][13].

Clustered Index

A clustered index can be applied to a column in a table, by which the rows in the table will then be sorted. An example is the table of contents in this paper, where the enumerations of the chapters and sections serve as the clustered index. Clustered indexes are suitable for retrieving a lot of data for range-based queries since all data are located next to each other. A clustered index also reflects how data is stored in memory, which limits the database to one clustered index per table [8][13].

Non-Clustered Index

A non-clustered index is different from a clustered index, in that a

table can have many non-clustered indexes pointing to data cells in

the table [8][13]. The index pages at the end of a book is an example

of how non-clustered indexes work; they point to the places where a

(19)

sought word is. For instance, when searching for the word "Mom", the non-clustered index points to the places in the table where the word

"Mom" is stored. This is convenient for larger tables in order to retrieve specific information that would otherwise require searching the entire table for matches, thus streamlining the execution time.

2.1.3 Stored Procedure

Stored procedures are a set of pre-compiled statements used to perform a particular task. These procedures are parsed and optimized before the first execution and compiled versions of the stored procedures remain in memory for later use. This means that a stored procedure does not need to be re-parsed or re-optimized for each execution and this results in shorter execution times.

A benefit of stored procedures is that the developer of a procedure can centralize data-access logic into a single place. That is, the procedure itself handles the relevant data-access internally and executions of the procedure only need to be concerned with the output. With stored procedures the developer can specify access permissions to specific procedures, adding another level of security. This means that a user does not need to have read/write permissions on the underlying tables of the procedures. Furthermore, they allow for more structure and orga- nization in larger SQL segments by removing repeated SQL segments with stored procedures.

2.1.4 Query Processing

In the Oracle relational database used in this study, the query processing has four steps. As described by Elmasri [8] and Oracle [17], these steps are:

1. Parsing 2. Optimization

3. Query plan generation

4. Execution

(20)

The way queries are processed affects the execution time. The follow- ing paragraphs explain the steps in more detail and allow for a deeper understanding of the execution plan analysis in Chapter 6

Parsing

The parsing phase of the query processing is to make sure that the query is valid before going to the next phase. During the parsing phase, the query is passed through three different checks:

1. Syntax check

Looks at the query and checks that it is syntactically correct ac- cording to SQL grammar.

2. Semantic check

Looks at references to the database objects and checks that host variables have correct data types and are valid.

3. Shared pool check

Looks at the query and checks if there is a need for some resource- intensive steps of the query processing. The check consists of looking through the shared pool (i.e. a shared memory space) and trying to find previously parsed statements. If the check determines that a statement in the shared pool exists, then the database performs semantic checks to determine whether the statements are semantically equivalent.

Optimization

The optimization phase of the query processing generates multiple execution plans to then choose the one with the lowest cost for the next phase. The optimizer uses collected statistics to calculate a cost for a subquery execution plan. SQL is a non-procedural language, which means that the optimizer can reorganize the subqueries in any order.

The cost computation takes into account factors of query execution such as I/O, CPU and communication, which is an internal unit that the optimizer uses for plan comparisons.

The optimizer determines the optimal execution plan for a subquery

by examining multiple access methods, such as full table scan, index

(21)

scans and different JOIN operations. After calculating the total cost for each execution plan, the optimizer chooses the plan with the lowest cost estimate. The selected execution plan is a recommendation of an execution method for the SQL statement. However, the execution plan is not usable for the database, but solely for the optimizer. The next step follows this up by formulating the execution plan as a usable query plan for the database.

Query Plan Generation

The query plan generation phase formulates a query plan that is usable for the database. The query plan itself takes the form of an iterative plan with several steps. Each step returns a row source, which is either used in the next step or returned to the application that initiated the SQL statement. A row source is a set of rows returned by a step in the query plan along with a control structure that helps processing the rows.

Execution

The execution phase uses the query plan generated in the query plan generation phase. During the execution phase, the SQL engine goes through the query plan and executes each row source. If the data are not stored in memory, then the database reads the data from disk to memory and then executes. The database makes sure to preserve the integrity by releasing internal locks and logging any changes made during the query execution.

2.1.5 The Oracle Relational Database

The Oracle relational database is an RDBMS from the Oracle Corpora-

tion. It has been in development for the last 35 years. The first commer-

cially available version, called Oracle version 2, was introduced in 1979

and was the first SQL-based RDBMS. The system is built with a rela-

tional database framework were data objects can be directly accessed

by users (or an application) through SQL statements. Today, the Oracle

relational database is one of the most used RDBMS and is often used

by global enterprises. [17]

(22)

2.2 Graph Databases

Graph databases come from the NoSQL family of database systems that have their foundation in graph theory [18]. In a graph database, an object or system is represented by a mathematical graph. Figure 2.3 is an example of a graph sketch of a person, where the person has a name, age and occupation. The different elements to create a graph-like abstraction is further explained in section 2.2.1.

Figure 2.3: A sketch of a graph representing a person and some of its attributes.

Accessing data that reside in a graph database can be done using dif- ferent languages. Since graph databases are quite new in comparison to relational databases, there has not been any widespread standard- ization regarding a graph query language [1]. However, SPARQL, Gremlin, GraphQL and Cypher are graph query languages that many graph databases have adopted, with Cypher being the most popular one due to its resemblance to SQL [12]. Cypher is a declarative graph query language, created by Neo4j Inc., which allows for expressive and efficient creation of graphs as well as manipulation of graph data.

Cypher’s code aesthetics resembles ASCII art and SQL, making it rela- tively easy to formulate bidirectional queries for complicated systems [20].

The following subsections cover graphs in detail followed by Index-

free Adjacency and graph query processing. Lastly, the Neo4j Graph

database is introduced.

(23)

2.2.1 Graphs

Graphs are created with two core data structures: the nodes where the data can be stored and the relationships that represents the connections between the nodes. The way the data is stored in each node depends on the graph database itself. A node can represent either an entity or an attribute depending on the data key-value mappings within. The data that reside within a node is schema-less, meaning each individual node can have its own number of data mappings. If a node contains one key-value mapping it could represent an attribute such as NAME in 2.3. However, if a node contains several key-value mappings it could represent a entity such as PERSON in 2.4. We can also distribute labels to nodes to help classify them where labels are used in a similar fashion to table names; they describe which collection a node belongs to. For example figure 2.5 show two node having the same label (PERSON), meaning they are classified as the same type of node [18][9].

Figure 2.4: An node representing a PERSON.

The other core data structure for forming graphs is relationships. As

mentioned previously purpose of a relationship is to represent a con-

nection between nodes. A relationship can also be classified by a label

in a similar way as a node. Figure 2.5 shows several nodes labeled PER-

SON and COMPANY and their corresponding data, theses nodes are

connected via different labeled relationships (:Friends, :WorksAt). In

addition, a relationship can also hold data and are mapped as key-value

pairs [18][9].

(24)

Figure 2.5: An illustration of relationships between nodes.

2.2.2 Index-Free Adjacency

There are two main families in graph databases: native and non-native graph databases where the underlying storage mechanism can vary.

Some non-native graph databases depend on a relational storage and store the graph data in relational tables, imposing another level of abstraction between the graph database, its management system and the physical devices where data are stored. Others use a key-value or document-oriented database for storage, making them inherently NoSQL databases.

Native graph databases, in contrast, are graph databases that are specif-

ically built for graph handling [18]. Its storage, memory management

and query engine all support index-free adjacency. With index-free

adjacency, each node has a direct reference to its adjacent node. This

means a look-up has a constant time complexity O(1), as opposed to a

(global) index look-up typically having the time complexity O(log(n)),

where n is the number of nodes in the graph [18][9]. With the property

of index-free adjacency, traversing a graph operates like pointer-chasing

[22].

(25)

Figure 2.6: An illustration of an graph database with relational storage and native graph database with graph storage.

2.2.3 Query Processing

Queries are processed differently in different graph databases. In the Neo4j graph database used in this study, the processing of a query until it is ready for execution has several steps [18][9][20]:

1. Parsing

2. Query graph generation 3. Logical plan generation 4. Optimization

5. Execution plan generation 6. Execution

These steps in themselves may contain several underlying steps. The

steps in Neo4j’s query processing have similarities to the query pro-

cessing in Oracle; both have a general pattern of parsing, optimization

and execution. However, due to that Oracle uses SQL and Neo4j uses

Cypher the similarities stop there. The following paragraphs explain

the steps in more detail and allow for a deeper understanding of the

execution plan analysis in Chapter 6.

(26)

Parsing

The parsing phase of the query processing has an end goal of creating a normalized Abstract Syntax Tree (AST). For that to happen, the query needs to be syntactically validated, tokenized and parsed into an AST.

Then, using the AST, a semantic check is performed to validate types, scoping and binding of variables.

At this point, the AST is both syntactically and semantically validated.

Now the parser only needs to normalize the AST. This is done by rewriting the AST in different ways. Some of these revisions could be label reorganization, type reorganization, redundancy suppression or alias expansion.

Query Graph Generation

The query graph generation phase uses the normalized AST to generate a query graph, which is a more high-level abstraction of the query. This query graph allows for more high-level optimization in further steps.

Logical Plan Generation

The logical plan generation uses the query graph to generate logical plans. A logical plan is plan that is used internally to determine which operations to use in future steps. A query graph can contain several subquery graphs; a logical plan is produced for each of the subgraphs in a step-by-step bottom-up fashion.

At each step, using information from the query graph, the phase esti- mates the number of matching nodes using previously collected statis- tics and then uses these to estimate the cost of building a preferred logical plan. The cost of a logical plan is a statistical estimation of how much work the database needs to do. Usually, this cost mainly takes account for I/O reads from storage, in-memory computations and com- munications.

Building the logical plan bottom-up is done using Iterative Dynamic

Programming (IDP). It makes sure that each individual step is the most

cost effective [16]. However, the cost effectiveness is built on estimates

and is therefore subjective to errors. This can occur in at least three

(27)

different ways:

• Statistical inputs can be incorrect because sampling is not perfect.

• Combining costs, the phase does not know how the costs correlate which can produce wrong estimates.

• If search space grows too much between steps, the phase will prune parts of the logical plan, which may lead to some plans never being evaluated.

At the end of the logical plan generation phase, the preferred cost effective logical plan is selected for the next step, being the optimization step. The selection of the most effective plan is made using IDP as described above. However, this selection is not guaranteed to be the most optimal plan.

Optimization

During the optimization phase, the logical plan is improved as much as possible. The process consists of running several improvement checks on the plan and applying suitable operations according to the results of these checks. Examples of such operations are unnesting and merging.

Execution Plan Generation

The execution plan generation phase uses the optimized logical plan and generates an execution plan. This is done by constructing an execution tree which consists of internal operations where each non-leaf node gets information from one or two child nodes. The information that is being sent between operators in the intermediate states of the tree is the attributes needed and their types.

Execution

The execution phase uses the execution plan tree and executes each operation in a pipeline until the final result is obtained.

2.2.4 The Neo4j Graph Database

The Neo4j graph database is a Java open-source native graph database

developed by Neo4j Inc. It has been on the market since 2007 and is

currently the world’s most used graph database [20].

(28)

2.3 Database Benchmarking

A database benchmark (DBB) is a set of tasks sent to databases to evaluate their relative performances in a controlled manner. However, creating a good and reliable DBB is not an easy task. For a DBB to be reliable, some factors need to be considered. Huppler [11] points out that five characteristics make a general benchmark good:

• Relevant – A reader of the result believes the benchmark reflects something important.

• Repeatable – There is confidence that the benchmark can be run a second time with the same result.

• Fair – All systems or software being compared can participate equally.

• Verifiable – There is confidence that the documented result is correct.

• Economical – The test sponsors can afford to run the benchmark.

According to Huppler [11], creating a perfect benchmark with all of these characteristics is almost impossible. Thus, benchmark developers need to abandon some of the characteristics and focus on the others.

The number of DBBs for relational databases are plenty. The most com- mon standards for RDBMSs are TPC-C, TPC-H and TPC-E benchmarks [7], but there are other open-source alternatives too. All of these focus on relational operations such as JOINs, projections, selections, aggrega- tions and sorting. However, these types of DBBs do not work for graph databases since graphs operate under a different principle.

Another type of DBB is the OO7 benchmark by Carey, Dewitt, and Naughton [6]. This benchmark is for object-oriented databases (OODB), which has similarities to graph databases. Data stored in an OODB can follow a graph-like structure, where the entities have relationships among themselves. The OO7 benchmark contains a set of queries that are categorized into two groups: traversal queries and general queries.

Even though OODB benchmarks can create graphs, their structures

(29)

are not like graphs in a graph database or graph analysis. That is be- cause OODB relationships are more like that of references in a relational database than pointers in a graph database.

Currently, there are only a handful of graph-oriented DBBs. One is the HPC Scalable Graph Analysis Benchmark [3]. This DBB contains four different categories of queries:

• Insert the graph database as a bulk load.

• Retrieve the set of edges with maximum weight.

• Perform a k-hops operation.

• Calculate the centrality of a graph, where the performance is measured in edges traversed per second.

In this project, the comparison is between two different types of databases (relational and graph) and at this moment there is no DBB that operates on cross paradigms. Thus, further details regarding the implementation of the benchmark for this project and how it reflects the characteristics that Huppler [11] pointed out will be presented in Chapter 4.

2.4 Database Modeling

When benchmarking an abstraction of a system, many factors con- tribute to its performance and the modeling of the abstraction is one of them. Creating a model that correctly and efficiently represents a system is crucial to the performance and is therefore essential to this project.

Creating a model of a system for a database can be done in different

ways. Depending on the underlying data structure of the database,

the procedures to develop a model that efficiently works for that spe-

cific database may be tedious and difficult. Generally, for both graph

databases and relational databases, the starting point when modeling

a system is to understand the domain of the system. From the under-

standing of the domain, entities and properties can be defined by how

they interrelate and what rules govern the domain. Most of this tends

to be quite informal and are often paper sketches. However, from here,

(30)

the modeling characteristics diverge. Figure 2.7 shows an example of a domain model representing a small electrical circuit. There you can see several components that are connected together according to the domain model. For example, several fuses can be connected to one bus bar and one cable can be connected to one fuse.

Figure 2.7: An example of a small electrical circuit domain model.

2.4.1 Relational Modeling

In the case of relational databases, the process of creating a model given a domain can be done in different ways. One way to model a system given an understanding of the domain is to first create a diagram of the system. This diagram shows all the identified entities and how they are connected, similar to Figure 2.7. The next step is to create a logical model, called an Entity Relation diagram (E-R diagram). This is a more comprehensive graph with more information about the entities.

This information can be anything related to the entity, usually relevant properties that define the entity.

Having an E-R diagram with all needed information, the developer can map it into tables and normalize them to reduce redundancy. Nor- malizing is a technique to reduce redundant data in tables by splitting them into intermediate tables and creating references between the in- termediate tables.

At this point, the tables are normalized and relatively corresponds to

the domain and the process of creating a model can now be seen as

complete. However, one of the problems of normalized models is that

they are typically not fast enough for real-world applications; to satisfy

real-world requirements, they need to be changed to suit the database

(31)

engine through denormalization.

Denormalization involves finding out what redundant data to intro- duce in order to reduce the complexity of a query to improve efficiency.

An example of this could be to introduce redundant data to reduce the number of JOIN operations needed in a query. However, when a database contains many tables and queries contain many JOIN opera- tions, denormalization can be difficult. Also, given the total lifespan of a model, this may be difficult to perfect since changes to the model may happen not only during development but also during its use.

2.4.2 Graph Modeling

In the case of native graph databases, the process of creating a graph model given an understanding of the domain is more straightforward.

The first step is to create a graph of the domain, similar to the first diagram when creating a relational model, and then include all the information of each entity/node to create an accurate representation of the domain. This means each node captures its appropriate label, properties and relationships to other nodes.

Domain modeling is usually identical to graph modeling; by represent- ing the domain model as accurately as possible, the graph model gets equally represented. An advantage of modeling in this manner is that the semantic context remains intact. Each node and relationship still has its domain representation, and there is no need for normalization or denormalization to make the model efficient.

At this point, the graph model is done and only needs to be transcribed into the graph database. This manner of modeling a system for a graph database can easily be described as "What you draw, is what you store"

by Webber [22], meaning how you draw it on a whiteboard or paper is how it will be stored and represented in the graph database.

2.5 Query Execution Time Estimation

When benchmarking execution time, there is a need to understand

which steps within the query process take time. However, execution

time is not simple to estimate. There are a several factors to take into

(32)

account when estimating the execution time. It primarily depends on the setup of the environment itself. A poorly designed environment consists of components set up in such a way that they will negatively affect the execution time. A well designed environment will instead minimize the access time, storage time, computation time and commu- nication time.

All factors take into account different operations and the total time estimation is a combination of them. For example, a distributed en- vironment can have a more considerable communication time than a standard local environment (i.e. everything is run on the same com- puter). A standard local setup typically has a more sizable access time than a local full RAM environment (i.e. the whole database is in pri- mary memory after one read). The following sections explain these different time factors and what they take into account.

2.5.1 Access Time

Access time is the time it takes for the CPU to get the sought after data from secondary storage to primary storage for further processing.

Today, this process is quite fast but depends on what type of secondary storage device the database uses.

Hard Disk Drives (HDD) and Solid State Drives (SDD) are the stan- dard types of secondary storage devices. An SSD does functionally everything an HDD does, but data are instead stored on flash memory chips whereas HDD stores data in magnetic discs. The main advantage that an SSD has over an HDD is that it is faster. An SSD can perform more I/O operations per second, which makes SSDs more efficient than HDDs and contributes a lot to the decrease of access time. Generally, for databases where access time is essential, using an SSD is recommended.

2.5.2 Storage Time

Storage time is the time it takes for the CPU to access the data from

primary memory. Since primary memory is the fastest storage medium,

RDBMSs can utilize whole systems with large amounts of primary

memory to fit the entire database. This allows for high-speed perfor-

mance and the CPU only needs to access data from secondary storage

(33)

once.

2.5.3 Computation Time

Computation time is the time it takes for the query engine to compute a query and produce the result. The majority of the computation time comes from the query processing phase (i.e. parsing, optimization and execution). This process happens within primary memory, which makes this quite fast. However, smaller databases that do not have enough primary memory space get affected by access time drawbacks and may need to optimize their database structure to decrease the com- putation time in order to increase overall performance.

Modern relational query engines have had a lot of time to improve the underlying algorithms performing the query processing. Today, a look-up usually has a time complexity of O(log(n)), where n is the number or rows in the concerned table. However, a modern native graph query engine can have a look-up time complexity of O(1) [18].

This makes JOIN-operation-heavy queries quite fast to compute, see section 2.2.2 for further explanation.

2.5.4 Communication Time

Communication time is the time it takes to send data between environ- ment points. This can be minimal for a local system. However, for a distributed system, this can affect the total time when data need to be transported from a cluster of different databases and used to compute a result.

2.6 Database Storage

When benchmarking a system, there is a need to understand the ma-

chine that it is performed on. Depending on the storage medium,

the performance of the benchmark may vary. The collection of data

that makes up a database needs to be stored physically in this storage

medium. The RDBMS can then retrieve, update and process these data

when needed. The storage medium forms a storage hierarchy that

includes two main categories:

(34)

• Cache and main memory are classified as primary storage.

• HDD and SSD, as mentioned before, are classified as secondary storage.

Primary storage provides fast access to data but has limited storage capacity. However, it has been rapidly growing in recent years. Nev- ertheless, they are still more expensive and have less storage capacity compared to secondary storage devices.

Secondary storage devices have a more extensive storage capacity but provide slower access to data than primary storage devices and data cannot be processed directly by the CPU. The CPU first needs to copy data into primary storage in order to access it for processing.

Programs reside and execute in main memory. Generally, large databases reside on secondary storage; usually, HDD or SDD and parts of the database are read and written into main memory when needed. Typi- cally, database applications require only a small part of the database when in use. Whenever specific data is required, that is not in main memory. It first needs to be located in the secondary storage, copied to primary memory for processing and then re-write it to secondary storage if the data has been modified.

The benefits of having the database stored in secondary storage medi- ums have some validity. Apart from the cost of storage per unit of data, there are other reasons to utilize secondary storage for the entire database. One reason is that databases today can be as large as hun- dreds of terabytes, which in most cases is too large to fit in primary memory. Another reason is that the circumstances causing permanent loss of stored data arise less frequently for secondary storage than for primary storage.

Data stored in secondary storage is organized as files of records. Each

record is a collection of data values that can be interpreted as facts about

entities, which could be seen as a row in the database, their attributes

and their relationships. Records are stored in the secondary storage

in a manner that makes locating them as efficient as possible for the

database.

(35)

Related Research

This chapter provides insight into research related to this thesis.

3.1 Benchmarking Database Systems for So- cial Network Applications

In a paper by Angles et al. [2], they perform a benchmark on five dif- ferent databases. The databases used are two graph databases (Dex and Neo4j), one RDF database (RDF-3X) and two relational databases (Virtuoso and PostgreSQL). The purpose of the paper is to investigate how different database paradigms handle graph workload and in this particular case social network workload.

The focus of the paper is on performance and the specific aspects are:

• Data loading time - The time it takes to load the data from a source file.

• Query execution time - The time it takes for the database to exe- cute the query.

• Data indexes - The time it takes to create indexes.

However, the performance aspect of query execution time is the central focus of the paper and is also the most relevant aspect for this project.

The results presented in the paper are based on several queries that the microbenchmark performs and all of them show that graph databases

25

(36)

perform better than relational and RDF databases when the graph in- creases in size. In the paper, the graph size increases from 1 million to 10 million nodes. The significant differences between relational and graph databases appear when the graph size is large.

The paper concludes that graph databases have shorter query execution times in comparison to relational databases when the graph size in- creases to the millions. However, this paper was published in 2013 and most of the databases, especially Neo4j, have been updated to increase their performance, meaning the results may be different now. Neverthe- less, the conclusion should still be valid, that graph databases perform better than relational databases when it comes to social-network re- lated queries and interconnected networks. This can be seen with the increased interest in graph databases and the wide use of Neo4j [22].

3.2 The Shortest Path Algorithm Performance

Comparison in Graph and Relational Database on a Transportation Network

In a paper by Miler, Medak, and Odobasic [15] they conduct a perfor- mance comparison on the Dijkstra’s Shortest Path Algorithm (DSPA) be- tween a relational database (PostgreSQL) and a graph database (Neo4j).

In this paper, they model a transportation network and execute DSPA several times with different settings. They experiment with different numbers of threads and different configurations of Neo4j to see which database operates the fastest.

The results show that Neo4j performs best with the recommended set- tings. Additionally, although Neo4j performed better than PostgreSQL, the memory required was 20-75% more, depending on the number of threads used in the computation. The authors’ initial assumption was that Neo4j must perform better since the transportation network is stored in the database.

The paper concludes that Neo4j is 30-35% faster than PostgreSQL. How-

ever, there is an increased cost of memory with Neo4j. If memory is

not an issue, then Neo4j is the choice for DSPA, according to the au-

(37)

thors. The paper provides some more in-depth insight into Neo4j and the memory management when performing more advanced comput- ing. Also, although PostgreSQL is not Oracle, it is still relevant when comparing relational databases and graph databases.

3.3 Relational Database and Graph Database:

A Comparative Analysis

In a paper by Medhi and Baruah [14], they perform a benchmark on a relational database (MySQL) and a graph database (Neo4j). The pur- pose is to compare their performances when handling connected data, where the performance metric is execution time.

The benchmark presented in the paper is based on two predefined queries on a graph schema representing the cricket sport, where the entities and relationships are as follows:

• Entities – Player – Team – Game

• Relationships – Player to Team – Player to Game

The predefined queries used by the authors are simple retrieve queries that follow the structure of:

• Find attribute/property X from entity Y.

• Find attribute/property X from entity Y, that has relationships Z.

The results show that Neo4j outperform MySQL for all queries, some-

times with a factor of 20. The authors performed the queries on graphs

of sizes 100, 300 and 400 nodes. With connected data, Neo4j outper-

formed MySQL regarding execution time.

(38)

3.4 Comparative Analysis of Relational and Graph Databases

In a paper by Batra and Tyagi [4], they perform a similar benchmark to the one in the paper by Medhi and Baruah [14]. The databases are also the same ones, being MySQL and Neo4j. As with the previous study, the purpose of the paper is to compare the performance of the two databases with connected data, where the performance metric is execution time.

The benchmark presented in the paper is based on three predefined queries on a graph schema representing relationships between users, movie interests and movie actors. Here the entities and relationships are:

• Entities – User – Movie – Actor

• Relationships – User to User – User to Movie – Movie to Actor

The graph scheme used in this paper is a simple undirected graph. A difference in this study, compared to the previous one, is that the graph seems to be more focused on reachability, rather than retrieval.

The predefined queries are simple reachability queries that follow the structure of:

• Find all neighbors of X.

• Find all neighbors of X with attribute/property Y.

• Find all neighbors of Y with attribute/property Z and a relation-

ship with X.

(39)

The presented results show that Neo4j outperform MySQL for all queries, sometimes with a factor of 20. The paper performed the three queries on graph of sizes 100 and 500 nodes. At first glance, the graph sizes seem to small, but according to the authors, the differences in exe- cution time were already sufficiently significant and that larger graph sizes would not give any more insightful results.

The paper and the authors’ conclusions were clear. Like in the previous

study, Neo4j outperformed MySQL regarding execution time in the

context of connected data.

(40)

Methodology

This chapter describes what data were used and how the setup and implementation of the electrical network model, environment of the experiment and the benchmark was conducted.

4.1 Datasets

There are three datasets used in this project, which are representations of three different electrical networks. The first dataset, the 8K-set, is compromised of 8000 nodes and is here used for testing and benchmark- ing of small datasets. The two other datasets, the 220K-set and 6.3M-set, are more extensive with 220 000 and 6.3 million nodes respectively and are here used for benchmarking of extensive datasets.

The electrical networks represented by the datasets are logically con- nected such that they depict real-world networks in Sweden. Each dataset consists of a table that is comprised of rows that represent elec- trical components occurring in the network and columns describing the properties of a component. For this work, relevant component properties are id, type and connection points (c1 and c2, depicting each connector end of a component). Other properties are redundant for this work, but are still preserved in order to keep the same amount of data in each database.

Each electrical component can be categorized into one out of two main categories: one connection component (OCC) or two connections com- ponent (TCC). OCC’s can be described as components that have only

30

(41)

one connection point (i.e. c1 and c2 have the same value) and can therefore not be logically chained. Examples of OCC’s are bus bars, bays or delivery points. TCC’s instead represent components with two connection points (i.e. c1 and c2 have different values) that can be logically chained. Examples of TCC’s are cables, fuses and circuit breakers.

The electrical networks that the three datasets represent can be depicted as graphs were each component is coupled, end to end, with another component similarly to the domain model in Figure 2.7.

4.2 Modeling

The focus of the model that is stored in both the Oracle RDBMS and the Neo4j graph database is on how the components are connected. For the Oracle relational database, a connection table (conntab) was created with every component from a dataset for each of the three datasets described in section 4.1. Since each component is represented as a table row holding its properties, the approach to traverse the network was to iteratively perform look-ups in conntab for the next connections.

Figure 4.1 shows an example of two cables connected via a bus bar in the middle and the circuit’s relationship to the conntab data. Notice that all the components are in the same table and distinguishing between the OCC and TCC’s is done by looking at their connection points. Here, the bus bar has the same value for c1 and c2, while the cables have different values for c1 and c2, indicating that the bus bar is an OCC and the cables are TCC’s.

Figure 4.1: A example representation of the relationship between con-

ntab and a small connected circuit.

(42)

For the Neo4j graph database, creating the graph of the electrical net- work was made in a more controlled manner. The algorithm used creates the graph step by step as follows:

1. Map connection point to a set of components containing that connection point.

2. For each connection point set:

• If an OCC exists in the set, connect all TCC’s in the set to the OCC.

• Else connect all components in the set to each other in an arbitrary direction (in this project, they are connected in the direction of the higher id).

This algorithm was created during the project and creates a graph that is similar to the domain model of an electrical circuit. Each component is represented as a node and thus has its neighbors directly linked due to index-free adjacency. Figure 4.2 shows the different types of connec- tion point sets that can occur and how they are connected during the algorithm. Appendix A shows a pseudo-code version of the algorithm.

Figure 4.2: A representation of the different connection point sets that

can occur and how they are connected.

(43)

4.3 Benchmark Framework

The purpose of the benchmark is to measure the execution time for dif- ferent search queries. This is done by sending queries to each database and then waiting for the results. The measured execution time is pre- sented in three different statistical forms: average time, standard de- viation and average throughput. Figure 4.3 shows a flowchart of a benchmark test given a search query.

Figure 4.3: A flowchart representation of a benchmark test given a query.

The framework of the benchmark is built in Java version 8 and uses Oracle Java Database Connectivity version 12.2.0.1 and Neo4j Java Driver version 1.5.1 to connect to the corresponding databases. Using these connection APIs, the benchmark can send queries and receive results that are then verified. The tests that the benchmark performs are comprised of several runs for each query. The gathered results from each test are then stripped of the cold runs (i.e. query execution with empty database cache) to only contain data from the hot runs (i.e.

query execution with non-empty cache) for statistical calculations. This process is performed for both Oracle and Neo4j and verification of the result accuracy is done by checking that the resulting set of components from both databases have identical properties.

The measurement of execution time performed by the benchmark is

done using the machine’s internal clock. The time is collected before

the query is sent and right after the result is received. Using these two

time values an elapsed time is calculated for each run of the query and

at the end, the average time is calculated using all the elapsed time

(44)

values. The measurement of throughput and standard deviation is con- ventionally calculated using statistical formulas [21] and the collected data from the execution time measurements.

As explained in section 2.3, five characteristics make a general bench- mark good according to Huppler [11]. These characteristics provide in- sight into making a good benchmark in the aspects of relevance, repeata- bility, fairness, verifiability and economical. However, as mentioned before, making a perfect benchmark that takes all of these aspects into account is almost impossible. For that reason, certain non-applicable aspects must be overlooked in order to put more focus on the other aspects. For this work, the economic aspect can be considered fulfilled or irrelevant because there are no economic factors involved.

This project is of scientific nature and the databases that are being tested need to be on equal grounds and the results need to be repeatable, veri- fiable and fair for scientific value. Hence, the most important factors to consider are fairness, repeatability and verifiability. The aspect of relevance comes from the measurement unit of the benchmark, which is also the focus of this project. Relevance is also a natural consequence due to the fairness and verifiability aspects.

For fairness, both databases are run on similar machines and have enough memory to load each dataset into main memory, although the memory needed may grow beyond that during execution.

4.3.1 Query

The query used in the benchmark is based on the Breadth-First Search algorithm (BFS), which given a component and some stop-property tra- verses through the network. The traversal stops when the stop-property is fulfilled or the entire connected network has been traversed. For this project, the query is a part of the benchmark suite and is used in two ways: for a complete search query (traversing without stop-property) and a stop-property search query.

Implementing the query was done via an SQL-stored procedure for the

Oracle RDBMS and through a Cypher-stored procedure for the Neo4j

graph database. For Oracle, the stored procedure was implemented

(45)

by Digpro since there are not any pre-existing procedures that can do these kinds of executions. In contrast, the Neo4j graph database has a Cypher library called APOC that contains pre-defined procedures that can perform graph-related algorithms, including BFS related algo- rithms.

The logic behind the stored procedure for Oracle is that through an iter- ative process we are able to find all components connected to a starting point. Step-by-step, this process first finds the connected components to a specified starting point, then checks if one of those connected com- ponents fulfills the stop-property. For the rest of the components that did not fulfill the stop-property, the query finds the connected compo- nents for each of them and repeats the process. This is done to all the components until there are no more to be found. All of the components that the stored procedure encounters, including the starting point, are collected and then stored in a list, which at the end of the procedure is returned.

For the Neo4j graph database, the procedure used was the subgraphN-

odes procedure provided by the APOC library. The subgraphNodes

procedure performs BFS-based operations given some parameters, such

as starting node and stop-property, and then returns a set of unique

nodes that corresponds to the operation. One inconvenience with the

subgraphNodes procedure’s BFS algorithm is that it does not include

the nodes that fulfill the stop-property (endnodes). A possible solution

is to perform subgraphNodes twice: one normal BFS and one to find

the outer boundary nodes, being the endsnodes. Appendices B and C

show the two versions of the query.

(46)

Results

This chapter presents the results in charts with accompanying explanations. The full benchmark test results can be seen in Appendices D and E.

5.1 Execution Time

The benchmark performed two tests on three different datasets. The first test was the complete search test that traversed the network with- out any stop-properties. A complete search on each dataset was done to find a whole graph given a node, and given that each dataset has a number of graphs, the results are relative to the size of the dataset and the graphs. Figure 5.1 represents the results of the complete search query and Figure 5.2 represents the results of the stop-property search query.

36

(47)

8K-set 220K-set 6.3M-set 0

1,000 2,000 3,000

276 410

3,518

39 41 193

T ime (ms)

Oracle Neo4j

Figure 5.1: Execution time result for the complete search query.

The result from the complete search query shows that Neo4j’s execution time is approximately 5-7 times shorter than Oracle’s execution time.

As mentioned before, the results are given relative to the graph size.

The graph sizes used for the datasets are 7745, 8268 and 35868 nodes respectively. This means that it takes 39 ms for Neo4j to find a graph with 7745 nodes in a dataset of approximately 8000 nodes.

8K-set 220K-set 6.3M-set 0

500 1,000

123

253

1,193

33 55 102

T ime (ms)

Oracle Neo4j

Figure 5.2: Execution time result for the stop-label search query.

The result from the stop-property search query shows that Neo4j’s exe-

cution time is approximately 4-6 times shorter than Oracle’s execution

time. Here the graph sizes are 3358, 5008 and 48 061 nodes respectively.

(48)

5.2 Throughput

Another way of looking at execution time is to look at the throughput of nodes (i.e. nodes/ms) from the databases.

8K-set 220K-set 6.3M-set 0

100 200

28.04

20.15

10.19

198.59 201.66

186.03

Thr oughput (nodes/ms)

Oracle Neo4j

Figure 5.3: Throughput result for the complete search query.

8K-set 220K-set 6.3M-set 0

50 100

27.31

19.81

7.27 102.79

91.45

85.46

Thr oughput (nodes/ms)

Oracle Neo4j

Figure 5.4: Throughput result for the stop-label search query.

Figures 5.3 and 5.4 show that the throughput from Neo4j is relatively

constant with some deviation of about 20 nodes/ms, but for Oracle

it becomes apparent that the larger the graph size is the lower the

throughput. There is also some difference considering the magnitude

of the throughput, in that Neo4j has a throughput that is 5-10 times

higher than that of Oracle.

(49)

5.3 Standard Deviation

Looking further into the results and the differences in execution time between each iteration, the standard deviation of each test on each dataset is presented in Figures 5.5 and 5.6.

8K-set 220K-set 6.3M-set 0

200 400 600

4 18

722

6 5

39

Standar d Deviation (ms)

Oracle Neo4j

Figure 5.5: Standard deviation of the complete search results.

8K-set 220K-set 6.3M-set 0

100 200 300

4 12

330

4 11

28

Standar d Deviation (ms)

Oracle Neo4j

Figure 5.6: Standard deviation of the stop-label search results.

The results show that in smaller datasets there is not much of a dif-

ference in standard deviation between Oracle and Neo4j. However,

when using a larger dataset such as 6.3M-set difference becomes more

significant.

(50)

Discussion

This chapter discusses the results presented in the previous chap- ter and reflects over if the objective of this project has been fulfilled.

6.1 Benchmark Comparison

Looking at Figures 5.1 and 5.2, we can deduce that Neo4j’s execution time is shorter than Oracle’s execution time for all tests. This supports findings from previous research and the claim that index-free adjacency property is faster than table look-up when it comes to traversing an interconnected network.

When looking at the standard deviation of each test, shown in Figures 5.5 and 5.6, we can deduce that the difference in execution time is rel- ative to the graph size. Furthermore, when the graph size becomes larger, the standard deviation increases. In the case of the 6.3M-set, the reason for the significant difference in standard deviation could be due to the size of the graph or the SQL-procedure. The procedure in the Oracle database is developed by Digpro and have more functionality than the Neo4j counterpart. This could have caused the increase in computation time when large amount of data is present and thereby cause the deviation to increase.

Figures 5.3 and 5.4 show the throughput results of the three datasets for the complete search query and the stop-label search query respec- tively. The throughput results for Neo4j are approximately equal, while

40