Response time analysis on indexing in relational databases and their impact

(1)

Response time analysis on indexing in relational databases and their impact

Sam Pettersson

&

Martin Borg

18, June, 2020

Dept. Computer Science & Engineering Blekinge Institute of Technology

SE–371 79 Karlskrona, Sweden

(2)

thesis is equivalent to 10 weeks of full-time studies.

Contact Information:

Authors:

Martin Borg

E-mail: mabn17@student.bth.se Sam Pettersson

E-mail: sapt17@student.bth.se

University advisor:

Associate Professor. Mikael Svahnberg Dept. Computer Science & Engineering

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

This is a bachelor thesis concerning the response time and CPU eﬀects of indexing in relational databases. Analyzing two popular databases, PostgreSQL and MariaDB with a real-world database structure using randomized entries. The experiment was conducted with Docker and its command-line interface without cached values to ensure fair outcomes. The procedure was done throughout seven diﬀerent create, read, update and delete queries with multiple volumes of database entries to discover their strengths and weaknesses when utilizing indexes.

The results indicate that indexing has an overall enhancing eﬀect on almost all of the queries. It is found that join, update and delete operations beneﬁts the most from non-clustered indexing. PostgreSQL gains the most from indexes while MariaDB has less of an improvement in the response time reduction. A greater usage of CPU resources does not seem to correlate with quicker response times.

Keywords: Database index, MariaDB, PostgreSQL, response time

(4)

Column exists inside of tables and gives information of which type an object is.

Counterpart refers to the index (either with or without non-clustered indexes) database version.

CPU (Central Processing Unit) is a computer component that pro- cesses instructions. It runs operating system, applications and much more to constantly receive input from the user or software programs.

CRUD is a database term that stands for create, read, update and delete.

Database management system (DBMS) is a group of programs that gives a user access to database entries and the power to change, add or remove information stored inside them [3].

DDL (Data deﬁnition language) is the SQL commands that can be used to create and modify the structure of database objects.

DML (Data manipulation language) contains the SQL commands that handles the manipulation of the database data.

Docker is a container engine which runs the container on top of the operating system and utilizes the features of the Linux kernel [49].

In return this gives the host the ability to run multiple and diﬀerent systems on one host.

Foreign key (FK) are columns in a database table. This provides links between data in two tables. FKs acts as cross-references between the tables and are used to reference the primary key of another table.

ii

(5)

Hierarchical is a data model that organizes data in a tree-like structure.

Indexes deﬁnes as data structures that improves the speed when executing most retrieval operations on database tables.

ISO/IEC data management standards for local and distributed system environments, promoting the harmonization of data management facilities in diﬀerent areas [55].

Primary key (PK) is a database key that is unique with every record.

This acts as an identiﬁer entry such as, vehicle identiﬁcation numbers, telephone numbers and much more.

Query is executable code that accesses or modiﬁes data.

Record (Database record) is a set of data stored inside a table.

Response time (latent period, latency) is a measurement of how quickly a system responds to input.

Row is present in a table and represent an object of information, also known as record.

Storage engine (database engine) is the root software component that DBMS uses to alter information from a database. These are most of the time unique to the database itself and allows the user to interact with the engine without going through the DBMS interface [22].

Structured Query Language (SQL/RDBMS/Relational databases) is a standard language to communicate with Relational-DBMS (RDBMS).

The data have to be predeﬁned inside tables or columns. It is partic- ularly useful when one works with systematic data (relations between entities and variables) [3, 1].

Table A table is a structure that contains rows and columns and can hold diﬀerent types of data.

iii

(6)

Abstract i

Abbreviations and Deﬁnitions ii

1 Introduction 1

1.1 Scope and Limitations . . . 2

1.2 Research Questions . . . 2

2 Literature Study 4 2.1 Literature Study Methodology . . . 4

2.1.1 Search results . . . 4

2.1.2 Selection of databases . . . 5

3 Theory 6 3.1 Relational Databases . . . 6

3.1.1 Structured query language . . . 6

3.1.2 Database schema . . . 6

3.1.3 Data deﬁnition and manipulation language . . . 7

3.1.4 Executing queries . . . 7

3.1.5 Join operation . . . 8

3.1.6 Compound statements . . . 8

3.2 Keys . . . 9

3.2.1 Primary keys . . . 9

3.2.2 Foreign keys . . . 9

3.3 Indexes . . . 9

3.3.1 Index types . . . 9

3.3.2 Index structures . . . 10

3.3.3 Indexes impact on CPUs . . . 11

3.4 Docker . . . 11

3.4.1 Dockerﬁles & images . . . 12

3.4.2 Docker containers & compose ﬁles . . . 12

iv

(7)

4 Method 13

4.1 Empirical . . . 13

4.1.1 Hardware and software setup . . . 13

4.1.2 Preparation . . . 13

4.1.3 Experiment . . . 15

4.2 Statistical Analysis . . . 19

4.3 Execution . . . 19

4.3.1 Digressions . . . 20

5 Results 21 5.1 The Eﬀects of Indexes . . . 21

5.1.1 Query #1 - Select . . . 21

5.1.2 Query #2 - Inner join . . . 22

5.1.3 Query #3 - Outer join . . . 23

5.1.4 Query #4 - Inner & Outer -join . . . 24

5.1.5 Query #5 - Insert . . . 25

5.1.6 Query #6 - Update . . . 26

5.1.7 Query #7 - Delete . . . 27

5.2 The Results as a Whole . . . 27

6 Analysis and Discussion 28 6.1 Overview . . . 28

6.2 Research Questions . . . 29

6.2.1 RQ1 - How much does indexes on foreign keys impact the databases speed when retrieving data, compared to using no non-clustered indexes during a set of select and join queries? 29 6.2.2 RQ2 - In what way does indexing aﬀect the databases response time when inserting, deleting and updating data? . 30 6.2.3 RQ3 - To what degree does indexing impact the CPU performance when analyzing the data from research questions one and two? . . . 31

6.3 Threats to Validity . . . 31

6.3.1 Internal validity . . . 31

6.3.2 Conclusion validity . . . 32

6.3.3 External validity . . . 33

6.3.4 Construct validity . . . 33

7 Conclusions and Future Work 34

References 36

v

(8)

B.2 init_maria_no_index.sql . . . ix

B Python Code xi A.1 main.py . . . xi

B.2 generate/__init__.py . . . xiv

C.3 generate/generateRandomUser.py . . . xiv

D.4 generate/generateRandomPost.py . . . xv

E.5 generate/generateRandomThread.py . . . xvi

F.6 generate/generateCategories.py . . . xvii

G.7 generate/utils.py . . . xvii

H.8 generate/conﬁg/values.json . . . xviii

C Dockerﬁles xix A.1 docker-compose.yml . . . xix

B.2 Dockerﬁle_maria . . . xxi

C.3 Dockerﬁle_postgre . . . xxi

vi

(9)

Chapter 1 Introduction

Databases are storages of organized information that in most cases are located electronically in a computer or server. The database is controlled by a database management system (DBMS) which can add, alter, remove and display information [20]. The ﬁrst DBMS came in the 1950s when computer software started to emerge which, made computers more than just giant calculators with minimal storage capacity. But, it was when businesses started using more software in which storage became important. Charles W. Bachman designed the ﬁrst DBMS in 1960 [59]. Bachman would later form the Database Task Group which handled the standardization of the programming language Common Business Oriented Language (COBOL) that was presented in the year of 1971. This became known as the CODASYL approach [51, 52].

Today in 2020 things are different. There are a plethora of databases one could chose varying from SQL (relational database, RDBMS), NoSQL, objective database along with others [3]. This paper will study relational databases because of popularity and architectural differences as the feature this thesis is analyzing operate in different approaches [48]. Relational databases are the most common;

they are typically modelled with rows and columns in an array of tables, making the processing and querying of data more efficient [20]. This structure simplifies the modification of data in a more controlled and well organized way. Databases of today have substantially more speed and are ”much easier to use” [10]. The performance of databases has improved over the years, but so has the amount of data they store. Statistics from the year 2012 show that Facebook alone, generates around 100 petabytes of new data every day and accessing all that takes time [58, 66]. This data turns into new tables and rows that makes it more challenging for the database to find what it is looking for. Customers are not waiting several minutes for everything to be loaded so that they can read through their feed [50]. The search has to go through every row and column inside of the tables to access the data that is being asked for [30]. In an attempt to satisfy the consumers, they will need to optimize the database in every way possible. There are multiple ways one can do this [46].

This thesis will touch onto one of the topics that have an impact on the performance, namely indexes. The reason behind studying indexes speciﬁcally is

1

(10)

to find out the improvement which could be applied on RDBMS overall during different create, read, update and delete (CRUD) operations without having to restructure the whole database design [6]. This could give valid information when picking between relation databases such as, if they have any performance differences in particular operations or if one database benefit more from indexes than the other.

1.1 Scope and Limitations

This study’s scope is limited to answering the speciﬁc research questions where the focus relies on the index feature of RDBMS, only measuring and analyzing the CPU usage and latency outcome that non-clustered indexes cause when applied to foreign keys.

This thesis will not compare non-clustered and clustered indexes against one and the other. The goal is to analyze the beneﬁts which relational databases operations (select, insert, update, delete) can draw from using non-clustered indexes on foreign keys. Primary keys will always have a clustered index applied, no matter the experiment.

Two databases, MariaDB and PostgreSQL were chosen to represent the relational databases. This was done due to the 10-week time limitation the researchers had to ﬁnish this thesis.

A more thorough description of the key concepts is introduced under chapter 3, Theory.

1.2 Research Questions

The following research questions will be answered in this thesis:

RQ1 - How much does non-clustered indexes on foreign keys impact the databases latency when retrieving data, compared to using no non-clustered indexes during:

• A select;

• The use of inner joins;

• The use of outer joins;

• Both inner and outer joins?

Indexing does improve the retrieval of data when used correctly [26], but by how much? The newsworthy information gathered from this could prove how much of an improvement they make during diﬀerent tasks.

(11)

Chapter 1. Introduction 3 This information will also provide data which could conclude if one would need to change from their current database to see improvements or, if tweaks to the indexing will be enough to fulﬁll their ongoing needs of performance.

One could argue that using multiple select statements instead of a join is easier to test but there are only a few cases where this can be applied into a real application. A join can save network resources and are more aﬀected by indexes which makes this more essential to the thesis [5].

RQ2 - In what way does indexing aﬀect the databases response time when inserting, deleting and updating data?

The goal of utilizing databases is not only about retrieving the data. The objective of this research question is to understand and analyze the possible trade- oﬀs with indexing when performing the diﬀerent scenarios.

RQ3 - To what degree does indexing impact the CPU performance when analyzing the data from research questions one and two?

By tracing the CPU usage and how it correlates with the response time one can see how much the beneﬁts or trade-oﬀs indexes can have on some of the hardware itself.

(12)

Literature Study

This chapter will explain how the references were gathered and how they shaped the experiment in chapter 4.

2.1 Literature Study Methodology

A theoretical method is used to search for the relevant academic papers, articles and blogs. These searches include a combination of the keywords; ”PostgreSQL”,

”MariaDB”, ”performance”, ”indexing”, ”clustered”, ”non-clustered”, ”relational”

and ”database” which give a more tailored result when gathering related information in Google Scholar and Blekinge Institute of Technology’s library database for academic articles called Summon. The documentation of SQL databases is used extensively for ﬁnding information that explain features, concepts and diﬀerences.

These results do not focus on an eﬀective timeline because there is a consid- erable amount of seminal information that dates back to more than twenty years.

This is relevant to enhance the value of the thesis, it can date farther back than 30 years if it gives insight to the ﬁeld of study. This includes history, theory, data to enhance the thesis background, experiment and overall research.

2.1.1 Search results

Searches using the combination of keywords such as ”PostgreSQL MariaDB Index Performance” gave no results. There was, however, 66 results by searching for

”Database Index Performance” and 40 results with ”Relational database index”, none of which were relevant to contribute to the thesis’s purpose. The combined results were ﬁltered out to only match academic papers and books. Testing the other keyword combination gave between 0-2 results, which were diﬀerent from this study.

Reﬁning the search results

There were no academic papers or books that analyzes the same type of study this thesis is looking into. Most of the search results are comparisons between

4

(13)

Chapter 2. Literature Study 5 various database types but none that explores the indexing feature. The closest matches are papers that analyzes the query optimizer or performance (such as RAM, CPU and disk storage) diﬀerences in SQL or NoSQL databases.

Six papers are used together with a snowballing method (i.e. using references they have) to ﬁnd more references that are closer to what this study require.

2.1.2 Selection of databases

The results gave an inclination of what databases to be experimented on. Picking two diﬀerent databases, MariaDB and PostgreSQL gives a better understanding of how RDBMS reacts to indexing as a whole. This was based on the fact that both:

1. Have diﬀerent code-bases [45];

2. Are contemporary databases with an active development [65];

3. Consist of active open-source projects in which, anyone can adjust their code [65].

Most of the books and research papers referenced [10, 7, 11] have analyzed older versions of PostgreSQL. New and updated algorithms has since then been added, which causes diﬀerent results [19], the same goes for MariaDB. MySQL is a far more explored and popular database [65] so by using MariaDB (compatible with MySQL) makes the paper more unique.

(14)

Theory

This chapter will introduce and explain the key concepts the reader needs to understand when reading this thesis.

3.1 Relational Databases

Almost all software need some kind of database which is a record-keeping system that store data. The database is much more than a storage facility for information, it also has a database management system that is used for the writing, reading and altering the information that is stored [20]. The data inside are tables consisting of rows and columns. These tables can share information between each other through relational connections [27]. For example, say that one table holds information of different employees. This table contains different columns, one of which is the employees identification number. The same number is then stored inside of another column in a different table so that, one can find more information on a specific employee in a simple way.

3.1.1 Structured query language

SQL is the language most DBMS use to communicate with the database [9].

It was previously used for research and was deﬁned as ISO/IEC, but turned to manipulating databases in the early 70s [54]. The method of communicating with the database is often called query.

3.1.2 Database schema

The schema is deﬁned as the structure of a database’s tables, relations and columns. The database schema can then be used to create the blueprint of the database with the use of logical formulas [21].

6

(15)

Chapter 3. Theory 7

3.1.3 Data deﬁnition and manipulation language

Data deﬁnition language (DDL) and data manipulation language (DML) are two standard languages and much like computer programming languages, have their own commands. The main use of DDL is to deﬁne data structures like schemas.

The statements create and modify the database to build new data objects such as tables and indexes [35]. DML, however, is speciﬁcally used to manipulate the data inside the database.

3.1.4 Executing queries

A database query is phrased into four diﬀerent steps; Parsing, Optimizing, code generation and the execution [4].

• The parsing stage takes place when issuing an SQL statement. The DBMS makes a call to the database to prepare the statement. During this phase, the parsing goes through three diﬀerent checks [31]: Syntax check makes sure that the query structure is correct; Semantic check is there to see if the references used exist inside of the database. For example, a column or a table could be the references and if it does not exist, an error will be thrown stating that it does not exist; Shared pool check compares and caches the textual, executable data forms of the queries data, to determine whether it can skip some of the statement’s resource-intensive steps. All of which are implemented to ﬁnd errors before execution of the statement [31].

• The optimization stage is there to optimize the DDL in the case it contains a DML component as a sub query which requires this step [31]. It is at this stage it contemplates the best access plan. It looks for the fastest way and uses indexes if they are determined to be the most rapid path when obtaining the information [37].

• The code generation stage, or rather known as the row source generator is a software that handles the execution plan from the optimizer and restructure the plan into an iterative format, making it usable by the database [4]. The iterative plan is executed by the SQL engine which in turn produces the result set [31]. This is a step by step plan of each returning row set. The second part uses the rows in the set or returns them.

• SQL execution in the ﬁnal stage where the database engine executes the row source and is the mandatory part of DML [4].

(16)

3.1.5 Join operation

The join operation is one of the core statements in SQL. The purpose of a join is to combine tables with relations to extract information from both to combine a return statement of the involved tables.

This thesis will only use two types of joins, INNER and OUTER. Both of which combines rows from two or more tables into a single result. By using joins as a condition, one can specify how columns from tables match to each other. In most instances, the objective is to ﬁnd equal values between tables and return those.

This approach is most common when matching foreign key of one table with the primary key of another [62].

The diﬀerence is in how they handle false match conditions. Inner joins do not include rows from both tables in a successful match; meanwhile outer joins do.

Figure 3.1: Overview of a few join combinations.

Outer joins can be modiﬁed with other keywords such as LEFT or RIGHT to modify how the results are combined. RIGHT will return all of the records from table B and the paired records from table A while, a LEFT will return all of the records from table A and its matched records from table B [61].

3.1.6 Compound statements

Compound statements are used to perform a series of database actions. The syntax of a compound statement starts with BEGIN and ends with END [33]. When the statement reaches the end syntax, the changes are committed unless an error occurs or if the RDBMS has been instructed otherwise.

(17)

Chapter 3. Theory 9

3.2 Keys

Database keys main functionality is to establish relations between tables. They also serve the purpose of identifying records or a speciﬁc row of data. Keys can be deﬁned with single or multiple attributes [10].

3.2.1 Primary keys

A primary key (PK) is the identifying key for a table as uniquely identify any row of data in the table [7].

3.2.2 Foreign keys

Foreign keys are not used for performance reasons, but works as an integrity tool that ensures a relationship has been established between two tables [38]. Caution is required when entering data into a foreign key (FK) as invalid data can break the relationship between the tables [16].

FKs does impact the performance on the tables themselves as they too, have to be scrutinized after inserting, updating or removing information [64]. This should hardly be noticed on medium sized databases (around 9 million records) or below, unless there is an overuse of the keys where they are not needed.

3.3 Indexes

Index is a data structure which makes the retrieval operations of data more ef- ﬁcient. They work as data entries to the value and search keys for a column [32].

Indexes are great in many ways, but on smaller tables they might be more of an obstacle. Indexing is not free, it takes up storage and memory which can be better utilized [11, 63]. They should be used on larger tables which will beneﬁt more of the shortcut they provide and make the cost a valuable trade oﬀ.

3.3.1 Index types

There are multiple diﬀerent index types in a database, and a column can in some cases, have more than one applied at the same time [24]. This thesis, however, will only use the clustered index (automatically applied to primary keys [23]) and the non-clustered which will be enforced onto the foreign keys.

Clustered indexes

A clustered index performs the task of sorting the data and then storing it in the corresponding order [30]. There can only be one clustered index per table, reason

(18)

being it can only be stored in one way. For example, a table has two columns (Id and Age), a clustered index is then applied to the Id. The Id is set to be an auto incremented value starting at 1, if a new entry is inserted it is going to get the Id of 2 and all of the rows are then going to be sorted on the Id in an ascending order.

Non-clustered indexes

In diﬀerence from clustered indexes, a non-clustered index is where it is ordered by the columns instead of the rows. If the type is not speciﬁed then the index defaults to non-clustered indexes [30].

3.3.2 Index structures

The index structure is restricted to what storage engine one chooses to use. Inn- oDB (MariaDB’s default storage engine) and PostgreSQL’s own database engine has a balanced tree (a.k.a. BTREE) index as their default type [34].

Balanced tree indexing

Balanced tree handles data as hierarchical sorted trees in which the leaf nodes, consist of keys in an ascending order [11]. These nodes are then connected to two other child-nodes (or non-leaf nodes). The child-node keys on the left side are less than the current key and the right sides have a larger value than the current. A node that hasn number of connections can have a maximal of n + 1 child-nodes [53].

Figure 3.2: BTREE layers and lookup [53].

(19)

Chapter 3. Theory 11 Figure 3.3.2 describes how a search process could look like for key 8. As a result of eight being greater than both of the root keys, it will pursuit the right side. The ﬁrst layer of child-nodes happens to be equivalent to the key that is searched for which forces the left pointer to be picked. The same operation goes on until it reaches the leaf node [53]. The upside of BTREE methods is that it reduces the number of reads made on the disk thus, makes the lookup operation faster [60]. The eﬃciency is, however, dependant on the number of layers it has [15].

A downside of BTREE is how it aﬀects some of the other operation methods within CRUD. When inserting a value, the value itself has to ﬁnd a suitable position inside the existing structure [60]. This means that it has to restructure most of its own layer which takes time. A similar process needs to be done during the removal of a key. The time of this process can nonetheless vary as of the root key being random. There is a chance that the key itself gets removed which will cause the non-leaf nodes to get restructured [60].

It is important to understand that BTREE indexes are limited to the queries themselves. The primary use is for column comparisons for the >, >=, =, >=,

<, ON and BETWEEN operators. LIKE comparisons can also take advantage of the index type but only if it begins with a constant [34]. An example of this could be the queries SELECT * FROM Employees WHERE FirstName LIKE ’Maria%’ and SELECT * FROM Employees WHERE FirstName LIKE ’%aria’;. The ﬁrst keyword will always start with ’Maria’ which makes it valid, the second query could start with any value which makes it incompatible with the search process and a full table scan will have to be performed.

3.3.3 Indexes impact on CPUs

The CPU stands for central processing unit and is often compared to the brain in humans [57]. Its main purpose is to use the input instruction it gathers from the hardware and software.

Databases CPU usage varies from each other, but it is the common nominator to database performance. High CPU usage and decline in performance can be summarized by poorly written queries, indexing on write operations and memory or hardware issues [47].

3.4 Docker

Docker is a software that simpliﬁes the process of building and running applications with the help of containers. Containers allow one to bundle an application or infrastructure with all of its dependencies to create a single package [49], resulting that the package can run on any other Linux machine regardless of their settings.

(20)

3.4.1 Dockerﬁles & images

A Dockerfile consists of instructions that Docker uses to build a file that represents the application or infrastructure [40]. This can contain guidelines such as, copying host-files, installing packages, running commands et cetera.

When the ﬁle has been built, a Docker image is created. This is the executable version of what was built as it contains all of the code necessary to run what one wants it to [41].

3.4.2 Docker containers & compose ﬁles

A compose ﬁle uses one or more Dockerﬁle(s) to aid the process of running and building images, by adding the options one normally include into docker build and docker run [42].

The running Docker image can be seen as a Docker container. This lets one access the application/infrastructure in an isolated process [41].

Multiple containers can be run separately at the same time and the compose ﬁle can aid the containers by sharing information and communicate between each other [43], but will not be used in such a way in the thesis’s empirical study.

(21)

Chapter 4 Method

This chapter describes the methods and approaches used in this thesis.

4.1 Empirical

The thesis uses an empirical method to perform experiments. These will provide evidence for research questions one, two and three. Experiments are setup to create an equal basis for both databases in order to achieve the most accurate results which translate over to deﬁnitive answers.

4.1.1 Hardware and software setup

Name Version

Operative System (OS) macOS 10.14 Mojave Processor (CPU) 2,2 GHz Intel Core i7

Memory (Ram) 16 GB 2400 MHz DDR4

Graphics Radeon Pro 555X 4 GB Intel - UHD Graphics 630 1536 MB

Hard-drive SSD 500GB

Docker Desktop Community 2.2.0.3

Docker Compose v1.25.5, build 8a1c60f6

MariaDB image mariadb:10.5.1

PostgreSQL image postgres:12.2

Table 4.1: Hardware and software speciﬁcations used when running the experiments.

4.1.2 Preparation

The database setup consists of four tables; user, post, category and thread. This is the same forum database that has been experimented on by Marcus Olsson and Simon Hevosmaa [2]. The only diﬀerence will be the data inside of it which is auto

13

(22)

generated into a comma-separated CSV ﬁle from a Python script (Appendix B) as it is not shared in their report.

The original DDL ﬁle is meant for MySQL and MariaDB and will be converted to PostgreSQL. This changeover can be complicated because there is a problem where PostgreSQL might not get the full build value such as using the wrong data types [29]. The approach of conversion is to use one of the recommended ways of PostgreSQL themselves which eliminates most of these risks and involves the use of mysqldump -u{user} -p{password} --compatible=”postgresql”

{database name} > {output} [36].

This dump-command can be utilized to create a DDL file in which PostgreSQL can use [28]. {user} and {password} are the login credentials which the database require, --compatible=”postgresql” makes sure that the syntax will be compatible with PostgreSQL, {database name} stands for the database one wants to replicate and lastly, {output} is the desired file name for the DDL file.

Figure 4.1: A relation diagram over the PostgreSQL and MariaDB database.

The created DDL ﬁles had no data from the CSV-ﬁles inside of them as they will be appended during the initialization phase that Docker uses when building the images.

The data used for the experiments together with the results can be found at the end of the reference list.

(23)

Chapter 4. Method 15

4.1.3 Experiment

There are two main phases to the experiment. In the ﬁrst phase, queries (select, insert, update, delete) will be individually executed with and without non- clustered indexes to establish a baseline. In the second phase, compound statements will be executed in the following order. First, database entries will be inserted. Second, entries will be updated. Finally, entries will be deleted. Each of these queries is executed inside of their respective Docker shells, reiterated on the total data entry breakpoints; 1, 10³, 10⁴, 10⁵ per table (besides ”category”

whom always have a constant number of entries, see Appendix G.7). This is made to achieve an accurate overview of how the corresponding databases behave in diﬀerent scenarios all of which are tested thirty times each.

The response times are taken directly from the databases own command-line interface. The default conﬁguration lets MariaDB print out the query execution time but as for PostgreSQL, an extra conﬁguration is made to achieve the same result (\timing) [17].

Docker setup, execution and tare-down

docker-compose build –-no-cache –-build-arg foldername={csv-folder}

is used to build the image. The –-no-cache options makes sure that the image will not inﬂuence the next build. –-build-arg foldername={csv-folder} spec- iﬁes what experiment is meant to be done and what folder to copy. Once a test is done, each of the volumes created is removed together with the existing Docker image to reassure the data set will start from zero.

CPU measurements

As for RQ3, the CPU measurements will be collected at peak activity when executing each of the queries from research questions one and two. The stream of information will then be collected from the docker stats command. This command displays the load Docker takes on the hosts random access memory, graphics processing unit and central processing unit [44]. One thing that has to be acknowledged is that the CPU measurements can have results above one- hundred percent, reason being that each CPU core are taken into account. For instance, if it has n cores, the usage could be up to n ∗ 100%.

Executed queries

The following is a representation of the queries that is used for the research questions. The ﬁrst statement in each block represents MariaDB while the second (separated by an empty ”--” string) speaks for PostgreSQL. Text that starts of with ”--” are only there to clarify some of the variable names used inside each query.

(24)

RQ1:

Query #1 - Select:

1 S E L E C T * FROM user ; 2 --

3 S E L E C T * FROM " user ";

Query #2 - Inner join:

1 S E L E C T

2 *

3 FROM

4 user u

5 INNER JOIN post p ON u . id = p . u s e r _ i d 6 INNER JOIN t h r e a d t on t . id = p . t h r e a d _ i d

7 INNER JOIN c a t e g o r y c ON c . name = t . c a t e g o r y _ n a m e ; 8 --

9 S E L E C T

10 *

11 FROM

12 " user " u

13 INNER JOIN post p ON u . id = p . u s e r _ i d 14 INNER JOIN t h r e a d t on t . id = p . t h r e a d _ i d

15 INNER JOIN c a t e g o r y c ON c . name = t . c a t e g o r y _ n a m e ;

Query #3 - Outer join

1 S E L E C T

2 *

3 FROM

4 post p

5 LEFT OUTER JOIN user u

6 ON p . u s e r _ i d = u . id 7 LEFT OUTER JOIN t h r e a d t 8 ON p . t h r e a d _ i d = t . id 9 LEFT OUTER JOIN c a t e g o r y c

10 ON c . name = t . c a t e g o r y _ n a m e ;

11 --

12 S E L E C T

13 *

14 FROM

15 post p

16 LEFT OUTER JOIN " user " u 17 ON p . u s e r _ i d = u . id 18 LEFT OUTER JOIN t h r e a d t 19 ON p . t h r e a d _ i d = t . id 20 LEFT OUTER JOIN c a t e g o r y c

21 ON c . name = t . c a t e g o r y _ n a m e ;

(25)

Query #4 - Inner & Outer -join:

1 S E L E C T

2 *

3 FROM

4 post p

5 INNER JOIN t h r e a d t on t . id = p . t h r e a d _ i d

6 LEFT OUTER JOIN c a t e g o r y c ON c . name = t . c a t e g o r y _ n a m e ; 7 --

8 S E L E C T

9 *

10 FROM

11 post p

12 INNER JOIN t h r e a d t on t . id = p . t h r e a d _ i d

13 LEFT OUTER JOIN c a t e g o r y c ON c . name = t . c a t e g o r y _ n a m e ;

RQ2:

Query #5 - Insert:

1 -- Assumes that a data entry exists inside user

2 -- Assumes that a data entry exists inside cate gor y with the

name ' SQL ' 34 BEGIN;

5 I N S E R T INTO t h r e a d ( title , c a t e g o r y _ n a m e , u s e r _ i d ) 6 V A L U E S(' title ', " ' SQL ' ", 1) ;

7 I N S E R T INTO post ( thread_id , user_id , title , c o n t e n t ) 8 V A L U E S( L A S T _ I N S E R T _ I D () , 1 , ' title ', ' c o n t e n t ') ; 9 C O M M I T;

10 --

11 BEGIN;

12 I N S E R T INTO t h r e a d ( title , c a t e g o r y _ n a m e , u s e r _ i d ) 13 V A L U E S(' title ', $$' SQL '$$ , 1) ;

14 I N S E R T INTO post ( thread_id , user_id , title , c o n t e n t ) 15 V A L U E S( c u r r v a l ( p g _ g e t _ s e r i a l _ s e q u e n c e (' t h r e a d ',' id ') ) ,

1 , ' title ', ' c o n t e n t ') ; 16 C O M M I T;

(26)

Query #6 - Update:

1 -- T o _ U p d a t e = The n u m b e r of e n t r i e s i n s i d e t h r e a d 23 U P D A T E post SET

4 t h r e a d _ i d = T o _ U p d a t e /2 , 5 c o n t e n t = ' u p d a t e d ' 6 WHERE

7 id = 1 OR

8 id = T o _ U p d a t e OR

9 id = ( T o _ U p d a t e + 1) OR 10 id = ( T o _ U p d a t e - 1) ;

11 --

12 EXPLAIN ANALYZE U P D A T E post SET 13 t h r e a d _ i d = T o _ U p d a t e /2 , 14 c o n t e n t = ' u p d a t e d ' 15 WHERE

16 id = 1 OR

17 id = T o _ U p d a t e OR

18 id = ( T o _ U p d a t e + 1) OR 19 id = ( T o _ U p d a t e - 1) ;

Query #7 - Delete:

1 -- T o _ D e l e t e = Half of the n u m b e r of e n t r i e s i n s i d e t h r e a d 23 BEGIN;

4 D E L E T E FROM post WHERE 5 t h r e a d _ i d = T o _ D e l e t e ; 6 D E L E T E FROM t h r e a d WHERE 7 id = T o _ D e l e t e ; 8 C O M M I T;

9 -- 10 BEGIN;

11 D E L E T E FROM post WHERE 12 t h r e a d _ i d = T o _ D e l e t e ; 13 D E L E T E FROM t h r e a d WHERE 14 id = T o _ D e l e t e ; 15 C O M M I T;

(27)

4.2 Statistical Analysis

The data include CPU usage together with response times and are collected during the same test. The information was then sorted by query and database entry volume. As stated in the empirical method, the tests were run thirty times each.

Setting up graphs to compare the full lengths of queries tested.

To analyze and present the data gathered from the experiments, the researchers will use what is called a ”descriptive statistics” together with a ”simple graphics analysis” as these are the basis of virtually every quantitative analysis [18]. This process collects the data and dissects it until a sample of average values can be drawn. This gives one an overview if the data are grouped together, spread out and if values are high or low.

4.3 Execution

The experiment stage starts with building the Docker image along with the correct CSV folder that is needed for the current test range. Build and run the Docker container in a terminal. Open a second terminal that runs the docker stats command to see the CPU usage. When the container is up, open a third terminal and execute the desired query. The following steps will display the CPU usage in the second terminal and the database’s response time in the third. When the results are gathered, tear everything down by stopping the container, removing the volume and image.

Example RQ 1 and 3, 1 Entry, Select with MariaDB

1. Terminal 1: docker-compose build --no-cache --build-arg foldername=1 maria

2. Terminal 1: docker-compose up maria 3. Terminal 2: docker stats

4. Terminal 3: docker exec -it maria mysql -uuser -ppass kandidat -e

”SELECT * FROM user”

5. Terminal 3: Leave the docker connection and tear everything down;

rm -rf ./data/*

docker-compose down -v

yes y | docker rmi performance_analysis_maria:latest

(28)

4.3.1 Digressions

The only obstacle that occurred was the time it took for Docker to automatically load in the DDL ﬁles. This causes a delay in which one is not able to execute SQL commands inside of the container.

Calculating outliers and leftover results

There were results which missed the target of the test majority and these are calculated with the ”Interquartile range” method (IQR). This is a mathematical way to determine if a data point is an outlier [8].

There are multiple quartiles which are used to calculate the upper and lower bounties. The ﬁrst quartile (Q1) and third quartile (Q3) are placed into a formula which can be seen below and the rules says anything above Q₃ or below Q₁ in the formula is an outlier.

Q1 is the median of the lower half from the original data set. This means that 25% of the numbers in the data set lie below Q₁ and 75% are above. Q₃ is calculated in the same way as 75% of the values are below Q3 and 25% above.

IQR: Q₃− Q₁

Lower Bound: Q1− 1.5 ∗ IQR Upper Bound: Q3+ 1.5 ∗ IQR

Outliers were then replaced with a rerun to ensure no external issues interfered.

This only happened in less than 2% of the test cases and could be tracked down to random CPU and RAM spikes on the host-machine.

(29)

Chapter 5 Results

This chapter contains the results acquired by the empirical study and evaluates the data collected from the two databases in a series of queries. The chapter goes through the queries in the order seen from the method chapter. Each query is presented as a subsection, breaking down the primary factors of the study.

5.1 The Eﬀects of Indexes

The graphs come in two diﬀerent formats illustrating the CPU workload and latency. The CPU graphs displays statistics from the docker stats’s CPU usage on the Y-axis (as %) and the X-axis is the number of entries per table that cur- rently exists inside the database while, the response time graphs Y-axis displays the time taken (in seconds) for the query to be fully executed. The table contains the breakpoints in which the graphs are using.

5.1.1 Query #1 - Select

Figure 5.1: Query #1 - results.

PostgreSQL with indexed foreign keys shows to have an apparent correlation between the CPU usage and latency, it is using the most processing power and

21

(30)

is the slowest. Its counterpart, however, proves to have a faster response time even if it has the same CPU usage. Both databases with altered FKs seem to have a better response time, but gets slower as the database grows, letting their counterparts to show better results.

MariaDB IMariaDB PostgreSQL IPostgreSQL

CPU Latency CPU Latency CPU Latency CPU Latency

1 .018 .018 .412 .017 1.035 .009 1.026 .003

1k 1.493 .020 3.401 .021 3.852 .010 2.403 .005

10k 9.112 .027 8.673 .028 12.643 .016 12.414 .020

100k 78.217 .085 77.831 .089 85.147 .099 96.607 0.106 Table 5.1: Query #1 - breakpoints.

5.1.2 Query #2 - Inner join

1 .147 .019 .392 .018 1.464 .027 .662 .014

1k 3.487 .036 8.190 .024 9.355 .027 6.565 .021

10k 42.153 .098 25.737 .061 36.597 .140 11.355 .052

100k 108.644 .531 108.935 .504 98.922 1.570 100.006 1.075 Table 5.2: Query #2 - breakpoints.

(31)

Chapter 5. Results 23 Both of the Maria databases are more swift at retrieving information although the latency difference between the altered and the non-modified seems to be almost even. The PostgreSQL duo are very different in processing power compared to the others, the non-modified interpretations displays the lowest CPU usage and are also the slowest.

5.1.3 Query #3 - Outer join

The results seem to display that both of the non-modiﬁed interpretations have higher CPU usage than their counterparts. The two Maria databases remain more rapid just as the last query where JOIN was used. There are a few milliseconds in between both iterations, but the results are showing that non-clustered are faster.

1 .558 .025 .532 .018 1.842 .026 .616 .017

1k 3.741 .062 8.211 .033 9.865 .027 8.360 .021

10k 47.792 .112 25.352 .063 58.006 .052 11.132 .043

100k 116.850 .602 111.523 .46 99.393 1.374 93.133 1.162 Table 5.3: Query #3 - breakpoints.

(32)

5.1.4 Query #4 - Inner & Outer -join

This query is the ﬁrst where a clear view is given over the impact non-clustered indexes can have. It is once again the JOIN operator (a combination of the two last queries) which shows that MariaDB is being faster than PostgreSQL in both adaptations. One can see how both of the Maria databases remain close throughout while PostgreSQL develops a gap as more records are introduced.

1 .379 .019 .488 .015 1.627 .023 .579 .012

1k 2.642 .029 6.259 .022 6.367 .026 6.304 .014

10k 34.696 .082 17.461 .053 37.969 .055 48.742 .029 100k 90.285 .375 98.824 .308 99.474 1.072 99.711 .464

Table 5.4: Query #4 - breakpoints.

(33)

Chapter 5. Results 25

5.1.5 Query #5 - Insert

The databases overall are close by on most entry levels, but with an expanding number of entries, both of the altered databases seem to develop an increased gap against their partners. The non-clustered versions utilize more CPU power on average. No improvement is to be found as the latency continues to increase.

1 .628 .018 .646 .018 1.753 .015 2.020 .015

1k .819 .019 .946 .0195 1.753 .016 2.040 .017

10k .883 .020 1.090 .021 1.920 .018 2.290 .019

100k .939 .021 1.101 .024 1.997 .021 2.480 .023

(34)

5.1.6 Query #6 - Update

Query number six provides a clear perspective of how much PostgreSQL gain in performance with the use of non-clustered indexes although the database itself seems to struggle when updating values as more entries matches the query.

There are small diﬀerences between the Maria databases, but overall, the modiﬁed version provide lower response time with CPU usage varying depending on the database itself.

1 .352 .017 .465 .017 1.020 .017 .435 .006

1k .719 .018 .987 .019 1.050 .019 .608 .007

10k .731 .020 1.003 .019 1.153 .020 .746 .023

100k 1.116 .021 1.176 .020 2.446 .036 .911 .024

(35)

Chapter 5. Results 27

5.1.7 Query #7 - Delete

Both the modiﬁed databases uses less processing power in comparison and displays a faster response time comparing to their counterparts.

When looking at the graphs, it is hard to overlook the spiking values of the non-modiﬁed databases. The main reason that causes this is how the storage engines handles their index structure, a more in-depth explanation can be found under section 6.2.2, paragraph 2.

1 .760 .020 .449 .017 1.500 .025 .923 .011

1k .823 .034 .723 .017 1.563 .025 1.191 .011

10k 1.447 .055 .773 .019 1.630 .025 1.898 .014

100k 5.952 .072 1.070 .019 2.410 .040 2.001 .016

5.2 The Results as a Whole

There seems to be an improvement, both on CPU usage and response times for almost all queries using non-clustered indexes. PostgreSQL show the largest diﬀerential results percent-wise. MariaDB, however, shows that the altered and non-modiﬁed adaptations are almost always close to each other, resulting in non- clustered indexes having a lesser impact for this database.

(36)

Analysis and Discussion

This chapter will begin by giving an overview of the ﬁndings followed by a more thorough analysis for each research question. Lastly, brings up risks and validity threats that can change the outcome of the thesis’s empirical method.

6.1 Overview

The results of non-clustered indexes display an enhancement on the response time for both databases. The degree of reduction varies from the systems. They seem to have a greater eﬀect on PostgreSQL than MariaDB as it does not diﬀerentiate much from its counterpart. With this, one can state that the databases themselves matter to a higher degree as there might be less room for improvement on a Maria database. The results could however change as the latency seem to develop an increased gap while more entries are included.

The CPU data demonstrates that MariaDB has lower levels of CPU usage on almost all of the queries while non-clustered indexes do not really seem to inﬂuence this. What distinguishes the non-modiﬁed version is that it seems to start at a higher capacity in query number 2, 3 and 4, although, they end up at around the same results as more entries are added. Q#7 is the only time the gap is extensive among the pair. On all of the other tests, they are either equal or slightly higher than its counterpart.

PostgreSQL looks to be more aﬀected as the altered versions displays lower values on query number 2 and 3. However, Q#4, that consists of a combination of the two previous queries, indicates a higher CPU usage.

The correlation in CPU usage and latency is hard to notice in both databases since most of the time, no matter the percentage, the modified adaptation proves to be faster. With that said, there is an effect of the time gap increasing when the non-clustered PostgreSQL system rises in capacity. The common theme behind MariaDB and PostgreSQL is that the processing curve is higher for both non- modified adaptations on lower populations which increase in scales on added entries, only having few exceptions. Q#4 (for PostgreSQL) and Q#6 (MariaDB) are two of these as the non-clustered adaptations have the highest usage from start.

28

(37)

Chapter 6. Analysis and Discussion 29

6.2 Research Questions

This subsection will go over and answer the thesis research questions one by one.

Tables that are displayed will cover the diﬀerence in percentages where a negative value means that the non-clustered index performs slower, and a positive number indicates that it is faster.

6.2.1 RQ1 - How much does indexes on foreign keys impact the databases speed when retrieving data, compared to using no non-clustered indexes during a set of se- lect and join queries?

There are some improvements when it comes down to using non-clustered indexes.

This can be seen at query number 2, 3 and 4. The most problematic query was Q#1 (a plain select) where the altered MariaDB adaption is 5.6% faster at the ﬁrst break point and after that crumbles as the database is being populated with more records. PostgreSQL is faster on the two ﬁrst breakpoints, but just as the non-clustered MariaDB, gets slower as the tables size increases.

MariaDB PostgreSQL

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4

1 5.6 5.3 28 21.1 66.7 48.1 34.6 47.8 1k -5 50 46.8 24.1 50 22.2 22.2 46.2 10k -3.7 37.8 43.8 35.4 -25 62.9 17.3 47.3 100k -4.7 5.1 23.6 17.9 -7.1 31.5 15.4 56.7

Table 6.1: RQ1 - Index diﬀerentials

Joins are of the best use for indexes as the results report here. The modified version of both databases is more rapid during query number 2, 3 and 4 since they are utilizing their non-clustered index in difference from query one which does not. The graphs show that altered MariaDB is faster than its counterpart although it is barely faster during the first and last break point in Q#2. This can be explained due to the first range being so low that the DDL does not need any optimization hence, calculating for more indexes makes it slower [11]. The last range generates a result which causes the non-clustered version to have a higher memory consumption, decelerating the row source generator stage [31, 63].

(38)

Summary

RQ1s outcome exhibits expected results on query number 2, 3 and 4 where non- clustered indexes are faster compared to its counterpart, given that the columns are already sorted for the optimization stage. Q#1 is also aﬀected by indexes but in a negative way as they consume unnecessary memory during the query execution similarly to joins with a set of results.

6.2.2 RQ2 - In what way does indexing aﬀect the databases response time when inserting, deleting and updating data?

Insert, delete and update in databases are in theory thought to be slower with more indexes as it needs to add more data to its BTREE structure [53]. This shows to be true during the insert test. Query number 6 and 7 however, displays a faster result which contradicts this assumption. Both of these utilized the WHERE statement to determine what entries it should manipulate and this accelerates the process to where the improvements from the optimization stage seem to outweigh the slower SQL execution stage [34]. An update seems to beneﬁt less than a delete overall even though it diﬀers depending on the database.

MariaDB PostgreSQL

Q5 Q6 Q7 Q5 Q6 Q7

1 0 0 15 0 64.7 56

1k -2.6 -5.5 50 -6.3 63.2 56

10k -5 5 65.5 -5.6 15 44

100k -14.3 4.8 73.6 -9.5 33.3 60 Table 6.2: RQ2 - Index diﬀerentials

The strange spike caused by MariaDB in query number seven is a side eﬀect of InnoDB where clustered indexes has to reorder the table based on its value. This means that when one column gets removed it has to remove it from the physical disk and reorder the whole table. The problem is not the delete itself but all that is going on in the background, InnoDB will not lock the table during the delete, but will instead require more system resources (causing the CPU spike).

If one were to rather use MyISAM, the whole table would be locked, take fewer resources and therefore speed up this process [39]. Update statements do not treat this the same way as it does not need to rearrange the physical reordering if one does not update the clustered index itself [12].

(39)

Chapter 6. Analysis and Discussion 31 Summary

A conclusion can be made where indexes does not always aﬀect updates and deletes in a negative way as the WHERE statement seem to accelerate this process.

Performing these operations on a larger scale or data sets with diﬀerent table connections could possibly change this outcome. The insert statement, however, does not seem to beneﬁt from this operation as expected given the theory chapter.

6.2.3 RQ3 - To what degree does indexing impact the CPU performance when analyzing the data from research questions one and two?

MariaDB PostgreSQL

1 1k 10k 100k 1 1k 10k 100k

Q1 -2188.9 -127.8 4.8 0.5 0.9 37.6 1.8 -13.5 Q2 -166.7 -134.9 38.9 -0.3 54.8 29.8 69 -1.1

Q3 4.7 -119.5 46.9 4.6 66.6 15.3 80.8 6.3

Q4 -28.8 -136.9 49.7 -9.5 64.4 1 -28.4 -0.2 Q5 0 97.6 -23.4 17.3 -15.2 -16.4 -19.3 -24.2 Q6 -32.1 -37.3 -37.2 -5.4 -2458.8 42.1 35.3 62.8

Q7 40.9 12.2 46.6 82 38.5 23.8 -16.4 17

Table 6.3: RQ3 - Index diﬀerentials

In the data collected during this thesis’s scope, there can not be any conclusion drawn. The databases CPU usage behaves diﬀerently while using the same query structure as seen in table 6.3.

This behaviour is spread out in multiple queries and ranges which, can not be pinpointed without understanding the CPU usage of each respective database core infrastructure.

6.3 Threats to Validity

6.3.1 Internal validity

In order to ensure the database tests used in the study had equal circumstances where the indexes performance would not be hindered, reduced or boosted, cer- tain steps were taken to control the conditions. A database design that had been experimented on in another study about performance was adopted given that a poorly design database can halter the performance [13]. The databases are setup

(40)

inside of Docker containers where hardware resources can be controlled and re- strained. Rebuilding the Docker container after each run as well as, removing any traces of the database volume and ﬁles, then turning oﬀ its cache. By performing the tests in this order reduces most of the variables which could have contributed to a database having an unfair advantage [14].

6.3.2 Conclusion validity

As described above, all tests were put under the same conditions when the data were collected. The data were acquired from docker stats, and the databases own command line tool which close to the original data source. All tests were run in Docker containers to reduce the chance of any outside process having an inﬂuence on the results. The tests were run thirty times each as it is a balance between often enough to average out any random errors and running more times would not mean further stability. There is an old statistics adage that if you test enough times most data regress towards a normal distribution and 30 is a common approximation of when this has usually happened [56]. None of the data was analyzed before drawing up the results which reduce the chance of the raw data being corrupted by human interference and afterwards, removing the outliers. With this setup there is little to no risk that the results are inﬂuenced.

Having a larger database set could also be beneficial to this study, but it would not have mattered as this thesis does not focus on scalability, it is to study the impact of non-clustered on foreign keys. This argument is detrimental to itself because it would then require to define different sizes of databases to verify them on multiple levels to get the correct index impact for each different database. Instead the thesis focus on finding a ”middle-ground” between a small and medium scale. The results can still be applied to them given that one can start to see a pattern of how the queries changes as more records are added.

The generated data should not have an impact on the databases as it is not fully random. For example, the names of the users are real and the posts content length have the same number of characters. The generated data is structured in a realistic setting which means not all of the users creates threads or posts as some people only visits to look for answers that exists.

A badly formatted query can reduce the performance of an operation, but this should not matter since all of the database versions uses the same queries hence, not having an impact on the results. The select and join statements are not specifically tailored to indexes as they do not look to filter out specific values with the help of other comparison statements. If they were to use one of these, one should see a larger differential between the versions with the indexed database having the advantage. The manipulation queries however, will be the same or close to a ”real” query as it simulates how a new thread is created or removed from a database.

(41)

Chapter 6. Analysis and Discussion 33

6.3.3 External validity

To be able to generalize the results of index performance, two relational databases with distant code-basis were picked and used to give an overall view of the impact indexes has as a whole. While carrying out the tests with the acknowledgement that PostgreSQL’s main focus is at scaling which could contribute to slower read- heavy workﬂows, but in the way the tests are performed this should not be an issue. The tests are not designed to compare the databases against one another.

They are drawn up to extract data of the impact that indexes has on the databases individually.

6.3.4 Construct validity

”The tests are only performed on a single database design”

This is a common problem when analysing databases. It is factual that results when testing depend on the data set that is used and how the database is con- structed [13]. As the database is based on a real example that also been experimented on beforehand, making the results more trustworthy [2]. It is one database, but what the thesis focus relies on is the relations between the tables which are not heavily dependant on a speciﬁc application or design. It is more about how the keys connect diﬀerent tables which are similar to most systems.

As mentioned under conclusion validity, the tests are repeated multiple times to give a larger data set which leads to a more precise result. The tests are setup to produce sets of both the CPU and response time at the same moment which ensures a better overview in case a test result is corrupted.

(42)

Conclusions and Future Work

The thesis goal is to evaluate how indexes can affect the response time and CPU usage during different CRUD operations. The empirical study conducts tests on two databases, PostgreSQL and MariaDB in an effort to put more emphasis on relational databases index performance as a whole.

All of the experiments are organized inside of the mentioned databases oﬃcial Docker images, hosted on a MacBook Pro, so that one can equalize the conditions for each newly ran tests. The experiments are conducted thirty times each to get an accurate set of results. To further strengthen the ﬁndings and minimize some of the external risks, a database design that has previously been experimented on was adopted.

Research question one; How much does indexes on foreign keys impact the databases speed when retrieving data, compared to using no non-clustered indexes during a set of select and join queries? In order to identify the read operations latency, three join statements together with a plain select was used to see how non- clustered indexes inﬂuences the retrieval of data. The evidence found concludes that queries which does not utilize their indexes are slower due to unnecessary consumption of memory while, the ones that do are faster given that the columns are sorted which optimizes the query.

Research question two; In what way does indexing aﬀect the databases response time when inserting, deleting and updating data? In the other category of CRUD operations (data manipulation), inserting new entries points towards there being a negative eﬀect as expected. However, a surprising factor was found given that update and delete operations yield a faster result which contradicts the theory. The researched outcome was supposed to display a higher latency as the storage engine has to refactor more of its index structure.

Research question three; To what degree does indexing impact the CPU performance when analyzing the data from research questions one and two? Through- out all the tests, the CPU was monitored in order to ﬁnd a correlation between response times and processing power. The two databases results were compared against one and another to ﬁnd out the impact indexes has on CPUs with the understanding that the data would display the same outcome for both databases.

The results indicate that the CPU does not get aﬀected by indexes themselves as they display diﬀerent patterns.

34

(43)

Chapter 7. Conclusions and Future Work 35 The conclusion displays that both databases have similar reactions regarding latency’s correlation with non-clustered indexes and the number of entries. A similar result could not be established for the CPU usage. The eﬀect of indexes is concluded absent as no evidence supporting a relationship between them could be found. This indicates that the CPU does not get aﬀected by indexes themselves as they are more dependant on the software itself.

The continuation of work would be in a comparison study, where different optimization strategies would be ranked on the effect and the effort they require to implement. This would give a clear insight where to invest time when optimizing the database response times. It would also be interesting to follow up on the negative effects that indexes could have, an example would be if one overuses this feature or simply implements the wrong type. There is a shortage of information about the stages a query has to go through and what they do. By analyzing the time all stages take and how the indexes affect these themselves might lead to a better understanding and concrete result.

(44)

[1] Erik Andersson and Zacharias Berggren - A Comparison Between MongoDB and MySQL Document Store Considering Performance, Thesis/Paper (En- glish), 2017

[2] Marcus Olsson, Simon Hevosmaa - Experimentell jämförelse av Cassandra, MongoDB och MySQL, Thesis/Paper (English), 2014

[3] Johan Gustavsson - Comparison of relational databases and NoSQL-databases, Thesis/Paper (Swedish), 2014

[4] P. Griﬃths SelingerM. M. AstrahanD. D. ChamberlinR. A. LorieT. and G.

Price - http://x86.cs.duke.edu/courses/spring03/cps216/papers/selinger-etal- 1979.pdf, Thesis/Paper (English), 1979 Accessed: 2020/03/04 - 19:45

[5] Mike Gunderloy, Joseph L. Jorden, David W. - Mastering Microsoft SQL Server 2005, Book (Enlgish), ISBN: 978-0-782-14380-5, 2005

[6] Kuhn, Darl, Alapati, Sam, Padﬁeld, Bill - Expert Indexing in Oracle Database 11g Performance for your Database, Book (Enlgish), ISBN: 978-1-4302-3736- 5, 2011

[7] Quirchmayr Gerald, Schweighofer Erich and Bench-Capon Trevor J. M. - Database and Expert Systems Applications, Book (English), ISSN: 0302-9743, 1998

[8] Vinutha H.P., Poornima B., Sagar B.M. - Detection of Outliers Using In- terquartile Range Technique from Intrusion Dataset, Book (English), ISBN:

978-981-10-7563-6, 2018

[9] Carlos Coronel, Steven Morris - Database Systems: Design, Implementation,

& Management, Book (English), ISBN: 978-1-337-62790-0, 2018

[10] Stanislaw Kozielski, Dariusz Mrozek, Pawel Kasprowski, Bozena Malysiak- Mrozek, Daniel Kostrzewa - Beyond Databases, Architectures and Structures.

Paving the Road to Smart Data Processing and Analysis, Book (English/Pol- ish), ISBN: 978-3-030-19093-4, 2019

36