Comparing database management systems with SQLAlchemy : A quantitative study on database management systems

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Bachelor thesis, 16 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-G--19/074--SE

Comparing

database

man-agement

systems

with

SQLAlchemy

–

A quantitative study on database management systems

Jämförelse av databashanterare med hjälp av SQLAlchemy

Marcus Fredstam

Gabriel Johansson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

Marcus Fredstam Gabriel Johansson

(3)

Abstract

Knowing which database management system to use for a project is difficult to know in advance. Luckily, there are tools that can help the developer apply the same database design on multiple different database management systems without having to change the code. In this thesis, we investigate the strengths of SQLAlchemy, which is an SQL toolkit for Python.

We compared SQLite, PostgreSQL and MySQL using SQLAlchemy as well as com-pared a pure MySQL implementation against the results from SQLAlchemy.

We conclude that, for our database design, PostgreSQL was the best database manage-ment system and that for the average SQL-user, SQLAlchemy is an excellent substitution to writing regular SQL.

(4)

Acknowledgments

We want to thank Jon Dybeck at the Department of Computer and Information Science at Linköping University for giving us the opportunity to work on a project that this thesis is based on.

We would also like to thank Anders Fröberg at the Department of Computer and Infor-mation Science at Linköping University for helping us throughout this thesis by giving us ideas and inspiration when we got stuck.

(5)

List of Figures

2.1 Example of 1NF of two database schemas. . . 7

2.2 Database schema in 2NF using ID as primary key. . . 7

2.3 Example of 3NF of two database schemas. . . 8

2.4 ER-model of the SQLAlchemy example. . . 9

3.1 EER-model of the database design . . . 13

4.1 Average insert time. . . 18

4.2 Average delete time. . . 18

4.3 Average search time. . . 19

4.4 Average update time. . . 19

4.5 Inserts per second. . . 20

4.6 Deletes per second. . . 20

4.7 Searches per second. . . 21

(7)

List of Tables

3.1 Hardware and OS used for testing the DBMS. . . 14

4.1 Result data from the insert time tests . . . 16

4.2 Result data from the delete time tests . . . 17

4.3 Result data from the search time tests . . . 17

4.4 Result data from the update time tests . . . 17

(8)

1 Introduction

There are many database management systems (DBMS) on the market. Each DBMS has a unique set of commands available to the user to be able to interact with the database. This means that once a database has been implemented with one DBMS, it will require a lot of work to implement the same database for a different DBMS. We believe that it is difficult to know which DBMS to use in advance, making it a risk when developing applications that use databases. In other words, since it takes a lot of effort to switch DBMS, it is important to carefully consider which DBMS to use.

Luckily, there exist tools that developers can use to help create databases, some of which can be applied to multiple different DBMS. This is a huge advantage for developers who have not decided or are unsure of which DBMS they want to use. However, the DBMS may have different performances using the same tool. I.e., DBMS A may perform better than DBMS B (and vice versa) using tool X.

In this thesis, we use the database tool SQLAlchemy for Python and compare the perfor-mance of different DBMS.

1.1 Background

Due to administration difficulties at the Department of Computer and Information Science at Linköping University, the system administrators wanted a tool that would help them keep track of what software was needed by each course on their computer environment, and store that information in a database. Therefore, the system administrators thought of the idea of running a website that the course staff could use to request software that they need for their course during the semester. While working on this project, we were considering which DBMS to use for the website. We eventually found a database tool called SQLAlchemy and used it to perform a study to figure out which DBMS fit our design best.

1.2 Aim

The purpose of this project is to evaluate different database management systems with SQLAlchemy.

(9)

1.3. Research questions

1.3 Research questions

These are the questions that our project will be looking into.

1. How well do different database management systems perform compared to each other with SQLAlchemy?

2. How does SQLAlchemy, using MySQL as database management system, compare to a pure MySQL implementation?

1.4 Delimitations

There were a couple of self-set delimitations during this project such as only using Python, only using relational databases, and having the tool only be compatible with Linux. These are delimitations we chose to work with ourselves, either because of ease in developing or because of requirement specification. We also delimited ourself to use SQLAlchemy exclu-sively.

Another factor that impacted the project was time. Since we had limited time with this project, we could not run some of the tests at larger volumes, which we otherwise would have wanted.

(10)

2 Theory

This chapter discusses the theory of different topics that are necessary for understanding this thesis. The chapter is divided into the following sections. Section 2.1 discusses the very basics of what data is. Section 2.2 explains what a database is at high level. Section 2.3 brings up what a database management system is and why they are necessary for managing data. Section 2.4 discusses design steps for making sure the database schema is in a good state. Section 2.5 explains what object-relational mapping is. Section 2.6 talks about the Python package SQLAlchemy for creating databases at a higher level.

2.1 Data

Data can be described as information [1] if it has been given meaning. It should have some value that makes the owner of this information more knowledgeable. Data can either be qualitative or quantitative.

Qualitative data

Qualitative data is data that cannot be represented by numbers [2]. Instead, the qualitative data is observed and defined in some other way. E.g., the colour of a shirt, the mood of your friend, and the name of software are data that cannot be described with numbers.

Quantitative data

Quantitative data can, unlike qualitative data, be represented by numbers [2]. Numbers can have different form depending on what kind of data it is, e.g., percentages, averages, and time. E.g., how fast a car is travelling, the number of employees at a company, and the temperature in a room are data that can be represented in numbers.

Data is most often collected when the person who collects the data want to become more knowledgeable [1]. When collecting data, the amount can become overwhelming in some cases. Thus, having the data organised some way is a good idea.

(11)

2.2. Database

2.2 Database

A database stores and organizes data electronically [3]. Databases are often used by compa-nies and organizations to manage their vast amounts of data. An example is an airport which keeps track of flights, travellers, tickets, and planes. The airport can use that information to figure out how many seats of a flight are not booked, how many tickets are sold, and which traveller used which ticket.

There are different kinds of databases, whereas the most common ones are relational databases, object databases and object-relational databases [3]. Another database type that has become popular in recent years are NoSQL databases [4, 5].

Relational data model

Based on the relational model proposed by Codd [6], relational databases use one or more tables. Each table has attributes (columns) that are of some data type, and rows, that hold the data. Tables can be related to each other, i.e., it is possible to use one table to get data from another table. This makes it easier to find and manage data.

Object data model

Another database model is the object data model, where the idea is to use the notion of object-orientation to group data. Instead of storing the data in tables, the data is stored in objects. The object data model defines objects, their attributes and functionality, whereas the key fac-tor is that designers can define data types, which is not possible in the relational data model [3].

Object-Relational data model

Over the years, the relational data model has been expanded with concepts and functionality from the object data model, making up what is called the object-relational data model [3]. In the object-relational data model, just like the object data model, designers can define data types, but instead of storing the data in the objects, the data is stored in tables, just like in the relational data model.

NoSQL model

NoSQL stands for "Not only SQL" and does not use the traditional tables that relational databases use. They are designed for millions of users interacting with the database all at the same time [4]. NoSQL databases scale horizontally, i.e., you can add more servers instead of having to upgrade on a central server, as well as being able to distribute the load and data on all of the servers. Three popular types of NoSQL databases are key-value stores, column-oriented databases, and document-based stores [5] (not covered here).

To be able to use these models and manage data, one typically would use a database management system.

2.3 Database Management Systems

A database management system (DBMS) is software that uses database models and thus allows for creating and managing databases [3]. Users interact with the DBMS by either a graphical user interface (GUI) or by running structured query language (SQL) commands. SQL is used to tell the DBMS what to do, such as insert new data, update or delete already existing data, and select data for viewing.

(12)

2.4. Database design

A few advantages in using a DBMS are redundancy control, restricting unauthorized ac-cess and communicating from a distance [3].

Redundancy control makes sure that each logical item is only stored once, and sometimes, even using denormalization to reduce the number of redundant queries on important data.

Restricting unauthorized access uses the notion of user permission to make sure that each user only can perform what their access-right allows them to do.

By using a DBMS, it is also possible for multiple users from multiple locations to receive data from the system without knowing the physical location of the database. This means that it is possible to create applications with databases, in which users do not have to query a specific position for data since the application handles requests to the database for them.

A few well-known DBMS of this kind are MySQL, PostgreSQL, Oracle, and SQLite.

2.4 Database design

Teorey et al. [7] describes a method of designing a good relational database using three steps. The first step is to create an enhanced entity-relationship model from the requirements of the system. The second step is to transform the model into relations in a preferred database language. The third and final step is to normalize the relations. In the upcoming parts, these steps are discussed in more depth.

Entity-Relationship and Enhanced Entity-Relationship modeling

The upcoming two sections explain entity-relationship and enhanced entity-relationship modelling, and how they differ from each other.

ER-modeling

Entity-relationship (ER) modelling is used to represent information about entities, entities’ attributes, and their relationships [7, 3]. Entities are an independent existence in the real world. While it can be a physical object like a car, phone, or a computer, it can also be a type of conceptual existence such as a school, a job, or a country.

The attributes of an entity are used to describe it. For example, an entity "Car" may use attributes like colour, registration number, and weight to describe itself, while an entity "dog" may use attributes like race, colour, height, and age.

Relationships describe the connectivities between entity occurrences such as one to one, one to many and many to many. Some connections between the entities are mandatory while others are not. E.g., the entity "person" may or may not own an animal, but the "person" certainly has a name and a birth date.

The ER model is useful when communicating fundamental data and relationship defini-tions with the end user. However, using the ER model as a conceptual schema representation is difficult because of the flaws of the original modeling constructs [7]. One such example is integration, which requires the use of abstract concepts such as generalization.

Another example is data integrity which involves null attribute values. This usually in-volves deciding whether a relationship allows a null attribute. I.e., whether it is mandatory that a person owns an animal or if it is optional.

The Enhanced ER (EER) model solves this by providing such generalisation and integrity of entities [3, 7]. Moreover, it is compatible with the ER-model.

EER-modeling

The EER-model expands upon the ER-model, inheriting the concepts of and ER-model while adding concepts like subclass and superclass, specialization and generalization, and category or union type [3]. Therefore, earning the name "enhanced" ER-model.

(13)

These concepts are fundamental when enhancing the ER-model, of which superclass (or supertype) and subclass (or subtype), is a concept concerning relations. For example, for an entity type Employee, there may also be different types of employees such as receptionists, managers, cleaners and so on. These are all a subset to the entity Employee, meaning that every member of these entities is also a member of the Employee entity. Therefore, these entities are called a subclass or subset of the Employee entity type, which in turn makes the Employee entity a superclass or a supertype.

The process of defining a set of subclasses of a superclass is called specialization. Spe-cialization also helps with establishing specific attributes of each subclass and establishing relations among other entity types or subclasses.

Generalization is the inverse of specialization. Namely, it defines a superclass from a set of entity types. For example, a dog, a house, and a car all have an owner. Therefore, it is pos-sible to create the entity type Owner from this common attribute. Since generalization is the inverse of specialization, it is also possible to say that dog, house, and car is a specialization of Owner.

Lastly, the concept of union or category types deals representing more than one super-class. For example, if the superclasses School, Household, and Company exist. Then for a database containing computer id:s, an owner could be any one of the School, Household, and Company. Therefore, it is possible to create a category named Owner, which is a subclass of the union between the superclasses Furniture, Household electronics, and Estates for this purpose.

Keys and Functional Dependencies

One part of designing a database is determining which attributes can uniquely identify records in the database. Attributes that can identify records are called keys. There are many types of keys, whereas some of them are candidate key, primary key, superkey and foreign key [3].

A superkey is a key that can uniquely identify a record and contains one or more attributes in the table to do so. A candidate key is a key that has the minimum amount of attributes of all superkeys. There can exist multiple candidate keys. Attributes that are part of some candidate key is called prime attributes and attributes that are not in any candidate key is called non-prime attributes. A primary key is a chosen candidate key that is used to identify unique records in the database.

Sometimes you want to reference another table. Referencing another table is done by us-ing a foreign key. A foreign key is a key that connects one table to another table, establishus-ing a relationship between the tables. The foreign key must reference the other tables primary key.

These keys are not only used to identify records but also in functional dependencies. A functional dependency (FD) is a constraint between attributes in database tables [3]. I.e., they define relationships in a database. FDs are used to determine the normal form of the database schema. There are six inference rules, which among other things are used to make the normal form of the database better.

1. Reflexivity: If Y is a subset of X, then X -> Y. 2. Augmentation: If X -> Y, then XZ -> YZ. 3. Transitivity: If X -> Y and Y -> Z, then X -> Z. 4. Union: If X -> Y and X -> Z, then X -> YZ.

5. Decomposition: If X -> YZ, then X -> Y and X -> Z. 6. Pseudotransitivity: If X -> Y and WY -> Z, then XW -> Z.

(14)

Normal Forms and Normalisation

To make sure the database schema is of high quality, the schema has to be in at least third normal form, which it gets to by going through normalisation. The four most critical normal forms for a high-quality schema are the first normal form, second normal form, third normal form, and Boyce-Codd normal form.

Normal Forms

The first normal form (1NF) states that each attribute in a schema has to be atomic [3, 8]. I.e., the attribute cannot be divided into multiple values. E.g., a schema containing a single attribute "Person" (see figure 2.1a) which contains data about a person; name, height, weight and hair colour. This schema is not in 1NF, but can effortlessly become that by distributing the attribute into multiple attributes, as seen in figure 2.1b.

(a) Database schema that is not in 1NF. _{(b) Database schema that is in 1NF.}

Figure 2.1: Example of 1NF of two database schemas.

The second normal form (2NF) states that the schema must be in 1NF and also that every non-prime attribute is fully functionally dependent on the entire primary key [3, 8]. I.e., the primary key must be able to determine the values of every non-prime attributes. E.g., in figure 2.1b (which is in 1NF), the database schema would not be in 2NF if the name was a primary key since there could be multiple people who have the same name. The same goes for all other attributes. By adding a unique identifier to the schema and using that attribute as the primary key, the schema would be in 2NF (as seen in figure 2.2).

Figure 2.2: Database schema in 2NF using ID as primary key.

The third normal form (3NF) states that the schema must be in 2NF and also that there are no non-prime attributes that are transitive dependent on the primary key [3, 8]. This means that no non-prime attributes should be able to determine another non-prime attribute. E.g., figure 2.3a shown a schema of people’s graduation, and where ID is the primary key. The table is in 2NF and not in 3NF because the non-prime attribute School can determine in which city the school is. To make the schema in 3NF, the tables have to be split into two tables as seen in figure 2.3b.

(15)

2.5. Object-Relational Mapping

(a) Database schema that is not in 3NF. (b) Database schema that is in 3NF.

Figure 2.3: Example of 3NF of two database schemas.

Boyce-Codd normal form (BCNF) is a strong version of 3NF. It states that the schema must be in 3NF and also that, for every functional dependency X -> Y, X is a superkey.

Normalization

Normalization is about reducing data redundancy and inconsistency [8, 9]. Data redundancy is having the same data in multiple places in the database. It should be reduced because it makes insertions, deletions and updates difficult. Insertions fill the database with more and more of the same data, which consumes disk space. When deleting records, the problem of accidentally removing information about other things arises. When we delete the last record in the table that holds redundant data, the redundant data is also deleted. I.e., the last record that we delete also deletes the only information we have about the redundant data. When updating, all records that have redundant data have to be updated. If the tables hold millions of records, then this update take longer to complete.

Normalization solves this problem by splitting the table that has redundant data into two tables. The redundant data is then saved in one place, meaning that it is not repeated when inserting new records, it is not accidentally deleted when removing the last record in the other table, and only one record needs to be updated.

2.5 Object-Relational Mapping

Object-relational mapping (ORM) is a way to convert between rows in a relational database to an object in a programming language [10]. The ORM sits between the programmer’s appli-cation and the database and helps the programmer by abstracting the SQL queries to access, remove and modify the data in the database.

2.6 SQLAlchemy

SQLAlchemy is a Python package that helps the programmer by abstracting the tedious parts of creating databases. Instead of writing SQL queries, the programmer writes Python classes for the design of the tables and functions to access/modify the tables, and then let SQLAlchemy take over. SQLAlchemy uses ORM to translate from Python objects to SQL (and vice versa). The programmer can then query the database by issuing queries via SQLAlchemy’s built-in functions. An example of usage of SQLAlchemy can be seen in listing 2.1 with its ER-diagram in figure 2.4.

SQLAlchemy can be used with a wide variety of DBMS. To be able to connect and query the database, the programmer has to install a database driver for the DBMS that is used. A database driver is a piece of software that acts as a middleman between the programmer’s application and the database. Its job is to convert the SQL queries sent by the application to the underlying format the database is using. These drivers can, just like SQLAlchemy, be installed with pip. SQLAlchemy supports the following DBMS out-of-the-box.

(16)

2.6. SQLAlchemy

• Firebird1- open-source relational DBMS developed by Firebird Project in 2000. • Microsoft SQL Server2_{- relational DBMS developed by Microsoft in 1989.}

• MySQL3_{- relational DBMS developed by Oracle Co. in 1995.}

• Oracle4- multi-model DBMS developed by Oracle Co. in 1979.

• PostgreSQL5 - open-source object-relational DBMS developed by PostgreSQL Global Development Group in 1996.

• SQLite6_{- public relational DBMS developed by Dwayne Richard Hipp in 2000.}

• Sybase7database - relational DBMS developed by Sybase in 1980’s.

Figure 2.4: ER-model of the SQLAlchemy example.

Listing 2.1: Example of SQLAlchemy usage. # i m p o r t s q l a l c h e m y

from sqlalchemy import Column , I n t e g e r , S t r i n g , Date , ForeignKey

from sqlalchemy import c r e a t e _ e n g i n e

from sqlalchemy . e x t . d e c l a r a t i v e import d e c l a r a t i v e _ b a s e

from sqlalchemy . orm import sessionmaker engine = c r e a t e _ e n g i n e ( ’ s q l i t e : / / / :memory : ’ ) Base = d e c l a r a t i v e _ b a s e ( ) S e s s i o n = sessionmaker ( bind=engine ) # d e s i g n a t a b l e c l a s s Room( Base ) : __tablename__ = ’ room ’ id = Column ( I n t e g e r , primary_key=True )

room_name = Column ( S t r i n g ( 2 0 ) , unique=True , n u l l a b l e = F a l s e )

def _ _ i n i t _ _ ( name ) : room_name = name

c l a s s Course ( Base ) :

1_{URL: https://www.firebirdsql.org/, Accessed: 2018-10-18}

2_{URL: https://www.microsoft.com/en-us/sql-server, Accessed: 2018-10-18}

3_{URL: https://www.mysql.com/, Accessed: 2018-10-18}

4_{URL: https://www.oracle.com/database/index.html, Accessed: 2018-10-18}

5_{URL: https://www.postgresql.org/, Accessed: 2018-10-18}

6_{URL: https://www.sqlite.org/index.html, Accessed: 2018-10-18}

(17)

2.6. SQLAlchemy __tablename__ = ’ c o u r s e ’ id = Column ( I n t e g e r , primary_key=True ) course_name = Column ( S t r i n g ( 5 0 ) , n u l l a b l e = F a l s e ) c o u r s e _ i d = Column ( S t r i n g ( 1 2 ) , unique=True , n u l l a b l e = F a l s e ) def _ _ i n i t _ _ ( name , c i d ) : course_name = name c o u r s e _ i d = c i d c l a s s L e c t u r e ( Base ) : __tablename__ = ’ l e c t u r e ’ id = Column ( I n t e g e r , primary_key=True ) c o u r s e = Column ( I n t e g e r , ForeignKey ( ’ c o u r s e . i d ’ ) , n u l l a b l e = F a l s e ) room = Column ( I n t e g e r , ForeignKey ( ’ room . id ’ ) , n u l l a b l e = F a l s e ) date = Column ( Date , n u l l a b l e = F a l s e )

def _ _ i n i t _ _ ( c , r , d ) : c o u r s e = c

room = r date = d # c r e a t e t h e t a b l e

Base . metadata . c r e a t e _ a l l ( engine ) #Add new rooms

room1 = Room( " room1 " ) room2 = Room( " room2 " ) room3 = Room( " room3 " )

S e s s i o n . add ( room1 ) S e s s i o n . add ( room2 ) S e s s i o n . add ( room3 ) S e s s i o n . commit ( ) #Add new c o u r s e s c o u r s e 1 = Course ( " I n t r o . C++ " , " ABC123 " ) c o u r s e 2 = Course ( " I n t r o . Python " , " ABC484 " ) S e s s i o n . add ( c o u r s e 1 ) S e s s i o n . add ( c o u r s e 2 ) S e s s i o n . commit ( ) #Add new l e c t u r e s l e c t u r e 1 = L e c t u r e ( c o u r s e 1 . id , room2 . id , " 2018´10´15 " ) l e c t u r e 2 = L e c t u r e ( c o u r s e 1 . id , room3 . id , " 2018´10´18 " ) l e c t u r e 3 = L e c t u r e ( c o u r s e 2 . id , room1 . id , " 2018´11´5 " ) S e s s i o n . add ( l e c t u r e 1 ) S e s s i o n . add ( l e c t u r e 2 ) S e s s i o n . add ( l e c t u r e 3 ) S e s s i o n . commit ( ) # P r i n t a l l room names f o r e v e r y l e c t u r e i n t h e d a t a b a s e f o r row in S e s s i o n . query ( L e c t u r e s ) . a l l ( ) :

room = S e s s i o n . query (Room ) . f i l t e r (Room . id == row . room ) . f i r s t ( )

(18)

2.6. SQLAlchemy

# D e l e t e some o f t h e rows S e s s i o n . d e l e t e ( l e c t u r e 3 ) S e s s i o n . d e l e t e ( room1 ) S e s s i o n . commit ( )

(19)

3 Method

This chapter explains the method used to answer the research questions "How well do different database management systems perform compared to each other using SQLAlchemy?" and "How does SQLAlchemy, using MySQL as database management system, compare to a pure MySQL implemen-tation?". Section 3.1 discusses design choices for the implementation, and section 3.2 explains the testing procedure for the research questions.

3.1 Design

The design of the database is split into a database design and database interface. These two are connected as the database interface uses the tables designed in database design when querying the database.

In this section, we will discuss the design choices made when implementing the database design and the core functionality of the database interface.

Database Design

The database design used for our research questions was made using a requirement specifi-cation. This requirement specification was created with the idea that Linköping University wants to keep track of their software, courses, and computer rooms. It also wants to keep track of any connections between them such as which course uses which software and so on. Therefore, the database design was made with the general idea being.

The database shall store all available software, all courses run by the university and all rooms that have computers connected to the system in them. The database shall also store information about which available software has been requested by which course, the people that can administer the courses, and the installed software for each room.

From this general idea, a database design was created using relational models and nor-malisation to reach 3NF. The design was then made into an EER-model which can be seen in figure 3.1. Since most of the data in this database are in a string format, which is hard to use as an identifier when working with databases. The primary keys in the database are named ID, while all names ending with an id are foreign keys except for course_id in Course. All foreign keys in the design are set to cascade on delete. Therefore, whenever a record in

(20)

3.1. Design

Figure 3.1: EER-model of the database design

a foreign table is deleted, all record with a foreign key to that record’s attribute will also be deleted.

Lastly, the database is set so that auto-commit is on. This means that whenever a change is done to any of the tables, it is immediately committed, which in turn means that the database is better suited to conduct quick changes and not suited for changes made in batches.

Database Interface

The database interface uses the database design to create functions that make the correct database queries so that the user does not have to write them every time. Therefore, for every table in figure 3.1 there exists an insert function, a delete function, an update function, and a find function.

The insert function is simple since it merely has to insert a record into the table. For tables such as Available Software, Room, and Course the insert is simple. While for the other tables the insert also has to consider the foreign keys and make sure they are correct before inserting the record.

The delete function has to consider a few key points when deleting a record form a table. For example, Installed Software has foreign keys to both the Room and Available Software tables for its attributes. This means that if a record in Available Software exists in Installed Software and the delete function tries to delete said software from Available Software. It has to first delete all records with that software from Installed Software before deleting the record in Available Software. Otherwise, it will cause constraint errors because one or more record’s foreign key in Installed Software has lost their source. This is usually a pretty complicated situation when deleting records as several queries have to be made. However, since we use cascade on delete in our database design, we do not have to worry about this and can thus delete records in all tables without having to worry about these constraint errors.

(21)

3.2. Testing

Table 3.1: Hardware and OS used for testing the DBMS. Component Model

CPU Intel Core i3 2310M @ 2.1GHz

RAM 4 GB Samsung SODIMM DDR3 @ 1333 MHz

SSD Samsung EVO 840 250GB

OS Ubuntu 18.04

The find function is merely searching for specific attribute values in a table and returning the results. In Available Software or any other table that doesn’t have foreign keys, it isn’t necessary to do anything complex. However, in a table like Installed Software, depending on the input, it has to first figure out the proper id:s for the targeted software and room, before performing the search.

The update function first searches for a targeted record before changing some of its values. The first part is just like the find functions, while the second part merely changes the values before committing the changes to the database.

3.2 Testing

This section discusses the testing to be able to answer the research questions. The database design and interface were implemented in both Python using SQLAlchemy and SQL for MySQL. For the first research question, we chose to use MySQL, PostgreSQL and SQLite as DBMS. We chose these DBMSs because they are available and compatible with SQLAlchemy. For the second research question, we went with MySQL because we had previous experience with it. The hardware and operating system used for testing can be seen in table 3.1.

Data generation

To be able to perform tests, data for filling the database had to be generated. We created a script for generating data, that accepted an integer N for how many records we wanted to generate. The tables in the database had relatively many string attributes. We, therefore, decided that each string-attribute should be filled with its maximum length. The generated string used a variety of upper case and lower case characters, as well as digits. After the script had generated N records, it wrote the records to a file on the disk in JSON format. Listing 3.1 shows an example of how data was generated for one of the tables.

Listing 3.1: Data generating example

def generate_room_data (N) : room_names = [ ]

f o r _ in range (N) :

room_name = rand om_text_generator ( l e n g t h =12) room_names . append ( room_name )

data = { " rooms " : room_names } f i l e n a m e = s t r (N) + " _room . j s o n "

w r i t e _ j s o n _ d a t a _ t o _ f i l e ( filename , data )

Time test

For both research questions, we ran time-tests for inserts, deletes, updates and search using Python’s time library. We measured time because we believe it is one of the most relevant metrics for database testing. For inserts, we used the size N=100, 1000, 10000 and 100000.

(22)

3.2. Testing

For deletes, updates and search we used 50%, 25% and 20% of N respectively. Data was generated for these sizes of N for all types of operation. We ran the test three times for each type of operation and on all DBMS for N=100, 1000 and 10000 to get an average. Because of limited time, we were only able to run the test for N=100000 once.

Listing 3.2 shows an example of how the insert test was carried out for the SQLAlchemy implementation. The MySQL implementation looks very similar with the only difference be-ing how the generated data was loaded and how they were inserted. When loadbe-ing the data for the SQLAlchemy implementation, the data was read from the file and stored in memory as a dictionary of lists for each table, whereas the MySQL implementation loaded the file names that the data was stored in into a list. When they then were inserted, the SQLAlchemy indexed on the dictionary to get the correct list of data for each table, whereas the MySQL implementation iterated through the list of files for each table and inserted them via shell through Python using the function shown in listing 3.3.

Listing 3.2: Insert time test example for SQLAlchemy implementation.

def i n s e r t _ t e s t (N) : # Load t h e g e n e r a t e d d a t a data = g e t _ j s o n _ d a t a (N) # Measure t h e t i m e t h e i n s e r t t a k e s t a r t _ t i m e = time . time ( ) i n s e r t _ r e c o r d s ( data ) end_time = time . time ( )

# C a l c u l a t e t h e a c t u a l t i m e i t t o o k t o t a l _ t i m e = end_time ´ s t a r t _ t i m e

r e t u r n t o t a l _ t i m e

Listing 3.3: Insert for the MySQL implementation.

def i n s e r t _ r e c o r d s ( s q l _ f i l e s ) :

f o r f i l e in s q l _ f i l e s :

command = " mysql ´´l o g i n ´path= e x j o b b ´D e x j o b b < " + f i l e s u b p r o c e s s . run ( command , s h e l l =True )

Code comparison

For the second research question, we wanted to compare the code for both the SQLAlchemy implementation and the MySQL implementation. We did this because we wanted to get a feeling of how complex the code itself was and it there was any significant difference between the two implementations. We created a script that, given a file, calculated the number of lines, the number of lines that were comments, and the number of lines that were expressions. The script ran on both the design and interface file for both implementations.

(23)

4 Results

This chapter presents the results from the testing. Section 4.1 presents the results for the time tests and section 4.2 presents the results for the code comparison.

4.1 Time results

Table 4.1, 4.2, 4.3, and 4.4 shows the time it took to insert, delete, search and update for all test runs and sizes. The column Pure MySQL is the MySQL implementation, and the rest of the columns are the SQLAlchemy implementation.

Figure 4.1, 4.2, 4.3, and 4.4 compares the average time for insert, delete, search and update. Note that the figures are log-scaled on the Y-axis.

Figure 4.1 compares the average insert time for all sizes. Overall, we see similar trends for the SQLAlchemy implementation’s DBMSs for all sizes. PostgreSQL has the lowest insert time, SQLite has the largest insert time, and MySQL stays somewhere in between the two other. The MySQL implementation has an entirely different story. It starts of being faster than all DBMSs of the SQLAlchemy implementation, but as insert-size increases, so does the insert-time. At size 100000, the MySQL implementation performs the worst out of all DBMS.

Table 4.1: Result data from the insert time tests

DMBS SQLite (sec) PostgreSQL (sec) MySQL (sec) Pure MySQL (sec) 100 Test-run 1 10.447 4.864 7.498 4.193 Test-run 2 10.757 4.825 7.562 4.188 Test-run 3 10.960 5.117 7.597 4.272 1000 Test-run 1 112.882 49.286 76.189 46.590 Test-run 2 110.997 48.299 76.224 46.737 Test-run 3 109.198 47.953 77.213 45.819 10000 Test-run 1 1171.439 483.592 777.567 794.455 Test-run 2 1208.083 479.709 814.619 840.806 Test-run 3 1197.372 480.752 815.737 847.164 100000 Test-run 1 12649.407 5056.615 8450.248 38292.714

(24)

4.1. Time results

Table 4.2: Result data from the delete time tests

Table 4.3: Result data from the search time tests

Table 4.4: Result data from the update time tests

(25)

4.1. Time results 100 1000 10000 100000 101 102 103 104 Size A verage time (seconds)

SQLite PostgreSQL MySQL Pure MySQL Figure 4.1: Average insert time.

In figure 4.2 the comparison of the average delete time for all sizes is shown. Regarding the SQLAlchemy, PostgreSQL has the lowest average time in all sizes while SQLite has the highest. However, concerning the MySQL implementation, that is not always the case. For the sizes of 100 and 1000, Pure MySQL has the lowest averages. However, during the rest of the sizes, PostgreSQL has the lowest average while MySQL has the third highest for 5000 and the highest for 50000.

50 500 5000 50000 100 101 102 103 104 105 Size A verage time (seconds)

SQLite PostgreSQL MySQL Pure MySQL Figure 4.2: Average delete time.

Figure 4.3 shows the average time the search-test took. The trends are similar for Post-greSQL and MySQL for all sizes. PostPost-greSQL is faster than MySQL. SQLite starts of being faster than the other SQLAlchemy implementations, but equally well or worse at larger sizes. The MySQL implementation out-performs the SQLAlchemy implementation for all sizes.

Figure 4.4, has the comparison of averages for updates. In the figure, it is possible to see the trend of PostgreSQL having the lowest average while SQLite has the highest regarding the

(26)

4.1. Time results 20 200 2000 20000 10´1 100 101 102 103 104 105 Size A verage time (seconds)

SQLite PostgreSQL MySQL Pure MySQL Figure 4.3: Average search time.

SQLAlchemy implementation. The MySQL implementation has the lowest average during the first half of sizes while it has the highest during the latter half.

25 250 2500 25000 100 101 102 103 104 Size A verage time (seconds)

SQLite PostgreSQL MySQL Pure MySQL Figure 4.4: Average update time.

Figure 4.5, 4.6, 4.7, and 4.8 compares how many records per second insert, delete, search and update does. In figure 4.5, 4.6 and 4.8, the trends are more or less identical for all DBMSs and sizes. PostgreSQL dominate the SQLAlchemy implementations, MySQL comes in second place, and SQLite is at the bottom. The MySQL implementation out-performs the SQLAlchemy implementation at lower sizes but is out-performed at bigger sizes. Figure 4.7 shows the number of records searched per second. We can see that SQLite performs the best out of the SQLAlchemy implementation and that the MySQL implementation dominates the SQLAlchemy implementation by a large margin.

(27)

4.1. Time results 100 1000 10000 100000 5 10 15 20 25 Size Recor ds per second

SQLite PostgreSQL MySQL Pure MySQL Figure 4.5: Inserts per second.

50 500 5000 50000 0 5 10 15 20 Size Recor ds per second

SQLite PostgreSQL MySQL Pure MySQL Figure 4.6: Deletes per second.

(28)

4.1. Time results 20 200 2000 20000 0 50 100 150 200 Size Recor ds per second

SQLite PostgreSQL MySQL Pure MySQL Figure 4.7: Searches per second.

25 250 2500 25000 0 5 10 15 20 Size Recor ds per second

SQLite PostgreSQL MySQL Pure MySQL Figure 4.8: Updates per second.

(29)

4.2. Code comparison results

Table 4.5: Result data from the code comparison test.

File Lines Comments Expressions

SQLAlchemy database design 101 0 101

MySQL database design 73 0 73

SQLAlchemy database interface 675 132 543

MySQL database interface 531 118 413

4.2 Code comparison results

Table 4.5 shows the results of the code comparison. We can see that the MySQL implementa-tion has fewer lines of code in both the design file and the interface file than the SQLAlchemy implementation.

(30)

5 Discussion

This chapter discusses the results and method in section 5.1 and 5.2 respectively. The work in a wider context is also discussed in section 5.3.

5.1 Results

Research question 1

A small surprise that happened during testing was the results for PostgreSQL. We had as-sumed that PostgreSQL and MySQL would be almost equal in time performance. However, the results show that PostgreSQL outperformed MySQL by large margins in every category tested. This could be due to the difference in translation from Python to the DBMS done by SQLAlchemy. It could also be related to the driver used when sending the queries to the DBMS. Therefore, this result does not mean that PostgreSQL is better than MySQL as a DBMS, but that SQLAlchemy is better at using PostgreSQL than MySQL.

There is another DBMS other than SQLite, PostgreSQL and MySQL called Firebird that we wanted to test as well. Because of how complicated it was to set up Firebird and the fact that we did not get it to work, we chose to not include it in the list of DBMSs we tested. The fact that we did not test Firebird does not change any of the results we got for the other DBMSs, but it would have been interesting to see how Firebird handled the test.

Research question 2

The most challenging part of implementing a design in two different languages is making the implementations as similar as possible.

One thing to note is that we believe that SQLAlchemy optimises the queries before ex-ecuting them, whereas we have not done any optimisation for the MySQL implementa-tion. Thus, the time it took to insert, delete and update records in the database with the MySQL implementation performs worse as the size increased. When looking at the result for the SQLAlchemy implementation, the time it took to do the same inserts, deletes and updates only get larger like you would expect, since more queries equal longer time. The SQLAlchemy implementation does not scale as poorly as the MySQL implementation does.

(31)

5.2. Method

Even though the MySQL implementation scaled poorly, it definitively performed the best when searching for records. This could be due to the way we searched for records in the SQLAlchemy implementation, or just that a pure MySQL implementation is faster than SQLAlchemy at searching.

The result from comparing the number of lines, comments, and expressions shows that Python code uses up more space than SQL. This is not that surprising as SQL is a language purely focused on database queries, unlike Python which has a wide array of uses. For exam-ple, in SQL you can do a lot of expressions and statements in a single query, while in Python it may be more efficient to split some expressions and statements into functions and then use the result to do the actual query. SQL also has handles errors for its queries automatically, unlike Python in which you have to manually handle any errors that pop up as well as pos-sibly doing a database rollback in some situations. These are some of the reasons that lead to Python taking up more lines and expressions.

Another thing to discuss regarding the difference between writing in Python or SQL is the ease of use. When combining a DBMS with another program, the programmer is bound to switch between using the SQL language and their other programming language. How-ever, if we take Python in this case, then with the possibility of using SQLALchemy it is no longer necessary to program in two languages at the same time. For example, when using SQLAlchemy, it is possible to use Python syntax when creating query functions. This means that the programmer is not only on home-ground but that the programmer can combine other Python functionalities when issuing the queries. This would arguably increase the effective-ness of the programmer compares to needing reasonable knowledge on both SQL and Python to accomplish the same task. All that is needed for a programmer to use SQLAlchemy is basic knowledge about databases and queries, and SQLAlchemy will handle the rest.

5.2 Method

Hardware

One interesting factor is the hardware used for testing. We only had access to an old laptop, meaning that the results could potentially be better if we had a better computer to run the tests for all DBMSs on. If someone were to replicate our work on other hardware, the results could vary depending on the what hardware is used. On the other hand, the results should always be the same on a particular hardware set-up, no matter what the hardware is.

Database management system

Like mentioned earlier, we chose SQLite, PostgreSQL and MySQL as our DBMSs. We chose these because they were available and compatible with SQLAlchemy. If we had had more time and money, we would have liked to test all the other DBMSs mentioned in 2.6 to get an even better understanding of which DBMS performed the best using SQLAlchemy.

Time testing

Limited time was the biggest factor when investigating the first research question. This was for both getting the average time the test took and testing an even more massive amount of records to be inserted, deleted, updated and searched.

We wanted to get the average time it took to insert, delete, update and search records, which we were only able to get three samples of. To get the true average time it took, we would have liked to get at least 30 samples.

We were only able to test the DBMSs for sizes of 100, 1000, 10000, and 100000. Initially, we wanted to also test for size of 1000000. After the test for 100000 was done, we calculated that

(32)

5.3. The work in a wider context

in the best case scenario, testing for size 1000000 would have taken about one month for all DBMSs, which we did not have time for.

5.3 The work in a wider context

Comparing databases against each other does not entail any ethical discussions in our opin-ion. However, data collection is strongly connected to our work in a wider context, since most data is stored in databases. In the modern digital era, companies collect data about people who visit the companies website or use their products. It is important that people can control the information that is gathered and how it is used. An example is when Facebook let third-party app-developers collect data which they didn’t need. This led to the Cambridge Analytica scandal [11, 12] where an app-developer provided personal data about Facebook users to Cambridge Analytica, which they used for marketing purposes. In May 2018, the EU legislation General Data Protection Regulation (GDPR) was enforced, which purpose is to help people have better control over their information online. We believe this is a step in the right direction, however, it is still is not enough as GDPR is only enforced in the EU as opposed to being a worldwide regulation.

(33)

6 Conclusion

This chapter concludes the thesis by answering the research questions. Section 6.1 concludes which DBMS performed the best regarding time, and section 6.2 concludes the code compar-ison.

6.1 Time test

For the first research question, the results show that PostgreSQL was the fastest in the major-ity of all tests. PostgreSQL only fell short on the search test where SQLite was faster at lower sizes. Therefore, we conclude that PostgreSQL is the best option for our database design and SQLAlchemy implementation.

For the second research question, the MySQL implementation performed better than the SQLAlchemy implementation at lower sizes but worse at larger sizes. I.e., the MySQL imple-mentation scaled worse than the SQLAlchemy impleimple-mentation. We believe this is because SQLAlchemy optimises its queries, which we did not do for the MySQL implementation. Therefore, we conclude that SQLAlchemy is the better option for the average SQL-user.

6.2 Code comparison

As the results show, the MySQL implementation can be written with fewer lines of code. With this said, the difference between the MySQL implementation and the SQLAlchemy im-plementation is not big enough to exclude SQLAlchemy as an option. We, therefore, conclude that SQLAlchemy is the better option since it performed better as the time test concluded.

6.3 Future work

For future work on this project, it may be interesting to consider adding more relational databases in the comparisons that SQLAlchemy supports but we could not afford. This will give a better perspective on the optional DBMS when using SQLAlchemy. Also, using non-relational databases such as NoSQL is another option to study as it is considered more mod-ern than relational databases.

(34)

6.3. Future work

Another thing to consider would be to increase the number of tables in the database used in this research. The current database is very small compared to big enterprises. Therefore, it would be good to also run these tests on an enterprise sized database too to get results closer to reality.

It would also be worthwhile to find other tools similar to SQLAlchemy. That way, it would be possible to perform the same research on those tools as for SQLAlchemy and thus create the opportunity to compare SQLAlchemy with its possible competitors.

Lastly, it would be of great interest to further investigate the reasons for the results re-garding the Pure MySQL implementation and possibly correcting any issues found. This is to either give a concrete reason for the nonlinear results found in 4 or redo the tests with a better version of the MySQL implementation.

(35)

Bibliography

[1] Gene Bellinger, Durval Castro, and Anthony Mills. “Data, information, knowledge, and wisdom”. In: (2004).

[2] Keith F Punch. Introduction to social research: Quantitative and qualitative approaches. sage, 2013.

[3] Ramez Elmasri and Sham Navathe. Database systems : models, languages, design and application programming. Boston, Mass. ; Singapore : Pearson, cop. 2011, 2011. ISBN: 9780132144988.URL: https : / / login . e . bibl . liu . se / login ? url = https : //search.ebscohost.com/login.aspx?direct=true&db=cat00115a&AN= lkp.575458&lang=sv&site=eds-live&scope=site.

[4] Rick Cattell. “Scalable SQL and NoSQL data stores”. In: Acm Sigmod Record 39.4 (2011), pp. 12–27.

[5] Neal Leavitt. “Will NoSQL databases live up to their promise?” In: Computer 43.2 (2010). [6] Edgar F Codd. “A relational model of data for large shared data banks”. In:

Communi-cations of the ACM 13.6 (1970), pp. 377–387.

[7] Toby J Teorey, Dongqing Yang, and James P Fry. “A logical design methodology for relational databases using the extended entity-relationship model”. In: ACM Computing Surveys (CSUR) 18.2 (1986), pp. 197–222.

[8] William Kent. “A simple guide to five normal forms in relational database theory”. In: Communications of the ACM 26.2 (1983), pp. 120–125.

[9] Moussa Demba. “Algorithm for relational database normalization up to 3NF”. In: In-ternational Journal of Database Management Systems 5.3 (2013), p. 39.

[10] Elizabeth J O’Neil. “Object/relational mapping 2008: hibernate and the entity data model (edm)”. In: Proceedings of the 2008 ACM SIGMOD international conference on Man-agement of data. ACM. 2008, pp. 1351–1356.

[11] Robinson Meyer. The Cambridge Analytica Scandal, in Three Paragraphs. 2018. URL: https : / / www . theatlantic . com / technology / archive / 2018 / 03 / the -cambridge - analytica - scandal - in - three - paragraphs / 556046/(visited on 02/09/2019).

(36)

Bibliography

[12] Alvin Chang. The Facebook and Cambridge Analytica scandal, explained with a simple dia-gram. 2018.URL: https : / / www . vox . com / policy - and - politics / 2018 / 3 / 23/17151916/facebook- cambridge- analytica- trump- diagram(visited on 02/09/2019).

Comparing database management systems with SQLAlchemy : A quantitative study on database management systems

Linköping University | Department of Computer and Information Science

Bachelor thesis, 16 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-G--19/074--SE

Comparing

database

man-agement

systems

with

SQLAlchemy

A quantitative study on database management systems

Jämförelse av databashanterare med hjälp av SQLAlchemy

Marcus Fredstam

Gabriel Johansson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Background

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Data

Qualitative data

Quantitative data

2.2

Database

Relational data model

Object data model

Object-Relational data model

NoSQL model

2.3

Database Management Systems

2.4

Database design

Entity-Relationship and Enhanced Entity-Relationship modeling

Keys and Functional Dependencies

Normal Forms and Normalisation

2.5

Object-Relational Mapping

2.6

SQLAlchemy

3

Method

3.1

Design

Database Design

Database Interface

3.2

Testing

Data generation

Time test

Code comparison

4

Results

4.1

Time results

4.2

Code comparison results

5

Discussion

5.1

Results

Research question 1

Research question 2

5.2

Method

Hardware

Database management system