• No results found

Engineering Degree Project Data storage for a small lumber processing company in Sweden

N/A
N/A
Protected

Academic year: 2021

Share "Engineering Degree Project Data storage for a small lumber processing company in Sweden"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Authors: Albin Ljungdahl &

Simon Bäcklund Supervisor: Daniel Martinsson Lnu Supervisor: Morgan Ericsson Examiner: Jonas Lundberg

Semester: Spring 2021

Engineering Degree Project

Data storage for a small lumber

processing company in Sweden

(2)

Abstract

The world is becoming increasingly digitized, and with this trend comes an increas- ing need for storing data for companies of all sizes. For smaller enterprises, this could prove to be a major challenge due to limitations in knowledge and financial assets. So the purpose of this study is to investigate how smaller companies can satisfy their needs for data storage and which database management system to use in order to not let their shortcomings hold their development and growth back. To fulfill this purpose, a small wood processing company in Sweden is examined and used as an example. To investigate and answer the problem, literary research is con- ducted to gain knowledge about data storage and the different options for this that exist. Microsoft Access, MySQL, and MongoDB are selected for evaluation and their performance is compared in controlled experiments. The results of this study indicates that, due to the small amount of data that the example company possesses, the simplicity of Microsoft Access trumps the high performance of its competitors.

However, with increasingly developed internet infrastructure, the option of hosting a database in the cloud has become a feasible option. If hosting the database in the cloud is the desired solution, Microsoft Access has a higher operating cost than the other alternatives, making MySQL come out on top.

Keywords: SME, DBMS, Database, MySQL, Microsoft Access, MongoDB, rela- tional database, non-relational database, cloud, DBaaS, data storage

(3)

Preface

The authors of this paper would like to thank Småland Timber, and especially Daniel Martinsson, for giving us their time and allowing us to use them as an example for this paper. We would also like to thank our supervisor Morgan Ericsson for his guidance and expertise during this study.

(4)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Related work . . . 1

1.3 Problem formulation . . . 2

1.3.1 List of requirements . . . 2

1.3.2 Research Questions . . . 3

1.3.3 Expected result . . . 3

1.4 Motivation . . . 4

1.5 Milestones . . . 4

1.6 Scope/Limitation . . . 4

1.7 Target group . . . 5

1.8 Outline . . . 5

2 Theory 6 2.1 SMEs and digitization . . . 6

2.2 Relational DBMS VS non-relational DBMS . . . 6

2.2.1 Microsoft Access . . . 7

2.2.2 MySQL . . . 7

2.2.3 MongoDB . . . 8

2.3 Measuring Values for evaluating a DBMS . . . 8

2.4 Cloud-based database . . . 9

2.4.1 Security . . . 10

2.4.2 Manage a database . . . 10

3 Method 11 3.1 Research Project . . . 11

3.2 Method . . . 11

3.2.1 Literature review . . . 11

3.2.2 Controlled experiments . . . 12

3.3 Reliability and Validity . . . 13

3.4 Ethical considerations . . . 13

4 Implementation 14 4.1 Microsoft Access . . . 14

4.2 MySQL . . . 15

4.3 MongoDB . . . 16

4.4 Data insertion . . . 17

4.4.1 Design decisions . . . 18

4.4.2 Script . . . 19

4.5 Graphical interface . . . 20

5 Experimental Setup, Results, and Analysis 21 5.1 Performance Experiments . . . 21

5.1.1 Setup . . . 21

5.1.2 Selecting data . . . 21

5.1.3 Inserting data . . . 23

5.1.4 Database size . . . 24

5.2 Cloud hosting . . . 24

(5)

5.2.1 Microsoft Access . . . 25

5.2.2 MySQL . . . 25

5.2.3 MongoDB . . . 25

5.2.4 Summary . . . 25

6 Discussion 27 6.1 Choosing the right DBMS . . . 27

6.2 Choosing the right hosting option . . . 27

7 Conclusion 29 7.1 Future work . . . 29

References 30

A SSTE Small A

B SSTE Large B

C Sort & Mottagningsplats C

D Average Brutto D

E Sum Pris E

(6)

1 Introduction

In today’s digitizing society many small companies face the difficult challenge of keeping up with the changing technology. According to Faller and Feldmüller, becoming skilled in the applications and technologies of the modern industries is especially hard for small and medium-sized enterprises (SMEs) [1]. Many times this forces these companies to develop and implement their own computer systems due to the large cost of buying an existing one.

These companies also usually cannot afford to hire a designated programmer to create the systems. Naturally, this might lead to the person developing these systems to be as simple as possible. This begs the question, are these companies hindered by the simplicity of their self-developed systems or is it the simplicity that actually makes them keep up with the ever-evolving technology?

This paper is performing a case study on a small company in Rydaholm, Sweden, called Småland Timber investigating different options for how this business can keep up with the evolving needs and technologies in the field of data storage. Today, Småland Timber is using an old database, self-developed and maintained by the company itself and built using a simple system. The goal of this paper is to investigate and present different options for data storage for this company and based on different requirements determine what solution suits them the best.

1.1 Background

As the world we are living in is becoming increasingly digitized, companies of all sizes’

need for data storage is ever-increasing. Between the years 2005 and 2020, the total amount of data in the world grew by a factor of 300 [2] and there is no indication of this increase slowing down. These days there are a lot of options on how to store data and for SMEs, the choice of storage method might just fall on the first option instead of what suits them the best. If in turn the database produced to store the data is developed by a self-taught employee besides their normal duties they might end up with a sub-optimal database based on the needs of the company, developed and operating in a sub-optimal manner handling an increasing amount of data.

Småland Timber is a small factory working with wood processing. All the lumber that arrives is measured by an impartial party, called Biometria, in order for the transactions to be fair. The measurements and data surrounding each transaction are manually inputted and stored in an old database which was developed in 1998 using Microsoft Access. The database is stored on a server in the office facilities of the factory. The database has been patched and maintained since its creation to accommodate Småland Timbers changing needs unfolding over the years. It now needs to be replaced to be able to keep up with the growth of the company as well as the digitizing society. Småland Timber now stands before a choice: should they stick with a Microsoft Access database which their economist can continually maintain and alter based on their needs, or is it time to find another more powerful alternative? Additionally, should the database stay on an in-house server, or is it time to follow the stream and move to a cloud-based solution?

1.2 Related work

This study handles questions related to digitization, databases, and cloud hosting. These are hot topics with many articles investigating these subjects. Below are a few articles that are relevant to this study:

(7)

• Eyada, Saber, El Genidy, and Amer presents a comparison between two different Database Management Systems (MongoDB and MySQL) and how the comparison is done. In addition to measuring the performance of the two database systems on different sizes of data, they also present two prediction models for assisting in deciding which database will perform best on a specific size of data. The result of their studies showed that MongoDB outperformed MySQL in terms of latency and resources used as the size of the data increased [3].

• Vrchota, Volek, and Novotná performs research with the purpose of identifying different factors that affect SMEs’ adaptation to the modern society. The research was performed through case studies on 72 different SMEs. The studies revealed several different factors that complicate SMEs’ ability to adapt to the increasing digitization, for example lacking a written strategy, employees lacking the expertise needed, and financial limitations [4].

• ˇCerešˇnák and Kvet evaluates and compares the performance of relational VS non- relational databases. Their experiments included multiple database systems of the different types and were conducted on firstly a small amount of data and then on a larger quantity. They concluded that on a huge amount of data non-relational databases will outperform relational databases but on a smaller set, the difference is not as noticeable [5].

• Sultan explores the option of hosting a database in the cloud as an SME. Even though some concerns regarding cloud computing is highlighted in the article, Sul- tan concludes that cloud computing is suitable for SMEs thanks to the increased flexibility and pay-as-you-go financial model. The article also includes a case study of an SME with just 20 employees where it is concluded that cloud hosting largely decreases management costs and future growth is the main advantage of cloud com- puting [6].

1.3 Problem formulation

The purpose of this study is to investigate how a small wood processing company can keep up with the progressing technology in the area of databases and satisfy their need for data storage. This is done by cooperating with Småland Timber, a small lumber processing company which is used as an example. Literary research is performed to gain knowl- edge about databases, different options for database management systems (DBMSs), and how to evaluate a database. A few selected alternatives of DBMSs are implemented and evaluated based on several measuring points, and the best solution for Småland Timber and other companies in a similar situation with resembling needs is presented. Solutions for hosting the database are also investigated and the most fitting solution for Småland Timber is presented.

Many articles have been written about the issue of SMEs trying to keep up with the evolv- ing technology. These are mainly focused on the whole aspect of digitization and general strategies for SMEs to adopt. This report has a more in-depth focus specifically on how SMEs should handle data storage, as this is what Småland Timber requires in their oper- ation.

1.3.1 List of requirements

The requirements from Småland Timber on the final product are specified as follows:

(8)

• A database which stores data related to the measuring reports from Biometria.

• A graphical user interface to be able to manually enter new data, view existing data, and create reports.

• An automatic way of inserting data into the database.

• A long term solution for how and where the database should be hosted.

• Småland Timber needs to be able to maintain and improve upon the final product.

The main part of the product is the database used to store measuring reports from Biome- tria. The measuring reports consists of raw data such as numbers and texts which does not require a large amount of storage space. The data is to be stored indefinitely and is mainly used for transactions between Småland Timber and their suppliers.

Another important part of the project is to develop a graphical user interface which the employees will use to access the data, compile reports and manually input data into the database. The reports created from the graphical user interface are a vital part of Små- land Timber’s operations and are used to report to the suppliers how much timber was purchased.

The current solution of manually inputting the data from the measuring reports into the database is very time consuming, so a large priority for Småland Timber is to automate this process. This will save a lot of time for Småland Timber to work on more crucial tasks.

At the moment, the existing database is hosted locally in their premises and the server on which the database is hosted is typically replaced in favor of superior hardware every 3 years. As cloud-hosting has become more and more popular, this option will be investi- gated to find if it is a more suitable alternative for Småland Timber than local hosting.

Bearing in mind that Småland Timber has no hired IT-person, simplicity has to be prior- itized in all parts of the system in order to allow for day to day maintenance and slight changes. Some tasks will require Småland Timber to hire a consultant to complete, but this should not be necessary on a weekly basis.

1.3.2 Research Questions

The research questions of this study are created from the requirements specified in the previous section and are as follows:

• Which DBMS is most appropriate for Småland Timber’s operation?

• How should server hosting be handled at Småland Timber, onsite or cloud-based?

1.3.3 Expected result

The primary result that is expected of the study is to present comparisons between an updated version of Småland Timber’s existing Microsoft Access database and other so- lutions for data storage. Advantages and disadvantages with each of the options will be brought forward. Based on the comparison, a recommendation for which solution fits best for Småland Timber’s needs and requirements is to be presented. A recommendation regarding whether or not hosting the database in the cloud is suitable for Småland Timber will also be produced.

(9)

1.4 Motivation

As previously mentioned, Småland Timber is not the only company finding itself in a need of evolving its technical solutions. For a company with a limited amount of IT knowledge, it is a daunting task to update existing IT solutions. The motivation behind this report is to shed some light on manageable ways to improve technical solutions and keep up with the ever-moving technology and why this is important for SMEs.

1.5 Milestones

For the study to be completed on time, it has been broken down into smaller parts repre- sented as milestones below which were completed in order.

M1 Literary research regarding SMEs and digitization

M2 Literary research regarding relational vs non-relational DBMS M3 Literary research regarding important measuring values of a database M4 Literary research regarding onsite vs cloud-based hosting

M5 Research which DBMS’ that will be analyzed in this study M6 Automate data retrieval from Biometria

M7 Build a database in Microsoft Access

M8 Build two additional databases using other DBMSs M9 Test performance of databases produced

M10 Investigate the possibility of a cloud-hosted solution M11 Present finished solution to Småland Timber

M12 Hand in finished report

1.6 Scope/Limitation

Småland Timber’s resources are limited and do not allow for an IT professional to manage and maintain the solution presented in this study. Thus the finished product needs to be easy to use, maintain and understand for the layman. This is the reason why Microsoft Access is one of the DBMSs included in this study. Småland Timber uses Access in their existing database, and since they are familiar with Access it is natural that this DBMS is one of the options to evaluate and use as a baseline regarding simplicity.

Another limitation of this paper is that due to time constraints only three databases could be constructed using various DBMSs. The databases chosen to be compared are Microsoft Access, MySQL, and MongoDB. The motivation behind these choices are as follows:

Microsoft Access is included due to Småland Timber’s familiarity with it, and MySQL and MongoDB are two of the most widely used DBMSs at the time of writing [7]. They are also different types of DBMSs, with MySQL being a relational DBMS and MongoDB a non-relational DBMS, so they are chosen to include both types in the experiments. There is a possibility that a DBMS that would fit the needs of Småland Timber and also being the most suitable option for SMEs, in general, is not one of the systems included in this study.

Due to this limitation, the choices of DBMSs were made to cover as much as possible of the available DBMSs on the market.

There are many aspects when looking at a company’s technical solutions, such as data analysis, automation in production, internal system, and so on. This study is limited to examining Småland Timber’s solutions regarding data storage. The motivation behind

(10)

this decision is that within the time frame for this study it is not possible to look at every aspect, and data storage was the most pressing matter for Småland Timber at the time of the study.

1.7 Target group

The target group that could be interested in the result of this paper is most, if not all, SMEs that are in a similar situation to Småland Timber; in need of new technical solutions but the existing alternatives on the market are out of the financial scope of their business. Larger enterprises typically have more financial strength and can therefore afford a proper IT department to develop their system or even outsource the entire development of a solution.

Hence why this paper is not targeted towards those companies.

1.8 Outline

The structure of the rest of the paper is as follows: Chapter 2 describes the theories and important concepts as well as related work in this field. Chapter 3 introduces and presents the methods that are used to handle the problems previously defined in Section 1.3. In Chapter 4 thorough descriptions of the different practical implementations in this study are presented. Chapter 5 contains the experimental setup, the results obtained from these experiments, and analysis for each of the results. Chapter 6 contains a discussion of the findings and whether or not the problems at issue for this paper have been answered.

Finally, Chapter 7 describes the conclusions of this study and how it is relevant for society.

(11)

2 Theory

2.1 SMEs and digitization

An SME is a small or medium-sized enterprise. According to the European Commission,

"Small and medium-sized enterprises (SMEs) represent 99% of all businesses in the EU"

[8]. The primary factors deciding if a business belongs in the SME category or not is the number of people employed by the enterprise and either turnover or balance sheet total [8]. The exact intervals for each category can be seen in Figure 2.1.

Figure 2.1: Definition of an SME [8]

Digitization has during the 2010s been a hot topic in the manufacturing world, and all around the globe enterprises strive to keep up with the evolution, which has been par- ticularly challenging for SMEs. Large amounts of money are invested in new systems, robotics, and other technologies to increase their production and expand their enterprises.

To evolve a company’s technological solutions in this way requires a large amount of competence and copious amounts of money to invest in the area. SMEs do not have the luxury to afford to be at the front of evolution, both lacking the staff needed to research new solutions and technologies as well as the money to risk investing in new technolo- gies with little to no testing in the field. This leads to some SMEs being left with older software that hinders their growth [1].

2.2 Relational DBMS VS non-relational DBMS

There are two main differences in DBMSs and that is whether they are relational or non- relational. The relational database was first invented by Edgar F. Codd in the 1970s and it was immediately recognized to have a large improvement over the current non-relational alternatives. A relational database divides the data into tables which consist of columns and rows. The columns represent a field describing the data in this column and the rows consist of records of this particular data. These fields can then be used as search keys to match records in different tables and find the desired data. During the 1980s this model became the preferred model and non-relational databases fell out of favor [9].

During the 21st century, the amount of data to store has increased substantially and be- cause of this the need for improved ways of scaling has emerged [10]. Scaling can be done in one of two ways, vertically or horizontally. Vertical scaling means adding more power in terms of better hardware to an existing storage unit, while horizontal scaling refers to the act of adding additional storage units [11]. Naturally, there is an upper limit of how much the hardware can be improved, and such a limit does not exist when it comes to adding storage units.

One drawback of relational databases is the fact that they can only scale vertically while non-relational databases can be scaled horizontally. Instead of storing data in static tables,

(12)

non-relational databases make use of more dynamic approaches such as key-value pairs, document-based, column-based, or graph-based storage [5]. Static tables are good when it comes to storing structured data which always has the same attributes and format. For unstructured data with varying attributes, the non-relational way of storing them is gener- ally advantageous. With the increasing need for highly scalable databases, non-relational DBMSs have reemerged as a viable option and each of the two variants now has its own strengths and weaknesses.

In this paper three different DBMSs have been chosen for evaluation; Microsoft Access, MySQL, and MongoDB. Microsoft Access is a relational database that is chosen based on the fact that this is the database used at Småland Timber today. MySQL and MongoDB is a relational and non-relational database respectively and as of April 2021, MySQL is ranked second among the most used databases in the world and MongoDB is ranked in fifth place, first among non-relational databases [7]. This implies that both of these DBMSs have a bright future and will remain relevant for some time to come, hence the choice of studying these systems in this paper.

2.2.1 Microsoft Access

Microsoft Access was first introduced in 1992. It focuses on being an easy-to-use DBMS with a self-explanatory graphical user interface that allows the user to create and design their own tables, forms, and queries without having prior programming knowledge. Mi- crosoft Access can also be used to create a front-end, not only for an Access database but also for other DBMSs such as MySQL [12]. In the previously mentioned list of the most popular DBMSs, Microsoft Access is placed in ninth place, indicating that it is widely used in the industry even though it has clear drawbacks. Such drawbacks are for example a maximum database size of 2GB and the maximum number of concurrent users is set to 255 [13]. These limitations will most definitely make a large corporation shun away from using this DBMS, but might not affect an SME’s choice of DBMS.

A clear downside to Microsoft Access is the fact that it is not free software and a Microsoft 365 Business Standard license is required before being able to use it [14]. Although financial reasons may be a deciding factor in many decisions for a company, especially for an SME, this might not carry as much weight in this case since this license also provides the company access to other Microsoft products such as Word, Excel, and Teams. Because of this, the Microsoft 365 Business Standard license is likely to be purchased by the company regardless of whether or not they intend to use Access.

A factor to consider when using any software is the longevity of the support for this particular software. Apart from Access, Microsoft also maintains a DBMS called SQL Server which is more powerful than Access. Undoubtedly this raises the question if they will keep developing 2 separate DBMSs, or direct their customers towards one of these.

For a brief time, Microsoft introduced a cloud version of Access called Access Web Apps but this was shut down in April of 2018 [15], further fueling the thought that support for Access may be discontinued. However, Microsoft has made no such claims at the time of writing.

2.2.2 MySQL

MySQL, developed by the two swedes Michael Widenius and David Axmark, was re- leased in 1995 [16] and is as previously mentioned a widely popular DBMS based on the

(13)

relational model. As the name suggests, it is based on the Structured Query Language (SQL) [3]. MySQL is fast, free to use, has a low overhead cost, and can store up to 8 terabytes of data per Table [17]. This has lead to MySQL having a large community around it and also makes it perfect for all types and sizes of corporations in need of a relational DBMS, resulting in its’ popularity. MySQL is free and available for anyone to use without any sort of license [18].

To connect to the MySQL server, a tool called Open Database Connectivity (ODBC) is used. ODBC is a standardized Application Programming Interface (API) that allows an application to communicate with a DBMS. For example, this could be used by Microsoft Access to connect to tables existing in a MySQL database instead of having to copy the data and import it into Microsoft Access tables [19]. ODBC consists of four components:

an application, a driver manager, one or more drivers, and a data source. The application is a program of some sort, and the data source is a database that the application wishes to access. The driver manager is a library that is responsible for the communication between the application and a driver. When the application wants to perform some action on the database, it contacts the driver manager which allocates a driver to that task. The driver then takes the action that the application wants to perform, processes and submits the call to the data source, and returns the results to the application [20]. ODBC is useful since it allows an application to communicate with a database. However, it does add another layer between the application and the database which could affect the performance.

2.2.3 MongoDB

MongoDB is the world’s most popular non-relational DBMS. It uses documents to store the data instead of tables as a relational database would [21]. Documents are a set of key-value pairs and in MongoDB the documents are identical to Javascript Object No- tation (JSON) objects, making them intuitive for many programmers to work with. The documents are then stored as a Binary JSON (BSON) object to optimize speed and space.

The schema of these documents may vary from document to document and it can even change during run time [22]. This is a stark contrast to how relational databases work, where the schema of the tables may be hard to alter after inserting data. Just like MySQL, MongoDB is free and anyone can use it without a license [23].

2.3 Measuring Values for evaluating a DBMS

There are many different DBMSs available today, and choosing between them may prove difficult. Therefore it is necessary to have a strategy of how to evaluate and compare the different DBMSs to make a well-informed decision. A good strategy for evaluating DBMSs is to have different measuring values to look at. Then there are concrete numbers to look at which can be compared and plotted in figures to visualize which performs best.

So what different measuring values are there?

Eyada, Saber, El Genidy, and Amer performed an experiment where large amounts of data were recorded from different sensors and stored in two different DBMSs which are later compared. The values that were looked at are latency of insertion operations, size of the databases, and latency of selection operations [3].

Another experiment was performed by C.Gy˝orödi, Dumse-Burescu, Zmaranda, Gabor, R.Gy˝orödi, and Pecherle where the performance of a relational database was compared to the performance of a non-relational database. The databases were compared based

(14)

on latency for different operations, both inserting and retrieving data using queries with varying levels of complexity [24].

In 2015 Lourenco, Cabral, Carreiro, Vieira, Bernardino performed extensive comparisons between several different NoSQL DBMSs. As different DBMSs excel for different use cases, the purpose of the experiment was to present which DBMS was most suitable for which scenario. The DBMSs were compared on the following software quality attributes:

availability, consistency, durability, maintainability, performance, reliability, robustness, scalability, stabilization time, and recovery time [25].

Veen, Waaij, and Meijer compared three DBMSs; one SQL and two NoSQL, where sev- eral different scenarios were looked at and the main measuring value was operations per second. The experiments were performed on both physical servers and virtual servers, to also look at how this affects the performance as well [26].

To summarize, the most frequent measuring value for measuring performance in DBMSs is latency, for both inserting and retrieving values. There are also a large variety of other measuring values as well, but latency is the most widely used.

2.4 Cloud-based database

Since the dawn of databases, the standard location for the stored data has been on a server (computer with specific hardware and software to optimize a specific service) on the com- pany’s perimeter, often in a specific server room. This introduced a number of questions to the enterprise like: How big should the storage be? How much potential growth should be prepared for when buying new hardware? The enterprise also needed to store backups of this data in a different room or an entirely different location to account for a potentially catastrophic event, like a fire or an earthquake. Nowadays, when all companies have ac- cess to the Internet and with the speed at which data can be uploaded and downloaded from the Internet, a lot of these issues can be solved by hosting the database online - in the "cloud".

Hosting a database in the cloud can be done in two ways. Option one is to host it on a virtual machine and option two is something called Database as a Service (DBaaS).

Hosting the database on a virtual machine implies creating a virtual version of a server that is located in the cloud and then setting up and maintaining the database on this virtual server. In this way, the database is hosted in the cloud but still maintained and updated by the customer. With DBaaS, the cloud service provider (CSPs) takes full responsibility for maintaining the database and the customer only pays per use and does not have to worry about maintenance and security [27]. DBaaS relieves some of the difficulties with having a cloud database, as most of the management and responsibility is transferred to the CSP who specialize in this field.

At the time of writing, there are plenty of CSPs to choose from and with each CSP there are a lot of choices regarding processor power and storage size. Consequently, the com- pany does not have to choose from how much storage and power is needed in the foresee- able future and can instead increase this as the company’s needs grow. What is implied by "cloud database" is a database that is both built and accessed through the internet. It is hosted on the CSPs servers and the benefits apart from easier scaling is that the database is accessible from anywhere at any time and the CSP takes care of backups and security of the data [28]. Some examples of CSPs are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud [29]. A drawback of using cloud-based data storage is the fact

(15)

that a different enterprise handles your potentially sensitive data and you have no con- trol over who has access to the data or if the data is securely backed up [30]. This is a rather important issue a company has to consider before fully committing to cloud-based storage.

2.4.1 Security

With a database, there are several security concerns regarding the data. Some of these are: platform vulnerabilities (for example the operating system of the machine might have security flaws), SQL Injections (a SQL command which is harmful to the data in a database), denial of service (an attack preventing users from accessing the database), weak authentication (login credentials are obtained by someone unauthorized and grants them access to the data), and backup data exposure (backups of the data in the database might be damaged or stolen) [31]. Staying updated with the latest software and protecting the data against various threats is essential to avoid getting targeted by these types of attacks, and the trend indicates that they will only increase. Damages caused by cyber- attacks globally cost around 3 trillion dollars and this value is predicted to increase to around 10.5 trillion dollars by 2025 [32]. These concerns are something an SME without a functioning IT department might have never even heard of and most, if not all, of these concerns can be minimized or even completely taken care of by a CSP [33].

2.4.2 Manage a database

Managing a database involves a plethora of things from making sure to use appropriate hardware that can store enough data while at the same time being fast enough, to control- ling the software side of things. As previously mentioned, security loopholes in software are detected all the time which might force the enterprise to make changes to how the database is structured. The data to be saved in the database and the specific needs of the company may also change over the years. To an SME without a dedicated IT department, answering these needs for changes to the software could prove difficult and also costly if there is a need to hire a consultant for the problem, which is why the authors in [6]

concludes that cloud computing options could become an attractive option for SMEs.

(16)

3 Method

The method chapter is divided as follows: In Section 3.1 it is briefly described what method is used to fulfill each milestone. In Section 3.2 a detailed description of how each method will be applied is provided. Section 3.3 includes a discussion regarding the reliability and validity of this report and the research project performed. Finally, Section 3.4 discusses the ethical considerations that have been made and taken into consideration during the study.

3.1 Research Project

Milestones 1-4 (described in Section 1.5) is handled by consecutive literature reviews where the majority of the material consists of peer-reviewed articles to achieve high reli- ability. Milestone 5 is accomplished through a literature review but in this case, the result is primarily based on statistical reports stating which DBMSs are most widely used in the world. The motivation behind this is that it is more relevant to examine and evaluate the most popular DBMSs, rather than some anonymous outliers that are rarely used.

Milestone 6 is a practical task, a script to implement which automatically transfers data from Biometria into the database. Milestone 7 and 8, similarly to milestone 7, are practical assignments where databases are constructed. The DBMSs for these databases are chosen based on the result of milestone 5. Milestone 9 consists of controlled experiments where the performance of the databases are tested and compared. The motivation behind this milestone is to determine if one of the DBMSs outperforms the other two significantly in this setting. Milestone 10 is achieved by researching what options there are for hosting the database in the cloud. Milestone 11 and 12 consist only of presenting the study and the results to both Småland Timber and the examiners from the Linnaeus University.

3.2 Method

In this section, the different methods are described both in general and how they are applied to this specific study. The practical tasks for milestones 7, 8, and 9 is described in detail in Chapter 4.

3.2.1 Literature review

A literature review implies that data from what others have published is studied and an- alyzed to gain an understanding of a specific area or subject. It is important to define what terms and keywords the review will be revolving around and what important criteria the results must adhere to. The terms and keywords are used for finding articles and the criteria are used in order to choose which ones are relevant to the study [34].

The literature review carried out to complete milestones 1-4 (Chapter 1.5) is performed us- ing literature found through the scientific search engines OneSearch and Google Scholar.

For milestone 1, "Literary research regarding SMEs and digitization" the keywords "SME"

and "digitization" are used together to find relevant articles regarding this subject. This is a popular topic and there is no absence of relevant articles to pick from. Milestone 2

"Literary research regarding relational vs non-relational DBMS" is completed in a similar fashion to the previous, but the keywords used to find articles relevant to this paper are not as tightly defined. Instead they are just centered around DBMS, relational database, and non-relational database.

(17)

The important measuring values researched during milestone 3 are defined through re- viewing scientific papers comparing different DBMSs to identify which measuring val- ues are usually measured and compared when looking at DBMSs. This is also a field where a lot of research has been done and so the keywords used to find articles regarding this topic are also flexible and solely centered around different popular DBMSs and the keyword "compare". For milestone 4 "Literary research regarding onsite vs cloud-based hosting", the result from the literature review does not only come from the two scientific search engines, but also from online websites to acquire a deeper understanding of what cloud-based hosting implies, which options exist and how they differ, and also what the pricing model looks like. To find relevant sources of information for this topic, different keywords around onsite and cloud-based hosting are used.

3.2.2 Controlled experiments

Controlled experiments are performed to measure quantitative data and be able to prove a hypothesis. This is conducted in a controlled environment where specific aspects of the system are measured. Before the experiment, it is important to define the dependant and independent variables, where dependant variables are the ones being measured and independent variables are altered to affect the result [35].

The controlled experiments performed to accomplish milestone 9 are performed in a way to make sure that if anyone who would try to replicate the experiments, they would reach the same results as in this paper. In general, when performing controlled experiments it is important that no matter who performs an experiment and how many times an experiment is performed, the result should always remain the same. Therefore the way the experi- ments are carried out during the research of this paper is exclusively through software to avoid human interference. The software produced to monitor the performance must be able to measure the important measuring values decided on during milestone 3, as well as being able to perform the same tests on all different databases produced.

To perform the experiment where the performance of the different DBMSs in this setting is measured, a script is created by the authors of this paper. The measuring values decided on are:

• Speed of inserting data into the database

• Speed of retrieving data from the database

• Final size of the database

Speed of inserting data and checking the final size of the database are both rather straight- forward to test as it is a task of telling the database what data to store in which place or inserting a lot of data and measuring the size. In contrast, the speed of retrieving data is more complicated since this task could involve a simple job such as returning all data in a field within a table or collection, or something more difficult such as combining different tables or collections, saving all rows where a condition is met and then returning the aver- age of all instances of one field. Therefore several types of queries are constructed to test different aspects and scenarios to give a fair evaluation of each DBMS. How the database and the queries are constructed are, needless to say, also important and it is paramount that they are as identical as possible across all three databases.

Småland Timber does not have a large amount of data, and in order to highlight the perfor- mance differences and make them more palpable, more data was generated by tweaking

(18)

and multiplying the data from Småland Timber. All tests are run three times each, and the average time is calculated and used as the result. This is done to minimize the impact of outliers that may occur.

The purpose of this paper and the question at hand is which DBMS is most suitable for an SME. With this in mind, the performance of the database might not be important in the end when it comes to the final verdict as the amount of data stored by an SME is typically not overly extensive. Despite this, it is always a good idea to have concrete values and facts to base a decision upon, and performance is easy to measure and gives concrete values to be evaluated.

A more detailed description regarding how the controlled experiment is set up and per- formed, as well as all the results acquired through the previously described methods, is presented in Chapter 5 "Experimental Setup, Results, and Analysis".

3.3 Reliability and Validity

A vital aspect of the study is to produce reliable and valid results, and this is kept in mind throughout the whole process. Despite this, some of the methods used in the study might introduce threats to this goal. For example, to fulfill milestone 4 both scientific and non- scientific articles are used, and there is always a possibility that non-scientific articles are unreliable or provide false information. To minimize the risk, the sources must be carefully evaluated and analyzed before using their material. Another possible threat to validity is the controlled experiments. It is important that the experiments are performed identically for each of the databases, and that as many sources for error are removed as possible. Therefore the tests for this report contain no human interaction and all the tests are performed on the same computer to maintain the same physical environment.

In order to perform correct and fair evaluations, all people involved are unbiased with no affiliation to any of the DBMSs analyzed in this study. Without this, the result of the study could be deemed invalid in terms of reliability and validity. In the case of this study, there is no affiliation with any of the DBMSs. The study is not funded by any company, which could have an impact on the result. There is no economic gain to be made by promoting one DBMS over another. So therefore the experiments and the result of these can be trusted.

3.4 Ethical considerations

Considering the fact that all experiments performed in this study are conducted on soft- ware where no humans are involved, the procedures and results of the experiments can be presented with total transparency without the need for any ethical considerations. How- ever, the data within the database includes personal information about the suppliers of Småland Timber and is therefore not included in this report.

(19)

4 Implementation

The software implemented during this study are:

• An updated version of the Microsoft Access database used by Småland Timber

• A replica of the database in MySQL

• A replica of the database in MongoDB

• Automated data insertion from an Excel file to the database used by Småland Tim- ber

• A user-friendly graphical interface in Microsoft Access

How the implementation process is handled and how it proceeded is described in the following sections.

4.1 Microsoft Access

Microsoft Access is a software that can be used to implement both a back end part with all the tables and relations of a relational database but it can also be utilized to create a user-friendly front end to display the stored data in various forms and reports. This part of the implementation chapter concerns the back end part and the front end is covered in Section 4.5.

The first thing to implement is a new database in Microsoft Access. The majority of the implementation process is not actual development, but angled more towards investigating workflows and understanding how the database is to be used. So close cooperation to- gether with Småland Timber is necessary to create a database fitted for their needs. A lot of design decisions are made by looking at the existing database and thinking "how could this be done better?" and trying to improve upon that. Mostly it is a case of removing data redundancy through connecting tables using foreign keys. Data redundancy is when the same data is stored in several different places, making it redundant.

Figure 4.2 displays how the finished product looks like with a complete view of all tables it contains and their respective relations. The central part of the database are the two tables

"tblMätbesked" and "tblSkotad" as storing this data is the main purpose of the database.

This is where the measurements mentioned in 1.1 is stored, and because of this, a big part of the implementation process went into deciding how this data should be stored and what different tables the result of these reports should be split into. The difference between "tblMätbesked" and "tblSkotad" is that the data in "tblSkotad" is a preliminary measurement done by the truck picking up the lumber where it was harvested, whereas

"tblMätbesked" is a more precise measurement performed by the independent third party.

The development process in Microsoft Access became simple due to the user-friendly graphical interface that is clearly catered towards users with no previous experience in developing databases. This is obvious due to the large number of tools available to im- plement different functionality in the database such as relations and queries. An example of this is if a field should be restricted to only allow options that already exist as a value in a different table, the "Lookup Wizard" can be used to accomplish this. Similarly, if a query is to be constructed to retrieve some data from the database there is a wizard to help with this task as well. This level of simplicity makes it easy and is good for a beginner.

(20)

Figure 4.2: Relations in Microsoft Access database

However, for an experienced database manager, it is often time-consuming having to push through these helpers every time.

4.2 MySQL

Since both Microsoft Access and MySQL are relational databases based on the SQL lan- guage the implementation process of the database using MySQL became simple thanks to the fact that it had already been created in Microsoft Access. This meant that because a lot of time had been spent making sure that all the data types, restrictions, and how the tables relate to each other are proper, this could all basically just be translated into a MySQL database. All of the tables look the same, are called the same, have the same fields, and also have the same relations as in the Access database.

When developing the database in MySQL, MAMP and MySQL Workbench are used.

MAMP is a collection of free software which can be used to create a local server envi- ronment [36]. This makes it so a MySQL database can be hosted on a local computer and still be accessible as if it is hosted on a server, which simplifies the development process.

MySQL Workbench is a visual tool designed to help with developing databases by pro- viding a graphical user interface. MySQL Workbench also supplies the user with a way to monitor the tables and the relations between them. Figure 4.3 displays how this looks and makes the similarities between the Microsoft Access database and the MySQL database apparent.

When developing a MySQL database there is no need for a user interface, it could all be done through queries. Doing it in this way requires a high level of expertise from the developer and has a high entry-level which is not beginner-friendly. Through the use of software such as MySQL Workbench this threshold is lowered and the development pro- cess becomes more similar to Microsoft Access, shrinking the gap in required knowledge to construct the database. Knowledge in the SQL programming language is still required, but the tools provided by MySQL Workbench facilitate the process.

(21)

Figure 4.3: Relations in MySQL 4.3 MongoDB

As previously described, MongoDB is different from the two other DBMSs in this study because it is a non-relational DBMS. The strength of a non-relational DBMS is that in- stead of consisting of tables with set fields it stores data in collections that are flexible and each document in the collection can contain widely different data. This is good when it comes to unstructured data which differs in how it looks. The downside of this is that the database might look unstructured and data might be redundant.

MongoDB is not made to have different collections which are combined in queries, unlike relational databases. So therefore two MongoDB databases were developed: one mim- icking the relational databases where data is divided into multiple collections to give it a bit more structure, and the other storing most of the data in one collection to cater to the strengths of MongoDB but also increasing data redundancy in the process. The structure of these two variations can be seen in Figure 4.4.

The schema and structure of the first MongoDB database are almost identical to the one in Figure 4.2 and 4.3. The naming convention of putting "tbl" and then the name of the table is maintained for MongoDB as well, despite it not being tables but collections instead.

This is decided to clearly show that a specific table in MySQL and Microsoft Access has a counterpart in MongoDB. The central collections "tblMätbesked" and "tblSkotad" store only the most essential data from the measuring report, and more detailed information regarding objects, sorts, and suppliers are stored in separate collections. Like relational

(22)

Figure 4.4: Collections in MongoDB

databases, if the details for a specific measuring instance are wanted this information is combined in a query. For example, in a measuring report, the sort of timber measured is described with both a code and in cleartext. The essential part of this information is the code which is stored in "tblMätbesked" and "tblSkotad" and the more detailed infor- mation, the clear text, is stored in "tblSort". With this approach the same data stored in multiple places is reduced, but if the clear text of a sort from the code on a measuring report is desired, the two collections needs to be combined.

In the second MongoDB database, all data received from the impartial measuring party is stored in "tblMätbesked" and "tblSkotad". This approach is more angled towards non- relational databases, all the details are present without the need for joining collections in a query. However, compared to the example above, both the code and the clear text of the sort would be stored in the central collections ("tblMätbesked" and "tblSkotad") increasing redundancy.

To develop a database in MongoDB, the database can either be hosted using MongoDB Atlas or hosted on a local machine. MongoDB Atlas hosts the database in the cloud and there is a free version where several databases share the same processor. To perform the performance evaluation between the different DBMSs, this is not a viable solution as this will add time during network transfer, when the processor is not available, and so on.

Because of this, the MongoDB database constructed during this study is hosted on a local machine.

MongoDB is similar to MySQL in that it does not require a graphical user interface to develop and all actions can be performed through a console window or from an applica- tion. However, to get a good overview of the schema and data of the database MongoDB Compass is used to provide a graphical user interface similar to MySQL Workbench.

4.4 Data insertion

The data insertion process today at Småland Timber is handled by a single employee who will dedicate a substantial part of the week for this process. All data measurement reports are received through traditional paper mail and manually inserted into the database, one

(23)

report at a time. This is not only time-consuming but also prone to human error.

To improve and simplify this process the authors of this paper developed a solution to instead fetch these reports in the shape of an Excel file and automatically insert the data into the database. The third-party company, Biometria, which performs the measuring also provides a solution to receive this data automatically through an API. However, the cost of this solution is expensive. The initial setup cost is 16 236 SEK and after that 0.71 SEK for every measuring report delivered. This big start-up cost is too expensive for Småland Timber, which leads them to find a different solution in the form of extracting the data from Biometria as an Excel file before inserting it into the database.

The data insertion script is described in two steps. Firstly all major design decisions are motivated and explained and then the flow of the script is described.

4.4.1 Design decisions

The first big design decision is which programming language to use for the script. The main criteria is that Småland Timber with limited programming knowledge should be able to grasp the flow of the script. Python sprung to mind early on, because of its syntax being simple. The authors are also already comfortable with Python, so it is a strong alternative early on. The second advantage of Python is its support for using libraries. A library is a program that implements a specific functionality, which can be imported to avoid having to implement the functionality again. Using libraries makes a program easier to read since much of the implementation behind the scenes is hidden behind the library. Python is built around libraries and for every functionality to implement there is likely a library for this.

Due to these advantages, Python is the obvious choice in this situation.

During the development process of the script, there are a lot of things to consider and design decisions to be made. Because the authors of this paper are only working on this project for a short period of time to then hand over the management and maintenance of the system to Småland Timber, simplicity and clarity in all software are critical for future manageability. With this in mind, the script created for data insertion is heavily commented with a focus on readability and understandability instead of efficiency and conciseness. An example of this can be seen below in Listings 1 and 2 where these two code snippets produce the same results but the code in Listing 1 is compact which makes it less understandable for a beginner in Python. The code in Listing 2 is a more verbose solution with more clarity. Solutions like the one in Listing 2 are always chosen in the script produced to increase readability.

first = datum(''.join(c for c in temp[-2] if c.isdigit())) last = datum(''.join(c for c in temp[-1] if c.isdigit()))

Listing 1: Compact python example

As previously mentioned, a big advantage of the Python programming language is the ease of importing different libraries. When importing libraries it is important to remem- ber that the support for the libraries can cease at any time and there is also a possibility that the creator of the library might have malicious intent. Therefore it is necessary to do some background research and choose with care. The libraries imported for the script in this study were: Pyodbc, PyMongo, Tkinter, and Openpyxl. Pyodbc is used for connect- ing and sending queries to a database and receiving data. Pyodbc is open-sourced and also publicly supported by Microsoft [37], indicating that the longevity of the library is

(24)

first = ''

for c in temp[-2]:

if(c.isdigit()):

first += c last = ''

for c in temp[-1]:

if(c.isdigit()):

last += c

first, last = datum(first), datum(last) Listing 2: More readable python example

reliable. Pyodbc is used in this study to connect to Microsoft Access and MySQL, but for MongoDB the library PyMongo is used instead. PyMongo is a simple and widely used library and it is the recommended way of connecting to a MongoDB database from Python [38].

The library Tkinter is used for selecting which file to import to the database. As stated on the official Python website "Tkinter is Python’s de-facto standard GUI (Graphical User Interface) package" [39] and since the file selector dialog window is part of the GUI, Tkinter is an obvious choice for implementing this functionality. Finally, Openpyxl is chosen for the script to be able to read Excel files. Openpyxl is also an open-source library and the first version is released in August of 2010 [40] suggesting that long-term support should be no issue.

When importing data from the Excel file with the script, there are two ways of determining what data is in which column. The first option is to rely on that the columns will always come in a specific order, and the second option is that they always have the same column name. In this situation, the column name is determined by the third party measuring company, Biometria, whereas the order in which the columns appear is decided on by the person retrieving the data. There is no assurance that the column names will always be called the same by Biometria, so basing the script upon the column names would make it vulnerable. Relying on the order of the columns gives more control to Småland Timber because they are the ones collecting the data, and therefore the script is based on column order instead of column names.

4.4.2 Script

The initial step of the script is to connect to the database and then open up a file selector dialog with the help of Tkinter for the user to choose the Excel file from which to import data. The file selected is examined to make sure that it is of the correct file type and, assuming that it is, the script examines the excel file and extract four values: the start and end date of the period which the measuring data is from, the row at which the actual data begins and finally the row where the data ends. The following step of the script is to query the database for all data between the start and end date of the new data. This is done to later be able to check for duplicate entries without having to check through the entire database. This data is stored in an array and it is now time to start entering data into the database. For each row of data in the excel file, this data is stored in a variable and a couple of the values in the row is compared to the data previously extracted from the database. These values need to be unique and in the case of duplicate entries, the database

(25)

can only save one of them.

Whenever a conflict occurs the script halts and the user is questioned on how to proceed.

The user is presented with 4 options: Option 1 is to delete the existing row in the database and instead save the new data. Option 2 is the opposite, the new data is discarded. If the user wishes to save all the new data from the excel sheet instead of the existing data, they can choose option 3, and option 4 will do the same thing but instead discard all new conflicting data and keep the already existing rows. It is quite a simple flow and the inputs from the user are handled within a terminal window where the user makes their choices by just entering the number of the option they wish to choose followed by pressing the enter key.

4.5 Graphical interface

For this study, a simple graphical interface is implemented containing the most crucial parts needed to visualize the system. It is decided in conjunction with Småland Timber to go for a simplified approach for two reasons: the first reason is that the time frame for this study is rather limited so the focus needed to be on functionality and robustness instead of aesthetics. The second reason is that Småland Timber wanted a stripped-down user interface covering the basic functionality to start with and add more functions and further develop it as they go.

The graphical user interface is developed in Microsoft Access because the employees at Småland Timber are familiar with working in this software so it is a natural choice.

Implementing a graphical user interface in Access is based upon creating different forms and reports. For this simplified version a list form and a detailed form arecreated for each table. The list form displays each row in a table in a simple list with only the most important information. The detailed form is displayed when a row is double-clicked and this page lists all specific details about the row. There are also forms created for running the script to insert data into the database.

Working in Microsoft Access to create a graphical user interface is simple and intuitive, but also takes a lot of time. No prior knowledge is needed and there is a button or wizard available for any functionality or element that is added to the form. The interface can also be connected to either a Microsoft Access database or a database created using a different system, such as the MySQL database created in this study.

(26)

5 Experimental Setup, Results, and Analysis

5.1 Performance Experiments

The performance of the databases is measured in three different ways: the time of se- lecting data using varying queries, the time of inserting data, and the storage size of the database. The different experiments measuring this as well as a brief analysis of the re- sults are presented in their own sections below.

5.1.1 Setup

All experiments were conducted on a Lenovo IdeaPad Y700-15ISK with the following specifications:

• Intel i5-6300HQ 2.3GHz CPU

• 8GB RAM

• Nvidia GeForce GTX 950M GPU

• Windows 10 Home 64-bit operating system

As mentioned in 3.2.2, all experiments are performed three times each in order to reduce the impact of outliers that may occur and the final time presented is a calculated average of the three.

The data provided during this study included every measuring data reported between 2018-08-13 and 2021-04-19, which is roughly 4000 instances. To amplify the differ- ences in performance between the different databases, more data is required and therefore the data is replicated with minor changes. For the experiments regarding inserting data, the original data is multiplied 5 times, resulting in about 20 000 rows. Selecting data from the databases is generally quicker and because of this, the original data is multiplied by 25 returning slightly over 100 000 rows.

When performing the tests, all unnecessary processes and applications running on the computer are terminated, giving the tests full access to the computer and resulting in optimal results. All tests are written in Python using the code editor Visual Studio Code and the time is measured using the library Time. Prior to each iteration of a test, a garbage collection is performed and disabled to prevent this from occurring during the tests and negatively affecting the results.

5.1.2 Selecting data

The experiment to measure how quickly each database could select data is performed by creating five different queries and testing these. Since most of the queries are performed almost instantly, all queries are run 100 times to increase the time it takes. These queries have varying levels of complexity and actions performed to measure the performance in different scenarios. The queries used are listed below and the code for all queries can be viewed in appendix A to E.

• SSTE Small - Select IDs of all measuring reports where the type of log measured is

"MAVSTLGRAN".

• SSTE Large - Select IDs of all measuring reports where the type of log measured is

"SÅGTGRAN".

(27)

• Sort & Mottagningsplats - Select IDs of all measuring reports where the type of log is "KLENTTALL" and the receiving station is "ATA-TIMBER".

• Average Brutto - Calculate the average brutto of measuring reports where the type of log is the most frequently occurring type of log in all the measuring reports.

• Sum Pris - Sum up the "Pris"-column for every measuring report.

The five queries can be divided into two categories. The first three queries are similar where the main action is selecting instances based on one or two conditions including eventual join statements. Query four and five can be grouped together as well, as they both include some sort of aggregate function. So the DBMSs will be evaluated on how they perform on the two different types of queries.

After performing the first test (SSTE Small), the database produced in MongoDB as a replica to the relational databases turned out to be inefficient when information from mul- tiple collections was needed, as can be seen in Table 5.1. As mentioned in Section 4.3, combining multiple collections is not the strength of MongoDB, therefore it is unjust to evaluate it based on a sub-optimal schema. Therefore it was decided to disregard this implementation of the database and proceed with evaluating the remaining three versions.

The results from the experiments can be found in Table 5.1.

Query Microsoft Access MySQL MongoDB MongoDB (relational)

SSTE Small 211 70 6 559 388

SSTE Large 123 1 280 7 -

Sort & Mottagningsplats 185 2 284 4 -

Average Brutto 37 490 17 722 26 635 -

Sum Pris 10 943 3 619 17 393 -

Table 5.1: Execution times for the five queries, measured in milliseconds.

To sum up the tests: MongoDB substantially outperforms its competitors for the first three queries, with Microsoft Access doing an overall good job and MySQL performing well for the first query but falling far behind for queries 2 and 3. Concerning the last two queries, MySQL achieves the best results by a large margin, with Microsoft Access and MongoDB doing pretty similarly for these. So overall it can be stated that each of the databases has its own strengths and weaknesses in different tasks.

For the first three queries which are mostly just extraction of data, MongoDB is excep- tionally quick. However, this comparison is not completely fair since the two relational databases have to combine multiple tables to complete this task whereas all data needed is contained within a single collection in MongoDB. Even though relational databases are more potent at joining different tables than MongoDB, this is still a task that has to be completed which will affect the result. Nevertheless, by looking at the result from the MongoDB database replicating the relational structure (559 388 milliseconds) it is ob- vious that this implementation has to be excluded from the experiments due to the poor performance and the experiment is as fair as possible in this setting.

One additional detail to note from the first three tests is that MySQL’s performance de- creases as the size of data to return from a query increases. This is blatant when compar- ing the result from "SSTE Small" and "SSTE Large", where the only difference in these two scenarios is that the number of rows matching the search criteria is larger in "SSTE Large". One possible reason for this may be due to the test script using ODBC to connect

(28)

to the MySQL server to send queries and receive data, and in comparison, Microsoft Ac- cess is simply reading data from the Microsoft Access file. This introduces an additional layer to the exchange and when sending large amounts of data this adds a fraction of time.

When repeating the same queries 100 times, this fraction has more of an impact and this is why MySQL performs worse when more data is returned.

From the fourth and fifth queries, where the task is to combine tables and calculate multi- ple things, it is apparent that this is where MySQL shines. From the fact that the databases built in Microsoft Access and MySQL and the queries for these two databases are almost identical, it is evident from the result of these two queries that MySQL is superior for these types of tasks.

The strengths of Microsoft Access are not as obvious as for the other two DBMSs. In- stead, this is highlighted by looking at the bigger picture where Access is all-around performing decently for all queries. It does not excel in any specific area but it is also not performing distinctly poorly during any tests except for "SSTE Small". The reason for the poor performance during this query is hard to tell considering that Access performed well for "SSTE Large" which is a very similar query. It may just be the case of this specific word being hard for Access to match or select based upon. However, this is merely pure speculation and impossible to state as a fact without knowing the intricate details of how Access works.

5.1.3 Inserting data

Testing the insertion of data was done by inserting 20 000 measuring reports into the three databases and measuring the time. The test was run three times for each database and the average was calculated and used as the result. The results can be seen in Figure 5.5.

Figure 5.5: Time needed to insert 20 000 rows

The conclusion to be drawn here is that MongoDB is faster than the two relational databases, and this is most likely due to the existence of foreign keys in MySQL and Microsoft Ac- cess. Before inserting a row the relational databases has to make sure that some of the

(29)

fields in the row already exist in other tables. For example, these databases do not allow a row to be inserted if the code in the SSTE field does not exist as a row in "tblSort". This adds an extra step to be performed for each insert and this does not occur in the MongoDB database.

The time difference between MySQL and Microsoft Access could again be explained by the added layer of using ODBC to send data to the MySQL server, while Access will basically just write the data to a file. This is a small increase in the time an insert will require, but over 20 000 iterations, the difference becomes apparent.

5.1.4 Database size

The final value to be measured was the total storage size of the database when 100 000 measuring reports had been inserted. There were also some other data in the other tables, but all the databases contained the same data to ensure that the experiment is done fairly.

The result of the test is displayed in Figure 5.6.

Figure 5.6: Storage size of databases with 100 000 measuring reports.

The results show that the final size of the MongoDB database is less than one-third of the MySQL database and roughly 40% of the total size of the Microsoft Access database indicating that MongoDB’s way of compressing the data into BSON objects is superior to MySQL’s and Microsoft Access’s way of storing the data in this case.

5.2 Cloud hosting

As previously mentioned in the Section 2.4, there are plenty of cloud service providers and among the CSPs, there are many choices regarding power and storage size. For this study, Google Cloud and Microsoft Azure were chosen to be evaluated and compared because these are two of the biggest CSPs on the market and both companies behind these products are large and widely known tech companies. It was also decided that only DBaaS would be considered as an option for hosting, as setting up and maintaining a virtual machine is beyond Småland Timber’s capabilities. DBaaS is simpler for the customer as the CSP takes full responsibility for maintaining the database.

References

Related documents

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av