Performance Evaluation of MMAPv1 and WiredTiger Storage Engines in MongoDB: An Experiment

(1)

Master of Science in Computer Science February 2017

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona Sweden

Performance Evaluation of MMAPv1 and WiredTiger Storage Engines in MongoDB

An Experiment

Rohith Reddy Gundreddy

(2)

ii

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies.

Contact Information:

Author(s):

Rohith Reddy Gundreddy E-mail: rogu15@student.bth.se

University advisor:

Emiliano Casalicchio, PhD

Associate Professor in Computer Science

Department of Computer Science and Engineering

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden

Internet : www.bth.se

Phone : +46 455 38 50 00

Fax : +46 455 38 50 57

(3)

i

C ^ONTENTS

CONTENTS ...I ABSTRACT ... III ACKNOWLEDGEMENT ... IV LIST OF FIGURES ... V LIST OF TABLES ... VI

1 INTRODUCTION ... 7

1.1 OVERVIEW ... 7

1.2 PROBLEMIDENTIFICATIONANDMOTIVATION ... 8

1.3 AIMSANDOBJECTIVES ... 8

1.4 RESEARCHQUESTIONS ... 9

1.5 CONTRIBUTION ... 9

1.6 OUTLINE ... 9

2 BACKGROUND ... 10

2.1 DATABASEMANAGEMENTSYSTEMS ... 10

2.1.1 SQL Databases ... 10

2.1.2 NoSQL Databases ... 10

2.1.3 CAP Theorem ... 10

2.2 MONGODB ... 11

3 RELATED WORK ... 13

4 METHODOLOGY ... 16

4.1 LITERATUREREVIEW ... 16

4.2 EXPERIMENTATION ... 17

4.2.1 Testing Environment ... 17

4.2.2 Initial Setup ... 17

4.2.3 Workloads and Metrics ... 17

4.2.4 Experiment ... 18

4.3 DATAANALYSIS ... 18

5 RESULTS ... 19

5.1 FINDINGSOFTHELITERATUREREVIEW ... 19

5.1.1 Benchmarking Tool ... 19

5.1.2 Performance Evaluation Metrics ... 19

5.2 RESULTSOFTHEEXPERIMENT ... 20

5.2.1 Case 1: Experimentation on m3.xlarge Instance ... 20

5.2.2 Case 2: Experiment on m3.2xlarge Instance ... 29

5.2.3 Comparing Case 1 and Case2 ... 37

6 ANALYSIS ... 40

6.1 M3.XLARGE INSTANCE ... 40

6.2 M3.2XLARGE INSTANCE ... 41

7 DISCUSSION ... 44

7.1 FINDINGSOFTHERESEARCH ... 44

7.1.1 Benchmarking Tool (RQ-1) ... 44

7.1.2 Performance Evaluation Metrics (RQ-2) ... 44

7.1.3 Discussion on Findings from the Experiment ... 44

(4)

ii

7.2 VALIDITYTHREATS ... 45

7.2.1 Statistical conclusion validity ... 45

7.2.2 Internal Validity ... 46

7.2.3 External Validity ... 46

7.3 LIMITATIONS ... 46

8 CONCLUSION AND FUTURE WORK ... 47

8.1 CONCLUSION ... 47

8.2 FUTUREWORK ... 47

REFERENCES ... 48

APPENDIX ... 50

(5)

iii

A ^BSTRACT

Context. As the data world entered Web 2.0 era, there is loads of structured, semi-structured and unstructured data growing enormously. The structured data can be handled efficiently by SQL databases. But to handle unstructured and semi-structured data, NoSQL databases have been introduced.

NoSQL databases can be broadly classified into four types – key-value, column-oriented, document- oriented and graph-oriented. MongoDB is one such NoSQL databases which comes under the category of document-oriented databases. The data in MongoDB is stored using storage engines. MongoDB currently uses two different storage engines– MMAPv1 and WiredTiger.

Objectives. This study focuses on presenting a performance evaluation of two data storage engines, MMAPv1 and WiredTiger, emphasizing on certain metrics which will be obtained from the literature review. This thesis aims to show which storage engine is better while using different workloads.

Methods. Literature study is done to obtain knowledge on performance evaluation of MongoDB database comparing with other SQL and NoSQL databases. YCSB benchmarking tool has been chosen to evaluate the performance of the storage engines. Later, to show which storage engine is better on different workloads, penalties have been calculated.

Results. The literature search resulted in obtaining four metrics – Execution time, Throughput, CPU Utilization and Memory Utilization as the metrics which best comply with presenting the evaluation of two storage engines, MMAPv1 and WiredTiger. The experiment resulted in generation of penalties that indicate which storage engine is better than the other and in which scenarios.

Conclusions. MMAPv1 shows better performance when the workloads are Read favorable. On the other hand, WiredTiger shows better performance when the workloads are Write favorable and also when the workloads are neutral (equal amounts of reads and writes).

Keywords: Performance Evaluation, MongoDB, NoSQL Databases, MMAPv1, WiredTiger

(6)

iv

A CKNOWLEDGEMENT

I would like to express my sincere gratitude to my thesis supervisor Emiliano Casalicchio, for his incredible support and guidance throughout the thesis. He was always available and was very fast and patient in answering to my queries. This work would not have been possible without his patience, exceptional guidance and supervision. I am grateful to my parents and my brother for their unconditional love and support till date. I would like to thank all my friends, especially Pavani, who made my stay in Sweden worthwhile.

(7)

v

L ^IST O ^F F ^IGURES

Figure 2.1: CAP Theorem [15] ... 11

Figure 2.2: MongoDB Architecture [16] ... 11

Figure 5.1: Execution Time of MMAPv1 and WiredTiger for Different Workloads... 20

Figure 5.2: Throughput of MMAPv1 and WiredTiger for Different Workloads ... 21

Figure 5.3: CPU Utilization of MMAPv1 and WiredTiger for Workload A ... 22

Figure 5.4: Memory Utilization of MMAPv1 and WiredTiger for Workload A ... 22

Figure 5.5: CPU Utilization of MMAPv1 and WiredTiger for Workload B ... 23

Figure 5.6: Memory Utilization of MMAPv1 and WiredTiger for Workload B ... 23

Figure 5.7: CPU Utilization of MMAPv1 and WiredTiger for Workload C ... 24

Figure 5.8: Memory Utilization of MMAPv1 and WiredTiger for Workload C ... 24

Figure 5.9: CPU Utilization of MMAPv1 and WiredTiger for Workload D ... 25

Figure 5.10: Memory Utilization of MMAPv1 and WiredTiger for Workload D ... 25

Figure 5.11: CPU Utilization of MMAPv1 and WiredTiger for Workload E ... 26

Figure 5.12: Memory Utilization of MMAPv1 and WiredTiger for Workload E ... 26

Figure 5.13: Average CPU Utilization of MMAPv1 and WiredTiger for different Workloads ... 27

Figure 5.14: Average Memory Utilization of MMAPv1 and WiredTiger for different Workloads ... 28

Figure 5.15: Execution Time of MMAPv1 and WiredTiger for different Workloads ... 29

Figure 5.16: Throughput of MMAPv1 and WiredTiger for different Workloads ... 30

Figure 5.17: CPU Utilization of MMAPv1 and WiredTiger for Workload A ... 31

Figure 5.18: Memory Utilization of MMAPv1 and WiredTiger for Workload A ... 31

Figure 5.19: CPU Utilization of MMAPv1 and WiredTiger for Workload B ... 32

Figure 5.20: Memory Utilization of MMAPv1 and WiredTiger for Workload B ... 32

Figure 5.21: CPU Utilization of MMAPv1 and WiredTiger for Workload C ... 33

Figure 5.22: Memory Utilization of MMAPv1 and WiredTiger for Workload C ... 33

Figure 5.23: CPU Utilization of MMAPv1 and WiredTiger for Workload D ... 34

Figure 5.24: Memory Utilization of MMAPv1 and WiredTiger for Workload D ... 34

Figure 5.25: CPU Utilization of MMAPv1 and WiredTiger for Workload E ... 35

Figure 5.26: Memory Utilization of MMAPv1 and WiredTiger for Workload E ... 35

Figure 5.27: Average CPU Utilization of MMAPv1 and WiredTiger for different Workloads ... 36

Figure 5.28: Average Memory Utilization of MMAPv1 and WiredTiger for different Workloads ... 37

Figure 5.29: Average Execution time of MMAPv1 and WiredTiger on different instances .. 38

Figure 5.30: Average Throughput of MMAPv1 and WiredTiger on different instances 38

Figure 5.31: Average CPU Utilization of MMAPv1 and WiredTiger on different instances 39 Figure 5.32: Average Memory Utilization of MMAPv1 and WiredTiger on different instances ... 39

(8)

vi

L ^{IST OF} T ^ABLES

Table 2.1: Comparing two Storage Engines of MongoDB - MMAPv1 and WiredTiger [17]12 Table 4.1: Configuration of two different AWS Instances – m3.xlarge and m3.2xlarge ... 17 Table 5.1: Performance Evaluation of Databases using different Benchmarking tools ... 19 Table 5.2: Different metrics used to measure the performance of the databases... 20 Table 5.3: Average Execution time in milliseconds of MMAPv1 and WiredTiger for different

Workloads ... 20 Table 5.4: Average Throughput of MMAPv1 and WiredTiger for different Workloads ... 21 Table 5.5: Average CPU Utilization of MMAPv1 and WiredTiger for different Workloads 27 Table 5.6: Average Memory Utilization of MMAPv1 and WiredTiger for different

Workloads ... 28 Table 5.7: Average Execution time in milliseconds for MMAPv1 and WiredTiger for different

Workloads ... 29 Table 5.8: Average Throughput for MMAPv1 and WiredTiger for different Workloads ... 30 Table 5.9: Average CPU Utilization of MMAPv1 and WiredTiger for different Workloads 36 Table 5.10: Average Memory Utilization of MMAPv1 and WiredTiger using different

Workloads ... 37 Table 5.11: Average Execution time in milliseconds of MMAPv1 and WiredTiger on different

instances ... 38 Table 5.12: Average Throughput of MMAPv1 and WiredTiger on different instances... 38 Table 6.1: F-Measure and Significant Value for Dependent Variable: Execution Time ... 40 Table 6.2: Standard Deviation for the Execution Time for the results obtained on m3.xlarge

instance ... 40 Table 6.3: CV values for the Execution time for the results obtained on m3.xlarge instance 40 Table 6.4: Standard Deviation for the Throughput for the results obtained on m3.xlarge

instance ... 41 Table 6.5: CV values for the Throughput for the results obtained on m3.xlarge instance ... 41 Table 6.6: F-Measure and Significant Value for Dependent Variable: Throughput ... 41 Table 6.7: Standard Deviation for the Execution Time for the results obtained on m3.2xlarge

instance ... 41 Table 6.8: CV values for the Execution time for the results obtained on m3.2xlarge

instance ... 42 Table 6.9: Standard Deviation for the Throughput for the results obtained on m3.2xlarge

instance ... 42 Table 6.10: CV values for the Throughput for the results obtained on m3.2xlarge instance . 42 Table 6.11: Showing improved performance of WiredTiger compared to MMAPv1... 43 Table 7.1: Penalties between MMAPv1 and WiredTiger on m3.xlarge instances ... 45 Table 7.2: Penalties between MMAPv1 and WiredTiger on m3.2xlarge instances ... 45 Table 0.1: Execution time in milliseconds of two storage engines on m3.xlarge instance .... 50 Table 0.2: Throughput of two storage engines on m3.xlarge instance ... 50 Table 0.3: Execution time in milliseconds of two storage engines on m3.2xlarge instance .. 51 Table 0.4: Throughput of two storage engines on m3.2xlarge instance ... 51

(9)

7

1 I NTRODUCTION 1.1 O ^VERVIEW

Now-a-days, the amount of data has increased enormously with the advent of Web 2.0.

To store such huge amounts of data and with the changing needs of applications, there emerged new types of databases termed as “NoSQL” (Not Only SQL) databases. Many companies use different types of databases according to their requirements. Data can be broadly categorized as – structured data, unstructured data and semi-structured data [1]. Structured data has a well- defined form which can be handled by traditional relational databases very efficiently. But, unstructured and semi-structured data, couldn’t be handled by traditional relational models. This emerged the need for NoSQL database. NoSQL databases are highly scalable, capable of parallel processing and can be said database-as-a-service. NoSQL databases can be broadly classified into 4 categories – Key-value, Column-oriented, Document-oriented and Graph-oriented [2].

Key-value Data Storage - It is a data storage designed for storing, retrieving and managing the associate arrays – A data structure. It contains a collection of objects or records and these contain many different fields within them containing data. These records of data can be stored and retrieved using a key that uniquely identifies the record and is used to find the data in the database [2]. Column-oriented Data Storage - Column oriented databases store the data-tables as columns instead of rows. Both column and row databases use traditional database languages like SQL to load data. But the main difference is the column databases can more precisely access the data to a query entered. It is very good when it comes to handle huge data and is very scalable [2]. Document-oriented Data Storage - The data is stored in the form of documents. We can say they are a sub-class of key-value storages. Document based storages are more like Relational databases. The difference is, Relational Database stores in several different tables but Document-oriented databases will store the data in one instance for a given object [3]. Graph-oriented Data Storage - As the name says, the data is stored in the form of Graphs. It is the best way to deal with complex, semi-structured and densely connected data. These are very useful in healthcare, retail, financial, social networking, online media etc.,[4].

Now, when we want to store the data like a traditional and old school relational database and also want to go with the new methods of NoSQL to store the data, we can consider document-oriented databases. Currently MongoDB is the number 1 NoSQL database out there for storing different forms of data [37]. So, this shifted our interest on knowing more about the MongoDB database and conducting a performance evaluation on it. MongoDB is a NoSQL database which comes under the category of document-oriented database developed by 10gen company. Data in MongoDB is organized in the form of BSON (Binary JavaScript Object Notation) documents [5]. MongoDB architecture consists three levels – Query Language, Data model and Storage engines. Query language - used for accessing and operating documents [5]. Data model - allows to customize the model of the database without having any impact on the performance of the query execution [5]. Storage engines - used to control how the data is being stored [5].

(10)

8

1.2 P ^ROBLEM I DENTIFICATION A ^ND M ^OTIVATION

After extensive literature study on various aspects of MongoDB, one of the topics which caught my attention was the different storage engines which MongoDB is providing.

MongoDB provides three storage engines - MMAPv1, WiredTiger, In-Memory [6]. But In- Memory storage engine is just a part of general availability, which leaves us with two storage engines, MMAPv1 and WiredTiger [6]. A database storage engine is an underlying software that a database uses to create, read, update and delete data from a database [7]. Storage engines significantly affect the performance of the database and applications which are dependent on it and they also control the database’s interaction with memory and storage subsystems [7].

As MongoDB offers two storage engines (at present), MMAPv1 and WiredTiger, it can be hard to choose which storage engine is good to use for someone who is considering MongoDB as their database and what can be the tradeoffs incurred. There is not much research presenting the comparison of the two storage engines, MMAPv1 and WiredTiger. This drove the focus on the idea of presenting performance evaluation of the two storage engines.

There are many researches presenting the performance evaluation of MongoDB in comparison with other NoSQL databases or with relational databases. Authors in [8,9,10] have presented performance evaluation of MongoDB by comparing them with other NoSQL databases like – Raik, Cassandra, HBase, ElasticSearch, Redis and OrientDB. While [11], presented the comparison/evaluation of MongoDB with respect to relational databases like MySQL. Most of the researchers [9,10,11] have carried out the performance evaluation of MongoDB by considering the benchmarking tool – Yahoo! Cloud Serving Benchmark (YCSB). The performance of the databases was measured using some defined parameters and by considering a well suited workload of read and write operations. Some researchers [8,9,10,11] have considered the time taken by basic database operations – Create, Read, Update and Delete as the parameters for presenting the evaluation.

Considering the reliability and the way which more aptly depicts the real-world situations, we are considering the former approach of using suitable benchmarking tool with well-suited workloads for presenting the evaluation of the two storage engines, MMAPv1 and WiredTiger.

1.3 A ^IMS A ^ND O ^BJECTIVES

The main aim of the thesis is to evaluate the performance of two data storage engines, MMAPv1 and WiredTiger, in MongoDB. This thesis also aims at comparing the performance of the data storage engines based on the identified metrics and on different workloads.

Objectives:

 Study different benchmarking tools available that support evaluation of MongoDB and choose the suitable benchmarking tool.

 Study different metrics that better comply with presenting the evaluation of two storage engines, MMAPv1 and WiredTiger.

 Measure the performance of the storage engines based on both read and write favorable scenarios.

 Analyze performance of the storage engines and to present detailed evaluation of the two storage engines with respect to the above identified metrics.

(11)

9

1.4 R ^ESEARCH Q ^UESTIONS

RQ-1: What are the different benchmarking tools available for presenting the evaluation of MongoDB?

Motivation: Identifying different benchmarking tools facilitates us with the metrics which can be used for performance evaluation of MongoDB.

RQ-2: What are the different performance metrics which can better comply with the evaluation of two storage engines, MMAPv1 and WiredTiger?

Motivation: Identifying different performance metrics allows us to easily evaluate the performance of the two storage engines, MMAPv1 and WiredTiger.

RQ-3: What are the outcomes when we compare two storage engines, MMAPv1 and WiredTiger using different workloads of Write(W) and Read(R).

Motivation: Comparison of performance of these two storage engines allows us to understand which storage engine performs better under different workloads. (Example: W- 80% : R-20% , W-50% : R-50% , W-20% : R-80%)

1.5 C ONTRIBUTION

The present research work contributes on which data storage engine in MongoDB performs better in terms of speed and throughput, and how it effects overall performance of the database. The present research also tells you which storage-engine is optimal for different combinations of write and read workloads on different environments. The present research will give the service providers a view on which storage engine to use when they select certain workload.

1.6 O ^UTLINE

Chapter 1 gives a brief introduction to the research work. Chapter 2 covers background information about SQL Databases, NoSQL Databases, CAP – Theorem, MongoDB and differences between MMAPv1 and WiredTiger. Chapter 3 gives a brief overview of other studies conducted in the same research area. Chapter 4 – literature review and experiment methods used were discussed. Results obtained from the experiment are reported in Chapter 5. Chapter 6 consists of Analysis of the results. Chapter 7 will be our discussion about the results obtained. Chapter 8 concludes the research work done in this thesis and possible areas of future work.

(12)

10

2 B ^ACKGROUND

2.1 D ^ATABASE M ^ANAGEMENT S ^YSTEMS

A Database Management System(DBMS) is a software which creates and manages the databases. It provides a systematic way to create, retrieve, update and manage data. It serves as an interface between database and application programs. It makes sure the data is arranged in a way it can be easily accessible and also plays an important role in backing-up and recovery of data[12].

There are two types of databases which uses DBMS in the present data world – 1) SQL Databases

2) NoSQL Databases

2.1.1 SQL Databases

SQL databases are basically Relational Database Management Systems (RDBMS). The leading users of these database systems are Oracle, IBM and Microsoft. RDBMS uses the concept of database normalization and sets a limitation for primary and foreign keys to establish a relationship with rows of data in different database tables [13]. That eliminates the need to store related data in multiple tables, which reduces data storage requirements, streamlines database maintenance and enables faster querying of databases [13].

2.1.2 NoSQL Databases

NoSQL databases are rapidly growing database management systems. NoSQL, which has a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases were not able to address [14]. NoSQL is useful when a company or an organization needs to access and analyze massive amounts of unstructured data or data which needs to be stored remotely on multiple virtual servers in the cloud [14]. There are more and more NoSQL databases coming into market. The most commonly heard NoSQL databases are Apache Cassandra, MongoDB, HBase, HyperTables, etc.,

But how to choose between so many SQL and NoSQL databases? How can we decide which database is good for our requirements? The answer is CAP theorem. CAP theorem makes it easy to understand the way databases work in a simple way.

2.1.3 CAP Theorem

CAP – Consistency, Availability and Partition Tolerance theorem indicates that no database can provide all three features. We can choose any two of the features and select a database according to the user requirements [15]. Figure 2.1 will give clear idea on how this theorem works.

(13)

11

Figure 2.1 – CAP Theorem [15]

Here we can observe how some renowned databases are spread according to their primary features. Pick any 2 features and a data model type to choose a database which is right for you.

2.2 M ^ONGO DB

MongoDB is a document oriented database which is highly scalable and very consistent.

The data is stored in BSON (Binary JavaScript Object Notation) format. It is rapidly getting famous among the developers due it’s dominant performance over other available NoSQL databases in the present data world. MongoDB architecture consists of three major components Mongos, Config server and Shards. Client first contacts the process called Mongos, which is the in-charge of rooting and co-ordination [16]. The config server retains meta information on shards [16]. Shards is where the data will be stored and each shard in- turn has a minimum of three replica sets. To form a database we need a minimum of three shards which is recommended by MongoDB [16].

Figure 2.2 – MongoDB Architecture [16]

(14)

12

But how will the data get stored? What is underlying source that inserts the data into the database? The answer is MongoDB provides with two data storage engines, MMAPv1 and WiredTiger. A data storage engine is responsible for managing how data is stored, both in memory and on disk. The data will be entered using either of these two data storage engines.

MMAPv1 is the original MongoDB storage engine and is the default storage engine for MongoDB versions before 3.2 [6]. WiredTiger is the default storage engine starting in MongoDB 3.2. WiredTiger provides a document-level concurrency model, checkpointing, and compression, among other features [6].

Features MMAPv1 WiredTiger

Write Performance Good

Collection-level concurrency control

Excellent

Document-level concurrency control

Read Performance

Excellent Excellent

Compression

Support NO YES

Query Language

Support YES YES

Secondary Index

Support YES YES

Replication

Support YES YES

Sharding Support YES YES

Ops manager and

MMs support YES YES

Security Controls YES YES

Platform

Availability Linux, Windows, Mac OS X,

Solaris (x86) Linux, Windows, Mac OS X

Table 2.1 – Comparing two Storage Engines of MongoDB — MMAPv1 and WiredTiger [17]

(15)

13

3 R ^ELATED W ^ORK

From the commencement of Web 2.0 there is lot of structured, semi-structured and unstructured data which is being produced. To handle such amount of data, flexible schemes, speed and distributed databases are necessary [18]. All these can be fulfilled by NoSQL databases which became preferred adaption for operating Big-data [18].

Authors in [18] stated that, not all NoSQL databases perform better than SQL databases.

The main operations in any database are read, write, delete and insertion operations. But there is a wide variation in these operation in different NoSQL databases. Though NoSQL gives speed and scalability advantage, they still have few drawbacks. They are fast when doing simple tasks and time consuming for complex tasks and are less consistent [18]. The primary way the databases get compared are through these following features – scalability, consistency, support for data models, support for quires and management tools. The authors in [18] have done a performance evaluation of MongoDB, RavenDB, CouchDB, Cassandra, Hypertable, Couchbase and Microsoft SQL using basic CRUD (Create, Read, Update and delete) commands. MongoDB and Couchbase performed had high performance in this experiment.

Microsoft SQL performed better than CouchDB and RavenDB proving not all NoSQL databases are better than SQL [18].

Authors in [19] stated that, MongoDB is a document oriented database. A document has a set of fields and can be thought of as a row, in a collection. It can contain complex structures like lists or even an entire document [19]. Each document has a unique ID field, which will be used as primary key and each collection can contain any kind of document but queries can only be applied on collections [19]. The authors in [19] have given a good example on how MongoDB works by explaining the development of a forum. Everyone have their own unique way to build a forum, so structure can be made according to the user. Thus using MongoDB, the static structure in which a forum contains subforums, the subforums contain discussions and discussions contain comments and so on [19]. There is a possibility of attaching other discussion in the same forum making it endless possibility to increase a forum. MongoDB also has one-to-many relationships, but concept of foreign key is not used, but instead, the concept of annotation is used [19]. The authors in [19] have conducted an experiment showing the performance of both the databases – MongoDB and MySQL. They have used basic insert, select, update and delete operations. Result show that MongoDB has performed better than MySQL in all the operations [19]. Switching from a relational database to a non-relational database can be a challenge in many ways. We need to carefully study all types of NoSQL databases and then decide which is best for that particular use. We can choose MongoDB instead of MySQL if the application is data intensive and stores-and- queries lots of data [19].

Authors in [20] stated that, there is no need for a join operation in MongoDB. Storing data in MongoDB can be done in one of the two ways. The first way is by nesting documents inside each other [20], which works for one-to-one and one-to-many relationships and this process is called embedding. The second option is to store a reference to other document rather nesting the entire document and this process is called referencing [20]. The authors in [20]

have compared MongoDB with Microsoft MySQL. They have used basic insert, updates and simple queries to compare the performance. The results show that MongoDB has performed well in many of the cases which were presented in the article [20]. MongoDB could be a good solution for larger data sets in which the schema is constantly changing or in the case that queries performed will be less complex [20]. MongoDB is definitely the choice for users who needs a less rigid database structure [20]. MongoDB has shown poor performance when it comes to querying non-key values and for aggregate functions.

(16)

14 MongoDB has gained tremendous popularity and is being used by many multi-media companies [21]. MongoDB gives consistency, durability and conditional atomicity. Oracle in other hand offers isolation, transaction, referential integrity and revision control [21]. Both – MongoDB and Oracle are horizontally scalable and has support for data replication but MongoDB is easy to deploy and can be copied from one server to other easily [21]. The authors in [21] have made performance evaluation of MongoDB and Oracle using basic insert, update and delete operations. The results how MongoDB has performed well in all the operations when compared to Oracle [21]. When you want to use fast and flexible database MongoDB is your choice. If rapidness is not your concern and you want to rely on relations between tables and the collections, you can always stick to old classic Oracle [21].

The authors in [15,22] have discussed about CAP – Consistency, Availability and Partition tolerance theorem. According to this theorem there is no database which provides all the three features prominently. Any one of the feature must be compromised. The authors in [22] made a good comparison for these two databases – MongoDB and Cassandra and made a performance evaluation of these two databases. They have selected mixed workloads to work on. The results show that, in all cases Cassandra has performed better than MongoDB [22].

The performance of Cassandra has been increasing continuously while data was getting increased [22].

The authors in [23] have made performance evaluation of NoSQL databases – MongoDB, Redis and HBase. They have used YCSB benchmarking tool to compare the performance. They have also used Attila, a data-oriented load balancer that monitors the performance of each database node. The results show that Redis and MongoDB have high performance when compared to HBase [23].

The authors in [24] have discussed about the YCSB benchmarking tool which is widely used for benchmarking database management systems. They have discussed about cloud servicing system characteristics, classification of system-and-tradeoffs and a brief survey of cloud data systems. They also discussed about different workloads YCSB is providing for evaluating the performance of a database. They have discussed about the architecture of YCSB. They made performance evaluation of Cassandra, HBase, PNUTS and MySQL databases. The results show that Cassandra and HBase performed equally well when compared to PNUTS and MySQL.

The authors is [25] have compared the performance of NoSQL databases – Cassandra, MongoDB, HBase and Couchbase on different number of nodes. They have used EndPoint benchmarking tool which is customized version of YCSB. Cassandra performed better for read and writes when compared to other databases [25].

The authors in [8] have made performance evaluation of NoSQL databases – MongoDB, ElasticSearch, OrientDB and Redis. Insert, update and read operations were used to measure the performance. The results show that, for insert operation MongoDB and Redis have performed well, for update and read operations Redis and ElasticSearch have performed well.

Redis had high performance when compared to all other databases [8]. IT professionals must choose a database carefully by running a performance test on the databases they would like to work on [8].

The authors in [9] have made performance evaluation of NoSQL databases – Cassandra, MongoDB and HBase. They have used AWS servers to conduct the experiment. Different read and update workloads were used. The results show that the data models are able to capture much of the main performance characteristics of the studied databases at workloads [9].

(17)

15 The authors in [10] have made performance evaluation of NoSQL databases – Cassandra, MongoDB and Riak. They have used Amazon Web Services (AWS) servers to conduct the experiment and have used different read/write workloads. The results show that Cassandra has given best overall performance when compared to Riak and MongoDB [10].

The authors in [11] have made performance evaluation of SQL and MongoDB databases. Basic insert, delete and update operations were used to measure the performance of the databases. The results show MongoDB has performed well in all the operations when compared to SQL [11].

(18)

16

4 M ^ETHODOLOGY

An appropriate method must be chosen to answer the research questions. The answers must be valid and reliable. For answering the research questions which are presented in this thesis, literature review has been chosen for answering RQ-1 and RQ-2 and an experiment is conducted for answering RQ-3.

Exclusion Criteria

Survey – Survey is done to know the opinion of the certain group of people or practitioners [36]. We can use survey to find an appropriate benchmarking tool and performance metrics, but the answers which we get from the survey are not reliable.

Performance evaluation of a database cannot be done using survey.

Case Study – We can prefer case study when we are doing exploratory studies which are indefinite in nature and takes a long time to get results [36]. This method cannot be used for finding benchmarking tool nor performance evaluation of a database.

Action Research – Action research is used to solve the real world problems [36].

Finding benchmarking tool and performance evaluation of a database are definitely not real world problems.

4.1 L ^ITERATURE R ^EVIEW

Literature review is conducted in order to answer RQ-1 and RQ-2 which focuses on finding appropriate benchmarking tool and to find appropriate metrics to measure the performance of MongoDB while using different storage engines. To gain knowledge on the topic, we need to search for articles which are about or related to our topic.

The steps followed in conducting the literature review are –

 Inspec, Scopus and ACM digital library databases were primarily used

 Using keywords which are related to this topic is a good way to start searching for the articles. The keywords which are used in this research are “SQL databases”, “NoSQL databases”, “MongoDB”, “Storage Engines”, “WiredTiger”, “MMAPv1” and “performance evaluation”

 Forward and reverse snowballing has been done to find more relevant articles

 Filtering the research papers using inclusion and exclusion criteria

 Analyzing on what the authors are presenting in their articles



Limiting the problem on which we are working on and avoiding unnecessary duplication of the content which is already there

(19)

17

4.2 E XPERIMENTATION 4.2.1 Testing Environment

In order to depict real-world scenarios and make the answers more reliable, the experiment is conducted on virtual servers. In this experiment, Amazon Web Services(AWS) servers are used. The experiment will be conducted on m3.xlarge instances first and later with m3.2xlarge instances to know if there is any performance difference when number of vCPU(s) and RAM is increased. Four instances – three for MongoDB replica set, one for setting up YCSB are needed to run the experiment. Hence, we need a total of eight instances – four m3.xlarge and four m3.2xlarge. The configuration and architecture of each of the instances are as follows –

S.No Category m3.xlarge m3.2xlarge

1 Operating System Ubuntu 16.04 LTS Ubuntu 16.04 LTS

2 Architecture x86_64-bit x86_64-bit

3 Kernel Version 4.4.0-53-generic 4.4.0-57-generic 4 Storage Solid State Drive(SSD) Solid State Drive (SSD) 5 Model Name Intel(R) Xeon(R) CPU E5-

2670 v2 @ 2.50GHz Intel(R) Xeon(R) CPU E5- 2670 v2 @ 2.50GHz

6 RAM 15 GB 30 GB

7 vCPU(s) 4 8

8 CPU MHz 2500.082 2500.092

9 Virtualization Type Full Full

Table 4.1 – Configuration of two different AWS Instances – m3.xlarge and m3.2xlarge

4.2.2 Initial Setup

Firstly, the MongoDB database is required for conducting the experiment. Latest stable version of MongoDB-3.4.1 is installed on three instances. Next, YCSB was installed on forth instance after installing the basic requirements – JAVA and Maven-3.3.9. The initial setup to run the experiment is now complete.

4.2.3 Workloads and Metrics

The workloads are chosen to be both read and write favorable. The first workload is completely write favored. Second workload has more writes and less reads. Third is neutral workload which has 50% writes and 50% reads. Forth workload has more reads and less writes.

Fifth and last workload is completely read favored.

1. Workload A – Write/Read = 100/0 (%) 2. Workload B – Write/Read = 80/20 (%) 3. Workload C – Write/Read = 50/50 (%) 4. Workload D – Write/Read = 20/80 (%) 5. Workload E – Write/Read = 0/100 (%)

The metrics were taken from the findings of the literature review. The first metric is

‘Execution Time’ of the experiment which is measured in milliseconds(ms). The second metric is ‘Throughput’ which is measured in operations-per-second (ops/sec). The third metric is total ‘CPU utilization’ when the experiment is running. The forth metric will be ‘Memory Utilization’ when the experiment is running.

(20)

18

4.2.4 Experiment

To start the experiment, a replica set must be created using the three servers which were taken. MongoDB must be running in all the three servers. Select any one of the server as primary storage and other two as secondary storages and create a replica set. YCSB will be running on forth instance.

In YCSB workloads, the default records and operations are 1000. The value can be changed accordingly. The records and operations that have been taken to conduct this experiment is one million i.e., 1,000,000. The size of each record is 1KB.

As all the servers are connected through the primary node, experiment will run through the primary node. Data needs to be loaded using YCSB to run the experiment. So Workload- A will be loaded first and then will be runt to complete the experiment. After running the experiment, the time taken to complete the experiment (execution time) and throughput (operation per second) will be noted. To know the CPU and Memory Utilization “sar” utility tool has been used.

In this way the experiment will be repeated using different workloads and storage engines. The experiment will be repeated 10 times for each workload and storage engine so that results will be more reliable and biased results will be avoided.

4.3 D ^ATA A ^NALYSIS

We need to prove that the results we get from the experiment are statistically significant – to know whether there is any significant difference between the values of execution time and throughput separately. To analyze the results statistically we need to select an appropriate statistical test.

In this experiment we have two independent variables – Storage Engines and Workloads and two dependent variables – Execution time and Throughput but the results for execution time and throughput will be collected separately. So, when we are tabulating the data, Storage Engines and Workloads will be our independent variables and there will be one dependent variable – Execution time or Throughput.

Normality Distribution test was done on all the datasets using Shapiro-Wilk test [26]

and after finding that the data is normal, a parametric test has been chosen. Considering the type of research questions, type of variables, levels of variables and measurement scale, Factorial ANOVA method has been chosen as an appropriate statistical method for analyzing the outputs of the experiment [27]. For conducting both Shapiro-Wilk test and Factorial ANOVA test we have used IBM’s SPSS statistics tool [28] .

Now to know which Storage engine performs better than the other, we use simple and easiest way to find i.e., Ratio of the storage engines.

For example: 𝑀𝑀𝐴𝑃𝑣1

𝑊𝑖𝑟𝑒𝑑𝑇𝑖𝑔𝑒𝑟

=

²

4

=

¹

2

2 × 𝑀𝑀𝐴𝑃𝑣1 = 𝑊𝑖𝑟𝑒𝑑𝑇𝑖𝑔𝑒𝑟

From the above example we can say that WiredTiger performs 2 times better than MMAPv1. Like this we will calculate the penalties between two storage engines for different workloads.

(21)

19

5 R ^ESULTS

This chapter will show findings of the literature review and the results from the experiment.

5.1 F ÎNDINGS O ^F T ^HE L ÎTERATURE R ÊVIEW

Literature review was conducted to find the appropriate benchmarking tool and performance metrics for the experiment.

5.1.1 Benchmarking Tool

Benchmarking tools used to evaluate the performance of the databases –

References Databases Compared Benchmarking tool used [8] ElasticSearch vs OrientDB vs Redis vs

MongoDB YCSB

[9] Cassandra vs MongoDB vs HBase YCSB

[10] Cassandra vs MongoDB vs Riak YCSB

[22] Cassandra vs MongoDB YCSB

[23] MongoDb vs Redis vs HBase YCSB

[24] Cassandra vs HBase vs PNUTS YCSB

[25] Cassandra vs MongoDB vs HBase vs

Couchbase EndPoint

Table 5.1 – Performance Evaluation of Databases using different Benchmarking tools

From Table 5.1 we can observe all the databases are NoSQL databases. In most of the cases YCSB benchmarking tool was used.

5.1.2 Performance Evaluation Metrics

Metrics used to measure and compare the performance of the databases – References Databases Compared Performance Metrics

[8] ElasticSearch vs OrientDB vs

Redis vs MongoDB Execution Time in Milliseconds (ms) [9] Cassandra vs MongoDB vs

HBase Throughput in Operations per Second (ops/sec)

[10] Cassandra vs MongoDB vs

Riak Throughput in Operations per Second (ops/sec)

[11] MongoDB vs SQL Execution Time in Milliseconds (ms) [16] Cassandra vs MongoDB vs

HBase Execution Time in Milliseconds (ms) [18] NoSQL Databases vs

Microsoft SQL Execution Time in Milliseconds (ms) [19] MongoDB vs MySQL Execution Time in Milliseconds (ms) [20] MongoDB vs MySQL Execution Time in Milliseconds (ms) [21] MongoDB vs Oracle Execution Time in Milliseconds (ms) [22] Cassandra vs MongoDB Execution Time in Milliseconds (ms) [23] MongoDB vs Redis vs HBase Throughput in Operations per Second

(ops/sec)

(22)

20 [24] Cassandra vs PNUTS vs

HBase Throughput in Operations per Second (ops/sec)

[25] Cassandra vs MongoDB vs

HBase vs Couchbase Throughput in Operations per Second (ops/sec)

Table 5.2 – Different metrics used to measure the performance of the databases

From Table 5.2 we can observe that, execution time and throughputs are the metrics used to measure and compare the performance of the databases.

5.2 R ^ESULTS F ^ROM T ^HE E ^XPERIMENT

5.2.1 Case 1: Experimentation on m3.xlarge Instance

m3.xlarge instances have been used first to conduct the experiment. The results are as follows –

5.2.1.1 Execution Time Storage

Engines Workload

A Workload

B Workload

C Workload

D Workload

MMAPv1 146082.2 216869.4 252875.9 269564.4 265656.8 E WiredTiger 125505.3 179082.9 228772.2 274468.4 281195.8

Table 5.3 – Average Execution time in milliseconds of MMAPv1 and WiredTiger for different Workloads

Figure 5.1 – Execution Time of MMAPv1 and WiredTiger for Different Workloads 0

50000 100000 150000 200000 250000 300000

Workload A Workload B Workload C Workload D Workload E

Time in Milliseconds (ms)

Execution Time

MMAPv1 WiredTiger

(23)

21 In Figure 5.1 we can observe the execution time of MMAPv1 and WiredTiger for different workloads. WiredTiger has better performance in Workload A, B and C but MMAPv1 has better performance in the Workload D and E.

5.2.1.2 Throughput Storage

A Workload

B Workload

C Workload

D Workload

Table 5.4 – Average Throughput of MMAPv1 and WiredTiger for different Workloads

Figure 5.2 – Throughput of MMAPv1 and WiredTiger for Different Workloads

In Figure 5.2 we can observe the throughput of MMAPv1 and WiredTiger. Similar to execution time, WiredTiger has better performance in Workload A,B and C but MMAPv1 has better performance in Workload D and E.

Analysis of overall performance of MMAPv1 and WiredTiger for m3.xlarge instance is as follows –

 Workload A: WiredTiger performs 1.163 times better than MMAPv1.

 Workload B: WiredTiger performs 1.211 times better than MMAPv1.

 Workload C: WiredTiger performs 1.105 times better than MMAPv1.

 Workload D: MMAPv1 performs 1.018 times better than WiredTiger.



Workload E: MMAPv1 performs 1.058 times better than WiredTiger.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Operationss per Second (ops/sec)

Throughput

MMAPv1 WiredTiger

(24)

22 5.2.1.3 CPU and Memory Utilization while running different Workloads

CPU Utilization and Memory Utilization of different workloads are presented in the graphs below –

Figure 5.3 – CPU Utilization of MMAPv1 and WiredTiger for Workload A

In Figure 5.3 we can observe the CPU utilization of MMAPv1 and WiredTiger for Workload A.

Figure 5.4 – Memory Utilization of MMAPv1 and WiredTiger for Workload A

In Figure 5.4 we can observe the memory utilization of MMAPv1 and WiredTiger for Workload A.

0 10 20 30 40 50 60 70

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157

CPU Utilization in Percentage (%)

Time in Seconds

CPU Utilization Workload A

MMAPv1 WiredTiger

0 10 20 30 40 50 60

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146

Memory Utilization in Percentage (%)

Time in Seconds

Memory Utilization Workload A

MMAPv1 WiredTiger

(25)

23

Figure 5.5 – CPU Utilization of MMAPv1 and WiredTiger for Workload B

In Figure 5.5 we can observe the CPU utilization of MMAPv1 and WiredTiger for Workload B.

Figure 5.6 – Memory Utilization of MMAPv1 and WiredTiger for Workload B

In Figure 5.6 we can observe the memory utilization of MMAPv1 and WiredTiger for Workload B.

0 10 20 30 40 50 60

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211

CPU Utiliztion in Percentage (%)

Time in Seconds

CPU Utilization Workload B

MMAPv1 WiredTiger

0 10 20 30 40 50 60

1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217

Time in Seconds

Memory Utilization Workload B

MMAPv1 WiredTiger

(26)

24

Figure 5.7 – CPU Utilization of MMAPv1 and WiredTiger for Workload C

In Figure 5.7 we can observe the CPU utilization of MMAPv1 and WiredTiger for Workload C.

Figure 5.8 – Memory Utilization of MMAPv1 and WiredTiger for Workload C

In Figure 5.8 we can observe the memory utilization of MMAPv1 and WiredTiger for Workload C.

0 10 20 30 40 50 60 70

1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 241 249

Time in Seconds

CPU Utilization Workload C

MMAPv1 WiredTiger

0 10 20 30 40 50 60

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253

Time in Seconds

Memory Utilization Workload C

MMAPv1 WiredTiger

(27)

25

Figure 5.9 – CPU Utilization of MMAPv1 and WiredTiger for Workload D

In Figure 5.9 we can observe the CPU utilization of MMAPv1 and WiredTiger for Workload D.

Figure 5.10 – Memory Utilization of MMAPv1 and WiredTiger for Workload D

In Figure 5.10 we can observe the memory utilization of MMAPv1 and WiredTiger for Workload D.

0 10 20 30 40 50 60

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271

Time in Seconds

CPU Utilization Worklod D

MMAPv1 WiredTiger

0 10 20 30 40 50 60

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271

Time in Seconds

Memory Utilization Workload D

MMAPv1 WiredTiger

(28)

26

Figure 5.11 – CPU Utilization of MMAPv1 and WiredTiger for Workload E

In Figure 5.11 we can observe the CPU utilization of MMAPv1 and WiredTiger for Workload E.

Figure 5.12 – Memory Utilization of MMAPv1 and WiredTiger for Workload E

In Figure 5.12 we can observe the memory utilization of MMAPv1 and WiredTiger for Workload E.

0 0.2 0.4 0.6 0.8 1 1.2

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271

Time in Seconds

CPU Utilization Workload E

MMAPv1 WiredTiger

0 10 20 30 40 50 60

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281

Time in Seconds

Memory Utilization Workload E

MMAPv1 WiredTiger

(29)

27

Figure 5.13 – Average CPU Utilization of MMAPv1 and WiredTiger for different Workloads

Figure 5.13 represents the average values of the CPU Utilization for different workloads.

Storage

A Workload

B Workload

C Workload

D Workload

MMAPv1 26.257 22.228 13.68 6.352 0.386 E

WiredTiger 41.792 28.261 17.251 8.457 0.494

Table 5.5 – Average CPU Utilization of MMAPv1 and WiredTiger for different Workloads

Analysis of CPU Utilization of MMAPv1 and WiredTiger on m3.xlarge instances is as follows –

 Workload A: WiredTiger CPU Utilization is 1.591 times more than MMAPv1.

 Workload B: WiredTiger CPU Utilization is 1.271 times more than MMAPv1.

 Workload C: WiredTiger CPU Utilization is 1.261 times more than MMAPv1.

 Workload D: WiredTiger CPU Utilization is 1.331 times more than MMAPv1.



Workload E: WiredTiger CPU Utilization is 1.279 times more than MMAPv1

.

0 5 10 15 20 25 30 35 40 45

CPU Utilization

MMAPv1 m3.xlarge WiredTiger m3.xlarge

(30)

28

Figure 5.14 – Average Memory Utilization of MMAPv1 and WiredTiger for different Workloads

Figure 5.14 represents the average value of the Memory Utilization for different workloads.

Storage

A Workload

B Workload

C Workload

D Workload

Table 5.6 – Average Memory Utilization of MMAPv1 and WiredTiger for different Workloads

Analysis of Memory Utilization of MMAPv1 and WiredTiger on m3.xlarge instances is as follows –

 Workload A: WiredTiger Memory Utilization is 1.610 times more than MMAPv1.

 Workload B: WiredTiger Memory Utilization is 2.148 times more than MMAPv1.

 Workload C: WiredTiger Memory Utilization is 2.179 times more than MMAPv1.

 Workload D: WiredTiger Memory Utilization is 2.125 times more than MMAPv1.

 Workload E: WiredTiger Memory Utilization is 2.235 times more than MMAPv1.

0 10 20 30 40 50 60

Memory Utilization

MMAPv1 m3.xlarge WiredTiger m3.xlarge

(31)

29

5.2.2 Case 2: Experiment on m3.2xlarge Instance

We have used m3.2xlarge instances to conduct the experiment second time. The results are as follows –

5.2.2.1 Execution Time Storage

A Workload

B Workload

C Workload

D Workload

Table 5.7 – Average Execution time in milliseconds for MMAPv1 and WiredTiger for different Workloads

Figure 5.15 – Execution Time of MMAPv1 and WiredTiger for different Workloads

In Figure 5.15 we can see the execution time of MMAPv1 and WiredTiger for different workloads. WiredTiger has better performance in Workload A, B and C but MMAPv1 has better performance in the Workload D and E.

0 50000 100000 150000 200000 250000 300000

Time in Milliseconds (ms)

Execution Time

MMAPv1 WiredTiger

(32)

30 5.2.2.2 Throughput

Storage

A Workload

B Workload

C Workload

D Workload

Table 5.8 – Average Throughput for MMAPv1 and WiredTiger for different Workloads

Figure 5.16 – Throughput of MMAPv1 and WiredTiger for different Workloads

In Figure 5.16 we can observe the throughput of MMAPv1 and WiredTiger for different workloads. WiredTiger has better performance in Workload A, B and C but MMAPv1 has better performance in the Workload D and E.

Analysis of overall performance of MMAPv1 and WiredTiger for m3.2xlarge instance is as follows –

 Workload A: WiredTiger performs 1.437 times better than MMAPv1.

 Workload B: WiredTiger performs 1.214 times better than MMAPv1.

 Workload C: WiredTiger performs 1.122 times better than MMAPv1.

 Workload D: MMAPv1 performs 1.002 times better than WiredTiger.

 Workload E: MMAPv1 performs 1.030 times better than WiredTiger.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Operationss per Second (ops/sec)

Throughput

MMAPv1 WiredTiger

Performance Evaluation of MMAPv1 and WiredTiger Storage Engines in MongoDB: An Experiment

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona Sweden

Performance Evaluation of MMAPv1 and WiredTiger Storage Engines in MongoDB

An Experiment

Rohith Reddy Gundreddy

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies.

Author(s):

Rohith Reddy Gundreddy E-mail: rogu15@student.bth.se

University advisor:

Emiliano Casalicchio, PhD

Associate Professor in Computer Science

Department of Computer Science and Engineering

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden

Internet : www.bth.se

Phone : +46 455 38 50 00

Fax : +46 455 38 50 57

C ONTENTS

A BSTRACT

A CKNOWLEDGEMENT

L IST O F F IGURES

L IST OF T ABLES

1 I NTRODUCTION 1.1 O VERVIEW

1.2 P ROBLEM I DENTIFICATION A ND M OTIVATION

1.3 A IMS A ND O BJECTIVES

1.4 R ESEARCH Q UESTIONS

1.5 C ONTRIBUTION

1.6 O UTLINE

2 B ACKGROUND

2.1 D ATABASE M ANAGEMENT S YSTEMS

2.1.1 SQL Databases

2.1.2 NoSQL Databases

2.1.3 CAP Theorem

2.2 M ONGO DB

3 R ELATED W ORK

4 M ETHODOLOGY

Exclusion Criteria

4.1 L ITERATURE R EVIEW



4.2 E XPERIMENTATION 4.2.1 Testing Environment

4.2.2 Initial Setup

4.2.3 Workloads and Metrics

4.2.4 Experiment

4.3 D ATA A NALYSIS

=

=

5 R ESULTS

5.1 F INDINGS O F T HE L ITERATURE R EVIEW

5.1.1 Benchmarking Tool

5.1.2 Performance Evaluation Metrics

From Table 5.2 we can observe that, execution time and throughputs are the metrics used to measure and compare the performance of the databases.

5.2 R ESULTS F ROM T HE E XPERIMENT

5.2.1 Case 1: Experimentation on m3.xlarge Instance

Execution Time



Throughput

CPU Utilization Workload A

Memory Utilization Workload A

CPU Utilization Workload B

Memory Utilization Workload B

CPU Utilization Workload C

Memory Utilization Workload C

CPU Utilization Worklod D

Memory Utilization Workload D

CPU Utilization Workload E

Memory Utilization Workload E



.

CPU Utilization

Memory Utilization

5.2.2 Case 2: Experiment on m3.2xlarge Instance

Execution Time

Throughput

C ^ONTENTS

A ^BSTRACT

L ^IST O ^F F ^IGURES

L ^{IST OF} T ^ABLES

1 I NTRODUCTION 1.1 O ^VERVIEW

1.2 P ^ROBLEM I DENTIFICATION A ^ND M ^OTIVATION

1.3 A ^IMS A ^ND O ^BJECTIVES

1.4 R ^ESEARCH Q ^UESTIONS

1.6 O ^UTLINE

2 B ^ACKGROUND

2.1 D ^ATABASE M ^ANAGEMENT S ^YSTEMS

2.2 M ^ONGO DB

3 R ^ELATED W ^ORK

4 M ^ETHODOLOGY

4.1 L ^ITERATURE R ^EVIEW

4.3 D ^ATA A ^NALYSIS

5 R ^ESULTS

5.1 F ÎNDINGS O ^F T ^HE L ÎTERATURE R ÊVIEW

5.2 R ^ESULTS F ^ROM T ^HE E ^XPERIMENT