URI: bth-18665
Resource utilization comparison of Cassandra and Elasticsearch
Nizar Selander September 2019
Faculty of Computing
Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden
This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the bachelor’s degree in software
engineering. The thesis is equivalent to 10 weeks of full-time studies.
Contact Information:
Author:
Nizar Selander
nizar.selander@gmail.com
External advisors:
Ruwan Lakmal Silva
ruwan.lakmal.silva@ericsson.com Pär Karlsson
par.a.karlsson@ericsson.com
University advisors:
Krzysztof Wnuk
krzysztof.wnuk@bth.se Conny Johansson
conny.johansson@bth.se
Faculty of Computing
Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden
Internet : www.bth.se Phone : +46 455 38 50 00 Fax : +46 455 38 50 57
I. Abstract
Elasticsearch and Cassandra are two of the widely used databases today with Elasticsearch showing a more recent resurgence due to its unique full text search feature, akin to that of a search engine, contrasting with the conventional query language-based methods used to perform data searching and retrieval operations.
The demand for more powerful and better performing yet more feature rich and flexible databases has ever been growing. This project attempts to study how the two databases perform under a specific workload of 2,000,000 fixed sized logs and under an environment where the two can be compared while maintaining the results of the experiment meaningful for the production environment which they are intended for.
A total of three benchmarks were carried, an Elasticsearch deployment using default configuration and two Cassandra deployments, a default configuration a long with a modified one which reflects a currently running configuration in production for the task at hand.
The benchmarks showed very interesting performance differences in terms of CPU, memory and disk space usage. Elasticsearch showed the best performance overall using significantly less memory and disk space as well as CPU to some degree.
However, the benchmarks were done in a very specific set of configurations and a very specific data set and workload. Those differences should be considered when
comparing the benchmark results.
Keywords: Databases, Benchmark, Performance, Kubernetes, Cassandra, Elasticsearch.
II. Acknowledgment
I would like to thank my incredible supervisors Ruwan Lakmal Silva, Pär Karlsson and Stefan Wallin at Ericsson as well as Krzysztof Wnuk and Conny Johansson at BTH for all the guidance and support they have given me and for affording me this valuable opportunity. I would also like to thank my family for their support in making my achievements possible.
III. Contents
I. Abstract iii II. Acknowledgment iv III. Contents 5 1. Introduction 7
1.1. Context ... 7
1.2. Problem ... 7
1.3. Target Group ... 8
1.4. Delimitations ... 8
1.5. Research questions ... 8
2. Background 9 2.1. History and Overview... 9
2.2. Flat model ...10
2.3. Relational ...13
2.4. Post-Relational ...16
3. Environment 17 3.1. App ...17
3.2. Stream processing ...17
3.3. Log stashing ...18
3.4. Storage ...18
3.5. Management ...18
4. Experiment 19 4.1. Method ...19
4.2. Design ...20
4.3. Data replication & consistency ...22
4.4. Deployment ...23
5. Results 30 5.1. CPU usage ...30
5.1.1. Cassandra default configuration: CPU usage...31
5.1.2. Cassandra modified configuration: CPU usage ...31
5.1.3. Elasticsearch default configuration: CPU usage ...32
5.1.4. All deployments: Total CPU usage ...32
5.2. Memory ...33
5.2.1. Cassandra default configuration: Memory usage ...33
5.2.2. Cassandra modified configuration: Memory usage ...34
5.2.3. Elasticsearch default configuration: Memory usage ...34
5.2.4. All deployments: Total memory usage ...35
5.3. Disk space ...35
5.3.2. Cassandra default configuration: Data disk usage ...36
5.3.3. Cassandra modified configuration: Commit disk usage ...37
5.3.4. Cassandra modified configuration: Data disk usage ...37
5.3.5. Elasticsearch default configuration: Data disk usage ...38
5.3.6. All deployments: Total disk usage ...38
6. Analysis 39 6.1. CPU utilization analysis ...39
6.1.1. CPU usage vs logs output rate ...39
6.1.2. CPU usage vs logs output delta ...41
6.1.3. CPU usage vs Producer and Logstasher CPU usage ...43
6.2. Memory utilization analysis ...45
6.2.1. Memory usage vs logs output rate ...45
6.2.2. Memory usage vs Producer & Logstasher CPU usage...47
6.3. Disk space utilization analysis ...48
6.3.1. Disk space usage vs logs output rate ...48
7. Conclusion 50 8. Concluding remarks 52 8.1. Summary ...52
8.2. Limitations ...52
8.3. Future work ...52 9. References 53
1. Introduction
Ever since humanity developed its means of written communication and evolved from oral cultures to ones capable of storing their knowledge and preserving it, it enabled them the means to preserve and record information vital to their identity, beliefs, culture and trade to name a few. Although the fundamental requirements of physical information storage have been largely left unchanged on the very basic level,
requirements such as reliability, security, accessibility and cost, the challenges we face today in meeting those demands are far more complex [1]. Largely due to the scale and speed at which we operate at and the ever-growing technology and how it continues to affect the world around us and how we interact with it.
As more and more processes that underpin our infrastructure and business world are being digitalized. Many of them have found a home in distributed cloud computing platforms. With that, the demand for more and more efficient, more reliable and better use case specific technologies for handling, managing and processing such operations has increased as evident by number of choices available for consideration.
The topic for this thesis project, a study into the resource utilization of Cassandra [1]
and Elasticsearch [2], deals with such an attempt at comparing two available technologies and seeks to provide empirical and relevant information to aid in the decision making between the two databases in a real-world scenario.
1.1. Context
The purpose of this research is to study and examine how two databases perform compared to each other in terms of resource utilization. Cassandra, one of the two technologies, is currently used in production at Ericsson to store operation and transaction logs which are produced by applications in a Kubernetes system.
Elasticsearch, an alternative solution, is being considered as a potential replacement.
Through benchmarking how the two perform and by measuring their resource utilization in terms of CPU, disk and memory usage, the results of the benchmarking experiment will provide insight into how the two compare and will allow for a cost- benefit analysis to be conducted on whether Elasticsearch is a suitable alternative for the use case at hand.
1.2. Problem
Cassandra, a wide column storage database, is currently used in one of the products offered by Ericsson for storing logs. While it is handling the task at hand just fine, an alternative database, Elasticsearch, offers features which are interesting and useful for the product which it is considered to be implemented in. Full text search is such a feature, it allows for searching through records stored in the database similar to how one would use a search engine. This is largely due to its document-based data model which allows for greater operational flexibility. The question however is, at what cost would such a feature come? How would it perform compared to the currently
implemented database in terms of resource utilization? Those are the questions which this thesis aims to answer. Through putting the two in a controlled environment in
which they are benchmarked under the same workload, the data will help us understand how the two utilize resources in terms CPU, disk and memory usage.
1.3. Target Group
This research is aimed at organizations, groups and individuals with an interest in finding information about how Elasticsearch would perform compared to Cassandra in terms of resource utilization and in a real-world use case. The study could also be of interest for those looking to explore how such a problem could be tackled as well as those who want further their knowledge in this field.
1.4. Delimitations
The study deals with a very specific deployment, configuration, workload, data set, database versions and dependencies. It is important to keep this in mind when drawing conclusions from the results in this thesis.
1.5. Research questions
1.5.1.1. RQ.1: How would Cassandra and Elasticsearch perform in terms of resource utilization under a heavy workload?
1.5.1.2. RQ.2: What factors influence their resource utilizations?
2. Background
This study focuses on cost-benefit analysis of the performance of Cassandra database and Elasticsearch in a real-world problem in co-operation with Ericsson. The goal is to establish how the two technologies perform relative to each other in a data storing task in a Kubernetes environment. To begin with, we will look at a brief history of the development of databases and how we got here in this chapter to familiarize ourselves with key concepts.
2.1. History and Overview
Although databases are strongly associated with computers and digital information, humans have stored and cataloged information long before the computer era. A recently uncovered Sumerian medical tablet that dates to 2400 BC lists 15
prescriptions used by a pharmacist [2] gives us a glance at how far back the practice dates in our history and its fascinating progression. The concepts and philosophies used to build and improve those systems have both formed and help guide the development of databases to where they are today.
Technological evolution ever since changed the physical storage from clay tablets, to papyrus and later to paper. Lists, ledgers, journals, card catalogs and archives have been used in elaborate systems developed by governments, libraries, hospitals and businesses [1].
Databases were created to solve the limitations and difficulties such systems faced at the time such as automation, speed, security ease of use and accessibility.
Punch-card, and variations thereof (paper, tape, etc.) based systems allowed for automatic processing of information stored on them particularly before general- purpose computers existed. Although those systems are mostly outdated and obsolete today, they are still widely used for tabulating votes and grading standardized tests.
The introduction and availability of disks and drums memory from the 1960s onwards as computers grew in speed and capability, allowed for direct-access storage and shared interactive use [1].
The emergence of the world wide web, the first iteration of the internet where users ingested content created by webmasters, and later Web 2.0, which brought a shift to user-generated content, brought new challenges and difficulties that changed the needs developers and administrators had for their database systems [3].
All those technological evolutions had significant effects on the development of
databases and the models that were used to store records in them. The development of database technology is generally divided into three different eras where the data models and structures have shown fundamental changes. These eras are navigational, relational and post-relational. The following sub-chapters takes a closer look at them.
2.2. Flat model
Early computer models followed A flat-file model, a simple, consecutive list of records.
This model uses separate files for each entity and the files themselves can be a plain text file or a binary file. Plain text records usually contain one record per line. Record value fields can be comma or delimiter separated [4].
In the example below we see how records of fruits and vegetables are stored in two different databases. This is due to the limitation of this model where each database consists of a single table.
1 “Lime”, 18.49, “Spain”
2 “Orange”, 16.99, “USA”
3 “Fig”, 32.99, “Greece”
4 “Avocado”, 24.49, “Peru”
Code block 2.2.1: Records for fruits stored in a Fruits flat file using the format: name, price, origin.
1 “Cabbage”, 11.49, “USA”
2 “Asparagus”, 16.99, “Germany”
Code block 2.2.2: Records for vegetables stored in a Vegetables flat file using the format: name, price, origin.
The flat model lacks structures for indexing or recognizing relationships between records. Relationships can be inferred from the data in the database but the database itself does not make those relationships explicit.
Fruit Vegetable
Name Price Country Name Price Country
Lime 18.49 Spain Cabbage 11.49 USA
Orange 16.99 USA Asparagus 16.99 Germany
Fig 32.99 Greece
Avocado 24.49 Peru
Table 2.2.3: Records in file-based model are stored in separate files for each entity. Record entries are separated by new lines and values are separated by commas.
The advantages the flat model enjoys over other models is its simplicity, in terms of concept and ease of use, and the fact that it is inexpensive [4]. It practically functions as list of records in a text file that anyone can open and modify to their needs. Examples of flat files used today are /etc/passwd and /etc/group in Unix-like operating systems.
Other examples are found in contact lists and address books when imported and exported between devices and services.
However, the model has major limitations and disadvantages that generally fall into three categories: integrity, durability and implementation.
In terms of data integrity, the model lacks what is commonly known as referential integrity which is the direct linking of different attributes in a database [5]. If a database administrator were to remove the record for David Howard or change his address, he or she will have to manually search through all the records in the database where David Howard or his address is referenced. Further, adding a contact that has multiple phone numbers or address is very difficult to implement due to the limitations of this model.
Contacts
FirstName LastName PhoneNr Street PostCode
Adam Palmer 07486563476 West garden 1B 49 412
David Howard 07486464464 Library street 12 48 216
Lisa Johns 07486432342 Main road 32 43 975
Sara Williams 07443654368 Down street 6A 47 197 Table 2.2.4: Contact records in a flat model.
Another data integrity related issue is the database model’s inability to ensure that values provided in record fields are of a valid type and that reference to the same subject are consistent across multiple entries. Such a problem can be seen in the
example below where an unintended mistyping of Sara Williams’ last name was used in the LastName field in one of her customer order records as well as an invalid data type can be found used in the Price field. Such problems can be expensive and laborious to deal with as the database will produce inaccurate search results or in some instances simply fail to do so.
Orders
FirstName LastName CustomerID Date OrderNr Price Sara Williams 114354 04-03-2018 4418988 562.99
Sara Williams 114354 07-09-2018 4419689 Organic apples Sara William 114354 25-12-2018 4420294 299.99
Sara Williams 114354 28-02-2019 4421481 499.98 Table 2.2.5: Customer order records in a flat model.
We can also see in the example above how there is a redundancy of data caused by the repetition of the customer’s name and ID for each record entry in the database which is an inefficient utilization of disk space and resources.
In terms of implementation, performing search operations especially when working with large volumes of records is an extremely slow and time-consuming process since the format requires the computer to start every search from the start of the list of entries in the database and sequentially work its way through it.
In the scenario where another application is to use the same database, the code in the first application that is responsible for accessing, parsing, processing and modifying the data in the database must be rewritten and duplicated over to the new application.
Similarly, any new changes to the data structure in the database will require the code in both applications to be accordingly updated [6]. Further, records in the database are susceptible to being lost or corrupted in cases where multiple users, applications or threads try to write to the database at the same time.
In terms of durability, the data in the flat model is susceptible and prone to corruption in cases where the machine hosting it crashes while the program interfacing it is updating a record.
Advantages Disadvantages
Simple to create Data redundancy
Easy to use Data inconsistency
Inexpensive Data access control difficulties
Requires extensive programming Poor security
Table 2.2.6: Advantages and disadvantages of the Flat-File model.
These limitations and disadvantages are some of the challenges that later database models and database management systems try to tackle and find solutions for.
2.3. Relational
In the 1970s, Edgar F. Codd, a programmer and an oxford-educated mathematician working at IBM published a paper, "A Relational Model of Data for Large Shared Data Banks" [7] showing how information in large databases can be accessed without needing to know how the data was stored or structured in the database [8]. In this paper, Codd, introduced the term “relational database” and proposed shifting from storing data in a hierarchical or navigation structures to storing data in rows and columns.
Each table in a relational database has one or more data categories known as columns and each row in the table is a record. In the table below we see can see how such a table could look like. The table describes a list of bands and consists of two columns, an ID column used for specifying a band’s ID and a Band column for specifying the band’s name. In this example we have a total of 7 records.
ID Band
ART01 DBMS Hoodlums ART02 Binary Beasts ART03 Callback Cats ART04 Life Cycle Thugs
ART05 Multiprocessing Moguls ART06 Open Source Pundits ART07 Source Code Cannibals
Table 2.3.1: Bands table in a relational database model.
The following two tables describe other attributes related to the bands such as albums and songs they have produced. They are however separated into different tables as they represent different entities with different attributes.
ID Title Year Label Band
ALB01 Boolean Autocrats 2006 Flip Framework Records ART03 ALB02 Mind Map Cache 2013 Garbage Collector Records ART03 ALB03 Ode To Code 2018 Glueware Gremlin Records ART04 ALB04 Open Source Pundits 2007 Overflow Archives ART07 ALB05 Pentium Predators 2010 Garbage Collector Records ART01
ALB06 We Push to Master 1998 Overflow Archives ART05
Table 2.3.2: Albums table in a relational database model.
ID Song Length Composer Album
SNG01 405 Found 04:10 Joe ALB05
SNG02 Binary Fetch 03:29 Steve ALB05
SNG03 Code Push 02:46 Joe ALB01
SNG04 Byte Me 03:26 Joe ALB04
SNG05 C-Sick 03:50 Steve ALB03
SNG06 Dirty Bits 02:55 Thomas ALB03
SNG07 Endless Embed 04:50 Thomas ALB02
SNG08 Error By Night 00:25 Joe/Steve ALB06
SNG09 Floating Encapsulation 03:08 Steve ALB05
SNG10 Hex Hypercity 03:41 Thomas ALB01
SNG11 Hypertext Assassins 04:20 Thomas/Joe ALB03
SNG12 Loon Bit Loop 03:34 Joe ALB02
SNG13 Regex Natives 03:57 Steve ALB01
SNG14 Runtime Terror 03:41 Steve ALB04
SNG15 The Epic Objective 03:25 Joe/Steve ALB03
Table 2.3.3: Songs table in a relational database model.
The relationship between tables is done through using keys. There are several types of keys used in to create such relationships. In our example, we can see the use of the Primary key and the Foreign key. In the songs table above, the first column, ID, is used to store primary keys. Those are fields used to uniquely identify a row in the table. In the fifth column of the same table, the Album column, we see the primary keys of the albums table used. When those keys are used as such, they are known as foreign keys and are used as a relationship between columns in two database tables.
With the help of such keys in a relational database, one can obtain a view of the database that suits specific needs. In the table below, we can see how relationships between different tables can be used to create a desired view.
Artists Albums Songs
Band Album Year Song Composer
DBMS Hoodlums Boolean Autocrats 2006 405 Found Joe Binary Beasts Mind Map Cache 2013 Binary Fetch Steve
Callback Cats Ode To Code 2018 Code Push Joe
Life Cycle Thugs Open Source Pundits 2007 Byte Me Joe Multiprocessing Moguls Pentium Predators 1998 C-Sick Steve
Query result
Title Artist Album Composer Year
405 Found Life Cycle Thugs Ode To Code Joe 2018 Binary Fetch Life Cycle Thugs Ode To Code Joe 2018 Code Push Life Cycle Thugs Mind Map Cache Joe/Steve 2013 Byte Me DBMS Hoodlums Boolean Autocrats Steve 2006 C-Sick DBMS Hoodlums Pentium Predators Thomas 1998
Table 2.3.4: Accessing data from different tables in a relational model database.
The main advantages of relational databases are the ability it provides users to easily query, filter, sort and combine to extract the information they need. Further, the databases can be extended to add new data categories without the need to modify existing applications that use it due to the fact that the data isn’t reliant on the physical organization.
The relational model also excels at data consistency across applications and database instances, ensuring that multiple instances of a database have the same data all the time [9].
Advantages Disadvantages
Does not require familiarity with internal structure
Substantial hardware and system software overhead
Multiple users can access the data at the same time
Can facilitate poor design and implementation.
Supports distributed databases May promote “islands of information”
problems Table 2.3.5: Advantages and disadvantages of the relational model.
2.4. Post-Relational
With the introduction and the rise of web 2.0, webservers and databases have been put under more and more performance pressure. The solution to this new challenge was to split up the databases and servers into smaller instances that work together. This contrasts with the conventional trend earlier to go for larger and more powerful machines to deal with the increased workload [3]. Further, the demand for databases with more flexible data structure has increased [10].
NoSQL is an example of such a post-relational, or a non-relational as it is sometimes described, database. Several variations of NoSQL database are in use today. Examples of such different databases are document-oriented databases, Graph databases and wide- column databases. Cassandra is an example of a wide column database and
Elasticsearch is an example of a document database.
As previously mentioned, Elasticsearch is document oriented, meaning that it stores entire objects or documents rather than flattening them into a table schema, using one field per column, and then reconstructs them every time they are retrieved [11]. It uses JSON, short for JavaScript Object Notation, as the serialization format for documents. A format that has become the standard in the NoSQL movement. It is simple, concise, easy to ready and is supported by most programming languages. An example of how an object and its attributes are represented in a JSON document is shown below.
Illustration 2.4.1: An identification card for a John Smith that includes different types of information about him.
1 {
2 "name": "John Smith", 3 "id_number": 11541556111841, 4 "info": {
5 "born": “1988-06-15”,
6 "memberships": [ "Theatre club", "Chess club" ],
7 "bio": "Software engineering student looking to meet like- minded people."
8 },
9 "join_date": "2018-10-01"
10 }
Code block 2.4.2: An identification card object represented in a JSON document format.
Birth:
Theatre club
Software engineering student looking to meet like-minded people.
John Smith
Memberships:
1988-06-15 2018-10-01
Bio:
Name:
Joined:
Chess club ID: 11541556111841
3. Environment
The database use case in this study is responsible for storing logs generated by applications running in the production system. However, there are a couple of intermediary steps that the data goes through first before it is stored by the database and is made available for searching and exporting.
Those steps are illustrated below in an architectural overview of the logging system and where the database fits into it. On one end of the operational process we have the apps, which produce both operation and transaction logs. The operation logs consist of system state and process execution logs which are used for maintenance, upkeep and monitoring purposes. The transaction logs consist of information that has been requested from the application instances that are similar in nature to API/HTTP requests. Those logs are stored for recordkeeping.
Diagram 3.0: An overview of the logging architecture.
3.1. App
The first step of the logging system is the source of the logs. What those applications do is not directly relevant for our study as what we are interested in is simply the data set or workload that they can produce. That is to say, the applications themselves can be and tasks they perform can be changed without any impact on the on the database as long as the logs which are generated by them produce follow the same type of logs from the database’s perspective. However, and as previously mentioned, those application produce both operation and transaction logs which contain valuable information and as such they are stored.
Diagram 3.1: The app instances in the logging architecture.
3.2. Stream processing
The logs generated and outputted by the app instances in the system form a stream of data. This stream is handled by Kafka, a distributed streaming platform, before being sent further down the operation line. The reason for including Kafka in the logging architecture is to decouple the producer side from the consumer side as to prevent any blocking in the system for a user due to data indexation operations.
APP 1
APP 2
Kafka Logstasher Database Log manager
Export
APP 1 APP 2
Kafka Logstasher Database Log manager
Export
A streaming platform is a platform capable of subscribing and publishing streams of records, storing them in a fault tolerant durable way and processing them as they occur [12]. Apache Kafka is such a platform that offers a high-throughput and low-latency for handling real-time data feeds.
Diagram 3.2: Stream processing in the logging architecture.
3.3. Log stashing
The logs handled by Kafka are finally sent to a data collection and log-parsing engine where they are further processed to have them served to the database in the
appropriate format and structure.
Diagram 3.3: The log stashing plugins in the logging architecture.
3.4. Storage
Finally, the logs are ready to be stored by the database.
Diagram 3.4: The databases in the logging architecture.
3.5. Management
Once the logs are successfully stored, they can be accessed and exported through a graphical interface specifically designed for this purpose.
Diagram 3.5: The log manager in the logging architecture.
APP 1
APP 2
Kafka
Logstasher Database Log manager ExportAPP 1
APP 2
Kafka
Logstasher
Database Log managerExport
APP 1
APP 2
Kafka Logstasher
Database
Log managerExport
APP 1
APP 2
Kafka Logstasher Database
Log manager
Export
4. Experiment
4.1. Method
The research is split into two parts. A theoretical analysis of the differences between Cassandra and Elasticsearch as well as an empirical experiment in which their performance is measured and compared.
Going through the former was necessary in order to conduct the latter as knowledge and understanding of how the databases functioned structurally and technically was used to build a small piece of software known as a log-stasher. As the name indicates, its purpose was to receive logs data from a logs producer and then use the appropriate client operations in order to send them to the database and instruct it on how to store them.
The log-stasher was built using Scala and used the same dependencies and dependency versions as the one used against the Cassandra database to insure having a controlled environment when running the benchmarking experiment.
The benchmarking was done in a Kubernetes cluster and helm charts were used to deploy the databases along with the log producers and log-stashers. This was done to simulate the production environment under which they will be running.
The metrics for the performance of the systems was collected with the help of two utilities, Prometheus and Grafana, which are commonly used in this type of environment.
The workload consists of randomly generated log data that is consistent in size. A total of 2,000,000 logs are generated by the Producer which the Logstasher receives and runs the database instructions necessary for it to store them.
After the workload is completed, the metrics data is downloaded from web interfaces of Grafana and Kubernetes. Grafana provides the metrics data in csv text files as time data series. Kubernetes allows the ability to print and store the logs produced by the
deployed applications in the cluster and store them as text files.
The metrics and logs are then imported into excel where they are processed and plotted onto scatter plot charts for analysis and presentation.
4.2. Design
In order to conduct the benchmarks, a number of applications are needed to be put in place to simulate the log indexation procedure. As previously mentioned, we utilize a log producer application to create the workload. The reason for this is to put the indexation procedure under workload stress.
The producer, as seen in the illustration below, consists of two primary processes. A log generator that produces logs which are the workload for the benchmark. Those logs are consistent in size but contain randomly generated parameter values. The second main process is the Kafka Producer, which is used to send the generated logs further down the operation flow. The logs are thereafter fetched by the Logstasher. This is the second stage of the operation flow so to speak. The Logstasher is a piece of software which I have produced for this specific use case and is responsible for conducting three primary tasks. Receiving the logs generated by the producer, perform any necessary formatting data processing on the logs and finally execute the appropriate database command calls to store the logs.
The retrieval of the logs is done through the Kafka Consumer, which together with Kafka Producer and Zookeeper build up the entirety of the Kafka system. Zookeeper is responsible for coordinating the consumption of the logs in the queue by the different Consumer instances. Once the logs have been received by the Logstasher, they are converted to the necessary format which the database requires in order to store it. In the case of Elasticsearch, the data structure is the document-based model, JSON, and in the case of Cassandra, the conventional query statements. Finally, the third
responsibility of the Logstasher, runs the appropriate API calls to the database which it is connected to in order for it to receive the log data and store it.
The illustration below provides an under-the-hood overview of the entirety of the logging system which is used for the benchmarking. The first illustration is
representative of that of the Cassandra configuration and the latter of the Elasticsearch configuration.
One difference between the two configurations, aside from the different databases used, which is shown in the illustrations is how the data is handled once it is received by the database instances. In the case of Elasticsearch, the database instances, or more technically the pods, have one persistent storage each which they use to store the data into. However, in the case of Cassandra, they have two storages per pod or instance.
This design choice, as explained to me by one of my supervisors, is for performance and data security. Before the data is indexed into the database, it is held in memory, which can be problematic in cases of system crashes or restarts as it leads to data loss. Thus, the data is first written into a commit log from which it is then flushed into the data storage.
This implementation was not possible for me to mimic in the elasticsearch due to the time constraints under which I was afforded to conduct my benchmarks. Thus, it is something to consider when drawing conclusions from the results of the benchmarks and when comparing the resource utilization of the Cassandra and Elasticsearch deployments.
Diagram 4.2.1: An under-the-hood overview of the logging system used for the benchmarking of Cassandra.
Diagram 4.2.2: An under-the-hood overview of the logging system used for the benchmarking of Elasticsearch.
4.3. Data replication & consistency
The data replication model, which ensures data consistency in replicated databases as in the case with Elasticsearch and Cassandra, is the process that is responsible for keeping the database replicas in sync when data is added or removed. A failure in this process can lead to situations where reading from one replica will result in different results than reading from another one.
In the case of Elasticsearch, each index is divided into shards and each shard can have multiple copies, those copies are known as a replication group [13]. The data
replication model is based on the primary-backup model described in the PacificA paper: “Replication in Log-Based Distributed Storage Systems” [14]. In this model, a single copy acts as the main entry point for all data indexation procedures and is responsible for validating the indexing operations and for replicating the operation to the other shards. Once all replicas have successfully performed the operation and responded to the primary, the primary acknowledges the successful completion of the request to the client.
The implications of this implementation are that it can be fault tolerant while maintaining only two copies of the data. This contrasts with quorum-based systems, explained below, which needs a minimum of 3 copies to maintain fault tolerance.
However, a single slow shard, or copy, is enough to slow down an entire operation as the primary has to wait for all the replicas [13].
A quorum-based technique, in distributed systems, is responsible for enforcing
consistent operations. Such a technique can be used in replicated databases in the form of a replica control protocol that ensures that no two copies of a data item are read or written by transactions concurrently.
In the case of Cassandra, its approach takes inspiration from the Amazon Dynamo paper: “Dynamo: Amazon’s Highly Available Key-value Store” [15]. Shards are
independent of one another and work together without a primary shard or a leader by using a peer to peer communication system. Whichever node receives an operation request, it becomes known as the coordinator node for that operation.
Cassandra allows for configuring the desired consistency level for the particular implementation’s use case. This consistency level specifies how many replicas need to respond in order to consider the operation successful.
Among the available possible consistency level configurations are: ONE – Only a single replica must respond, TWO – Two replicas must respond, QUORUM – A majority of the replicas must respond, ALL – All of the replicas must respond. Those consistency levels come as a tradeoff between data consistency and data availability. A lower consistency level, such as ONE, provides higher throughput, latency and availability as it does not involve other replicas in the operation but comes at the cost of data correctness. The opposite is true for higher consistency levels [16].
The table below shows the data replication and consistency levels used for each of the experiments.
Data replication and consistency configuration
Parameter Cassandra default Cassandra modified Elasticsearch
Replication 3 3 2
Consistency level QUORUM QUORUM All*
Table 4.3.1: Replication and consistency level configurations used for each of the deployments.
* NOTE: Elasticsearch doesn’t use a quorum-based system, the All value is used as it the closest equivalent to represent its configuration.
4.4. Deployment
The benchmarking of the databases is performed in Kubernetes, a container orchestrating system for cloud run services and applications. In order for the applications to be deployed on the platform, they have to be first built into Docker image binaries.
This step is performed by using a dockerfile in which configuration specifications for the image are set. Similarly, once the image files are ready for deployment, a YAML-file is used to configure the parameters and specifications for which they are to be
deployed under.
The YAML-file allows for a great deal of configuration possibilities and is used to provide necessary information which the apps will use to communicate with one another such as networking configuration, proxy, encryption and port number to name a few. Other parameters include instructions such as how many instances of each application, or more technically pod or image, should be created and how much resources they are allowed to utilize.
The kubernetes cluster services uses this information and accordingly distributes the deployed images and allocates any specified storage among the number of nodes it is configured to run on.
The illustration below shows how the YAML-file is provided to the kubernetes cluster services through its API and the number of workers (nodes), it is set to work with. The workers, or nodes, have a kubelet running on each of them, which is a “node agent”. The kubelet is responsible for insuring that the containers in the node are running and are healthy according to the specifications that it is provided with [17].
Diagram 4.4.1: Kubernetes deployment overview.
Diagram 4.4.2: Overview of the data and operation flow for the Cassandra default and modified configuration deployments
Diagram 4.4.3: Overview of the data and operation flow for the Elasticsearch deployment.
The illustration above provides an overview of the data and operation flow for the Cassandra default and modified configuration deployments. The illustration that follows it provides the same overview for Elasticsearch’s.
As can be seen, the apps that generate the logs are replaced by the Producer which as has been discussed previously.
4.4.1. Deployment similarities
As previously mentioned, the YAML-file contains the values and the configuration specifications for the images, such as applications, services and networks, which are deployed. The table below shows the similar value parameters which were used for the three benchmark deployments that were conducted for the experiment.
YAML deployment configuration similarities (Databases) Parameter Cassandra
default
Cassandra modified
Elasticsearch
Database
Service cassandra cassandra elasticsearch
Nodes cassandra-0
cassandra-1 cassandra-2
cassandra-0 cassandra-1 cassandra-2
elasticsearch -0 elasticsearch -1 elasticsearch -2
Port 9042 9042 9200
Replicas 3 3 3
Persistence
Storage class Standard Standard Standard
Volume size 8Gi 8Gi 8Gi
Resources Requests memory
8Gi 8Gi 8Gi
Requests CPU 1 1 1
Limits memory 8Gi 8Gi 8Gi
Limits CPU 3 3 3
Heap new size 2048 2048 2048
Heap max size 4096 4096 4096
Table 4.4.1.1: YAML-file specified parameter values that are similar for all three benchmark deployments.
YAML deployment configuration similarities (Logging)
Logstasher
Replicas 3 3 3
Resources Requests memory
4Gi 4Gi 4Gi
Requests CPU 2 2 2
Limits memory 8Gi 8Gi 8Gi
Limits CPU 4 4 4
Zookeeper
Replicas 1 1 1
Persistent volume claims Storage class
name
Standard Standard Standard
Storage 1Gi 1Gi 1Gi
Kafka
Replicas 3 3 3
Persistent volume claims Storage class
name
Standard Standard Standard
Enabled False False False
Configuration overrides Default
replication factor
1 1 1
Offsets topic replication factor
1 1 1
Producer Kind Deployment Deployment Deployment
Replicas 1 1 1
Table 4.4.1.2: YAML-file specified parameter values that are similar for all three benchmark deployments.
YAML deployment configuration similarities (Monitoring)
Prometheus
Volume claim template Access mode Read
write once
Read write once
Read write once Storage class Standard Standard Standard
Requests storage 5Gi 5Gi 5Gi
Grafana
Persistence
Enabled True True True
Storage class Standard Standard Standard Access mode Read
write once
Read write once
Read write once
Size 5Gi 5Gi 5Gi
Alert manager
Volume claim template Access modes Read
write once
ReadWrit eOnce
Read write once Storage class Standard Standard Standard
Requests size 5Gi 5Gi 5Gi
Prometheus node exporter
Service port 30206 30206 30206
Service target port 30206 30206 30206
Kube etcd Enabled False False False
Kube controller manager
Enabled False False False
Kube scheduler Enabled False False False
Table 4.4.1.3: YAML-file specified parameter values that are similar for all three benchmark deployments.
4.4.2. Deployment differences
In the tables above, we saw that two Cassandra configurations were presented. This is due to the fact that the Cassandra version running in the production environment is configured to use both data encryption and compression between the database nodes.
Since the time constraint didn’t allow me to produce a matching configuration using Elasticsearch, which required familiarizing myself with the security systems and certification procedures in place, I opted to add a second benchmark for Cassandra that is stripped of such specifications as a means to see how much of an impact it had on the overall benchmarking for reference.
Thus, two Cassandra deployments were used. The first, Cassandra default, contains all the in-house configuration except for data encryption and compression between the nodes. The second, Cassandra modified, is configured to use the in-house configuration including the configuration variables that enable and set the data encryption and compression between the nodes.
Aside from those differences, everything else is configured to run using identical
specifications as can be seen in the table above. We can also see the tools used to collect metrics information at the bottom of the table and how they were configured.
Deployment differences
Parameter Cassandra default Cassandra modified Elasticsearch
Replication 3 3 2
Consistency level QUORUM QUORUM All*
Compression No Yes No
Encryption No Yes No
Table 4.4.2.1: Differences between the three benchmarked configurations.
* NOTE: Elasticsearch doesn’t use a quorum-based system, the All value is used as it the closest equivalent to represent its configuration.
5. Results
This section of the thesis presents the resource utilization of the three conducted benchmarks. It is divided into three sections. The first presents the results for the CPU utilization, followed by a section for the memory utilization and finally a section for the disk space usage utilization.
5.1. CPU usage
The CPU usage results show that the modified Cassandra configuration took the longest amount of time to complete the benchmark, roughly about 25 minutes in total and averaging around 600% in overall total CPU usage during the workload. In comparison, the default Cassandra configuration took about 9 minutes to complete the task and Elasticsearch took about 13 minutes to do so.
The default Cassandra configuration however used roughly twice the amount of CPU to complete the task compared to the two other deployments. We can also see in chart 5.1.3 that not all of Elasticsearch’s instances used the same amount of CPU. This is due to the role assignment system under which the instances work under.
The main differences between Elasticsearch’s results and Cassandras’ are the number of copies that were made of the data and the indexation procedure. Each of the Cassandra deployments made three copies of the data while Elasticsearch has made only two. This is due to the differences in the replication model between the two databases. Elasticsearch can maintain a fault tolerant implementation with only two copies while Cassandra needs at least three. Further, Cassandra needs to commit the data first to a separate storage before it indexes it and stores it in the data storage.
Those two factors affect the amount of computation needed and as a result will require more CPU.
Another difference, one that is between the default and the modified Cassandra configurations, is data compression and encryption between the nodes. The default Cassandra is configured to run without any encryption or compression meanwhile the modified configuration uses them. The compression and encryption come at a cost as they act as obstacles in the communication and thus impact the throughput negatively.
Those differences can be better seen if we calculate the area under the line for each deployment’s overall CPU utilization, the area represents the total amount of CPU work done by each deployment.
Chart 5.1: Overall Total CPU work done for each deployment.
0 1 000 2 000 3 000 4 000 5 000 6 000 7 000 8 000 9 000
Deployments
Total CPU work (Area under overall CPU utilization) Cassandra Default Cassandra Modified Elasticsearch
The default Cassandra deployment has exceeded the maximum amount of CPU that was allocated to it in the configuration, as can be seen in the graphs below, which might be due to the way the compression and encryption was been removed. Running more benchmarks of each deployment would have helped understand this issue better but unfortunately time didn’t allow for any more runs. The default Cassandra results are left here none the less as a reference.
5.1.1. Cassandra default configuration: CPU usage
Chart 5.1.1: CPU usage of each Cassandra pod under the workload of the Cassandra default configuration benchmark.
5.1.2. Cassandra modified configuration: CPU usage
Chart 5.1.2: CPU usage of each Cassandra pod under the workload of the Cassandra modified
0%
200%
400%
600%
800%
00:05:00 00:07:00 00:09:00 00:11:00 00:13:00 00:15:00
CPU usage (vCPU %)
Time (hh:mm:ss)
Cassandra-1 Cassandra-2 Cassandra-3
0%
100%
200%
300%
400%
00:03:00 00:07:00 00:11:00 00:15:00 00:19:00 00:23:00 00:27:00 00:31:00
CPU usage (vCPU %)
Time (hh:mm:ss)
Cassandra-1 Cassandra-2 Cassandra-3
5.1.3. Elasticsearch default configuration: CPU usage
Chart 5.1.3: CPU usage of each Elasticsearch pod under the workload of the Elasticsearch default configuration benchmark.
5.1.4. All deployments: Total CPU usage
Chart 5.1.4: Total CPU usage of all instances for each deployment.
In the chart above we see the overall CPU utilization of each deployment. The differences in the CPU usage and the amount of time taken as discussed in the beginning of the CPU results section can be better seen here.
0%
100%
200%
300%
400%
00:22:00 00:24:00 00:26:00 00:28:00 00:30:00 00:32:00 00:34:00 00:36:00
CPU usage (vCPU %)
Time (hh:mm:ss)
Elasticsearch-1 Elasticsearch-2 Elasticsearch-3
0%
300%
600%
900%
1200%
1500%
1800%
00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00
CPU usage (vCPU %)
Time (hh:mm:ss)
Cassandra Default Cassandra Modified Elasticsearch
5.2. Memory
The memory utilization results show that both Cassandra configurations utilized the same amount of memory, both starting at just over 8 gigabytes of memory usage for each instance and landing at around 13 gigabytes of memory usage by the end of the workload. This indicates that the differences in their configurations didn’t necessitate more memory usage to complete the task, the duration that the memory was used however was different.
Elasticsearch however used a significantly less memory compared to Cassandra even when considering that Elasticsearch only created two copies of the data unlike Cassandra which creates three copies.
We can see from Elasticsearch’s results that all three instances started with about 3 gigabytes of memory usage, of which one continued with that amount of memory all through the benchmark. The other two instances, Elasticsearch-2 and Elasticsearch-3 followed a similar increase of memory usage under the workload to that of Cassandra’s but ended at a total of 4 gigabytes by the end of the workload.
Elasticsearch-1’s reduced memory usage compared to the other two Elasticsearch instances reflects the same CPU usage seen in the previous section, more specifically in chart 5.1.3.
5.2.1. Cassandra default configuration: Memory usage
Chart 5.2.1: Memory usage of each Cassandra pod under the workload of the Cassandra default configuration benchmark.
0 4 8 12 16
00:00:00 00:04:00 00:08:00 00:12:00 00:16:00
Memory usage (Gigabytes)
Time (hh:mm:ss)
Cassandra-1 Cassandra-2 Cassandra-3
5.2.2. Cassandra modified configuration: Memory usage
Chart 5.2.2: Memory usage of each Cassandra pod under the workload of the Cassandra modified configuration benchmark.
5.2.3. Elasticsearch default configuration: Memory usage
Chart 5.2.3: Memory usage of each Elasticsearch pod under the workload of the Elasticsearch default configuration benchmark.
0 4 8 12 16
00:00:00 00:08:00 00:16:00 00:24:00 00:32:00
Memory usage (Gigabytes)
Time (hh:mm:ss)
Cassandra-1 Cassandra-2 Cassandra-3
0 1 2 3 4 5
00:00:00 00:08:00 00:16:00 00:24:00 00:32:00 00:40:00 00:48:00
Memory usage (Gigabytes)
Time (hh:mm:ss)
Elasticsearch-1 Elasticsearch-2 Elasticsearch-3
5.2.4. All deployments: Total memory usage
Chart 5.2.4: Total memory usage of all instances for each deployment.
5.3. Disk space
A difference between this section, disk space usage, and the two sections which precede it, CPU usage and memory usage, is that the results for each Cassandra configuration is presented in two charts while Elasticsearch’s results remain presented in a single chart.
This is due to the fact that each Cassandra instance is assigned to storage volumes while Elasticsearch is assigned a single storage. This is due to how the Cassandra image and version currently used has been configured to handle the data before it is indexed into its storage. The reasoning behind this is to prevent data loss in case of a crash or a restart as well as for a reduction in operation redundancy for how the indexation process is performed and to improve performance for their specific use case and needs out of the storage system. These differences and the reasoning behind them are
discussed in more detail in section 4.2.
The disk space utilization results show that both configuration deployments of Cassandra ended up using a total of about 1480 megabytes of disk space for each instance while Elasticsearch ended up using about 800 megabytes by two of its
instances with the third, Elasticsearch-1, remaining on its initial disk space usage of 52 megabytes all throughout the benchmark.
Considering that Cassandra creates 3 copies of the data while Elasticsearch creates only two, Elasticsearch’s results shows that it still used less disk space even when that difference is considered.
0 10 20 30 40 50
00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00
Memory usage (Gigabytes)
Time (hh:mm:ss)
Cassandra Default Cassandra Modified Elasticsearch
We can also see how the diskspace usage in Cassandra’s Data directories correspond with changes in its Commit log directories. A clear example of this can be seen charts 5.3.1 and 5.3.2 at the 13-minute mark where the disk space usage of the Commit log drops by about 100 megabytes which concedes an increase of disk space in the Data directory shortly after.
5.3.1. Cassandra default configuration: Commit disk usage
Chart 5.3.1: Disk space usage of each Cassandra commit volume under the workload of the Cassandra default configuration benchmark.
5.3.2. Cassandra default configuration: Data disk usage
Chart 5.3.2: Disk space usage of each Cassandra data volume under the workload of the Cassandra default configuration benchmark.
0 100 200 300 400
00:00:00 00:03:00 00:06:00 00:09:00 00:12:00 00:15:00 00:18:00 00:21:00
Disk space usage (Megabytes)
Time (hh:mm:ss)
Cassandra Commit Log-1 Cassandra Commit Log-2 Cassandra Commit Log-3
0 500 1 000 1 500 2 000
00:00:00 00:03:00 00:06:00 00:09:00 00:12:00 00:15:00 00:18:00
Disk space usage (Megabytes)
Time (hh:mm:ss)
Cassandra Data Dir-1 Cassandra Data Dir-2 Cassandra Data Dir-3
5.3.3. Cassandra modified configuration: Commit disk usage
Chart 5.3.3: Disk space usage of each Cassandra commit volume under the workload of the Cassandra modified configuration benchmark.
5.3.4. Cassandra modified configuration: Data disk usage
Chart 5.3.4: Disk space usage of each Cassandra data volume under the workload of the Cassandra modified configuration benchmark.
0 100 200 300 400
00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00 00:30:00 00:35:00
Disk space usage (Megabytes)
Time (hh:mm:ss)
Cassandra Commit Log-1 Cassandra Commit Log-2 Cassandra Commit Log-3
0 400 800 1 200 1 600
00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00 00:30:00 00:35:00
Disk space usage (Megabytes)
Time (hh:mm:ss)
Cassandra Data Dir-1 Cassandra Data Dir-2 Cassandra Data Dir-3
5.3.5. Elasticsearch default configuration: Data disk usage
Chart 5.3.5: Disk space usage of each Elasticsearch volume under the workload of the Elasticsearch default configuration benchmark.
5.3.6. All deployments: Total disk usage
Chart 5.3.6: Total disk space usage of all instances for each deployment.
0 200 400 600 800 1 000
00:00:00 00:10:00 00:20:00 00:30:00 00:40:00 00:50:00
Disk space usage (Megabytes)
Time (hh:mm:ss)
Elasticsearch-1 Elasticsearch-2 Elasticsearch-2
0 1000 2000 3000 4000 5000
00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00
Disk space usage (Megabytes)
Time (hh:mm:ss)
Cassandra Default - Data Dir Cassandra Default - Commit Log Cassandra Modified - Data Dir Cassandra Modified - Commit Log Elasticsearch
6. Analysis
The results from the benchmarks, presented in the previous chapter, are used along with data generated by the Producer and the Logstasher to provide context to results we see. This generated data was collected from console logs made by the Producer and the Logstasher during the benchmark. They consist of date and time timestamps along with the total number of logs produced by the Producer or stashed by the Logstasher.
By matching the timestamps to those collected from the metrics data, we can plot the two on the same graph using two separate y-axes.
We also use the total resource utilization by all three instances in this section, unlike in the results sections where we looked at the resource utilization of each instance individually.
6.1. CPU utilization analysis 6.1.1. CPU usage vs logs output rate
Comparing the CPU usage of the three benchmarked deployments with their respective Producer’s log production output rate and Logstasher’s stashing rate shows
irregularities in how the two rates follow one another in the case of the default
Cassandra configuration. We also see an increasing latency between the two the further down the workload time line we go. This irregularity isn’t as evident in the two other benchmarks, the modified Cassandra configuration and Elasticsearch.
This difference could be an indirect result of the high CPU usage of Cassandra in this configuration deployment where a bottleneck could have occurred in the Producer’s and Logstasher’s ability to match the rate at which the database is able to receive the logs. It could also be due to limitations in how the Producer and Logstasher were implemented. Unfortunately, time didn’t allow for repeated runs of the benchmark in order to understand it better.
6.1.1.1. Cassandra default configuration
Chart 6.1.1.1: Cassandra pods total CPU usage and the output rates of the Producer and the
0 2 500 5 000 7 500 10 000
0%
450%
900%
1 350%
1 800%
14:18:00 14:20:00 14:22:00 14:24:00 14:26:00 14:28:00
Producer & Logstasher -Output rate (Logs)
Cassandra -Total CPU usage (vCPU %)
Time (hh:mm:ss)
Cassandra - CPU 5 per. Mov. Avg. (Producer - Output) 5 per. Mov. Avg. (Logstasher - Output)
6.1.1.2. Cassandra modified configuration
Chart 6.1.1.2: Cassandra pods total CPU usage and the output rates of the Producer and the Logstasher under the Cassandra modified configuration benchmark.
6.1.1.3. Elasticsearch default configuration
Chart 6.1.1.3: Elasticsearch pods total CPU usage and the output rates of the Producer and the Logstasher under the Elasticsearch default configuration benchmark.
0 800 1 600 2 400 3 200
0%
250%
500%
750%
1 000%
12:08:00 12:12:00 12:16:00 12:20:00 12:24:00 12:28:00 12:32:00
Producer & Logstasher -Output rate (Logs)
Cassandra -Total CPU usage (vCPU %)
Time (hh:mm:ss)
Cassandra - CPU 5 per. Mov. Avg. (Producer - Output) 5 per. Mov. Avg. (Logstasher - Output)
0 2 000 4 000 6 000 8 000
0%
200%
400%
600%
800%
17:21:00 17:23:00 17:25:00 17:27:00 17:29:00 17:31:00 17:33:00 17:35:00
Producer & Logstasher -Output rate (Logs)
Elasticsearch -Total CPU usage (vCPU %)
Time (hh:mm:ss)
Elasticsearch - CPU 5 per. Mov. Avg. (Logstasher - Output) 5 per. Mov. Avg. (Producer - Output)