Resource utilization comparison of Cassandra and Elasticsearch

(1)

URI: bth-18665

Resource utilization comparison of Cassandra and Elasticsearch

Nizar Selander September 2019

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the bachelor’s degree in software

engineering. The thesis is equivalent to 10 weeks of full-time studies.

Contact Information:

Author:

Nizar Selander

nizar.selander@gmail.com

External advisors:

Ruwan Lakmal Silva

ruwan.lakmal.silva@ericsson.com Pär Karlsson

par.a.karlsson@ericsson.com

University advisors:

Krzysztof Wnuk

krzysztof.wnuk@bth.se Conny Johansson

conny.johansson@bth.se

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden

Internet : www.bth.se Phone : +46 455 38 50 00 Fax : +46 455 38 50 57

(3)

I. Abstract

Elasticsearch and Cassandra are two of the widely used databases today with Elasticsearch showing a more recent resurgence due to its unique full text search feature, akin to that of a search engine, contrasting with the conventional query language-based methods used to perform data searching and retrieval operations.

The demand for more powerful and better performing yet more feature rich and flexible databases has ever been growing. This project attempts to study how the two databases perform under a specific workload of 2,000,000 fixed sized logs and under an environment where the two can be compared while maintaining the results of the experiment meaningful for the production environment which they are intended for.

A total of three benchmarks were carried, an Elasticsearch deployment using default configuration and two Cassandra deployments, a default configuration a long with a modified one which reflects a currently running configuration in production for the task at hand.

The benchmarks showed very interesting performance differences in terms of CPU, memory and disk space usage. Elasticsearch showed the best performance overall using significantly less memory and disk space as well as CPU to some degree.

However, the benchmarks were done in a very specific set of configurations and a very specific data set and workload. Those differences should be considered when

comparing the benchmark results.

Keywords: Databases, Benchmark, Performance, Kubernetes, Cassandra, Elasticsearch.

(4)

II. Acknowledgment

I would like to thank my incredible supervisors Ruwan Lakmal Silva, Pär Karlsson and Stefan Wallin at Ericsson as well as Krzysztof Wnuk and Conny Johansson at BTH for all the guidance and support they have given me and for affording me this valuable opportunity. I would also like to thank my family for their support in making my achievements possible.

(5)

III. Contents

I. Abstract iii II. Acknowledgment iv III. Contents 5 1. Introduction 7

1.1. Context ... 7

1.2. Problem ... 7

1.3. Target Group ... 8

1.4. Delimitations ... 8

1.5. Research questions ... 8

2. Background 9 2.1. History and Overview... 9

2.2. Flat model ...10

2.3. Relational ...13

2.4. Post-Relational ...16

3. Environment 17 3.1. App ...17

3.2. Stream processing ...17

3.3. Log stashing ...18

3.4. Storage ...18

3.5. Management ...18

4. Experiment 19 4.1. Method ...19

4.2. Design ...20

4.3. Data replication & consistency ...22

4.4. Deployment ...23

5. Results 30 5.1. CPU usage ...30

5.1.1. Cassandra default configuration: CPU usage...31

5.1.2. Cassandra modified configuration: CPU usage ...31

5.1.3. Elasticsearch default configuration: CPU usage ...32

5.1.4. All deployments: Total CPU usage ...32

5.2. Memory ...33

5.2.1. Cassandra default configuration: Memory usage ...33

5.2.2. Cassandra modified configuration: Memory usage ...34

5.2.3. Elasticsearch default configuration: Memory usage ...34

5.2.4. All deployments: Total memory usage ...35

5.3. Disk space ...35

(6)

5.3.2. Cassandra default configuration: Data disk usage ...36

5.3.3. Cassandra modified configuration: Commit disk usage ...37

5.3.4. Cassandra modified configuration: Data disk usage ...37

5.3.5. Elasticsearch default configuration: Data disk usage ...38

5.3.6. All deployments: Total disk usage ...38

6. Analysis 39 6.1. CPU utilization analysis ...39

6.1.1. CPU usage vs logs output rate ...39

6.1.2. CPU usage vs logs output delta ...41

6.1.3. CPU usage vs Producer and Logstasher CPU usage ...43

6.2. Memory utilization analysis ...45

6.2.1. Memory usage vs logs output rate ...45

6.2.2. Memory usage vs Producer & Logstasher CPU usage...47

6.3. Disk space utilization analysis ...48

6.3.1. Disk space usage vs logs output rate ...48

7. Conclusion 50 8. Concluding remarks 52 8.1. Summary ...52

8.2. Limitations ...52

8.3. Future work ...52 9. References 53

(7)

1. Introduction

Ever since humanity developed its means of written communication and evolved from oral cultures to ones capable of storing their knowledge and preserving it, it enabled them the means to preserve and record information vital to their identity, beliefs, culture and trade to name a few. Although the fundamental requirements of physical information storage have been largely left unchanged on the very basic level,

requirements such as reliability, security, accessibility and cost, the challenges we face today in meeting those demands are far more complex [1]. Largely due to the scale and speed at which we operate at and the ever-growing technology and how it continues to affect the world around us and how we interact with it.

As more and more processes that underpin our infrastructure and business world are being digitalized. Many of them have found a home in distributed cloud computing platforms. With that, the demand for more and more efficient, more reliable and better use case specific technologies for handling, managing and processing such operations has increased as evident by number of choices available for consideration.

The topic for this thesis project, a study into the resource utilization of Cassandra ^[1]

and Elasticsearch ^[2], deals with such an attempt at comparing two available technologies and seeks to provide empirical and relevant information to aid in the decision making between the two databases in a real-world scenario.

1.1. Context

The purpose of this research is to study and examine how two databases perform compared to each other in terms of resource utilization. Cassandra, one of the two technologies, is currently used in production at Ericsson to store operation and transaction logs which are produced by applications in a Kubernetes system.

Elasticsearch, an alternative solution, is being considered as a potential replacement.

Through benchmarking how the two perform and by measuring their resource utilization in terms of CPU, disk and memory usage, the results of the benchmarking experiment will provide insight into how the two compare and will allow for a cost- benefit analysis to be conducted on whether Elasticsearch is a suitable alternative for the use case at hand.

1.2. Problem

Cassandra, a wide column storage database, is currently used in one of the products offered by Ericsson for storing logs. While it is handling the task at hand just fine, an alternative database, Elasticsearch, offers features which are interesting and useful for the product which it is considered to be implemented in. Full text search is such a feature, it allows for searching through records stored in the database similar to how one would use a search engine. This is largely due to its document-based data model which allows for greater operational flexibility. The question however is, at what cost would such a feature come? How would it perform compared to the currently

implemented database in terms of resource utilization? Those are the questions which this thesis aims to answer. Through putting the two in a controlled environment in

(8)

which they are benchmarked under the same workload, the data will help us understand how the two utilize resources in terms CPU, disk and memory usage.

1.3. Target Group

This research is aimed at organizations, groups and individuals with an interest in finding information about how Elasticsearch would perform compared to Cassandra in terms of resource utilization and in a real-world use case. The study could also be of interest for those looking to explore how such a problem could be tackled as well as those who want further their knowledge in this field.

1.4. Delimitations

The study deals with a very specific deployment, configuration, workload, data set, database versions and dependencies. It is important to keep this in mind when drawing conclusions from the results in this thesis.

1.5. Research questions

1.5.1.1. RQ.1: How would Cassandra and Elasticsearch perform in terms of resource utilization under a heavy workload?

1.5.1.2. RQ.2: What factors influence their resource utilizations?

(9)

2. Background

This study focuses on cost-benefit analysis of the performance of Cassandra database and Elasticsearch in a real-world problem in co-operation with Ericsson. The goal is to establish how the two technologies perform relative to each other in a data storing task in a Kubernetes environment. To begin with, we will look at a brief history of the development of databases and how we got here in this chapter to familiarize ourselves with key concepts.

2.1. History and Overview

Although databases are strongly associated with computers and digital information, humans have stored and cataloged information long before the computer era. A recently uncovered Sumerian medical tablet that dates to 2400 BC lists 15

prescriptions used by a pharmacist [2] gives us a glance at how far back the practice dates in our history and its fascinating progression. The concepts and philosophies used to build and improve those systems have both formed and help guide the development of databases to where they are today.

Technological evolution ever since changed the physical storage from clay tablets, to papyrus and later to paper. Lists, ledgers, journals, card catalogs and archives have been used in elaborate systems developed by governments, libraries, hospitals and businesses [1].

Databases were created to solve the limitations and difficulties such systems faced at the time such as automation, speed, security ease of use and accessibility.

Punch-card, and variations thereof (paper, tape, etc.) based systems allowed for automatic processing of information stored on them particularly before general- purpose computers existed. Although those systems are mostly outdated and obsolete today, they are still widely used for tabulating votes and grading standardized tests.

The introduction and availability of disks and drums memory from the 1960s onwards as computers grew in speed and capability, allowed for direct-access storage and shared interactive use [1].

The emergence of the world wide web, the first iteration of the internet where users ingested content created by webmasters, and later Web 2.0, which brought a shift to user-generated content, brought new challenges and difficulties that changed the needs developers and administrators had for their database systems [3].

All those technological evolutions had significant effects on the development of

databases and the models that were used to store records in them. The development of database technology is generally divided into three different eras where the data models and structures have shown fundamental changes. These eras are navigational, relational and post-relational. The following sub-chapters takes a closer look at them.

(10)

2.2. Flat model

Early computer models followed A flat-file model, a simple, consecutive list of records.

This model uses separate files for each entity and the files themselves can be a plain text file or a binary file. Plain text records usually contain one record per line. Record value fields can be comma or delimiter separated [4].

In the example below we see how records of fruits and vegetables are stored in two different databases. This is due to the limitation of this model where each database consists of a single table.

1 “Lime”, 18.49, “Spain”

2 “Orange”, 16.99, “USA”

3 “Fig”, 32.99, “Greece”

4 “Avocado”, 24.49, “Peru”

Code block 2.2.1: Records for fruits stored in a Fruits flat file using the format: name, price, origin.

1 “Cabbage”, 11.49, “USA”

2 “Asparagus”, 16.99, “Germany”

Code block 2.2.2: Records for vegetables stored in a Vegetables flat file using the format: name, price, origin.

The flat model lacks structures for indexing or recognizing relationships between records. Relationships can be inferred from the data in the database but the database itself does not make those relationships explicit.

Fruit Vegetable

Name Price Country Name Price Country

Lime 18.49 Spain Cabbage 11.49 USA

Orange 16.99 USA Asparagus 16.99 Germany

Fig 32.99 Greece

Avocado 24.49 Peru

Table 2.2.3: Records in file-based model are stored in separate files for each entity. Record entries are separated by new lines and values are separated by commas.

The advantages the flat model enjoys over other models is its simplicity, in terms of concept and ease of use, and the fact that it is inexpensive [4]. It practically functions as list of records in a text file that anyone can open and modify to their needs. Examples of flat files used today are /etc/passwd and /etc/group in Unix-like operating systems.

Other examples are found in contact lists and address books when imported and exported between devices and services.

However, the model has major limitations and disadvantages that generally fall into three categories: integrity, durability and implementation.

(11)

In terms of data integrity, the model lacks what is commonly known as referential integrity which is the direct linking of different attributes in a database [5]. If a database administrator were to remove the record for David Howard or change his address, he or she will have to manually search through all the records in the database where David Howard or his address is referenced. Further, adding a contact that has multiple phone numbers or address is very difficult to implement due to the limitations of this model.

Contacts

FirstName LastName PhoneNr Street PostCode

Adam Palmer 07486563476 West garden 1B 49 412

David Howard 07486464464 Library street 12 48 216

Lisa Johns 07486432342 Main road 32 43 975

Sara Williams 07443654368 Down street 6A 47 197 Table 2.2.4: Contact records in a flat model.

Another data integrity related issue is the database model’s inability to ensure that values provided in record fields are of a valid type and that reference to the same subject are consistent across multiple entries. Such a problem can be seen in the

example below where an unintended mistyping of Sara Williams’ last name was used in the LastName field in one of her customer order records as well as an invalid data type can be found used in the Price field. Such problems can be expensive and laborious to deal with as the database will produce inaccurate search results or in some instances simply fail to do so.

Orders

FirstName LastName CustomerID Date OrderNr Price Sara Williams 114354 04-03-2018 4418988 562.99

Sara Williams 114354 07-09-2018 4419689 Organic apples Sara William 114354 25-12-2018 4420294 299.99

Sara Williams 114354 28-02-2019 4421481 499.98 Table 2.2.5: Customer order records in a flat model.

We can also see in the example above how there is a redundancy of data caused by the repetition of the customer’s name and ID for each record entry in the database which is an inefficient utilization of disk space and resources.

In terms of implementation, performing search operations especially when working with large volumes of records is an extremely slow and time-consuming process since the format requires the computer to start every search from the start of the list of entries in the database and sequentially work its way through it.

(12)

In the scenario where another application is to use the same database, the code in the first application that is responsible for accessing, parsing, processing and modifying the data in the database must be rewritten and duplicated over to the new application.

Similarly, any new changes to the data structure in the database will require the code in both applications to be accordingly updated [6]. Further, records in the database are susceptible to being lost or corrupted in cases where multiple users, applications or threads try to write to the database at the same time.

In terms of durability, the data in the flat model is susceptible and prone to corruption in cases where the machine hosting it crashes while the program interfacing it is updating a record.

Advantages Disadvantages

Simple to create Data redundancy

Easy to use Data inconsistency

Inexpensive Data access control difficulties

Requires extensive programming Poor security

Table 2.2.6: Advantages and disadvantages of the Flat-File model.

These limitations and disadvantages are some of the challenges that later database models and database management systems try to tackle and find solutions for.

(13)

2.3. Relational

In the 1970s, Edgar F. Codd, a programmer and an oxford-educated mathematician working at IBM published a paper, "A Relational Model of Data for Large Shared Data Banks" [7] showing how information in large databases can be accessed without needing to know how the data was stored or structured in the database [8]. In this paper, Codd, introduced the term “relational database” and proposed shifting from storing data in a hierarchical or navigation structures to storing data in rows and columns.

Each table in a relational database has one or more data categories known as columns and each row in the table is a record. In the table below we see can see how such a table could look like. The table describes a list of bands and consists of two columns, an ID column used for specifying a band’s ID and a Band column for specifying the band’s name. In this example we have a total of 7 records.

ID Band

ART01 DBMS Hoodlums ART02 Binary Beasts ART03 Callback Cats ART04 Life Cycle Thugs

ART05 Multiprocessing Moguls ART06 Open Source Pundits ART07 Source Code Cannibals

Table 2.3.1: Bands table in a relational database model.

The following two tables describe other attributes related to the bands such as albums and songs they have produced. They are however separated into different tables as they represent different entities with different attributes.

ID Title Year Label Band

ALB01 Boolean Autocrats 2006 Flip Framework Records ART03 ALB02 Mind Map Cache 2013 Garbage Collector Records ART03 ALB03 Ode To Code 2018 Glueware Gremlin Records ART04 ALB04 Open Source Pundits 2007 Overflow Archives ART07 ALB05 Pentium Predators 2010 Garbage Collector Records ART01

ALB06 We Push to Master 1998 Overflow Archives ART05

Table 2.3.2: Albums table in a relational database model.

(14)

ID Song Length Composer Album

SNG01 405 Found 04:10 Joe ALB05

SNG02 Binary Fetch 03:29 Steve ALB05

SNG03 Code Push 02:46 Joe ALB01

SNG04 Byte Me 03:26 Joe ALB04

SNG05 C-Sick 03:50 Steve ALB03

SNG06 Dirty Bits 02:55 Thomas ALB03

SNG07 Endless Embed 04:50 Thomas ALB02

SNG08 Error By Night 00:25 Joe/Steve ALB06

SNG09 Floating Encapsulation 03:08 Steve ALB05

SNG10 Hex Hypercity 03:41 Thomas ALB01

SNG11 Hypertext Assassins 04:20 Thomas/Joe ALB03

SNG12 Loon Bit Loop 03:34 Joe ALB02

SNG13 Regex Natives 03:57 Steve ALB01

SNG14 Runtime Terror 03:41 Steve ALB04

SNG15 The Epic Objective 03:25 Joe/Steve ALB03

Table 2.3.3: Songs table in a relational database model.

The relationship between tables is done through using keys. There are several types of keys used in to create such relationships. In our example, we can see the use of the Primary key and the Foreign key. In the songs table above, the first column, ID, is used to store primary keys. Those are fields used to uniquely identify a row in the table. In the fifth column of the same table, the Album column, we see the primary keys of the albums table used. When those keys are used as such, they are known as foreign keys and are used as a relationship between columns in two database tables.

With the help of such keys in a relational database, one can obtain a view of the database that suits specific needs. In the table below, we can see how relationships between different tables can be used to create a desired view.

(15)

Artists Albums Songs

Band Album Year Song Composer

DBMS Hoodlums Boolean Autocrats 2006 405 Found Joe Binary Beasts Mind Map Cache 2013 Binary Fetch Steve

Callback Cats Ode To Code 2018 Code Push Joe

Life Cycle Thugs Open Source Pundits 2007 Byte Me Joe Multiprocessing Moguls Pentium Predators 1998 C-Sick Steve

Query result

Title Artist Album Composer Year

405 Found Life Cycle Thugs Ode To Code Joe 2018 Binary Fetch Life Cycle Thugs Ode To Code Joe 2018 Code Push Life Cycle Thugs Mind Map Cache Joe/Steve 2013 Byte Me DBMS Hoodlums Boolean Autocrats Steve 2006 C-Sick DBMS Hoodlums Pentium Predators Thomas 1998

Table 2.3.4: Accessing data from different tables in a relational model database.

The main advantages of relational databases are the ability it provides users to easily query, filter, sort and combine to extract the information they need. Further, the databases can be extended to add new data categories without the need to modify existing applications that use it due to the fact that the data isn’t reliant on the physical organization.

The relational model also excels at data consistency across applications and database instances, ensuring that multiple instances of a database have the same data all the time [9].

Advantages Disadvantages

Does not require familiarity with internal structure

Substantial hardware and system software overhead

Multiple users can access the data at the same time

Can facilitate poor design and implementation.

Supports distributed databases May promote “islands of information”

problems Table 2.3.5: Advantages and disadvantages of the relational model.

(16)

2.4. Post-Relational

With the introduction and the rise of web 2.0, webservers and databases have been put under more and more performance pressure. The solution to this new challenge was to split up the databases and servers into smaller instances that work together. This contrasts with the conventional trend earlier to go for larger and more powerful machines to deal with the increased workload [3]. Further, the demand for databases with more flexible data structure has increased [10].

NoSQL is an example of such a post-relational, or a non-relational as it is sometimes described, database. Several variations of NoSQL database are in use today. Examples of such different databases are document-oriented databases, Graph databases and wide- column databases. Cassandra is an example of a wide column database and

Elasticsearch is an example of a document database.

As previously mentioned, Elasticsearch is document oriented, meaning that it stores entire objects or documents rather than flattening them into a table schema, using one field per column, and then reconstructs them every time they are retrieved [11]. It uses JSON, short for JavaScript Object Notation, as the serialization format for documents. A format that has become the standard in the NoSQL movement. It is simple, concise, easy to ready and is supported by most programming languages. An example of how an object and its attributes are represented in a JSON document is shown below.

Illustration 2.4.1: An identification card for a John Smith that includes different types of information about him.

1 {

2 "name": "John Smith", 3 "id_number": 11541556111841, 4 "info": {

5 "born": “1988-06-15”,

6 "memberships": [ "Theatre club", "Chess club" ],

7 "bio": "Software engineering student looking to meet like- minded people."

8 },

9 "join_date": "2018-10-01"

10 }

Code block 2.4.2: An identification card object represented in a JSON document format.

Birth:

Theatre club

Software engineering student looking to meet like-minded people.

John Smith

Memberships:

1988-06-15 2018-10-01

Bio:

Name:

Joined:

Chess club ID: 11541556111841

(17)

3. Environment

The database use case in this study is responsible for storing logs generated by applications running in the production system. However, there are a couple of intermediary steps that the data goes through first before it is stored by the database and is made available for searching and exporting.

Those steps are illustrated below in an architectural overview of the logging system and where the database fits into it. On one end of the operational process we have the apps, which produce both operation and transaction logs. The operation logs consist of system state and process execution logs which are used for maintenance, upkeep and monitoring purposes. The transaction logs consist of information that has been requested from the application instances that are similar in nature to API/HTTP requests. Those logs are stored for recordkeeping.

Diagram 3.0: An overview of the logging architecture.

3.1. App

The first step of the logging system is the source of the logs. What those applications do is not directly relevant for our study as what we are interested in is simply the data set or workload that they can produce. That is to say, the applications themselves can be and tasks they perform can be changed without any impact on the on the database as long as the logs which are generated by them produce follow the same type of logs from the database’s perspective. However, and as previously mentioned, those application produce both operation and transaction logs which contain valuable information and as such they are stored.

Diagram 3.1: The app instances in the logging architecture.

3.2. Stream processing

The logs generated and outputted by the app instances in the system form a stream of data. This stream is handled by Kafka, a distributed streaming platform, before being sent further down the operation line. The reason for including Kafka in the logging architecture is to decouple the producer side from the consumer side as to prevent any blocking in the system for a user due to data indexation operations.

APP 1

APP 2

Kafka Logstasher Database Log manager

Export

APP 1 APP 2

Kafka Logstasher Database Log manager

Export

(18)

A streaming platform is a platform capable of subscribing and publishing streams of records, storing them in a fault tolerant durable way and processing them as they occur [12]. Apache Kafka is such a platform that offers a high-throughput and low-latency for handling real-time data feeds.

Diagram 3.2: Stream processing in the logging architecture.

3.3. Log stashing

The logs handled by Kafka are finally sent to a data collection and log-parsing engine where they are further processed to have them served to the database in the

appropriate format and structure.

Diagram 3.3: The log stashing plugins in the logging architecture.

3.4. Storage

Finally, the logs are ready to be stored by the database.

Diagram 3.4: The databases in the logging architecture.

3.5. Management

Once the logs are successfully stored, they can be accessed and exported through a graphical interface specifically designed for this purpose.

Diagram 3.5: The log manager in the logging architecture.

APP 1

APP 2

Kafka

^Logstasher ^Database Log manager Export

APP 1

APP 2

Kafka

Logstasher

^Database Log manager

Export

APP 1

APP 2

Kafka Logstasher

Database

Log manager

Export

APP 1

APP 2

Kafka Logstasher Database

Log manager

Export

(19)

4. Experiment

4.1. Method

The research is split into two parts. A theoretical analysis of the differences between Cassandra and Elasticsearch as well as an empirical experiment in which their performance is measured and compared.

Going through the former was necessary in order to conduct the latter as knowledge and understanding of how the databases functioned structurally and technically was used to build a small piece of software known as a log-stasher. As the name indicates, its purpose was to receive logs data from a logs producer and then use the appropriate client operations in order to send them to the database and instruct it on how to store them.

The log-stasher was built using Scala and used the same dependencies and dependency versions as the one used against the Cassandra database to insure having a controlled environment when running the benchmarking experiment.

The benchmarking was done in a Kubernetes cluster and helm charts were used to deploy the databases along with the log producers and log-stashers. This was done to simulate the production environment under which they will be running.

The metrics for the performance of the systems was collected with the help of two utilities, Prometheus and Grafana, which are commonly used in this type of environment.

The workload consists of randomly generated log data that is consistent in size. A total of 2,000,000 logs are generated by the Producer which the Logstasher receives and runs the database instructions necessary for it to store them.

After the workload is completed, the metrics data is downloaded from web interfaces of Grafana and Kubernetes. Grafana provides the metrics data in csv text files as time data series. Kubernetes allows the ability to print and store the logs produced by the

deployed applications in the cluster and store them as text files.

The metrics and logs are then imported into excel where they are processed and plotted onto scatter plot charts for analysis and presentation.

(20)

4.2. Design

In order to conduct the benchmarks, a number of applications are needed to be put in place to simulate the log indexation procedure. As previously mentioned, we utilize a log producer application to create the workload. The reason for this is to put the indexation procedure under workload stress.

The producer, as seen in the illustration below, consists of two primary processes. A log generator that produces logs which are the workload for the benchmark. Those logs are consistent in size but contain randomly generated parameter values. The second main process is the Kafka Producer, which is used to send the generated logs further down the operation flow. The logs are thereafter fetched by the Logstasher. This is the second stage of the operation flow so to speak. The Logstasher is a piece of software which I have produced for this specific use case and is responsible for conducting three primary tasks. Receiving the logs generated by the producer, perform any necessary formatting data processing on the logs and finally execute the appropriate database command calls to store the logs.

The retrieval of the logs is done through the Kafka Consumer, which together with Kafka Producer and Zookeeper build up the entirety of the Kafka system. Zookeeper is responsible for coordinating the consumption of the logs in the queue by the different Consumer instances. Once the logs have been received by the Logstasher, they are converted to the necessary format which the database requires in order to store it. In the case of Elasticsearch, the data structure is the document-based model, JSON, and in the case of Cassandra, the conventional query statements. Finally, the third

responsibility of the Logstasher, runs the appropriate API calls to the database which it is connected to in order for it to receive the log data and store it.

The illustration below provides an under-the-hood overview of the entirety of the logging system which is used for the benchmarking. The first illustration is

representative of that of the Cassandra configuration and the latter of the Elasticsearch configuration.

One difference between the two configurations, aside from the different databases used, which is shown in the illustrations is how the data is handled once it is received by the database instances. In the case of Elasticsearch, the database instances, or more technically the pods, have one persistent storage each which they use to store the data into. However, in the case of Cassandra, they have two storages per pod or instance.

This design choice, as explained to me by one of my supervisors, is for performance and data security. Before the data is indexed into the database, it is held in memory, which can be problematic in cases of system crashes or restarts as it leads to data loss. Thus, the data is first written into a commit log from which it is then flushed into the data storage.

This implementation was not possible for me to mimic in the elasticsearch due to the time constraints under which I was afforded to conduct my benchmarks. Thus, it is something to consider when drawing conclusions from the results of the benchmarks and when comparing the resource utilization of the Cassandra and Elasticsearch deployments.

(21)

Diagram 4.2.1: An under-the-hood overview of the logging system used for the benchmarking of Cassandra.

Diagram 4.2.2: An under-the-hood overview of the logging system used for the benchmarking of Elasticsearch.

(22)

4.3. Data replication & consistency

The data replication model, which ensures data consistency in replicated databases as in the case with Elasticsearch and Cassandra, is the process that is responsible for keeping the database replicas in sync when data is added or removed. A failure in this process can lead to situations where reading from one replica will result in different results than reading from another one.

In the case of Elasticsearch, each index is divided into shards and each shard can have multiple copies, those copies are known as a replication group [13]. The data

replication model is based on the primary-backup model described in the PacificA paper: “Replication in Log-Based Distributed Storage Systems” [14]. In this model, a single copy acts as the main entry point for all data indexation procedures and is responsible for validating the indexing operations and for replicating the operation to the other shards. Once all replicas have successfully performed the operation and responded to the primary, the primary acknowledges the successful completion of the request to the client.

The implications of this implementation are that it can be fault tolerant while maintaining only two copies of the data. This contrasts with quorum-based systems, explained below, which needs a minimum of 3 copies to maintain fault tolerance.

However, a single slow shard, or copy, is enough to slow down an entire operation as the primary has to wait for all the replicas [13].

A quorum-based technique, in distributed systems, is responsible for enforcing

consistent operations. Such a technique can be used in replicated databases in the form of a replica control protocol that ensures that no two copies of a data item are read or written by transactions concurrently.

In the case of Cassandra, its approach takes inspiration from the Amazon Dynamo paper: “Dynamo: Amazon’s Highly Available Key-value Store” [15]. Shards are

independent of one another and work together without a primary shard or a leader by using a peer to peer communication system. Whichever node receives an operation request, it becomes known as the coordinator node for that operation.

Cassandra allows for configuring the desired consistency level for the particular implementation’s use case. This consistency level specifies how many replicas need to respond in order to consider the operation successful.

Among the available possible consistency level configurations are: ONE – Only a single replica must respond, TWO – Two replicas must respond, QUORUM – A majority of the replicas must respond, ALL – All of the replicas must respond. Those consistency levels come as a tradeoff between data consistency and data availability. A lower consistency level, such as ONE, provides higher throughput, latency and availability as it does not involve other replicas in the operation but comes at the cost of data correctness. The opposite is true for higher consistency levels [16].

(23)

The table below shows the data replication and consistency levels used for each of the experiments.

Data replication and consistency configuration

Parameter Cassandra default Cassandra modified Elasticsearch

Replication 3 3 2

Consistency level QUORUM QUORUM All*

Table 4.3.1: Replication and consistency level configurations used for each of the deployments.

* NOTE: Elasticsearch doesn’t use a quorum-based system, the All value is used as it the closest equivalent to represent its configuration.

4.4. Deployment

The benchmarking of the databases is performed in Kubernetes, a container orchestrating system for cloud run services and applications. In order for the applications to be deployed on the platform, they have to be first built into Docker image binaries.

This step is performed by using a dockerfile in which configuration specifications for the image are set. Similarly, once the image files are ready for deployment, a YAML-file is used to configure the parameters and specifications for which they are to be

deployed under.

The YAML-file allows for a great deal of configuration possibilities and is used to provide necessary information which the apps will use to communicate with one another such as networking configuration, proxy, encryption and port number to name a few. Other parameters include instructions such as how many instances of each application, or more technically pod or image, should be created and how much resources they are allowed to utilize.

The kubernetes cluster services uses this information and accordingly distributes the deployed images and allocates any specified storage among the number of nodes it is configured to run on.

The illustration below shows how the YAML-file is provided to the kubernetes cluster services through its API and the number of workers (nodes), it is set to work with. The workers, or nodes, have a kubelet running on each of them, which is a “node agent”. The kubelet is responsible for insuring that the containers in the node are running and are healthy according to the specifications that it is provided with [17].

(24)

Diagram 4.4.1: Kubernetes deployment overview.

Diagram 4.4.2: Overview of the data and operation flow for the Cassandra default and modified configuration deployments

(25)

Diagram 4.4.3: Overview of the data and operation flow for the Elasticsearch deployment.

The illustration above provides an overview of the data and operation flow for the Cassandra default and modified configuration deployments. The illustration that follows it provides the same overview for Elasticsearch’s.

As can be seen, the apps that generate the logs are replaced by the Producer which as has been discussed previously.

(26)

4.4.1. Deployment similarities

As previously mentioned, the YAML-file contains the values and the configuration specifications for the images, such as applications, services and networks, which are deployed. The table below shows the similar value parameters which were used for the three benchmark deployments that were conducted for the experiment.

YAML deployment configuration similarities (Databases) Parameter Cassandra

default

Cassandra modified

Elasticsearch

Database

Service cassandra cassandra elasticsearch

Nodes cassandra-0

cassandra-1 cassandra-2

cassandra-0 cassandra-1 cassandra-2

elasticsearch -0 elasticsearch -1 elasticsearch -2

Port 9042 9042 9200

Replicas 3 3 3

Persistence

Storage class Standard Standard Standard

Volume size 8Gi 8Gi 8Gi

Resources Requests memory

8Gi 8Gi 8Gi

Requests CPU 1 1 1

Limits memory 8Gi 8Gi 8Gi

Limits CPU 3 3 3

Heap new size 2048 2048 2048

Heap max size 4096 4096 4096

Table 4.4.1.1: YAML-file specified parameter values that are similar for all three benchmark deployments.

(27)

YAML deployment configuration similarities (Logging)

Logstasher

Replicas 3 3 3

Resources Requests memory

4Gi 4Gi 4Gi

Requests CPU 2 2 2

Limits memory 8Gi 8Gi 8Gi

Limits CPU 4 4 4

Zookeeper

Replicas 1 1 1

Persistent volume claims Storage class

name

Standard Standard Standard

Storage 1Gi 1Gi 1Gi

Kafka

Replicas 3 3 3

Persistent volume claims Storage class

name

Standard Standard Standard

Enabled False False False

Configuration overrides Default

replication factor

1 1 1

Offsets topic replication factor

1 1 1

Producer Kind Deployment Deployment Deployment

Replicas 1 1 1

(28)

YAML deployment configuration similarities (Monitoring)

Prometheus

Volume claim template Access mode Read

write once

Read write once

Read write once Storage class Standard Standard Standard

Requests storage 5Gi 5Gi 5Gi

Grafana

Persistence

Enabled True True True

Storage class Standard Standard Standard Access mode Read

write once

Read write once

Size 5Gi 5Gi 5Gi

Alert manager

Volume claim template Access modes Read

write once

ReadWrit eOnce

Read write once Storage class Standard Standard Standard

Requests size 5Gi 5Gi 5Gi

Prometheus node exporter

Service port 30206 30206 30206

Service target port 30206 30206 30206

Kube etcd Enabled False False False

Kube controller manager

Enabled False False False

Kube scheduler Enabled False False False

(29)

4.4.2. Deployment differences

In the tables above, we saw that two Cassandra configurations were presented. This is due to the fact that the Cassandra version running in the production environment is configured to use both data encryption and compression between the database nodes.

Since the time constraint didn’t allow me to produce a matching configuration using Elasticsearch, which required familiarizing myself with the security systems and certification procedures in place, I opted to add a second benchmark for Cassandra that is stripped of such specifications as a means to see how much of an impact it had on the overall benchmarking for reference.

Thus, two Cassandra deployments were used. The first, Cassandra default, contains all the in-house configuration except for data encryption and compression between the nodes. The second, Cassandra modified, is configured to use the in-house configuration including the configuration variables that enable and set the data encryption and compression between the nodes.

Aside from those differences, everything else is configured to run using identical

specifications as can be seen in the table above. We can also see the tools used to collect metrics information at the bottom of the table and how they were configured.

Deployment differences

Parameter Cassandra default Cassandra modified Elasticsearch

Replication 3 3 2

Consistency level QUORUM QUORUM All*

Compression No Yes No

Encryption No Yes No

Table 4.4.2.1: Differences between the three benchmarked configurations.

* NOTE: Elasticsearch doesn’t use a quorum-based system, the All value is used as it the closest equivalent to represent its configuration.

(30)

5. Results

This section of the thesis presents the resource utilization of the three conducted benchmarks. It is divided into three sections. The first presents the results for the CPU utilization, followed by a section for the memory utilization and finally a section for the disk space usage utilization.

5.1. CPU usage

The CPU usage results show that the modified Cassandra configuration took the longest amount of time to complete the benchmark, roughly about 25 minutes in total and averaging around 600% in overall total CPU usage during the workload. In comparison, the default Cassandra configuration took about 9 minutes to complete the task and Elasticsearch took about 13 minutes to do so.

The default Cassandra configuration however used roughly twice the amount of CPU to complete the task compared to the two other deployments. We can also see in chart 5.1.3 that not all of Elasticsearch’s instances used the same amount of CPU. This is due to the role assignment system under which the instances work under.

The main differences between Elasticsearch’s results and Cassandras’ are the number of copies that were made of the data and the indexation procedure. Each of the Cassandra deployments made three copies of the data while Elasticsearch has made only two. This is due to the differences in the replication model between the two databases. Elasticsearch can maintain a fault tolerant implementation with only two copies while Cassandra needs at least three. Further, Cassandra needs to commit the data first to a separate storage before it indexes it and stores it in the data storage.

Those two factors affect the amount of computation needed and as a result will require more CPU.

Another difference, one that is between the default and the modified Cassandra configurations, is data compression and encryption between the nodes. The default Cassandra is configured to run without any encryption or compression meanwhile the modified configuration uses them. The compression and encryption come at a cost as they act as obstacles in the communication and thus impact the throughput negatively.

Those differences can be better seen if we calculate the area under the line for each deployment’s overall CPU utilization, the area represents the total amount of CPU work done by each deployment.

Chart 5.1: Overall Total CPU work done for each deployment.

0 1 000 2 000 3 000 4 000 5 000 6 000 7 000 8 000 9 000

Deployments

Total CPU work (Area under overall CPU utilization) Cassandra Default Cassandra Modified Elasticsearch

(31)

The default Cassandra deployment has exceeded the maximum amount of CPU that was allocated to it in the configuration, as can be seen in the graphs below, which might be due to the way the compression and encryption was been removed. Running more benchmarks of each deployment would have helped understand this issue better but unfortunately time didn’t allow for any more runs. The default Cassandra results are left here none the less as a reference.

5.1.1. Cassandra default configuration: CPU usage

Chart 5.1.1: CPU usage of each Cassandra pod under the workload of the Cassandra default configuration benchmark.

5.1.2. Cassandra modified configuration: CPU usage

Chart 5.1.2: CPU usage of each Cassandra pod under the workload of the Cassandra modified

0%

200%

400%

600%

800%

00:05:00 00:07:00 00:09:00 00:11:00 00:13:00 00:15:00

CPU usage (vCPU %)

Time (hh:mm:ss)

Cassandra-1 Cassandra-2 Cassandra-3

0%

100%

200%

300%

400%

00:03:00 00:07:00 00:11:00 00:15:00 00:19:00 00:23:00 00:27:00 00:31:00

CPU usage (vCPU %)

Time (hh:mm:ss)

(32)

5.1.3. Elasticsearch default configuration: CPU usage

Chart 5.1.3: CPU usage of each Elasticsearch pod under the workload of the Elasticsearch default configuration benchmark.

5.1.4. All deployments: Total CPU usage

Chart 5.1.4: Total CPU usage of all instances for each deployment.

In the chart above we see the overall CPU utilization of each deployment. The differences in the CPU usage and the amount of time taken as discussed in the beginning of the CPU results section can be better seen here.

0%

100%

200%

300%

400%

00:22:00 00:24:00 00:26:00 00:28:00 00:30:00 00:32:00 00:34:00 00:36:00

CPU usage (vCPU %)

Time (hh:mm:ss)

Elasticsearch-1 Elasticsearch-2 Elasticsearch-3

0%

300%

600%

900%

1200%

1500%

1800%

00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00

CPU usage (vCPU %)

Time (hh:mm:ss)

Cassandra Default Cassandra Modified Elasticsearch

(33)

5.2. Memory

The memory utilization results show that both Cassandra configurations utilized the same amount of memory, both starting at just over 8 gigabytes of memory usage for each instance and landing at around 13 gigabytes of memory usage by the end of the workload. This indicates that the differences in their configurations didn’t necessitate more memory usage to complete the task, the duration that the memory was used however was different.

Elasticsearch however used a significantly less memory compared to Cassandra even when considering that Elasticsearch only created two copies of the data unlike Cassandra which creates three copies.

We can see from Elasticsearch’s results that all three instances started with about 3 gigabytes of memory usage, of which one continued with that amount of memory all through the benchmark. The other two instances, Elasticsearch-2 and Elasticsearch-3 followed a similar increase of memory usage under the workload to that of Cassandra’s but ended at a total of 4 gigabytes by the end of the workload.

Elasticsearch-1’s reduced memory usage compared to the other two Elasticsearch instances reflects the same CPU usage seen in the previous section, more specifically in chart 5.1.3.

5.2.1. Cassandra default configuration: Memory usage

Chart 5.2.1: Memory usage of each Cassandra pod under the workload of the Cassandra default configuration benchmark.

0 4 8 12 16

00:00:00 00:04:00 00:08:00 00:12:00 00:16:00

Memory usage (Gigabytes)

Time (hh:mm:ss)

(34)

5.2.2. Cassandra modified configuration: Memory usage

Chart 5.2.2: Memory usage of each Cassandra pod under the workload of the Cassandra modified configuration benchmark.

5.2.3. Elasticsearch default configuration: Memory usage

Chart 5.2.3: Memory usage of each Elasticsearch pod under the workload of the Elasticsearch default configuration benchmark.

0 4 8 12 16

00:00:00 00:08:00 00:16:00 00:24:00 00:32:00

Time (hh:mm:ss)

0 1 2 3 4 5

00:00:00 00:08:00 00:16:00 00:24:00 00:32:00 00:40:00 00:48:00

Time (hh:mm:ss)

(35)

5.2.4. All deployments: Total memory usage

Chart 5.2.4: Total memory usage of all instances for each deployment.

5.3. Disk space

A difference between this section, disk space usage, and the two sections which precede it, CPU usage and memory usage, is that the results for each Cassandra configuration is presented in two charts while Elasticsearch’s results remain presented in a single chart.

This is due to the fact that each Cassandra instance is assigned to storage volumes while Elasticsearch is assigned a single storage. This is due to how the Cassandra image and version currently used has been configured to handle the data before it is indexed into its storage. The reasoning behind this is to prevent data loss in case of a crash or a restart as well as for a reduction in operation redundancy for how the indexation process is performed and to improve performance for their specific use case and needs out of the storage system. These differences and the reasoning behind them are

discussed in more detail in section 4.2.

The disk space utilization results show that both configuration deployments of Cassandra ended up using a total of about 1480 megabytes of disk space for each instance while Elasticsearch ended up using about 800 megabytes by two of its

instances with the third, Elasticsearch-1, remaining on its initial disk space usage of 52 megabytes all throughout the benchmark.

Considering that Cassandra creates 3 copies of the data while Elasticsearch creates only two, Elasticsearch’s results shows that it still used less disk space even when that difference is considered.

0 10 20 30 40 50

00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00

Time (hh:mm:ss)

Cassandra Default Cassandra Modified Elasticsearch

(36)

We can also see how the diskspace usage in Cassandra’s Data directories correspond with changes in its Commit log directories. A clear example of this can be seen charts 5.3.1 and 5.3.2 at the 13-minute mark where the disk space usage of the Commit log drops by about 100 megabytes which concedes an increase of disk space in the Data directory shortly after.

5.3.1. Cassandra default configuration: Commit disk usage

Chart 5.3.1: Disk space usage of each Cassandra commit volume under the workload of the Cassandra default configuration benchmark.

5.3.2. Cassandra default configuration: Data disk usage

Chart 5.3.2: Disk space usage of each Cassandra data volume under the workload of the Cassandra default configuration benchmark.

0 100 200 300 400

00:00:00 00:03:00 00:06:00 00:09:00 00:12:00 00:15:00 00:18:00 00:21:00

Disk space usage (Megabytes)

Time (hh:mm:ss)

Cassandra Commit Log-1 Cassandra Commit Log-2 Cassandra Commit Log-3

0 500 1 000 1 500 2 000

00:00:00 00:03:00 00:06:00 00:09:00 00:12:00 00:15:00 00:18:00

Time (hh:mm:ss)

Cassandra Data Dir-1 Cassandra Data Dir-2 Cassandra Data Dir-3

(37)

5.3.3. Cassandra modified configuration: Commit disk usage

Chart 5.3.3: Disk space usage of each Cassandra commit volume under the workload of the Cassandra modified configuration benchmark.

5.3.4. Cassandra modified configuration: Data disk usage

Chart 5.3.4: Disk space usage of each Cassandra data volume under the workload of the Cassandra modified configuration benchmark.

0 100 200 300 400

00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00 00:30:00 00:35:00

Time (hh:mm:ss)

Cassandra Commit Log-1 Cassandra Commit Log-2 Cassandra Commit Log-3

0 400 800 1 200 1 600

00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00 00:30:00 00:35:00

Time (hh:mm:ss)

Cassandra Data Dir-1 Cassandra Data Dir-2 Cassandra Data Dir-3

(38)

5.3.5. Elasticsearch default configuration: Data disk usage

Chart 5.3.5: Disk space usage of each Elasticsearch volume under the workload of the Elasticsearch default configuration benchmark.

5.3.6. All deployments: Total disk usage

Chart 5.3.6: Total disk space usage of all instances for each deployment.

0 200 400 600 800 1 000

00:00:00 00:10:00 00:20:00 00:30:00 00:40:00 00:50:00

Time (hh:mm:ss)

0 1000 2000 3000 4000 5000

00:00:00 00:05:00 00:10:00 00:15:00 00:20:00 00:25:00

Time (hh:mm:ss)

Cassandra Default - Data Dir Cassandra Default - Commit Log Cassandra Modified - Data Dir Cassandra Modified - Commit Log Elasticsearch

(39)

6. Analysis

The results from the benchmarks, presented in the previous chapter, are used along with data generated by the Producer and the Logstasher to provide context to results we see. This generated data was collected from console logs made by the Producer and the Logstasher during the benchmark. They consist of date and time timestamps along with the total number of logs produced by the Producer or stashed by the Logstasher.

By matching the timestamps to those collected from the metrics data, we can plot the two on the same graph using two separate y-axes.

We also use the total resource utilization by all three instances in this section, unlike in the results sections where we looked at the resource utilization of each instance individually.

6.1. CPU utilization analysis 6.1.1. CPU usage vs logs output rate

Comparing the CPU usage of the three benchmarked deployments with their respective Producer’s log production output rate and Logstasher’s stashing rate shows

irregularities in how the two rates follow one another in the case of the default

Cassandra configuration. We also see an increasing latency between the two the further down the workload time line we go. This irregularity isn’t as evident in the two other benchmarks, the modified Cassandra configuration and Elasticsearch.

This difference could be an indirect result of the high CPU usage of Cassandra in this configuration deployment where a bottleneck could have occurred in the Producer’s and Logstasher’s ability to match the rate at which the database is able to receive the logs. It could also be due to limitations in how the Producer and Logstasher were implemented. Unfortunately, time didn’t allow for repeated runs of the benchmark in order to understand it better.

6.1.1.1. Cassandra default configuration

Chart 6.1.1.1: Cassandra pods total CPU usage and the output rates of the Producer and the

0 2 500 5 000 7 500 10 000

0%

450%

900%

1 350%

1 800%

14:18:00 14:20:00 14:22:00 14:24:00 14:26:00 14:28:00

Producer & Logstasher -Output rate (Logs)

Cassandra -Total CPU usage (vCPU %)

Time (hh:mm:ss)

Cassandra - CPU 5 per. Mov. Avg. (Producer - Output) 5 per. Mov. Avg. (Logstasher - Output)

(40)

6.1.1.2. Cassandra modified configuration

Chart 6.1.1.2: Cassandra pods total CPU usage and the output rates of the Producer and the Logstasher under the Cassandra modified configuration benchmark.

6.1.1.3. Elasticsearch default configuration

Chart 6.1.1.3: Elasticsearch pods total CPU usage and the output rates of the Producer and the Logstasher under the Elasticsearch default configuration benchmark.

0 800 1 600 2 400 3 200

0%

250%

500%

750%

1 000%

12:08:00 12:12:00 12:16:00 12:20:00 12:24:00 12:28:00 12:32:00

Cassandra -Total CPU usage (vCPU %)

Time (hh:mm:ss)

Cassandra - CPU 5 per. Mov. Avg. (Producer - Output) 5 per. Mov. Avg. (Logstasher - Output)

0 2 000 4 000 6 000 8 000

0%

200%

400%

600%

800%

17:21:00 17:23:00 17:25:00 17:27:00 17:29:00 17:31:00 17:33:00 17:35:00

Elasticsearch -Total CPU usage (vCPU %)

Time (hh:mm:ss)

Elasticsearch - CPU 5 per. Mov. Avg. (Logstasher - Output) 5 per. Mov. Avg. (Producer - Output)