VoloDB: High Performance and ACID Compliant Distributed Key Value Store with Scalable Prune Index Scans

(1)

VoloDB: High Performance and ACID Compliant Distributed Key Value Store with Scalable Prune

Index Scans

ALI DAR

Master’s of Science Thesis KTH Royal Institute of Technology

School of Information and Communication Technology Supervisor: Dr. Jim Dowling

Examiner: Prof. Seif Haridi Stockholm, Sweden, June 2015

TRITA-ICT-EX-2015:96

(2)

(3)

Abstract

Relational database provide an efficient mechanism to store and retrieve structured data with ACID properties but it is not ideal for every scenario. Their scalability is limited because of huge data processing requirement of modern day systems. As an alternative NoSQL is different way of looking at a database, they generally have unstructured data and relax some of the ACID properties in order to achieve massive scalability. There are many flavors of NoSQL system, one of them is a key value store. Most of the key value stores currently available in the market offers reasonable performance but compromise on many important features such as lack of transactions, strong consistency and range queries. The stores that do offer these features lack good performance.

The aim of this thesis is to design and implement Volo- DB, a key value store that provides high throughput in terms of both reads and writes but without compromising on ACID properties. VoloDB is built over MySQL Cluster and instead of using high-level abstractions, it communi- cates with the cluster using the highly efficient native low level C++ asynchronous NDB API. VoloDB talks directly to the data nodes without the need to go through MySQL Server that further enhances the performance. It exploits many of MySQL Cluster’s features such as primary and partition key lookups and prune index scans to hit only one of the data nodes to achieve maximum performance.

VoloDB offers a high level abstraction that hides the complexity of the underlying system without requiring the user to think about internal details. Our key value store also offers various additional features such as multi-query transactions and bulk operation support. C++ client libraries are also provided to allow developers to interface easily with our server. Extensive evaluation is performed which bench- marks various scenarios and also compares them with another high performance open source key value store.

(4)

(5)

Acknowledgement

I would like to thank my examiner Prof. Seif Haridi for giving me an opportunity to work on this research project. I will especially thank Dr. Jim Dowling for his constant support and guidance in every step of the way during the thesis. His expert input was instrumental in solving complex design, implementation and optimization problems in the project.

In the end, I would like to thank my wife and parents for their moral support throughout my studies.

(6)

List of Figures

4.1 Overview of MySQL Cluster Architecture . . . 20

4.2 High Level System Architecture . . . 27

4.3 VoloDB Architecture . . . 28

5.1 Protocol Buffer Messages . . . 34

6.1 Throughput of Set Operations . . . 39

6.2 Throughput of Get Operations . . . 40

6.3 Throughput of Delete Operations . . . 41

6.4 Throughput of Mixed Reads and Writes . . . 42

6.5 Throughput of Prune Index Scans with 50 Records Returned . . . 43

6.6 Throughput of Prune Index Scans with Variable Records Returned . . . 44

6.7 Throughout Comparison with Aerospike: 0% Reads, 100% Writes . . . 45

(12)

(13)

Acronyms

ACID Atomicity, Consistency, Isolation, Durability. 1–3, 5, 7, 8, 16, 17, 23, 49 API Application Programming Interface. 13, 16, 17, 21, 29

BASE Basic Availability, Soft State, Eventual Consistency. 7, 8, 16, 23 BSON Binary JavaScript Object Notation. 15

CAP Concurrency, Availability, Partition Tolerance. 6, 13–15, 22 CPU Central Processing Unit. 32

CQL Cassandra Query Language. 15 GFS Google File System. 2, 12, 13

HDFS Hadoop Distributed File System. 2, 12, 13 IBM International Business Machines. 1

JSON JavaScript Object Notation. 15, 32 NDB Network DataBase. 19–21, 29, 33, 34, 57 NewSQL New Structured Query Language. 2, 17

NoSQL Not Only SQL. 2, 3, 7, 8, 11, 12, 14–17, 19, 21–23, 45, 49 OLAP Online Analytical Processing. 1

OLTP Online Transaction Processing. 1

OODBMS Object Oriented Database Management System. 1, 8 OOP Object Oriented Programming. 7

(14)

ORD Object-Relational Database. 1

ORDBMS Object-Relational Database Management System. 7, 8 POSIX Extensible Markup Language. 32

RDBMS Relational Database Management System. 1, 2, 5, 7, 49 SQL Structured Query Language. 1, 2, 7, 15–17

XML Extensible Markup Language. 32

(15)

Chapter 1

Introduction

During 1970’s E.F Codd while working at International Business Machines (IBM) Research Laboratory proposed relational model[1] to organize and manipulate data.

The database based over this model organizes the records into rows or tuples which are grouped together into a two dimensional relation or a table. The row itself is an ordered collection of attributes containing associated values. The model presented was very extensive and included things like relational joins, constraints and normalization procedures to make sure that designed models were consistent and optimized. Most of the current Relational Database Management System (RDBMS) use Structured Query Language (SQL) to manage the data. The model was a big success and became an industry standard to store and manipulate data. It was and is currently in use by Online Transaction Processing (OLTP) systems that ensures ACID properties of every transaction. It is also being used for Online Analytical Processing (OLAP) to offer business intelligence services to users. The relational model was extended later to Object-Relational Database (ORD) and Object Ori- ented Database Management System (OODBMS), they supported data to be saved in terms of objects. They were mainly inspired from the object oriented programming paradigm such as C++.

Relationship databases are designed for a specific use case and perform rather well in those scenarios. It works well when ACID properties are requirement and data is structured in a predefined schemas which can be logically organized into rows and tables. Although RDBMS works and scales well with reasonably large amount of data, but its scalability is limited[2].

Data requirements today have outgrown immensely which can not be handled efficiently by the conventional database systems. Processing petabytes and Exabyte of data nowadays is a frequent task for large enterprises. Other than just the massive volume of data requirements, other characteristics have changed. The speed at which data is generated and processed is huge and the complexity has increased[3].

(16)

CHAPTER 1. INTRODUCTION

The data just is not rigid enough to be mapped into a predefined schemes, it is unstructured, complex and comes from varies sources. Such data sets are termed as Big Data. Big data posed an enormous challenge and new method called Not Only SQL (NoSQL) was conceived for data storage and retrieval. NoSQL presented a model that was significantly different than the traditional relational way of looking at thing. NoSQL supported databases that do not impose rigid schemas as RDBMS do and can scale massively. The massive scaling is attributed to relaxed rules of ACID properties, for example consistency can be relaxed to achieve the desired effect.

NoSQL was started as internal projects at big enterprises. Google came with the implementation of the MapReduce programming model[4] to efficiency process huge amount of data in parallel using clusters of distributed commodity computers.

They later presented storage solutions such as column oriented database called Big Table[5], which was built upon Google File System (GFS). It later inspired many open source project such as Hadoop, HBase and Hadoop Distributed File System (HDFS)[6]. FlockDB by developed by Twitter, Cassandra[7] was started initially by Facebook. Highly available key value store called Dynamo[8] by Amazon, and Voldemort by Linkedin. There are other examples such as a key value store called Riak. Document oriented stores like MongoDB[9] and CouchDB and graph based stores such as OrientDB and Neo4G have become very popular as well. The NoSQL landscape has become extremely versatile and provides numerous options to users and enterprises to choose from many solutions depending upon their unique requirement. The NoSQL databases can be complex to use that has resulted into creation of many high level tools and languages to help users and reduce time of development. For example, Pig Latin and HiveQL are procedural languages built over Hadoop that offers SQL like features.

Though NoSQL databases continue to serve the purpose of handling big data, it is still not perfect especially because it does not guarantee ACID properties. New systems are being researched and developed which seek to overcome this limitation.

A class of relational database called NewSQL is a new initiative that ensures ACID properties at the same time allowing massive scaling. Google Spanner[10], MemSQL and VoltDB are few examples.

1.1 Motivation

The NoSQL databases that exist today are built to scale and offer good performance but are limited in terms of many important functionalities that are offered by traditional relational databases such as strong data types, range queries and transactions. They also lack strong Atomicity, Consistency, Isolation, Durability (ACID) properties by relaxing one of these in order to scale better. Strong consistency is generally relaxed to eventual consistency which means that after some

2

(17)

1.2. CONTRIBUTION

time the system will converge to a consistence state. Though databases with these properties work just fine with most use cases but it is not ideal for all the scenarios.

The ideal scenario is a database system that scales well but does not compromise on ACID properties.

As mentioned earlier, the lack of features in NoSQL databases is a big limitation for many users. Some NoSQL databases such as HBase provide transactions, but their scope is very limited since transactions only ensure row level consistency[11].

Key value stores such as Riak, Aerospike and Foundation also put numerous limitations to ensure strong consistency[12][13]. We would like to run transactions like the ones on the traditional relational databases without any limitations with full ACID properties.

Other than lack of fully featured transactions there are many other limitations.

Key values stores either have limited or no support for strong data types[14], for example FoundationDB only allows bytes as a key or a value[15]. Due to lack of strong data types, user has to encode/decode to/from bytes every time that causes inconvenience. Many stores also does not allow composite keys and values which is inefficient as user has to serialize/deserialize data to/from a single column. Support for non primary key lookups are either missing or if supported does not scale well and lacks performance. There is also a possibility to increase the overall performance of key value stores by using novel techniques to boost throughput for reads and writes.

1.2 Contribution

The purpose of the thesis is to research, design and implement VoloDB, a NoSQL distributed key value store that offers high throughput of reads and writes. Other than having high performance, it offers features that other key value stores or NoSQL databases lack in general. The following main features are implemented.

• Provides high throughput of reads and writes for primary key lookups.

• Fully supports transactions to allow execution of multiple operations as an atomic operation.

• Provides full compliance of ACID properties for all type of queries.

• Supports highly scalable non primary key lookup queries by exploiting distri- bution/partition key.

• Supports many strong data types with feature to allow composite keys and values.

(18)

CHAPTER 1. INTRODUCTION

1.3 Outline

• Chapter 1: Gives brief introduction and motivation of the project.

• Chapter 2: Explains the background concepts required to understand the design and implementation of the project.

• Chapter 3: Discusses work related to this project that were considered while determining the design and feature set of the project.

• Chapter 4: Describes in detail the system architecture and design of VoloDB.

• Chapter 5: Describes the implementation details of VoloDB.

• Chapter 6: Evaluates the performance of VoloDB by conducting various experiments and then comparing it with another open source high performance key value store.

• Chapter 7: Gives concluding remarks and discuss future work that could be carried out.

4

(19)

Chapter 2

Background

2.1 ACID Properties

ACID is set of properties that describes the guarantees provided by a transactions that are central for any RDBMS. The concept of these properties for any reliable transaction was described and implemented by Jim Grey[16] but the term ACID was introduced by Andreas Reuter and Theo Harder in 1983 in their paper[17]. The properties are explained below:

Atomicity: It refers to a property that states that all the operations in a transaction are either completed or none are. There can not be a situation where some of the operations executed successfully and some failed to do so. Let us consider a financial transaction where we have to transfer money to another account, in this case it has two operations, one is to debit money from one account and other is to credit deducted money into the other account. If these operations are not done atomically, then we can have a situation where amount is credited to other account but not debited at the source, thus causing financial damages to the bank. It can also happen that amount was debited at the source, but was not credited at the destination account.

Consistency: This property ensures that only valid data is ever written to the database. The validity means that the data written does not violate any constraints or any consistency rules. Any violation of rules will return in rollback of all the operations in the transaction. If a transaction succeeds, it takes the database from one consistent stage to another.

Isolation: It states that multiple transactions happening at the same time should not effect each other and must work in isolation. Any intermediate data generated from one transaction should not be visible to the other one. Usually databases provide many isolation levels that users can choose from depending upon their use case.

(20)

CHAPTER 2. BACKGROUND

For example, Read committed allows only the committed data from one transaction to be visible to any other concurrent transaction. Another isolation level called Serializable releases all the locks after the transaction completes thus only allowing transaction to happen one after another in a serial manner.

Durability: It ensures that any data committed by a transaction will be per- sisted and must not be lost. Any system, hardware or power failure should not result in any loss of committed data.

2.2 Strong and Eventual Consistency

Distributed system generally offers two notion of consistency, which are described below:

Strong Consistency: This property ensures that all the users in the system have same view of the data. In other way it could be described is that for a read operation, all users should get back the last committed value of the write operation[18].

Eventual Consistency: It is a weaker form of strong consistency, which does not guarantee that read will always return the last written value but states that after some period of time all reads will start getting the last written value and thus reaching a consistent state[19]. This type is also referred as weak consistency.

2.3 CAP Theorem

CAP theorem was introduced by Eric A. Brewer[20] that states that for a distributed system, only two of the following three properties are possible to achieve:

Consistency: All the users of the database have the exact same view of the data.

It is same as strong consistency.

Availability: It refers to a property that system is able to respond to the request of all the users even in case of any node failure. Every request should get a response back from any non failed node.

Partition Tolerance: It is a property that ensures that system stays in a working condition even in case of lost messages, node failure or any number of nodes making a separate partition because of isolation. The system must keep on responding to any request received.

6

(21)

2.4. BASE

2.4 BASE

As discussed in section 2.1 ACID model is used by transaction processing relational databases. This model can be very strong for some applications which results in very limited scalability. A weaker model exists which is called BASE[20] introduced by Eric A. Brewer. This new model are generally adopted by NoSQL databases and has the following three properties:

Basic Availability: NoSQL databases tend to focus on availability to make sure that services are always up and running to respond to user request even if there are disruptions and failures of nodes. They tend to achieve it through active replication of data to many nodes, so if a node dies, data remains available to be served by users.

Soft State: The requirement of consistency is generally weakened and data should not have to be consistent at all times but it should eventually converge to a valid consistent form after some period of time. In this way, state is always soft which changes without the intervention of the user.

Eventual Consistency: As mentioned in section 2.2, it is a weaker form of strong consistency that drops the requirement of data being consistent at all the time, but should converge to a consistent state eventually. This weakening of a property can allow the system to scale on a massive scale.

2.5 SQL Based Databases

SQL is a procedural language used to manage and manipulate databases. SQL based databases adhere to ACID model and have the following main categories:

Relational Databases (RDBMS):

As mentioned in chapter 1, it organizes the records into rows or tuples which are grouped together into a two dimensional relation or a table. The row itself is an ordered collection of attributes containing associated values.

Object-Relational Databases (ORDBMS):

This kind of database is very similar to relational databases, the data is stored in tables but it has many features inspired from Object Oriented Programming (OOP)[21]. Added features include inheritance, custom types, classes, polymor- phism and object methods[22]. They are manipulated in a very similar fashion like relational databases using SQL.

(22)

CHAPTER 2. BACKGROUND

Object Oriented Databases (OODBMS):

It stores data in form of objects only that is different from ORDBMS, which uses hybrid approach of relational and object orientation[23]. It offers tight integration and direct mapping from objected oriented programming languages since both uses the same model.

2.6 NoSQL Databases

These are another class of databases that does not adhere to ACID properties but instead adhere to BASE model as discussed in section 2.4. Adhering to strong ACID properties also prevent the system to scale massively, so these databases weakens one of the properties to achieve the desired scalability. Compared to relational databases, they can be highly unstructured and might not bind to a static scheme.

They are divided in to the following main categories:

Key Value Store

It is the simplest type of NoSQL database where each record in the database has a value corresponding to a given key, like a (key, value) pair. The value is generally a byte array that can store value of any serialized data. Most common operations are Set to insert/update a value and Get to retrieve it. It is conceptually like a map or a dictionary data structure of object oriented programming language. Dynamo[8]

by Amazon, Riak and FoundationDB are examples of these kind of databases.

Document Oriented Store

It is one of the most popular types of NoSQL databases which store documents that are in the form of semi structured data. The documents are most commonly encoded in XML or JSON. Documents are identified by keys but unlike relational databases, documents in a store do not need to have a fixed structure and every document can have a totally different structure. MongoDB[9] and CouchDB[24] are example of such databases.

Column Oriented Store

In this type of database, the data is stored in terms of columns instead of rows.

There is a concept of column family that groups number of columns. Each family can contain any number of columns that can be added at runtime. Since values of a particular column are stored together, it allows extremely fast access to any column of the table. Google’s BigTable, Apache Cassandra and HBase are examples of column oriented databases.

8

(23)

2.6. NOSQL DATABASES

Graph Database

These databases are ideal for data whose relationships are well suited for a graph structure. It consists of nodes, and an edge between them can represent a relationship. World Wide Web, network topologies, road and rail networks and Facebook friends graph can be represented well in these databases. Products such as Ori- entDB, Infinite Graph and Neo4J are examples of this category.

(24)

(25)

Chapter 3

Related Work

After the emergence of NoSQL databases, a lot of research and development has been constantly taking place. It has resulted into variety of products in every category with increasingly better performance. Using databases that store big data can be complex to use, so just storing the data is not enough. While giving high performance it should also be easy to use for new users, so they can focus on the given problem without thinking too much on how to use a complex database.

3.1 MapReduce

It is a programming model to do parallel distributed data processing over group of computer connected with each other in a cluster. It was inspired by an older concept of functional programming and in 2004 Google developed an implementation of this model to process large scale data over commodity computers[4].

The main idea of the framework is that it takes (key, value) pairs as input and produces another list of (key, value) as output. The computation is done using two main functions:

Map: Takes input as set of (key, value) pairs and produces and another inter- mediately list of (key, value) pairs. It can be represented as Map(key, value) ->

List(key1, value1). This function is applied to inputs in parallel on different distributed computers and separate group is created for unique key with list of values of this key combined together.

Reduce: The reduction function takes input each generated group after map stage as key value pairs and produces list of values. It can be represented as Map(key1, list(v1)) -> List(v2).

The map reduce functions are written by users but all the underlying complex

(26)

CHAPTER 3. RELATED WORK

details and implementation is hidden and managed the system. The user provided input is partitioned by the system and then assigned to worker machines in the cluster. All the workers are controlled and coordinated by a master node. Input is then processed in parallel by different worker machines by applying Map function to produce intermediate (key, value) pairs. The intermediate values are then grouped together for each unique key and then again assigned to worker machines to execute Reduce function. Each worker machine then produces the final output from the reduction function.

3.2 Apache Hadoop and YARN

The MapReduce implementation by Google has inspired many other products and databases for an open source implementations of this model, Apache Hadoop is one of them. It allows distributed parallel processing of big data using HDFS modeled over Google’s GFS as storage and MapReduce as processing framework. Apache Hadoop has itself inspired many other products and data processing tools where it is used as a base underlying technology that we will discuss later.

Apache Hadoop’s MapReduce went undergo a big overhaul by introducing MapRe- duce 2.0 that is also called Apache YARN[25]. The initial version had a tight integration with the processing framework and resource management that resulted in scalability issues. YARN introduced a different approach that fixed a lot of these issues. MapReduce has only one master node that tracks and manages all the jobs and nodes in the cluster that resulted in scalability issues and a single points of failure. If a master nodes crashes then all the currently running jobs will fail. Another issue is that if number of jobs running becomes too high, it will limit scalability since it gets difficult for one node to manage all the concurrent jobs effectively.

YARN decouples the resource management and job scheduling by introducing per cluster Resource Manager and per application Application Manager. Resource Manager manages all the resources in the cluster. Node Manager was also intro- duced which exists per node to manage the health and heartbeats of the node.

Every job is handled by a separate Application Manager which manages all the lifecycle related the job such as negotiating the required resources. The decoupling significantly reduces the scalability issues since Resource Manager handle resources and Application Manager separately handle jobs which removes a major bottleneck.

3.3 BigTable

It is a column oriented NoSQL database developed by Google in 2006[5]. It is a proprietary database built over GFS, which is used internally by Google in numerous products such as Google Earth, Google Analytics and for web indexes. It efficiently handles petabytes of data running in a cluster of thousands of computers. BigTable

12

(27)

3.4. APACHE HBASE

is in effect a distributed sorted multidimensional map. High performance C++

API is also available to interact with the database. Google not only uses it to store data for its products, it is also used as an input source or an output destination of MapReduce jobs.

BigTable arranges the data in form of tables. In each table, data is stored in rows, which are identified by keys. Keys are byte arrays that do not have any specific type associated with them. Every row consists of column families, where each column family groups number of related columns. Column families are fixed at the table creation time but during insert a row can skip any column family or add new columns in it at runtime. The columns are of generic type having byte array type. Each column value is called a cell which is referred by the combination of row id, column family and column name. Each cell value is versioned by a timestamp, whenever a new value is written to a column, an updated value corresponding to a new timestamp is added. During read time, if no timestamp is specified, the value associated with the latest timestamp is returned, otherwise the cell value with the user given timestamp is returned back. The number of versioned value for column families is configurable.

3.4 Apache HBase

It is an open source implementation of BigTable[26]. It is written in Java and is built over Apache Hadoop and HDFS. It provides strong consistency of read and writes and offers partition tolerance capabilities, so according to CAP theorem it is a CP database system. It features capabilities such as in memory operations, compression, compaction, linear scalability and automatic failover support. In 2010, Facebook implemented its messaging platform over HBase.

HBase provides a similar data model as BigTable, it has similar concepts such as structure of a row, column family and versioning. In HBase each column family is saved into a separate file in GFS that allows extremely fast access to columns. It provides auto sharding by partitioning tables automatically if they become too big.

Continuous ranges of rows are stored together and each region is managed by one region server that is responsible for all reads and writes in it. Region servers can also be added dynamically depending upon the need. There is also a master node that manages all the region servers.

It offers easy and comprehensive set of client libraries. It provides JRuby based shell, Thrift Gateway, JAVA API, REST, XML and Protocol Buffers support. It provides support for real time queries and convenient classes to support massive MapReduce jobs.

(28)

3.5 Dynamo

It is a NoSQL distributed key value store developed by Amazon[8]. Amazon is an e-commerce giant having customer centric business who has to deal with millions of customers at a given time. With these needs in mind, Dynamo was designed to provide reliability at a massive scale and to ensure ’always on’ experience. Amazon uses it internally for their internal core services. It provides high availability and partition tolerance with eventual consistency. According to the CAP theorem it is an AP database system.

Dynamo nodes are arranged into a ring structure like Chord DHT[27]. Data is partitioned by consistent hashing. Nodes are responsible for range of keys and an incoming request can be received by any of the node. There is a preference list of nodes for a given key that is available at every node in the system. In order to provide high availability and failover support, the data is replicated to N-1 successor nodes in the ring. Inconsistencies in replicas are handled through Merkle trees[28]

that ensure that correction is done quickly and without generating much data traffic over the network. Another prime feature of Dynamo is that it provides an always write which is achieved by a concept called sloppy quorum. Instead of enforcing strict quorums, the values are read or written to the any of the N healthy nodes at the given time. The values for R and W are configurable and for example W can be set to 1 to significantly increase the write availability. Every written value has a vector clock[29] value associated with it. Any conflicting value resulted because of node failure or partition is resolved during the read operation. Any conflict that can be resolved automatically is done by the system by looking at their related causally of vector clock values. For unresolved conflicts, all causally unrelated values are returned which are then reconciled. The reconciled values are then considered the correct value and written back to the system. New nodes can easily be added in the system, there is no central authority that manages the membership. The membership and failure detection is done through a gossip based protocol that prevents single point of failure.

Dynamo has inspired many other open source NoSQL databases which uses its set of techniques. For the end users, Amazon provides a database called Dy- namoDB, which is based on the similar data model of Dynamo[30]. DynamoDB can be used as part of Amazon Web Services. It also provides numerous language binding for developers having support of Java, .NET, Erlang, Python and many other programming and scripting languages.

3.6 Apache Cassandra

It is a an open source distributed store inspired by Google’s BigTable and Amazon’s Dynamo[7]. It was initially developed internally by Facebook but later made it open

14

(29)

3.7. MONGODB

source. Like Dynamo store, it is highly available, partitioning tolerant and eventual consistent, thus according to the CAP theorem it is an AP database system. It is incrementally scalable, requires minimal administration and has no single point of failure. Although it is eventually consistent data store, it can be tuned to offer better strong consistency. Cassandra can tune between consistency and latency tradeoff depending on the user needs.

The data model of Cassandra is column oriented similar to BigTable, which is discussed in section 3.3. The internal implementation such as node arrangement, data partitioning and replication are based on techniques learned from Dynamo. It allows integration with Hadoop with MapReduce support. It supports other features such as secondary indexes on columns, online schema changes, compression, compaction and dynamic upgrades without downtime. In addition to providing binding for popular programming and scripting languages such as C++, Java, Python and Ruby, it also provides an easy to use SQL type query language called Cassandra Query Language (CQL) to query the database.

3.7 MongoDB

It is a document oriented NoSQL database developed by MongoDB Inc[9]. Unlike traditional relational databases, it saves data in form of JSON like documents. The data format used by MongoDB for network transfer and storage is BSON, which is the binary form of JSON. BSON is an extremely efficient in terms of storage, network transfer and scan speeds. MongoDB is a strongly consistent database and is a CP system as described by CAP theorem.

MongoDB has a very dynamic schema in which documents are stored in JSON.

Documents are identified by keys but unlike relational databases, documents in a store do not need to have a fixed structure, every document can have a totally different structure or fields. It supports horizontal scaling to allow deployment of thousands of nodes on cloud platforms. The database supports MapReduce operations and also contains an expressive query language to easily communicate with the database. MongoDB also has various management tools to allow deployment, monitoring and scaling of the database. It has an active large developer community which is driving the product forward with the rapid pace. MongoDB is considered to be the most popular and widely used NoSQL database. It is mostly used as a backend for medium and large scale websites. Companies such as SAP, Foursquare, Ebay, SourceForge are using it in their backend.

3.8 Riak

Riak is a NoSQL key value store developed by Basho Technologies[31]. It is open source project based on the principles and techniques learned from Dynamo. It

(30)

is distributed, provides partition tolerance, high availability, scalability and fault tolerance. Riak has predictable low latency even during node failures, network partition and during peak times. It offers eventual consistency though tunable to offer strong consistency.

Riak is implemented over Erlang, a language ideal for massive distributed system. The data model is extremely simple, each value is identified by a value which is stored in bucket. A value could be anything such as bytes, XML, JSON, documents e.t.c. Riak offers an extremely simple interface to developers for interaction by giving RESTful web services API for set and get operations. It also supports Protocol Buffers and offers language binding for popular programming languages.

Creation of secondary indexes is allowed and non key value operations for large data set is supported through MapReduce. Another product called Riak Cloud Storage is also available which is an open source database built on top of Riak to offer both public and private cloud solutions.

It has many limitations such as it does not allow composite keys and values.

It has limited support of strong data types with only few types supported. It has also limited support for transactions in CP mode and does not support prune index scans.

3.9 FoundationDB

FoundationDB is database that provides both SQL and NoSQL access with high performance[32]. It is a multi model database based on shared nothing architecture.

At the core, FoundationDB has an ordered key value store with additional features implemented on top in form of layers. It is scalable, fault tolerant, supports various operating systems and cloud platforms such as EC2. It provides various methods such as command line interface, numerous language bindings and SQL like layer to access the database.

Currently most of the NoSQL databases are based on BASE and does not support ACID properties. The distinctive feature of FoundationDB is the support of transactions. Transactions fully support ACID properties like traditional relationship databases.

Though ACID properties are supported however there are numerous limitations.

FoundationDB does not support transactions which lasts for more than five seconds or if commit takes place after five seconds of first read operation. The key and values in the database must not exceed 10 KB and 100 KB respectively. The total data size written to key and values also should not exceed 10 MB in a single transaction.

The limitations are in place because of performance reasons and failing to comply with it will result in a failure. Other than having limitation on transactions, it is

16

(31)

3.10. AEROSPIKE

limited in terms of data types. It does not support strong data types and key value pair can only be byte strings. It also does not have support for queries like scalable prune index scans which are based on non primary keys columns.

3.10 Aerospike

It is an in-memory NoSQL key value store, which is highly optimized for Flash and SSD storage. It is mainly an AP system according to CAP theorem thus provides partition tolerance and availability. The properties of the system are also tunable and can be set to CP mode to allow strong consistency and ACID compliance.

It also claims to be 10 times faster than both Cassandra and MongoDB according to Yahoo! Cloud Serving Benchmark[33]. Despite of its various features it has many limitations. Transaction support in CP mode is limited as it does not allow bulk write operations in a transaction and also imposes a limit on a batch size. It has limited support for data types and does not allow composite keys and values. It also does not allow prune index scans.

3.11 Spanner and F1

Spanner is Google’s proprietary modern semi-relational database management system and a successor of BigTable[10]. It is a NewSQL form of database that tries to solve issues faced in NoSQL databases. Spanner fully supports transactions that were lacking in BigTable. The data is replicated, stored and managed across distributed data centers. It has a core API that introduces a concept of clock uncer- tainty to provide strong consistency. It gives the same scalability performance as a NoSQL database while guaranteeing the ACID properties of traditional relational database system.

F1, a database management system is based on Spanner is used as a backend for AdWords business[34]. For the internal use, Google had implemented a custom version of MySQL that failed to scale big and F1 database also replaced it. F1 is fully relational database system in a traditional sense with ACID properties and SQL query support but also scales massively and hence has properties of both NoSQL and SQL databases.

(32)

(33)

Chapter 4

Design

VoloDB, our NoSQL key value store is designed and implemented over MySQL Clus- ter. MySQL Cluster is an in memory database which is scalable, ACID compliant, runs on commodity computers and provides 99.99% availability[35]. It is built upon shared nothing technology with auto-sharing to efficiently process massive read and write requests.

Features of VoloDB are mapped over MySQL Cluster database. Store is mapped to a table, key value pair as a table row. An individual key or a value of a pair mapped to a table column. Index on store column is mapped to standard column index of MySQL Cluster database. There are various other features specifics to MySQL Cluster that are exploited to enhance overall performance of VoloDB which are discussed later.

4.1 MySQL Cluster

In order to understand how VoloDB is implemented over MySQL Cluster, it is important to understand its overall architecture, available client libraries and interfaces.

4.1.1 Architecture

MySQL Cluster is composed of several components. It integrates a standard MySQL database with an in memory storage engine called Network DataBase (NDB). The cluster setup consists of many number of host machines which can run many pro- cesses which are called nodes. A node can consists of a MySQL Server to help access the data or it can store only data. There can also be one or more data management servers to configure/manage the cluster.

The data stored by the NDB engine can be accessed by any of the MySQL

(34)

CHAPTER 4. DESIGN

Server nodes. An update made to the data node by any of the server is visible by other server nodes. Data is actively replicated to other data notes and hence failure of a data node does not cause availability issues since it is available in other active data nodes. Nodes can be added, removed or restarted easily by the NDB management servers. They can do also do configuration changes or apply soft- ware updates. MySQL Cluster also offers various client interfaces such as standard MySQL client, various language bindings and connectors and even low level high performance interfaces. The architecture is shown in figure 4.1[36].

Figure 4.1: Overview of MySQL Cluster Architecture

4.1.2 Partition Key

A concept that is important to understand, which is used through this project is called Key Partitioning. Since MySQL Cluster contains multiple data nodes, Partition key of the table determines on which data node will the main replica of the row reside. The responsible data node is calculated by taking MD5 hash of the partitioning key columns specified by the users. In this way the data is randomly distributed across the data nodes, which provides load balance. If a column is not specified, then the primary key columns are automatically taken as partitioning key columns. The point to note is that partition key must be part of the primary key.

20

(35)

4.2. NDB API

4.2 NDB API

It was briefly discussed in section 4.1.1 that MySQL Cluster provides various interfaces that users or programmers can use to interact with it. The examples of provided interfaces are shown in figure 4.1 under Clients/API. As we can see most of the provided interfaces go through SQL nodes before going to the data nodes.

This extra layer between the data nodes can hider performance for applications that have extremely high throughput requirements. In figure 4.1, we can see that NDB API interface talks directly to the NDB storage engine running on data nodes. NDB API is a low level high performance C++ programming interface to communicate with the data nodes.

VoloDB is built using low level NDB API C++ API. This interface provides functions that can be used to create tables, indexes, insert, fetch, update and delete records in asynchronous manner in an highly efficient way.

4.2.1 Asynchronous API

NDB API offers two flavors, one is synchronous and other is asynchronous. Syn- chronous as name suggests is a collection of functions and classes that wait for the result to return from MySQL Cluster. This behavior though handy in certain scenarios, it can be extremely inefficient when high throughput is the requirement and thus it is not feasible.

The other asynchronous flavor allows the user to query and send commands to the cluster as many times without waiting for the response. This behavior allows the client to send as many requests as he wants without blocking and then receive the responses in bulk thus increasing the throughput considerably. VoloDB is built upon this asynchronous version of the NDB API.

4.3 Abstractions and Mappings

As mentioned earlier, our VoloDB key value store is built upon MySQL Cluster.

In order to make it work, it is very important to map the concepts and entities correctly from a relational database to our NoSQL key value store. The following are the main mappings done with the MySQL Cluster.

4.3.1 Database

The concept of database that is a collection of individual stores is mapped to schema.

Schema in MySQL Cluster is a collection of tables, for our case we will treat it as a collection of stores.

(36)

CHAPTER 4. DESIGN

4.3.2 Store

The store that holds set of key value pairs is mapped to database table. By ab- stracting a store over tables, we get all the features of MySQL tables for free such as automatic sharding and replication to achieve high availability and failover support.

4.3.3 Key Value Pair

Key and value pair in the key value store is abstracted and mapped as a row of table in the MySQL Cluster. As MySQL Cluster is an in-memory database, it tries to keep all the rows of tables in the memory at all times and as a result we will get super fast access to records for our store.

4.3.4 Key

The key of an individual store is mapped to the primary key of a table that uniquely identifies a given row of the table.

4.3.5 Value

The values for the given key are stored as columns of row of the table. VoloDB supports multiple values for a given key, thus internally a value can be mapped to more than one table columns.

4.3.6 Index

Our key value store also internally creates indexes to allows faster access to data.

The index on the store is simply mapped to the index of the MySQL Cluster’s table column.

4.3.7 Operations

Set operation is mapped to Insert Into/Update Table, Get operation to Select From Table, Delete operation to Delete From Table and Create/Drop Store operation is mapped to Create/Drop Table.

4.4 Feature Set

The following are the salient high level features of VoloDB.

4.4.1 Strong consistency

VoloDB is strongly consistent where all the concurrent users will have a same view of the store. Most of the current NoSQL databases have a weaker form of consistency according to CAP theorem. Though there are few NoSQL databases that offer strong consistency but they are limited such as HBase offers consistency only at

22

(37)

4.4. FEATURE SET

a row level and not when a query effects more than one row. Our goal is to offer strong consistency with no limitations and support table level consistency.

4.4.2 ACID Properties

As discussed earlier, generally all of the NoSQL databases adhere to BASE instead of ACID properties because of the scalability issues. Though there are some NoSQL databases that do offer ACID properties but they lack performance and put certain restrictions. VoloDB fully supports ACID properties without any performance or feature limitations by using the underline MySQL Cluster functionalities.

4.4.3 Transactions

VoloDB fully supports transactions with ACID properties. Most current key value stores also lack transactions or provide only a weak form of it. Our transactions not only support a single query per transaction but as many as the user wants, and all of them are run atomically. VoloDB maps the transaction to a standard MySQL cluster transaction and thus achieves all required properties of it.

4.4.4 Prune Index Scans

VoloDB not only supports fetching a value by key that is unique, but it also supports queries that are not based on a primary key. Prune index scans is one of the distinct features of VoloDB that allows the users to query a store using partition key.

As explained in section no 4.1.2, the records of a table are distributed to data nodes using partition key and since prune index scans query using this key it will only hit a single data node which tend to be very efficient.

4.4.5 Strong Data Types

Most of the current key values stores support a generic byte as value for the key.

This scheme though allows the user to store any value in the store but it can be inefficient and inconvenient as user has to encode and decode values to and from bytes respectively. Since values are mapped to strongly type column of a table, VoloDB supports this feature to offer better usability and performance.

4.4.6 Multi-column key

Key for any record in the store is not limited to a single column. A user can specify many values for a key when creating a store, they are not required to be of a generic byte type.

(38)

CHAPTER 4. DESIGN

4.4.7 Multi-column value

VoloDB supports more than one value for a given key. A user when creating a store can specify as many values for a key. There are many key value stores that only support a single value. If a user has to store more than one value for a key, the user has to pack and encode values in a single byte array value and store it.

Our multi-column value supports a scenario where user has to store a structure like values efficiently without any overhead.

4.4.8 Supported Data Types

VoloDB supports wide range of strong data types. Total of 10 data types are supported that are boolean, int32, uint32, int64, uint64, float, double, char, varchar and varbinary.

4.5 Performance Considerations

VoloDB keeps in account many performance considerations before running operations. Not all kind of queries are supported. The queries which tend to be very slow are not allowed to make sure the throughput remains very high at all times.

An inherently slow query not only slows the current user but it can slow down the whole system thus effecting the other concurrent users.

As discussed in section 4.1, Data in MySQL Cluster is distributed across many data nodes. For every inserted row, MD5 hash is taken of the partition key column to compute the destination node where the data is then stored. While querying a table in the cluster every data node can be hit to access the required data. These type of queries that hit all the data nodes are slow and are blocked by VoloDB.

4.5.1 Allowed Queries

Here we discuss the kind of the operations that are supported in VoloDB.

Primary Key Lookup/Insert: Primary key lookup and insert only effects one record since the key is guaranteed to be unique. Also as partition key is required to be part of the primary key, the system can compute the data node that is responsible for the given operation and as a result it will only hit a single data node.

As this type of query is hitting only one data node, it will allowed in our key value store since it can be executed efficiently.

Prune Index Scan: Prune index scan queries are the ones that do not use primary key. They are based on partition key that only hits one data node. Query based on the partition key columns using equality operator can end up fetching more than one record but since they were initially distributed using these partition

24

(39)

4.6. DEFINITION OPERATIONS

key columns, the cluster will hit only one data node. As this type of query only ends up accessing one data node, it is allowed in VoloDB.

4.5.2 Disallowed Queries

As a rule, any query that can end up hitting more than one data node is not allowed due to performance reasons.

Non Primary Key Operation: Query based on non primary key will not be allowed, since it means that record is not unique and the cluster will have to hit all the data nodes to fulfill the query.

Non Partition key Operation: Partition key is only required to be a part of the primary key but queries based on it is allowed since records are distributed using this key and MySQL Cluster can directly go the destination data node without going to other nodes. So if a query is not based on partition key columns, it will be disallowed.

Full Table/Index Scan: Although it is mentioned that any query not based on either the primary or partition key is not allowed and will be blocked but to make things clear, queries resulting in full table or index scan will be disallowed.

The point to note is that although index scan can be a lot faster than full table scan, but in order to satisfy the query, the store will have to go through all the data nodes to traverse all the column index values which will still be a performance hit.

4.6 Definition Operations

These are operations that are used to make changes in the schema/database of VoloDB. The supported operations are as follows:

4.6.1 Create Store

This operation will create a new store in VoloDB. Store is a container for key value pairs. User must specify more than one fields that can have the following properties:

1. A single or a composite key columns which will uniquely identify the values.

2. One or more columns to hold values for a given key.

3. A single or composite partition key columns. This option is not mandatory, if it is not specified, then primary key is automatically taken as partition key.

(40)

CHAPTER 4. DESIGN

4.6.2 Delete Store

The Delete command is opposite of Create which will remove the given store.

Records that exist within the store will be removed automatically.

4.7 Manipulation Operations

These set of operations allows the user to manipulate the data. The following manipulation operations are supported.

4.7.1 Set

This operation adds a key value pair in the given store. The user specifies the data for each key and value columns to be inserted into the store. The key must be unique without any null values. The key and non key columns must match exactly as they were created during the store creation. User is not allowed to add or remove field in a store on the fly. Note that no Update command exists to modify the existing values in the store. The values are updated with the same Set command. If no value exists for the key, then a new key value pair is added, otherwise old values are overwritten and updated to new values.

4.7.2 Get

Given the key, this operation fetches the values that were inserted earlier using the Set operation. As mentioned in section 4.5.1, only those fetch queries are supported that only hit single data nodes. In order to satisfy this condition, only equality operator on either primary key or partition key is allowed. If query is based on a primary key then only single key value pair is returned, and if it based on partition key then more than one pairs can be returned.

4.7.3 Delete

This operation given a key, removes the key value pair from the store. In order to Deletea key value pair, the user has to provide full primary key data of the record.

4.7.4 Atomic Mode

All of the mentioned manipulation operations can be grouped together into a transaction to be run atomically. All of the grouped operations will either run successfully or none will. The result is guaranteed to be strongly consistent even in the case of concurrent running transaction on the same store.

26

(41)

4.8. HIGH LEVEL SYSTEM ARCHITECTURE

4.8 High Level System Architecture

A high level system architecture of VoloDB and how it fits in with the rest of the external entities is shown in figure 4.2. We have our key value store sitting in between the clients and the MySQL Cluster data nodes. Clients will talk to VoloDB using a custom library that transports data on wire using protocol buffers. The detail of client library and its implementation is discussed later. VoloDB receives request from different clients and talks directly to data nodes without the additional layer of MySQL Server in the middle. After receiving the response from the data node, VoloDB returns the result back to the clients using protocol buffers. The details of Protocol Buffers and wire format are discussed later in section no 5.2.

VoloDB

<<NDBAPI>>

Client

<<Protobuf>>

NDBD

NDBD NDBD

NDBD

Data Nodes

Figure 4.2: High Level System Architecture

4.9 VoloDB Architecture

The design of VoloDB is inspired from Mikael Ronstrom’s benchmark program that shows how to create a highly scalable key lookup engine[37]. The internal architecture of VoloDB is shown in figure 4.3. It shows the most important concepts that are critical to the design and performance.

4.9.1 Network I/O Handler

VoloDB has a single asynchronous I/O thread that is used to handle network traffic.

All the requests from the clients and responses sent back are handled by this thread, shown as Network I/O Thread in figure 4.3. Whenever network I/O thread receives a request from the client, it forwards it to one of the definer threads using a round robin approach to distribute the requests load equally. Network I/O thread can also receive a message from its internal executor component. The response of a user request is handed over to it and is responsible to deliver the message to the destination client. In order to correctly reply to destinations, it keeps track of all the connected clients by keeping state information.

(42)

CHAPTER 4. DESIGN

Executor Thread 5 Executor

Thread 3 Executor

Thread 2 Executor

Thread 1

Executor Thread 5 Definer

Thread 3 Definer

Thread 2 Definer

Thread 1

Network I/O Thread

Data Node 1

Data Node 2

Data Node 3

Data Node 4

Data Node 5 Incoming Request

VoloDB

Outgoing Response

Figure 4.3: VoloDB Architecture

4.9.2 Definers

Definer threads sit between the network I/O thread and the executor threads and receives the user request forwarded by network I/O thread. Each definer thread is connected to all the executor threads running in the system. The number of definer threads are configurable and can increased or decreased as per user requirements but at least one definer thread is required.

The task of the definer thread is to decode the received message and assign it to an executor. It collects as much messages as possible within a specified period of time(10 milli second by default). The requests are then randomly assigned to one of the executors.

28

(43)

4.9. VOLODB ARCHITECTURE

4.9.3 Executors

Executor threads are responsible for running the user queries by communicating directly to the data nodes using the asynchronous NDBAPI. The number of executor threads are configurable and can increased or decreased as per user requirements but at least one executor thread is required. For better performance, there should be at least one to one mapping between executor threads and the number of data nodes in the cluster such that allowing the executor no 1 is to be responsible for handling queries for data node no 1, executor no 2 for data node no 2 and so on.

Every executor thread collects the requests received from the definer threads while silently ignoring any unrecognized messages. It then tries to prepare as much queries in bulk since sending queries in bulk to the data nodes have obvious performance benefits. After all the queries are prepared, they are sent in bulk to data nodes using NDBAPI in asynchronous fashion. Executor thread then waits asyn- chronously to receive the responses back from the data nodes. When the executor thread receives the response back from the data nodes, it serializes the response in an appropriate format as expected by the client and forwards it to the network I/O thread for delivery.

(44)

(45)

Chapter 5

Implementation

VoloDB is completely written in C++ using the latest standard called C++11. In this chapter we will discuss implementation specific details of the store. It includes details of any specific language, framework or libraries used. It also mentions any optimizations and techniques used to enhance the performance of the store. All the implementation follows the same functional specifications and design guidelines as discussed earlier.

5.1 Transport

Transferring and receiving data over the network is one of the most important aspects since all the requests from the clients are received over the wire. Inefficient handling of network messages will slow down the system and hence will adversely effect all the connected users.

For our project, ZeroMQ[38] is used for all the transport needs. ZeroMQ is an efficient, ultra fast socket library. It is built for general purpose distributed system applications with focus on extreme optimizations. It also has various design patterns already implemented, transport mechanisms and language bindings. For VoloDB, transport server that is used to service requests and responses over the network is completely written using this library. A pattern called asynchronous server is used with user configurable number of worker threads and C/C++ language binding to achieve best possible performance.

As an alternative, many other approaches and frameworks were considered to handle requests and responses over the network. One obvious approach was to write our own library using native sockets. This approach was risky since it takes a lot of effort and time to achieve the same level of optimization, efficiency and stability which we can readily achieve by using a popular, community driven, already stable and tested library. Another asynchronous, high performance and popular C++

(46)

CHAPTER 5. IMPLEMENTATION

library called Boost Asio[39] was considered but it was not chosen due to its usability issues and complexity compared to ZeroMQ.

5.2 Serialization

In order to send messages from clients to VoloDB and get a response back, all the messages must be encoded in a certain pre-defined format. The encoded message should be small so that it takes less time to transfer over the wire otherwise clients over a high latency network will suffer from low performance. It should be also be encoded in such a way that decoding takes less time otherwise precious CPU cycles will be used in decoding a messages which could have been better utilized in actual request handling of a user.

The encoding and decoding of network messages are handled by Google’s Pro- tocol Buffers[40]. Protocol Buffers are extensible, language and platform neutral way of serializing data. All the messages and operations supported by VoloDB are defined using it and then classes are generated using the tools provided by Google.

Classes generated have the ability to convert themselves into raw steam of bytes that can be sent over the network. Same generated classes are provided to the client that they can easily populate, encode and send to VoloDB for handling.

As an alternative, many other approaches were considered for data serialization.

For example, data could have been converted into XML or JSON but these formats are very verbose can take a lot of bytes to encode which will slow down its transfer over the network. Cap’n Proto[41] was one of the framework that could have been used instead of Protocol Buffers. It is another language and platform neutral serialization library that requires absolutely no encoding and decoding step. All the data added to an object is always appropriate for memory representation and data interchange format but as a side effect the overall bytes taken by an object is far greater than compared to Protocol Buffer’s serialized object. Hence Cap’n Proto is ideal for use cases when serialized objects are stored and retrieved locally but sending it over a network will have a significant performance loss because of the larger data size. Another framework developed by Google called FlatBuffers[42], which is based on the same techniques as Cap’n Proto was considered by was left out due to the same reason.

5.3 Worker Threads

Worker threads run by VoloDB are standard C++11 threads for best performance and portability. Other popular implementations such as Boost and POSIX threads were not used because of standardization and portability issues. Worker threads are configurable and are in the form of either definers or executors as mentioned in

32

VoloDB: High Performance and ACID Compliant Distributed Key Value Store with Scalable Prune Index Scans