Database analysis and managing large data sets in a trading environment

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Database analysis and managing large data sets

in a trading environment

by

Per Månsson

LIU-IDA/LITH-EX-A--14/009--SE

2014-02-10

(2)

Final Thesis

Database analysis and managing large data

sets in a trading environment

by

Per Månsson

LIU-IDA/LITH-EX-A--14/009--SE

2014-02-10

Supervisor: Patrick Lambrix

Examiner: Fang Wei-Kleiner

(3)

(4)

Abstract

Start-up companies today tend to find a need to scale up quickly and smoothly, to cover quickly increasing demands for the services they cre-ate. It is also always a necessity to save money and finding a cost-efficient solution which can meet the demands of the company.

This report uses Amazon Web Services for infrastructure. It covers host-ing databases on Elastic Computhost-ing Cloud, the Relational Database Service as well as Amazon DynamoDB for NoSQL storage are compared, bench-marked and evaluated.

(5)

(6)

1 Introduction 1 1.1 Background . . . 1 1.2 Purpose . . . 1 1.2.1 Requirements . . . 1 1.2.2 Desires . . . 2 1.3 NoSQL and SQL . . . 2 1.4 Related Work . . . 2 1.5 Disclaimer . . . 3 1.6 Data sets . . . 3 1.6.1 Core . . . 4 1.6.2 Token . . . 4 1.6.3 Payment . . . 4 1.6.4 Log . . . 4 1.6.5 Search . . . 4 2 Research 6 2.1 The Cloud . . . 6

2.2 Amazon Web Services . . . 7

2.2.1 Instance Sizes . . . 7

2.2.2 Amazon Relational Database System . . . 8

2.3 Database Technologies . . . 9 2.3.1 Relational Databases . . . 9 2.3.2 Graph Databases . . . 9 2.3.3 Key/Value stores . . . 9 2.3.4 Document stores . . . 10 2.4 Database Systems . . . 10 2.4.1 PostgreSQL . . . 10 2.4.2 Cassandra . . . 11 2.4.3 accumulo . . . 13 2.4.4 Neo4j . . . 13 2.4.5 Riak . . . 14 2.4.6 MongoDB . . . 14 2.4.7 Other solutions . . . 15

(7)

CONTENTS CONTENTS

2.4.9 Amazon DynamoDB . . . 16

2.4.10 Amazon Redshift . . . 17

2.4.11 Amazon Elasticache . . . 17

2.4.12 Amazon SimpleDB . . . 17

2.5 Database systems selected for further evaluation . . . 17

2.6 Prototypes . . . 18

2.6.1 Cassandra . . . 18

2.6.2 MongoDB . . . 19

2.6.3 PostgreSQL . . . 19

2.7 Database usage plan . . . 20

2.7.1 Core . . . 20

2.7.2 Token and Payment . . . 20

2.7.3 Log . . . 20 2.8 Data Generation . . . 21 2.8.1 Core . . . 21 2.8.2 Log . . . 21 3 Benchmarks 22 3.1 Benchmark setup . . . 22 3.2 Core . . . 22 3.2.1 Tests . . . 23 3.3 Log . . . 23 3.3.1 Tests . . . 24

3.4 On Cacheability and Latency . . . 25

3.5 Benchmark Results . . . 25

3.6 Core . . . 25

3.6.1 Price for performance . . . 28

3.7 Log . . . 29

4 Analysis 32 4.1 Discussion . . . 32

4.2 Recommendation . . . 33

4.2.1 Token and Payment . . . 33

4.2.2 Log . . . 34

4.2.3 Core . . . 34

5 Summary 36 5.1 Future work . . . 37

(8)

Chapter 1 Introduction

1.1 Background

Kulipa AB is a Swedish startup company creating a trading solution for mobile purchases.

Kulipa is running a trading system on Amazon Web Services. The databases responsible for storing the data they manage are the focus for this thesis work.

Prior to the report everything is stored in several different PostgreSQL databases, and a little in Amazon SimpleDB. Kulipa themselves had set up and were maintaining the Postgres databases.

Concerns regarding scalability of postgres as well as simplicity and main-tainability as Kulipa grows makes them want to evaluate the alternatives for data storage. These data storage systems should all run on the Amazon cloud, either as a hosted database service or on the computing service.

1.2 Purpose

The goal with this work is to present a solution which ensures future scala-bility, flexiscala-bility, availability and maintainability for the involved databases. Here are specified things that are required for the database system to be acceptable at Kulipa.

1.2.1 Requirements

1. It shall be possible to take and restore backups of the system without downtime.

2. The system shall handle minor failures, for example single server crashes, without downtime.

(9)

1.3. NOSQL AND SQL CHAPTER 1. INTRODUCTION

3. The system shall be performant and scalable enough to cover the per-formance demands of Kulipa for the foreseeable future.

1.2.2 Desires

1. The system is desired to scale down to low price levels, so that Kulipa can start using the system as soon as possible.

2. The system is desired to scale elastically, so that servers can be added to the system to help with scaling without downtime.

3. The system is desired to be easily managed with regards to handling node failures, backups, etc.

1.3 NoSQL and SQL

In traditional relational databases a primary concern is the ACID property. ACID means Atomicity, Consistency, Isolation and Durability, which in turn means that a transaction in a traditional relational database will be atomic and independent of other simultaneous transactions in the database. This is a fairly strict requirement.

NoSQL was created from the idea that by sacrificing the ACID property one can gain other advantages, mainly in terms of scalability and raw speed [28]. Since the beginnings of NoSQL databases they have evolved mainly in the direction of the old relational databases, without sacrificing their advantages. For example, FoundationDB provide ACID transactions in a multi node cluster [11]. Other databases guarantee isolation and atomicity at lower levels, for example row-level in Cassandra [9].

1.4 Related Work

In Survey on NoSQL Database [34] Jing Han, Haihong E and Guan Le sur-veys several NoSQL databases, and describe them by evaluating the pros and cons. The databases compared are Redis, Tokyo Cabinet, Flare, Cassandra, Hypertable, MongoDB and CouchDB. The conclusion reached is that to select your NoSQL database, you need to consider the business model, de-mand for ACID transactions and other requirements. They clarify that each NoSQL database has its own strengths and weaknesses.

In NoSQL Evaluation: A Use Case Oriented Survey [35] Robin Hecht and Stefan Jablonski compare the data models, query possibilities, concur-rency controls, partitioning and replication opportunities of several NoSQL databases. They mainly focus on which type of database technology one would want, such as key/value stores, document stores or graph databases. Since the NoSQL databases tend to be very different in characteristics, they

(10)

1.5. DISCLAIMER CHAPTER 1. INTRODUCTION

help decide on which ones are good at what. They go fairly deeply into features such as Map Reduce Support as a query possibility.

In Evaluating Caching and Storage Options on the Amazon Web Service Cloud [29] David Chiu and Gagan Agrawal evaluate data storage solutions on Amazon Web Services. They mainly compare storing on S3 versus EC2. Some pricing evaluation and network accessibility between the instance types is also compared. They do cost evaluation on all their solutions as well. They conclude that S3 is the most efficient storage option and using m1.xlarge is the fastest one in their study.

In A comparison between several NoSQL databases with comments and notes[37] Bogdan George Tudorica and Cristian Bucur compare Cassandra, HBase, Sherpa and MySQL with regards to read latency during varying workloads, focusing on percentages of reads and writes. They show that many databases have problems scaling past a certain point, except Cassan-dra which seems to decline relatively gracefully. They conclude that NoSQL databases behave differently in similar situations and cannot be used inter-changeably.

In Scalable SQL and NoSQL Data Stores [28] Rick Cattell examines a number of SQL and NoSQL storage solutions which are scalable over many servers. It serves as a good introduction to several of the databases avail-able. The databases introduced are among others Riak, Redis, SimpleDB, MongoDB, HBase and Cassandra. It provides use cases for different kinds of NoSQL databases. It remarks on the variable nature of the data storage landscape and the moving target this presents.

In Eventual Consistency: How soon is eventual? An evaluation of Ama-zon S3’s consistency behavior [27] David Bermbach and Stefan Tai create a method to evaluate the consistency behavior of data storage systems and ap-ply it to Amazon S3, especially with regards to at which rate data becomes stale.

1.5 Disclaimer

The database landscape changes very quickly. NoSQL as a concept is still fairly young and the databases are adding new features very quickly. For example, Cassandra updated from version 1.3 to 2.0 during the writing of this project, and they added a lot of features during the upgrade, invalidating claims which were true before the upgrade.

1.6 Data sets

Five data sets are the focus for this report, named Core, Token, Payment, Log and Search. There is also a future interest in creating a system for analysis of accumulated data for reasons of market research and such things.

(11)

1.6. DATA SETS CHAPTER 1. INTRODUCTION

1.6.1 Core

Core contains information about users, stores, customers, organizations and several other subparts of the system. These are highly relational, where several stores belong to the same organization, cashiers work in stores, and so on. This database also stores the orders placed in the system. Orders track which user bought what, when, and where. A simplified overview of core can be seen in Figure 1.1.

1.6.2 Token

The token database stores authentication tokens for logging in with devices. The tokens are relatively relationless and are only used to record and verify logged-in clients.

1.6.3 Payment

Kulipa does not handle money directly, but uses third party providers for this. These payment services use credential systems for payments. These credentials are stored in this database. These, like the tokens, are relation-less.

1.6.4 Log

The log database is responsible for storing the logs generated by all other services in the system. This could be large amounts of data. The logs need to be searchable, and logs which occurred within 10 days should be quickly accessible.

1.6.5 Search

The search database is used to search for nearby stores, products in stores, and so on.

(12)

1.6. DATA SETS CHAPTER 1. INTRODUCTION

(13)

Chapter 2 Research

In this chapter I write about Amazon Web Services, Database technologies and available database systems. I make a plan for which types of databases are interesting for which data sets. I write about how I created prototypes for the interesting database systems. How test data is generated is also handled.

2.1 The Cloud

The Cloud is a term which in practice means a service where you can provi-sion (rent) computing power or some other resource. The most basic form of Cloud computing is the Infrastructure as a Service (IaaS) where you pro-vision things like storage space and virtually hosted computers, on which you build your system. There can also be higher levels, Platform as a Ser-vice (PaaS) or Software as a service (SaaS), where you provision things like databases or even games. The Cloud is an alternative to purchasing physical computers and storing them in a data center.

The five essential characteristics of cloud computing are, as defined in The NIST Definition of Cloud Computing [36], as follows:

Self service. A customer can provision the resources they need, when they need them, without speaking to an employee of the provisioner. Broad network access. Capabilities are available over the network through

standard mechanisms such as a simple website.

Resource Pooling. The provider uses a multi-tenant model where the hardware is shared through virtualization. The process is transpar-ent to the end user who has no control over locality of resources. Rapid elasticity. Resources can be provisioned and released rapidly and

(14)

2.2. AMAZON WEB SERVICES CHAPTER 2. RESEARCH

Measured service. The utilization of resources can be monitored and re-ported, providing transparency to both the end user and the provi-sioner.

Kulipa uses Amazon Web Services, which is a Cloud provider. The main use is the IaaS, where they provision virtual computation instances and other things such as load balancers.

2.2 Amazon Web Services

Amazon Web Services (AWS) is a cloud computing platform consisting of many remote computing services.

AWS is a global service split into regions. Kulipa AB has all its services in the EU-West-1 region, which is the only European region, and located in Ireland. There are also four regions in North America, one in Brazil, two in Asia and one in Australia.

The EU-West-1 region consists of three Availability Zones (AZ). These are physical server centers located about one millisecond of network latency apart. They are separated to provide availability even if one of the AZs were to fail.

The computing service in AWS is called Amazon Elastic Cloud Comput-ing (EC2). In this service you provision servers, which are virtual servers run in the cloud. There are many server types, from the cheap micro instance up to high end servers with dedicated hardware.

One can provision hard drives on a Storage Area Network and connect them to EC2 instances. This is called Elastic Block Storage (EBS). This is useful for storing data which needs to persist, since most of the storage space on the virtual servers is in ephemeral storage, meaning that if the instance is shut off for any reason, the data is removed.

There is also a storage service called Simple Storage Service (S3). This service allows you to upload data and pay elastically for what you use. It has additional features such as static website hosting.

2.2.1 Instance Sizes

The instances provisionable vary from the very low computing power of the m1.micro instance up to very high performance, highly specialized instances such as the c3.8xlarge compute optimized instance, and the cg1.4xlarge with two NVIDIA Tesla graphics cards for GPU computing.

The instance sizes used in this work are the ones listed in Table 2.1. Throughout this report the naming conventions m1.micro vs just micro are used interchangeably, since all the instance sizes are from the m1 series.

The m1.micro instance type has 2 EC2 Compute Units, which is faster than the m1.small type. However, if using the processor for more than

(15)

2.2. AMAZON WEB SERVICES CHAPTER 2. RESEARCH

Instance Type Memory EC2 Compute Units Instance Storage m1.micro 613 MiB 2 None m1.small 1.7 GiB 1 160 GiB m1.medium 3.75 GiB 2 410 GiB m1.large 7.5 GiB 4 850 GiB m1.xlarge 15 GiB 8 1690 GiB

Table 2.1: AWS instance performance comparison table

short bursts of computation it gets severely throttled to something very low, approximately 10% but not exactly specified [16].

In addition to the higher memory and computing power in the higher instances, they also are advertised to get generally better performance in accessing EBS, network and such. Instances m1.large and above have the option for EBS optimized available. The m1.micro instance type has Low network performance, and the m1.xlarge has High. Exactly what this means is not specified.

Instance Type Hourly price Price per 30 days m1.micro $0.020 $14.4 m1.small $0.065 $46.8 m1.medium $0.130 $93.6 m1.large $0.260 $187.2 m1.xlarge $0.520 $374.4

Table 2.2: AWS instance price comparison table

The prices for the instances at the time of writing can be seen in Ta-ble 2.2. The micro instance is comparatively very cheap, as it works dif-ferently to the other sizes. The small instance is 3.25 times the price of the micro instance. For every increase in size after that, the instance price doubles.

2.2.2 Amazon Relational Database System

The Amazon Relational Database System (RDS) allows one to host a database as a service. This means one specify the instance size and, on the larger types, the IO requirements. This is essentially equivalent to hosting your own database on EC2, only that the setup, the backups and failovers in case of failure is handled for you. It should be noted that the prices for the hosted services are slightly higher than the corresponding setup of EC2 instances. This is interpreted as maintenance overhead, and is roughly 10% of the cost of the hosted RDS service.

(16)

2.3. DATABASE TECHNOLOGIES CHAPTER 2. RESEARCH

2.3 Database Technologies

Databases differ in what technologies they use for implementation. Many databases start out in a single category but ends up overlapping other cat-egories as the database matures. The result is that especially among the NoSQL databases there is a lot of overlap. For example, Riak is a key/value store with extra features enabling it to act as a graph database in some situations.

2.3.1 Relational Databases

Relational databases are the classic answer to data storage, and is used by all Kulipas systems today. Emphasis is always on consistency, with ACID transactions being the norm.

In a relational database the data is organized into tables. The tables are strongly typed so every row in a table contain the same fields. The set of specifications of all tables are called the schema. Data is commonly accessed through a Primary Key, which is a set of fields which uniquely identifies a row. It is also possible look up rows based on almost any other field or constraint. Interconnections between tables are handled by Foreign Keys, where a field in one table refers to a key in another, most commonly the Primary Key.

Examples of relational databases are PostgreSQL, MySQL, MSSQL, Or-acleDB and many more.

This type of technology should be very suitable for the Core database, since it is relational with many interconnections.

2.3.2 Graph Databases

Graph databases store data as a graph, with nodes and relations therebe-tween. This gives access times which scales with locality, rather than total size of database. Graph databases are part of the NoSQL movement and are schemaless.

A node can contain data, often organized in a key/value structure with-out a predefined structure. Connections between nodes are created dynam-ically and contain data the same way as nodes.

Examples of graph databases are Neo4j and Titan.

Graph databases could possibly be used for Core, due to the relational nature of the stored data.

2.3.3 Key/Value stores

These store values which you look up using a primary key. They are based on the idea that if you can get really fast access to reading and writing on primary key you can construct whatever features on top of that. A key/value store is approximately the same as a distributed hash table.

(17)

2.4. DATABASE SYSTEMS CHAPTER 2. RESEARCH

When you look up an entry in a key/value store you get the whole row back, of a binary blob of data. Some key/value stores understand that a row can have columns and access only part of these rows, or use them in queries and indexes.

Examples of key/value stores are Cassandra and DynamoDB.

Key/Value stores are interesting mainly for the Token and Payment databases, but also for Log.

2.3.4 Document stores

Document stores store objects in the form of JSON or similar. These object can be nested and queried on any part of the objects. The database is schemaless, and not all objects need to contain the same fields. They often have a primary key for unique access similarly to a key/value store, but puts more emphasis on the ability to store arbitrary objects and query on individual parts of them.

Examples of document stores are MongoDB and Couchbase.

The ability to query for secondary attributes make this technology in-teresting for the Log database.

2.4 Database Systems

In this section I detail the various database solutions available. I go through their options for reliability and backup and how the machines are set up. I also write about Amazon’s solutions for databases, RDS and DynamoDB. Characteristics such as consistency versus availability and how scaling works is also handled.

The databases I decided to focus more on have more details in this chapter.

The ones I focus less on have one flaw or another which I discovered when reading about them. Things such as inability to do backups or simply having a data model which does not easily fit with any of the data sets involved disqualified the database from further research and testing.

2.4.1 PostgreSQL

PostgreSQL is the database system used for most database needs at Kulipa today. It is a fully relational ACID compliant database with excellent sup-port for transactions, replication and backups [20].

2.4.1.1 Replication

Replication is handled via Write-Ahead Log (WAL) shipping. This variant of replication means shipping the WALs from the master server to one or more slave servers, which are therefore kept up-to-date. The slaves can then

(18)

serve read requests, but not write requests. They can lag behind the master server due to delays in the WAL shipping.

2.4.1.2 Backup

PostgreSQL has several solutions for backing up the database. The most simple solution is to use pg_dump, a tool which dumps the database to SQL form. This makes restoring a backup simply feeding the dump to the PostgreSQL driver, psql.

A more complex, but also more flexible solution is on-line backup with point in time-recovery. This requires more setup, and involves setting up WAL archiving. Since the WALs are already used in replication, this is probably the best solution for Kulipa.

2.4.1.3 Setup

To set up a PostgreSQL system which can handle server failures, you need at least two servers, deployed in different AZs. This is the level of replication you get if you use the hosted Amazon RDS. For even more redundancy and read capacity, a third server could be used in a third AZ.

It is also possible to set up a read replica, which works the same as a replica used for availability purposes but is instead used for read-only queries. One cannot, however, use a replica both for availability purposes and as a read replica, since if you use a replica for reading and the primary server goes down, the replica would need to act as both the read replica and the primary, which it cannot by definition handle.

2.4.1.4 Characteristics

Since PostgreSQL is a fully ACID database it prioritizes consistency above availability. In the case of a network partition, the master server would con-tinue operating as usual, but the slave servers would be rendered unusable. Scaling PostgreSQL is mostly about using more powerful and more ex-pensive servers. Putting a caching layer in front of a Postgres server is almost always a good thing, for example memcached [15] either hosted on EC2 or Amazon ElastiCache [3]. If this does not work the database content would have to be sharded into smaller databases, which would be problematic due to the relational nature of Core.

2.4.2 Cassandra

Cassandra is a Key/Value store. In a Cassandra cluster, each instance is called a node. All nodes in a Cassandra cluster are equivalent. There are no master nodes or slave nodes [4]. This makes it fairly simple to add new nodes to a cluster, as they simply acquire the appropriate amount of space and the servers collaborate to make sure the data is synchronized and replicated.

(19)

When storing an item in a Cassandra database, it is given a token. When this project started, each server was given a starting token, meaning they will store each token from a specific one up to the next server in the ring. This changed to a virtual node system, where the servers divide up a set of virtual tokens, which effectively divides up the token space between the servers.

It is recommended to store Cassandra data on the ephemeral storage, rather than EBS, due to the lower latency and higher speed [9]. This is in contrast to a most other databases where you can ressurect a node after stopping it. With Cassandra, to start a node you usually need to make sure it is empty anyway.

Cassandra is very established in Amazon EC2. It is used on EC2 by big actors such as Netflix [31]. Netflix has made a couple of tools for managing Cassandra clusters on AWS, as follows

Priam. Priam is a co-process for Cassandra and is run on every node with Cassandra. It provides backup/recovery, token management, configu-ration management and a HTTP API for managing nodes [21]. Chaos Monkey. Tests the fault tolerance of the system by randomly killing

nodes in production. This is to make sure that in case of an accidental node crash, the system will behave as expected [6]. This tool is not especially for Cassandra but it shows a dedication to absolute avail-ability in cases of server failures. The fact that they are using this I think speaks well for Cassandra.

2.4.2.1 Backup

Using Priam, Cassandra backups are simple [21]. All data is moved to a specified S3 bucket. If priam is not used, it is fairly simple to do manually. A snapshot can be created using nodetool snapshot simultaneously on all nodes. This creates a snapshot which can then be stored elsewhere [9]. It is also possible to enable incremental backups.

2.4.2.2 Setup

Cassandra needs at least 3 servers to be able to handle a node failure. In this scenario all data needs to be replicated on all three nodes. Cassandra can also be used with more servers, and is indeed designed to scale horizontally, meaning that adding more servers should increase the performance of the database a proportional amount. Netflix scaled a Cassandra cluster up to 300 nodes with linear scaling characteristics [30], which can be seen in Figure 2.1.

(20)

Figure 2.1: Netflix scaling test graph. Note the very linear scaling characteristics. Figure by Adrian Cockcroft taken from Netflix Tech Blog [30].

2.4.2.3 Characteristics

Cassandra is tuneable regarding availability vs consistency. For example, if some user data is read, it might not be a problem if the data is a few minutes out of sync. Order data shared by a customer and a store, however, needs to always be up to date. This is managed on a query level, where you specify a consistency level of one for ann eventually consistent read, and to get real consistency you ask for a majority approval on the read.

To get consistent reads, so called Quorum operations are used. It means a majority of nodes agree on a value. This way, even if some nodes are yet to be updated a majority will agree, and therefore be correct.

2.4.3 accumulo

Apache Accumulo is a sorted, distributed key/value store [1]. It was created in 2007 by American National Security Agency, and was contributed to Apache Foundation in 2011.

Accumulo has cell-level security, meaning that when you connect to the database with a certain role specified, for example admin, you can only see values in the database with a visibility specifier for admin. Visibility is specified as for example admin|user to allow both admin and user, or admin|(user&store) to allow admins and connections with both the user and store specifiers.

2.4.4 Neo4j

Neo4j is a graph database, which means that it stores data as a graph with nodes and relations between nodes [18]. For example, in Core, the users and stores would be nodes, and the relations between them would be the edges

(21)

in the graph. This graph is queried by traversing these relations, which does not require lookups in large tables as in a SQL-database.

Neo4j has support for both full and incremental backup on running clus-ters. There is support for report instances, for running ad-hoc reports in production without disturbing the rest of the cluster.

All writes to the system has to go through a master node. This means that if writes are too slow, a more powerful machine is needed, adding more nodes do not help. Additional slave nodes help with read-only queries.

If the master server crashes or otherwise becomes unavailable, the re-maining slave servers will perform an election to automatically determine the new master.

2.4.4.1 Setup

A high availability cluster in Neo4j requires at least 3 servers.

2.4.5 Riak

Riak is a key/value store. In addition to primary key lookups, there is secondary indexes and link walking [23].

Link walking lets Riak act like a graph database. Links are created between objects and, for example, querying friends of friends is possible.

Riak is eventually consistent, meaning that not all clients will see the same values from the database.

Conflicting writes are resolved using vector clocks, which is a tool to determine in which order nodes have seen writes. Some conflicts are auto-matically resolvable, otherwise the application layer becomes responsible for choosing how to resolve conflicts.

2.4.6 MongoDB

MongoDB is a key/value store. Documents are stored in a binary JSON form called BSON [17].

Data replication in MongoDB is handled with replica sets. These are clusters of servers that replicate amongst one another and ensure automated failover. One server is the primary, all others are secondary. All writes to a replica set is directed to the primary and the secondary servers replicate from the primary asynchronously.

If a single replica set is not performant enough, a sharded cluster is used instead. Every shard contains a subset of a collection. A shard is deployed in a replica set.

2.4.6.1 Setup

The minimal setup for a redundant cluster is a single replica set. A replica set is two or more MongoDB servers.

(22)

If sharding is needed, the minimal setup becomes much larger, as follows: 3 x config servers Configuration servers for the cluster

2 x mongos mongos is a routing service for MongoDB shard configura-tions.

shards The shards are replica sets containing the actual data.

2.4.7

2.4.8 Amazon Relational Database Service

Amazon Relational Database Service (RDS) is a service in which one pro-vision a database instance. These are either single instances or deployed in two separate AZs, to provide redundancy and failover.

Things like backup, replication, failover and host replacement are han-dled automatically by Amazon, which makes using this system simple.

You can choose between MySQL, PostgreSQL, Oracle DB and Microsoft Sql Server for the RDS instances. Of these, MySQL and PostgreSQL are evaluated, as they have free licenses.

2.4.9 Amazon DynamoDB

Amazon DynamoDB is a managed key/value store. No servers or hardware is provisioned. Instead, you provision the number of IO operations per second (iops) you need. You then pay for the provisioned iops and the amount of data you store.

Iops is provisioned in terms of read units per second and write units per second. A unit in this context means an operation on 1kb or less. If you settle for eventually consistent reads you can do two reads per unit, instead of one.

It is possible to scale the provisioned iops both upwards and downwards without any downtime.

2.4.9.1 Keys

DynamoDB allows you to specify two columns for your primary key, the hash key and the range key. A query will return the results ordered by the range key. Note that a range key is not required, it is allowed to simply use the hash key as primary.

On their homepage an example forum database layout is provided. The example uses Forum Name for hash key and Subject as range key. That

(24)

2.5. DATABASE SYSTEMS SELECTED FOR FURTHER

EVALUATION CHAPTER 2. RESEARCH way a query can easily ask for all posts in a given forum and order them by subject name.

As well as the primary key, DynamoDB allows you to query on secondary indexes. These have to be specified at table creation time. These secondary indexes are ways to provide secondary range keys to the database. You will still need the hash key to query, but you can, for example, order the database by date as well as subject this way.

2.4.9.2 Setup

The Token and Payment databases are good fits for DynamoDB, seeing as they would only use the hash key on the ID column as primary key.

Log is more problematic. For the last 10 days for a service query you would want the service column as the hash key and the date column as range key. For the connection time query, however, it is natural to use the owner column as hash key and the date key as range key. This disparity means you would either need to work around them and consolidate the hash/range keys into something DynamoDB specific, or make several log databases for each type of log. Neither of the alternatives seem attractive, so I decided against using DynamoDB for Log. It is still evaluated for purposes of Payment and Token, however.

2.4.10 Amazon Redshift

Amazon Redshift is an analytics oriented database designed to work with huge data sets. This is possibly a good solution for a future analytics database, as it is designed for data warehousing rather than live databases.

2.4.11 Amazon Elasticache

Amazon Elasticache is a cache service designed to reside between a proper database and the user. This can drastically improve performance for a RDBMS, or any other system where lateral scaling is needed.

2.4.12 Amazon SimpleDB

Amazon SimpleDB is a key/value store which is primarily designed for ease of use. Does not allow for data sets larger than 10Gb or iops > 25. It is possible to query on all columns in a SimpleDB database.

2.5 Database systems selected for further

eval-uation

(25)

2.6. PROTOTYPES CHAPTER 2. RESEARCH

EC2 hosted PostgreSQL The configuration Kulipa is running today, with added replication for availability purposes.

RDS hosted PostgreSQL The equivalent system to EC2 hosting, only hosted in RDS instead.

RDS hosted MySQL An alternative to using PostgreSQL is MySQL. Cassandra The NoSQL database all other NoSQL databases compare

them-selves against, used by Netflix with great success in EC2.

MongoDB Recommended by people at Amazon, advertised for simplicity of use.

DynamoDB Amazons NoSQL database.

2.6 Prototypes

I use a local virtual instance provisioning tool called vagrant to set up vir-tual prototype clusters on my local computer to test how it feels to start and configure instances to make them run properly and communicate be-tween each other. The systems which were selected for evaluation but not discussed in this section are not manually managed and does therefore not require anything in the way of setting up.

2.6.1 Cassandra

For Cassandra I set up a cluster and give each server an initial token so that each server is given an equal token space. I also give the IP address to the first server I set up in the cluster, so that each server can find the others using the gossip protocol [5]. An example setup is seen in Figure 2.2.

Figure 2.2: Cassandra example node setup. All nodes receive the same seed to the initial node. The solid lines are the connections established through the seed,

(26)

2.6. PROTOTYPES CHAPTER 2. RESEARCH

I create a so called keyspace, a namespace which contains the data tables, for storing the test data, and set the replication factor to 3, to make sure the data is replicated on at least three nodes.

2.6.2 MongoDB

I set up a single replica set of MongoDB, not a sharded cluster. I let the first node initiate the replica set using rs.initiate() from the mongo con-sole, and once the other servers are started they are added from the master server using rs.add(node_ip), also from the mongodb console. An example MongoDB setup can be seen in Figure 2.3.

Figure 2.3: MongoDB example setup. The connection between the nodes is explicitly established using rs.add(’10.0.1.132’) from the primary node.

Seeing as everything is funneled through the master server, two servers is enough for the purposes of this work.

2.6.3 PostgreSQL

I create a configuration for both the master server and the slaves, since options only regarding the master is ignored on the slaves and vice versa. This makes it simple to have a slave take over in case the master fails, since it is configured equivalently.

In the configuration file I set activate WAL-logging and ensure that slave servers will be kept as hot standbys, which means that they are ready to replace the master at any time. These settings are appropriately mirrored for the slave servers.

On the slave servers, I create a recovery.conf file, which lets them know that they are slaves, and what master to connect to.

(27)

2.7. DATABASE USAGE PLAN CHAPTER 2. RESEARCH

The master server is started first. To create a slave, a backup is taken from the master, and recovered on the slave. When the slave subsequently starts, it connects to the master server and is brought up to date, and starts the replication process.

An example PostgreSQL setup can be seen in Figure 2.4.

Figure 2.4: PostgreSQL example setup. The slave server connects to the master using the primary_conninfo data in the recovery.conf file.

2.7 Database usage plan

2.7.1 Core

Since Core stores a data set with many interconnections, a relational database is an excellent fit. Either the PostgreSQL solution, the hosted RDS system with MySQL or one of the Graph databases would be the best solution for this data.

If the scaling demands grow too large, single relatively independent tables such as Order can be sharded out to a database of their own.

2.7.2 Token and Payment

These are very similar systems, since they both have completely non-relational data. These would preferably be stored in a key/value store such as Dy-namoDB or Cassandra.

2.7.3 Log

Log is the database that accumulates data the fastest, since all operations in the system is logged. The requirements for fast log access is fast access to approximately the 10 million latest log messages. Due to the requirements

(28)

2.8. DATA GENERATION CHAPTER 2. RESEARCH

for fast access to recent logs it needs to be stored somehow. As the data is almost completely non-relational, a NoSQL store is reasonable.

Seeing as the fast-access requirement is 10 million log messages of 1kB each, this means an active data set of 10GB. This is precisely the upper limit of a data set stored in DynamoDB. If indexes are added, their sizes count towards the limit, meaning that DynamoDB cannot host Log in a single table if the data reaches the projected size.

Since one of Cassandra’s strengths is fast writes [9], and Log messages are very write-heavy, Cassandra seems like a good choice.

An alternative solution is to use some log framework such as scribe to write log messages to log files, and then use more ad-hoc methods for query-ing, such as grep and perl. This solution is not covered in this report, however.

2.8 Data Generation

I decided that since all the non-relational data is quite similar, I only gener-ate the log data and test all the non-relational access patterns towards this data. This leaves me with two data sets, Core and Log.

2.8.1 Core

Core is a fairly large system with many interactions. I have extracted what I believe is the most important part of the system, namely the tables User, Store, Organization, Product, Order, Device and Terminal. A diagram over this is shown in Figure 1.1. This enables me to test the important relational queries.

The core generation creates the relations properly, but much of the data is simply randomly generated text. Some of the things, for example phone numbers and names, are generated from a data set.

The generation system starts off with a number of users and generates the other properties following that. A user has one or more devices, for approximately every 100 users there is a store, for every 2 stores there is an organization, and so on. It assumes users make many orders, on average 100 orders per user. The total size of the generated data amounted to slightly above 5GB.

2.8.2 Log

As with the Core data, the log data is mostly random. However, part of the access patterns of Log require specific structure for part of the data, so I ensure to generate this properly. This is further covered in the next chapter in Section 3.3.1.

As the rows of the tables are independent they are simply generated one at a time. The total size of the generated data amounts to almost 12GB.

(29)

Chapter 3 Benchmarks

In this chapter I write about the setup used for benchmarking and the tests procured for benchmarking the databases.

3.1 Benchmark setup

For the benchmark setup I create a Virtual Private Cloud (VPC), with subnets in all three AZs. For the EC2-ran databases this allows the instances not to be open to the internet, which is an attractive security feature.

For querying I set up three EC2 instances, one in each AZ, with all the benchmark scripts on them. I then use parallel-ssh to run the scripts on all these instances at once and gather the data. This creates a workload for the databases which ensures that the queries are not all on the local network. I run all scripts on a parallel level of three and nine, which is one and three connections at a time, respectively, per instance.

3.2 Core

To benchmark the core system, the data was generated and put into the test setups, namely the EC2 Postgres database and the RDS hosted Postgres and MySQL databases.

Since all these systems use SQL, the tests are simply SQL queries which are timed. There are some differences between the syntax in Postgres and MySQL. For example, in a count(*) query in Postgres, the resulting column is automatically named count, while in MySQL you have to do count(*) as count for the same effect. MySQL will happily do sum on a VARCHAR column, while in Postgres a cast statement needs to be inserted. One of the columns in Kulipa is named order, which conflicts with the order by statement in SQL. In Postgres, this is handled by quoting the name, "order", while in MySQL it is surrounded instead by backticks, `order`. Such differences

(30)

3.3. LOG CHAPTER 3. BENCHMARKS

may require some queries to be written differently than engineers at Kulipa are used to.

For the Postgres systems, I use the instance sizes from Table 2.1. For the AWS hosted RDS, I use the corresponding sizes of database variants.

3.2.1 Tests

These are the test queries ran for each setup. These are contained SQL script for each test. I use different scripts for Postgres and MySQL to handle the differences in syntax. They use identical semantics, however, and should not result in any performance differences.

Get monthly top sellers This query gets the top selling products for all stores in an organization. This is the heaviest analytical test query for Core. It would not be expected to run very often due to the long query time, but could still provide useful information.

Average transaction value for frequent customers This extracts cus-tomers who are frequent, meaning they are repeat cuscus-tomers at the specified store, and extracts the average transaction values for the purchases of these customers between the dates specified.

Popular products among frequent customers Among the frequent customers between the two dates, this query answers which products are the most popular.

Popular products among non-frequent customers The same idea as above, only for non-frequent customers instead of frequent ones. The idea is to find out if the products attracting new customers are the same ones as the products which keep customers coming back.

Sells by cashier This query lists the amount of sales all cashiers make in the specified time frame.

3.3 Log

For the non relational data I generate the log data and put it into Cassandra, mongodb and Amazon DynamoDB, as well as PostgreSQL.

All the EC2-hosted databases use the instance sizes from Table 2.1. For DynamoDB I tried to match the price level and run the benchmarks at a price point which is approximately equivalent to the EC2-instances. This is not entirely trivial, as you pay elastically for the amount of space you use in DynamoDB, while the instances have fixed storage space. The amount of IOPS I provided to match each instance size is listed in Table 3.1. The

(31)

price for 10 read operations or 50 write operations is $0.00735 per hour. I estimated that the amount of write operations should be about 25% of the total amount of operations. If the difference between the price listed in the table and the instance costs in Table 2.2 is calculated as the storage cost, I believe I undershoot the storage costs a bit. This needs to be kept in mind when comparing the results.

Instance Type matched Write Capacity Read Capacity IOPS total price m1.small 50 150 $0.059 m1.medium 100 300 $0.118 m1.large 200 600 $0.235 m1.xlarge 400 1200 $0.470

Table 3.1: DynamoDB provisioned IOPS matching instance sizes

When I moved on from prototypes to the benchmark servers, Cassan-dra had released an updated version making the initial token assignment irrelevant, so it was removed. Instead, a system of virtual tokens is used, distributing tokens randomly across the cluster. This makes setup of a Cas-sandra cluster significantly simpler.

For Cassandra the minimal set is three instances. I run all the tests in this configuration for each instance size.

Compared to Cassandra, MongoDB is very comfortable to use. The python code for the connection time query and is simply

client.test.log.find({'action': 'connection.lost', 'owner': owner}) which can then be iterated over. The objects return are structured so that they can be accessed using simply obj[’meta_data’][’connection_lost’]. Compared to Cassandra, where an indexed expression is explicitly con-structed, and the objects are not naturally nested.

3.3.1 Tests

These are the test queries ran for each setup. For each test and database a python script is made. I chose python because appropriate drivers exist for all databases, and Kulipa use a lot of python in their codebase.

A bug in the index usage in Cassandra makes the Get single row by ID query time out. It also affects the replication ability, so you have to recreate all indexes when replicating a node. This bug is fixed in the latest version of Cassandra, but the one I ran the tests on was affected. This unfortunately means I have no data for that test for Cassandra.

Get all log lines for this service from last 10 days Simple query for getting interesting logs regarding a service.

(32)

3.4. ON CACHEABILITY AND LATENCYCHAPTER 3. BENCHMARKS

get a single row by ID The benchmark test for single row access, useful for the Token and Payment databases. For this test I get a random set of 1000 rows over the database on the same connection.

Get login times This query fetches logs of the type connection.lost, which contains an internal field of type connection_established. The two are sub-tracted to find the total time the user was connected. Since this uses a particular subfield it is taken care to generate properly in the data genera-tion.

3.4 On Cacheability and Latency

Since most data tends to be fairly cacheable, the benchmarks are not written to spread the queries over the entire data set, but rather a small part of it. For example, it is unlikely that every user of the system tries accessing every store at once, but more likely that a subset of users use a subset of the stores. Later orders are more interesting than old orders, and so on.

The time taken for connecting to the database is included in most bench-marks. This is because I’m looking for the real total time taken for a typical query, not the time in the database queries as well. If the benchmarks were comparing different ways of obtaining the same results, the connection time could be discarded.

3.5 Benchmark Results

After all the tests are run and the data collected, the data is averaged and plotted in the following charts. The bars represent the average and the error bars are plus/minus the standard deviation. In some queries the standard deviation actually extend below 0, this does not mean some queries took negative time, but rather that the standard deviation is larger than the average. If you looks at the actual data for where this happens, you would find that the actual timings are groups of some very quick timings and some very slow. This is the contrast between the micro instance running att full speed versus where it gets throttled.

Most scripts was run 100 times per query instance, meaning 300 total for a paralell level of three and 900 times for nine. The slowest ones were occasionally terminated early when a stable result was deemed to have been achieved.

3.6 Core

Average transaction value for frequent customers (Figure 3.1) As can be seen in Figure 3.1, the micro instances seem to be doing very well on

(33)

3.6. CORE CHAPTER 3. BENCHMARKS

micro small medium large xlarge 0 1 2 3 4 5 6 7 8 9 rdspostgres postgres rdsmysql (a) 3 parallel

micro small medium large xlarge 0 10 20 30 40 50 60 70 rdspostgres postgres rdsmysql (b) 9 parallel

Figure 3.1: Average transaction value for frequent customers benchmark in seconds

the 3 parallel benchmark, actually faster than the small instances, followed by an expected successive reduction in time taken by the larger instances. On the 9 parallel benchmarks the micro instances become very unstable with a very large standard deviation.

The EC2-hosted Postgres database seem to be doing very well, but with a strange hiccup in that the large instance runs slower than the medium instance.

micro small medium large xlarge 0 20 40 60 80 100 120 140 rdspostgres postgres rdsmysql (a) 3 parallel

micro small medium large xlarge 0 100 200 300 400 500 600 rdspostgres postgres rdsmysql (b) 9 parallel Figure 3.2: Most popular by month benchmark in seconds

Most Popular by Month (Figure 3.2) The micro instances are very slow and have a very high standard deviation in this benchmark, as seen in Figure 3.2. The other instances behave as predicted, approximately doubling performance for each instance size.

Orders per Cashier (Figure 3.3) Here RDS Postgres seem to under-perform while MySQL seems to under-perform very well. Note that also here the

(34)

micro instances are faster than the small instances.

micro small medium large xlarge 0.0 0.5 1.0 1.5 2.0 2.5 3.0 rdspostgres postgres rdsmysql (a) 3 parallel

micro small medium large xlarge 5 0 5 10 15 20 rdspostgres postgres rdsmysql (b) 9 parallel Figure 3.3: Orders per cashier benchmark in seconds

micro small medium large xlarge 0 2 4 6 8 10 12 rdspostgres postgres rdsmysql (a) 3 parallel

micro small medium large xlarge 5 0 5 10 15 20 25 30 35 40 rdspostgres postgres rdsmysql (b) 9 parallel Figure 3.4: rare buyers buy what benchmark in seconds

Rare Buyers Buy What (Figure 3.4) Here in Figure 3.4 the micro level of RDS MySQL seems extremely unstable, while otherwise performing very well. RDS Postgres follows a more predictable curve than the EC2 one. Store Total Sales (Figure 3.5) In Figure 3.5 we see that RDS Postgres seems very slow and unstable. RDS MySQL also acts unstable for the micro instance. For the others, EC2 Postgres has a hiccup at the large level, but otherwise seems predictable. RDS MySQL completely crushes the other databases for this query.

Top 10 Products for Frequent Customers (Figure 3.6) For this benchmark in Figure 3.6 the databases seem to act approximately the same

(35)

micro small medium large xlarge 2 0 2 4 6 8 10 12 rdspostgres postgres rdsmysql (a) 3 parallel

micro small medium large xlarge 0 5 10 15 20 25 30 35 rdspostgres postgres rdsmysql (b) 9 parallel Figure 3.5: store total sales benchmark in seconds

micro small medium large xlarge 5 0 5 10 15 20 rdspostgres postgres rdsmysql (a) 3 parallel

micro small medium large xlarge 0 10 20 30 40 50 60 rdspostgres postgres rdsmysql (b) 9 parallel

Figure 3.6: top 10 products for frequent customers benchmark in seconds

as the previous benchmarks. RDS MySQL is a lot faster than the alterna-tives here as well.

3.6.1 Price for performance

Since this report aims to optimize the price for the necessary performance, it is interesting to plot the data in relation to the price for the instance size used. Two representative benchmark results were chosen and the resulting data multiplied by the price of the database to get a price per performance stat.

The result of this can be seen in Figure 3.7. Since the instance price doubles for every step of increased size, the expected result would be a doubling in performance as well. The result would be a flat characteristic which is approximately what is seen in the graph. The large instance of EC2 Postgres is clearly performing uncharacteristically well. This is discussed more in the Closing chapter in the 4.1 Discussion section.

(36)

micro small medium large xlarge 0 5 10 15 20 25 30 35 40 postgres rdsmysql rdspostgres

(a) The most popular by month bench-mark

micro small medium large xlarge 0.0 0.5 1.0 1.5 2.0 2.5 postgres rdsmysql rdspostgres

(b) The orders per cashier benchmark Figure 3.7: Representative benchmarks weighed with instance costs. Lower means a lower price for performance.

A clear trend in this graph is that the RDS MySQL is significantly cheaper per performance in addition to having an absolute performance which is better. The notable exception is the micro instances, where EC2 hosted Postgres was superior for the most popular by month benchmark and the RDS Postgres was the better choice for the orders per cashier bench-mark.

3.7 Log

When inserting data into MongoDB on the instance size small, the insertion became unresponsive after about 80% of the insert script, reducing the inser-tion rate to less than 1%. I aborted the inserinser-tion and ran some test queries on the database. It turns out that the scripts which ought to take less than a second took more than 45 seconds to run. I abandoned my efforts to run MongoDB on small instances. On every other instance size, MongoDB ran all the benchmarks.

Raw Access by ID (Figure 3.8) Since this is such a simple benchmark, all the databases completed it without trouble. The only measurements missing is the MongoDB small instance.

For the 3 parallel benchmark, DynamoDB didn’t improve its perfor-mance when increasing capacities, and Cassandra only increased when go-ing from small to medium, the higher ones are basically the same. When increasing to the 9 parallel benchmark the same effect can be spotted, only between medium and large for Cassandra. DynamoDB also sees an improve-ment when going from small to medium, but then flattens out. These result indicate that 9 computers running the python drivers cannot saturate the databases on higher instance sizes.

(37)

The differences in time between the databases is interpreted to indicate the difference in latency for a query. MongoDB seem to have the lowest latency of the tested databases.

small medium large xlarge 0 1 2 3 4 5 6 7 dynamodb mongodb postgres cassandra (a) 3 parallel

small medium large xlarge 0 2 4 6 8 10 12 14 16 dynamodb mongodb postgres cassandra (b) 9 parallel Figure 3.8: Raw Access by ID in seconds

Get last 10 days for a service (Figure 3.9) Only MongoDB and Post-gres was able to run this benchmark. Upgrading to a larger instance does not seem to improve speeds much on this benchmark either, which suggests that it is fast enough to be dominated by latency rather than database performance.

small medium large xlarge 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 dynamodb mongodb postgres cassandra (a) 3 parallel

small medium large xlarge 0.0 0.2 0.4 0.6 0.8 dynamodb_mongodb postgres cassandra (b) 9 parallel Figure 3.9: Get last 10 days for a service benchmark in seconds

Get login times (Figure 3.10) In this benchmark MongoDB performs completely awfully. It’s so bad it makes me think it’s some issue with an index that has broken or equivalently, because a difference in access times this high is not seen anywhere else.

(38)

For Cassandra and postgres, it seems to get capped on latency rather than database performance.

small medium large xlarge 5 0 5 10 15 20 25 30 dynamodb mongodb postgres cassandra (a) 3 parallel

small medium large xlarge 10 0 10 20 30 40 50 dynamodb mongodb postgres cassandra (b) 9 parallel Figure 3.10: Get login times for user in seconds

(39)

Chapter 4 Analysis

In which the benchmark results are discussed and a recommendation is pre-sented.

4.1 Discussion

For the most part, the databases behaved as was expected, at least as far as the small up to xlarge went.

Generally the micro instances were surprisingly fast, for example in Fig-ure 3.1 they are faster than the small instances. They clearly dropped off when the level of parallelism increased, and were very less predictable than the normal instances. This is probably because of the advertised throttling of CPU performance when the workload increased. This is also visible in the Most popular by month benchmark (Figure 3.2), where they seem to get throttled even for the lower level of parallelism.

The large version of the EC2 Postgres seem to act strangely on many benchmarks, for example the Store Total Sales-benchmark in Figure 3.5, it seems slower than the medium instance. It’s possible that there was some network connectivity issues, since the effect is not apparent on the analytical most popular by monthbenchmark. This is probably not a general issue with the large instances, but instead a problem with the specific one I ran the benchmark on. To be really sure one could rerun the benchmarks on a new large instance.

Hosting PostgreSQL on EC2 versus using the Amazon hosted RDS ver-sion seems to be fairly equivalent, performance wise. With the exception of the large instances, the EC2 version tends to be slightly faster. If a Post-greSQL system were to be chosen, the EC2 version is thus both slightly cheaper and slightly faster. This would be weighed against the simplicity of RDS, where the setup and maintenance is much easier.

In most benchmarks the RDS MySQL was significantly faster than either of the Postgres solutions. For example in the Rare Buyers buy What

(40)

bench-4.2. RECOMMENDATION CHAPTER 4. ANALYSIS

mark in Figure 3.4, MySQL completes the test in about 70% to 50% of the time the RDS Postgres solution. This weighs heavily in favor of MySQL. The RDS MySQL instances are also cheaper than the RDS PostgreSQL ones. Both, however, are more expensive than the manual EC2 hosting.

Looking at the log tests, the difference between the Postgres and the NoSQL databases was lower than I expected. Otherwise, I think the NoSQL databases mostly lived up to their hype, with a general increase in perfor-mance for a specific set of tasks. Unfortunately most of the tests didn’t stress the databases quite as much as I had hoped.

With the results from Log, one can conclude that there is no signifi-cant difference in latency between the different instance types. In the Raw Access by ID benchmark (Figure 3.8) there was no apparent difference at all between instance sizes medium to xlarge. Since the xlarge instance cer-tainly has more computational power, the similarity in performance must be attributed to network latency. The differences in network performance characteristics described between the instances might be throughput only, or perhaps not as significant as one might think.

4.2 Recommendation

All the solutions tested live up to the requirements presented in the intro-duction chapter. Most of the desires are fulfilled as well. In conclusion, all systems are acceptable. This recommendation presents what I believe to be the most desired solution.

The desired solution should optimize these criteria • Simplicity of transferral from the old system, • Scalability for the foreseeable future, and • Performance per dollar.

Each of the Kulipa data sets are examined.

4.2.1 Token and Payment

For storing any type of non-relational data and accessing them by row ID, any of the databases examined can certainly fulfill the requirements. Judging primarily by the Raw Access-benchmark, all the databases ran into latency issues on my setup.

If a database would be chosen for lowest latency, MongoDB would be a good choice. It is also really nice to interact with from python, and the other languages used at Kulipa.

If simplicity of setup is deemed more important, DynamoDB is a superior choice. It’s not quite as nice to interact with but very usable. It is also significantly simpler to scale either upwards or downwards. You can even

(41)

4.2. RECOMMENDATION CHAPTER 4. ANALYSIS

do things like scale down over night, or significantly scale up during popular times, both of which are very bothersome to do in MongoDB.

If a 5ms latency is acceptable, I would recommend DynamoDB for both Token and Payment. Otherwise MongoDB seems an excellent choice.

4.2.2 Log

For the log database Cassandra unfortunately wasn’t available for one of the tests and MongoDB performed awfully at the other. Still, if the assumption that the MongoDB version is fixable to get equivalent performance as the first test is made, and both tests are extrapolated assuming the databases would have approximately equal performance compared to Postgres on both, some conclusions can be drawn.

Both NoSQL databases are faster than Postgres in this instance. Since Cassandra is optimized for writing it seems like a good choice for a log database. Seeing as it is used heavily on EC2 by netflix, I think it is the best choice.

A problem with the Cassandra solution is that it should use ephemeral storage. This means it has to use at least three instances of the small type. This is larger than what is used at Kulipa today. Another problem is the indexing bug discussed in 3.3.1. To overcome the bug one could either use the current version from the developers of Cassandra or wait until the Ubuntu repositories pull in a fixed version. A good compromise may be to either try to set up Cassandra on micro instances for the time being and ignoring the bug, or to continue to use the current Postgres solution until the system becomes large enough that it would utilize three small instances and the bug is fixed.

One thing not done in this report is testing for write performance in addition to read performance. This is mostly relevant for the log database, where a large percentage of the load is assumably writes. However, this is fairly strongly covered in A comparison between several NoSQL databases with comments and notes [37] where they find that cassandra does not have a problem with scaling up writes.

4.2.3 Core

When examining the Core benchmarks, generally MySQL is the fastest of the solutions. It is the solution which performs best, and best per dollar as well. However, MySQL has fewer protections against programmer mistakes, some examples of which can be seen in [32], and a history of security flaws, such as allowing a user to connect with any password if attempted enough times [33]. These flaws are not insurmountable by any means, since Kulipa uses an ORM -layer to abstract away most details about the database, and it shouldn’t be possible to connect to the database from outside of the VPC anyway. If these flaws are considered acceptable, MySQL is a good choice.

(42)

4.2. RECOMMENDATION CHAPTER 4. ANALYSIS

If the increased robustness of Postgres is deemed necessary, the choice between RDS Postgres and EC2 Postgres remains. The performance seem to be approximately equivalent, without a clear advantage to either. The price for RDS is approximately 30% higher than EC2. This is weighed against the complexity of maintaining a EC2 hosted cluster.

An advantage of EC2 hosting is flexibility in backups, where there can be problems with RDS backups if you want to move out of RDS for example [2]. This counts against both Postgres and MySQL when hosted in the RDS. All in all, I think EC2 Postgres seems the way to go. It is the cheapest solution, as well as the most flexible. It unfortunately needs the most manual maintenance. It is also what Kulipa is running today, which make the migration costs very low, and does not require engineers at Kulipa to learn new quirks of another database.

The only thing that would be necessary to transition would be to add a replica server and start the replication stream to it.

(43)

Chapter 5 Summary

In this report a number of database solutions were considered for a number of data sets. The data sets considered were the highly relational Core, the purely non-relational Token and Payment databases and the semi-relational write-heavy Log database.

The data sets were grouped based on however they were sets of relational data, which left us with two data sets on which to run benchmarks. It was decided to test the relational data in relational databases and the non-relational data in both non-relational and non-non-relational NoSQL databases.

After researching several database solutions a few were decided to be the most interesting ones. For the all data sets the current system in place today at Kulipa are stored in relational Postgres databases. For the rela-tional data, it was decided to compare the current system of EC2-hosted Postgres databases versus Amazon RDS-hosted Postgres as well as MySQL databases. For the non-relational data the current Postgres solution was compared against Cassandra, MongoDB and Amazon DynamoDB.

For the benchmarks both relational and non-relational data sets were randomly generated and inserted into the corresponding database solutions. A set of benchmarks grounded in business logic from Kulipa were produced. These benchmarks were then run over all database systems with the gener-ated data.

After the database solutions were tested the results were analyzed and a recommendation for a scalable future solution for databases was presented. The recommendation took into consideration several factors including sim-plicity of setup and maintenance as well as performance and price. The rec-ommended systems were to use the current solution, EC2-hosted Postgres, for the relational data, to use DynamoDB for the purely non-relational data sets Token and Auth, and finally to use Cassandra for the semi-relational write-heavy Log database.

(44)

5.1. FUTURE WORK CHAPTER 5. SUMMARY

5.1 Future work

Since the field of NoSQL databases is so young there is always opportunities to run database benchmarks with newer versions and also new databases. Several databases were decided against for various reasons. There were many database systems which were decided against for reasons of maturity, where there was no real system in place for backups, for example. When these come to maturity they can be tested properly.

The results from this report is probably fairly specific to companies using Amazon Web Services for hosting. The database benchmarks used were chosen from business requirements for Kulipa. If someone is interested in doing equivalent benchmarks, a set of tests should be constructed from the requirements on the database from the point of view of the interested party. Another thing to note is that the prices for almost everything on AWS been lowered several times over the last few years. Naturally, if the prices for hosting are lowered, so are the prices for the databases.

Unfortunately I did not really touch on the search database in this report. It seemed very different from the other databases so I left it alone and then I ran out of time.

The future concern for an analysis database was also left fairly un-touched. This is mainly because Kulipa is small enough that large scale analysis of data seems very far away, and can probably be covered by simply running analysis jobs on the Core database at low-traffic times, for example at night, for a long time. When this becomes too heavy a task for Core to handle, this point can be revisited and something like Amazon Redshift could be analyzed more in depth.

(45)

Bibliography

[1] Accumulo. http://accumulo.apache.org/, Mar. 2013.

[2] Amazon discussion forums about downloading backups. https: //forums.aws.amazon.com/message.jspa?messageID=239119# jive-message-242280, Dec. 2013.

[3] Amazon elasticache. http://aws.amazon.com/elasticache/, Dec. 2013.

[4] Cassandra. http://cassandra.apache.org/, Mar. 2013.

[5] Cassandra gossiper. http://wiki.apache.org/cassandra/ ArchitectureGossip, Dec. 2013.

[6] Chaos monkey wiki on github. https://github.com/Netflix/ SimianArmy/wiki/Chaos-Home, Dec. 2013.

[7] Couchbase. http://www.couchbase.org/membase, Mar. 2013. [8] Couchdb. http://couchdb.apache.org/, Mar. 2013.

[9] Datastax architecture planning for ec2. http://www.datastax.com/ documentation/cassandra/2.0/webhelp/index.html, Dec. 2013. [10] Elasticsearch. http://www.elasticsearch.org/, Mar. 2013. [11] Foundation db. http://foundationdb.com, Mar. 2013. [12] Hbase. http://hbase.apache.org/, Mar. 2013. [13] Hypertable. http://hypertable.org/, Mar. 2013.

[14] Kyoto tycoon. http://fallabs.com/kyototycoon/, Mar. 2013. [15] Memcached. http://memcached.org/about, Dec. 2013.

[16] Micro instances throttling explanation. http://docs.aws.amazon. com/AWSEC2/latest/UserGuide/concepts_micro_instances.html, Dec. 2013.

Database analysis and managing large data sets in a trading environment

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Database analysis and managing large data sets

in a trading environment

by

Per Månsson

LIU-IDA/LITH-EX-A--14/009--SE

2014-02-10

Final Thesis

Database analysis and managing large data

sets in a trading environment

by

Per Månsson

LIU-IDA/LITH-EX-A--14/009--SE

2014-02-10

Supervisor: Patrick Lambrix

Examiner: Fang Wei-Kleiner

Abstract

Contents

Chapter 1

Introduction

1.1

Background

1.2

Purpose

1.2.1

Requirements

1.2.2

Desires

1.3

NoSQL and SQL

1.4

Related Work

1.5

Disclaimer

1.6

Data sets

1.6.1

Core

1.6.2

Token

1.6.3

Payment

1.6.4

Log

1.6.5

Search

Chapter 2

Research

2.1

The Cloud

2.2

Amazon Web Services

2.2.1

Instance Sizes

2.2.2

Amazon Relational Database System

2.3

Database Technologies

2.3.1

Relational Databases

2.3.2

Graph Databases

2.3.3

Key/Value stores

2.3.4

Document stores

2.4

Database Systems

2.4.1

PostgreSQL

2.4.2

Cassandra

2.4.3

accumulo

2.4.4

Neo4j

2.4.5