MASTER'S THESIS
Scalable Storage of Sensor Data for a High Frequency Measurement Vehicular
Monitoring System
Erik Eliasson Daniel Öhrlund
Master of Science in Engineering Technology Computer Science and Engineering
Luleå University of Technology
Department of Computer Science, Electrical and Space Engineering
Scalable Storage of Sensor Data for a High Frequency Measurement
Vehicular Monitoring System
Erik Eliasson Daniel ¨ Ohrlund
Lule˚ a University of Technology
Dept. of Computer Science, Electrical and Space Engineering
December 9, 2012
A BSTRACT
This thesis investigates how to store sensor data for a large scale vehicular monitoring system with high frequency measurements. The system is intended to measure multiple sensors at a rate of one measurement per second and allow for non-real-time analysis of the data. The goal is that the system should be able to support all cars in Sweden, which means that 240 TB of raw sensor data must be stored every year. To be able to keep up with the new that is data generated by cars, sensor batches (a batch is a collection of sensor measurements captured at the same time by a single car) must also be inserted at a rate of at least 110 000 per second.
Cassandra is identified to be the database that best suits the system requirements.
It is then investigated how to store and index the sensor data in Cassandra to be able to perform one- and two-dimensional range queries. The data models focus on how to store the data so that it is equally distributed over the cluster and is able to scale up for large numbers of records.
An empirical study is performed that evaluates the performance for insertion, re- trieval and storage for the different data models. The result is then used to determine how many servers would be required to meet the system requirements. The study shows that it is feasible to store the data in Cassandra. Less than 21 servers are required to reach the insertion rate and the data can be stored using between 360 and 1260 servers (with 2 TB of storage each) per year with three replicas of each node, depending on the indexing approach.
Keywords: d atabases, scalable data storage, Cassandra, big data, sensor data, vehicular monitoring system, vehicle monitoring
iii
P REFACE
This thesis was carried out during the spring semester of 2012 at Tieto AB in Lule˚ a.
They wanted to investigate how to create a generic system to support pay as you drive and other similar applications that use sensor values from vehicles to generate statistical data. However, the scope of the initial project was deemed too broad. During the analysis of the problem we realised that one of the bigger challenges would be to handle the huge amount of sensor data that the system needs to store. The purpose then changed to finding a scalable storage solution for the system, which is the focus of this report.
We would like to thank Tieto AB Lule˚ a, our supervisors Evgeny Osipov (LTU), Olov Schel´ en (LTU) and Robin Mellberg (Tieto), and Jim Sundqvist (Tieto) for coming up with the original idea for the thesis and providing guidance along the way. We would also like to thank our families and friends for their support and understanding.
Erik Eliasson, Daniel ¨ Ohrlund
v
C ONTENTS
Chapter 1 – Introduction 1
1.1 System Specification . . . . 1
1.2 Assumptions, delimitations and purpose . . . . 3
1.3 Structure of the report . . . . 4
Chapter 2 – Data load analysis 7 2.1 The amount of data generated by vehicles . . . . 7
2.2 Data transfer at gas stations . . . . 8
2.3 The required insertion rate . . . . 9
Chapter 3 – Databases and database management systems 11 3.1 Schemas . . . . 11
3.2 Queries . . . . 11
3.3 The ACID contract . . . . 12
3.3.1 Atomicity . . . . 12
3.3.2 Consistency . . . . 12
3.3.3 Isolation . . . . 12
3.3.4 Durability . . . . 12
3.4 Indexing . . . . 12
3.5 Redundancy (replication) . . . . 13
3.6 Distribution . . . . 13
3.6.1 Clusters . . . . 13
3.7 CAP theorem . . . . 13
3.8 Partitioning and sharding . . . . 13
3.9 Caching . . . . 14
3.10 NoSQL . . . . 14
3.11 Existing distributed databases . . . . 15
Chapter 4 – Choosing a database 17 Chapter 5 – Introduction to Cassandra 21 5.1 Data model . . . . 21
5.2 Cassandra architecture . . . . 22
5.3 Communication . . . . 23
5.4 Queries . . . . 24
5.5 Mutating data . . . . 24
5.6 Replication & Consistency level . . . . 25
5.7 Composites . . . . 25
5.8 Compression . . . . 25
5.9 Secondary indices . . . . 26
5.10 Spatial indexing in Cassandra . . . . 26
5.10.1 Using a Z-curve . . . . 26
5.10.2 BinTree . . . . 28
5.10.3 Why a BinTree is a good fit for Cassandra . . . . 30
Chapter 6 – Related work 31 6.1 Discussion . . . . 32
Chapter 7 – Data models for storing and indexing sensor data 35 7.1 Storing sensor batches in Cassandra . . . . 35
7.1.1 Alternative key generation methods . . . . 36
7.1.2 Alternative model . . . . 36
7.2 Value-embedded index vs Reference-embedded index . . . . 37
7.3 Data models for indexed range queries in Cassandra . . . . 37
7.3.1 Approach 1 - Many rows . . . . 37
7.3.2 Approach 2 - Wide rows . . . . 39
7.3.3 Approach 3 - BinTree overlay . . . . 40
7.3.4 Pagination of query results . . . . 46
7.3.5 Multidimensional queries . . . . 47
Chapter 8 – Empirical study 49 8.1 Cassandra configuration . . . . 50
8.2 Tests . . . . 51
8.2.1 Inserting unindexed batches . . . . 53
8.2.2 Retrieving unindexed batches . . . . 53
8.2.3 Determining the best split threshold for retrievals . . . . 53
8.2.4 Determining the best split threshold for insertions . . . . 53
8.2.5 Insertion rate for the indexing approaches . . . . 53
8.2.6 Retrieval rate for the indexing approaches as the retrieval size grows . . . . 54
8.2.7 Retrieval rate for the indexing approaches as the number of stored indices grows . . . . 54
8.2.8 Storage . . . . 54
8.3 Data structures . . . . 54
8.4 Time measurements . . . . 54
Chapter 9 – Results and Evaluation 57 9.1 Inserting unindexed batches . . . . 57
9.2 Retrieving unindexed batches . . . . 58
viii
9.3 Determining the best split threshold for retrievals . . . . 59
9.4 Determining the best split threshold for insertions . . . . 60
9.5 Insertion rate for the indexing approaches . . . . 62
9.6 Retrieval rate for the indexing approaches as the retrieval size grows . . 63
9.7 Retrieval rate for the indexing approaches as the number of stored in- dices grows . . . . 64
9.8 Storage . . . . 65
9.8.1 Overhead . . . . 66
Chapter 10 – Hardware estimation for a full-scale system 69 10.1 Insertion rate . . . . 69
10.2 Storage . . . . 71
10.3 Retrieval times . . . . 72
10.4 Putting it all together . . . . 74
Chapter 11 – Discussion 75 Chapter 12 – Conclusion 79 Appendix A – Complexities of a balanced BinTree from equally dis- tributed points in the plane 81 A.1 Height of a balanced BinTree . . . . 81
A.2 Creation complexity . . . . 82
1.2.1 Alternative amortized analysis of the number of operations . . . 83
1.2.2 Traversals . . . . 83
A.3 Simulation verifying the result . . . . 83 Appendix B – Estimating the effect of multiple rows vs one row for
approach 2 87
Appendix C – String constants in tests 89
Appendix D – Test results 91
Appendix E – Approximating approach 3 traversal impact in practice 93 Appendix F – Retrieval time for indexed range queries over network 95
ix
C HAPTER 1 Introduction
Having access to logged sensor data from every vehicle in a country enables for very rich and powerful applications that have not yet been seen as of today. An example of such an application would be to allow for the highway department to identify dangerous sites based on how many fast decelerations that occur in the area. Another example could be a pay as you drive insurance solution (PAYD), where everyone could get individual insurance fees based on their own driving performance. By providing access to the logged data in a generic way it becomes possible to perform other types of analysis on the data as well and enables for other currently unthought-of applications to be created. The question is how to store the data for such a system.
The systems that exist today that monitor vehicles are mainly fleet tracking systems and pay as you drive insurance solutions. Fleet tracking systems gather sensor data at low frequencies, about once every five minutes, and upload the data either by manually extracting it from the vehicles or by using cellular networks. PAYD solutions aggregate the data extensively and only upload pre defined evaluation metrics. Common to these systems is that neither log and store sensor data frequently [1].
In order to be able to provide generic access to sensor data from all vehicles in Sweden, a system that collects, stores and makes it possible to retrieve the data in a generic way is to be created.
1.1 System Specification
The system is used to gather and store one- and two-dimensional sensor data from vehicles. The purpose of the system is to be able to support non-real-time analysis of the gathered data for various applications that can access the data through an API. It should be possible to retrieve data by specifying a range of interest such as all values in a region or all values with a speed greater than 120.
As can be seen in Figure 1.1 the system consists of:
• client devices (the hardware mounted inside the vehicles),
• servers that the client devices and the applications communicate with (these will be referred to as backend-servers from here on)
1
2 Introduction
Client device
Server architecture
Server n
Server 1
1
2
n
...
Distributed database architecture
API Hotspot
sensor 1
sensor n
Applications
Figure 1.1: An architectural overview of the system and its components.
• and the data storage solution.
Each client device is identified by a unique identifier that we refer to as the deviceID.
The client devices collect sensor data when the vehicles are driven and as soon as they are connected to the internet they start uploading the logged data to the backend- servers. The backend-servers are responsible for storing the data in an efficient manner in the database, as well as retrieving the correct data from the database upon requests from applications.
According to the specification of requirements, every vehicle will log data from each sensor once a second. The collection of these sensor values combined with the deviceID is referred to as a batch. The system shall support one- and two-dimensional sensor values, and be capable of having every registered car in Sweden connected to the system.
To make sure that all client-devices will be connected regularly, an initial idea is to place hotspots at gas stations as every car has to pass one at some point in time.
However, most client-devices will be connected much more often since hotspots are common in family homes, at offices, stores etc.
Since the sensor values arrive in groups when a car arrives at a hotspot, the type of operation required by the data store is a grouped insert. The same goes for retrievals, as the system is used for non-real-time analysis, values are retrieved in groups. Both one- and two-dimensional range queries should be supported.
Sensor data consists of measurements of a physical quantity, usually accompanied
by a timestamp of when the measurement was performed. A pair of a sensor value
and a timestamp represents the state of a sensor at a specified point in time. That
point in time will never occur again. Thus, a pair of a sensor value and a timestamp
is immutable. This means that to be able to represent the state of a sensor over time,
a lot of data points must be recorded.
1.2. Assumptions, delimitations and purpose 3 The values generated by a sensor take on a bounded set of values determined by the properties of the physical quantity measured and the precision of the sensor hardware.
The precision (number of decimals) and range of values determine the data type needed.
If the value has lower precision than 7 decimal digits and a value less than 2 24 it can be stored using a float, which is only 4 bytes. If the precision is higher (up to ≈16 decimals) or the value is bigger, a double value has to be used which is stored using 8 bytes. If the higher accuracy is not needed, the storage space can be reduced by using floats instead.
Most sensors measure in 1 dimension, but there are some, such as GPS, who measure in two dimensions. As of today, no sensor imaginable for this type of system measures values with a higher dimension than 2. Sensors likely to be used for this type of system include those measuring:
• acceleration,
• speed,
• temperature,
• pressure,
• and position (GPS).
1.2 Assumptions, delimitations and purpose
In this thesis we will focus on finding a data storage solution for this system that is capable of storing a huge amount of sensor data, and identifying how to structure the data so that it can be both inserted and retrieved in a fast manner. The specification of requirements states that every vehicle in the system will gather data from its sensors at an interval of one data point per second. This means that if the system shall be capable of handling every registered vehicle in Sweden it has to store about 10 12 data points at a total size of approximately 240 TB each year, which is more than the total size of most databases today 1 . During full load the system must also handle the insertion of more than 100 000 data points per second.
The system is intended to work as a platform that provides access to the data for external third-party applications. The applications that will be built on top of the platform are not known in advance. It is therefore necessary to be able to provide the data in a general way that can be used for any kind of application that might want to use it. The employer therefore wants to store the data without any aggregation.
Because of the amount of time dedicated for this thesis, we will not be able to implement our own data store and will therefore use an existing database management system (dbms). As our budget for the thesis is barely non-existing, the dbms and all other software that is included in our solution must be free to use. We assume that
1 Compare Business Products, ”10 largest databases in the world.” [online] http://www.
comparebusinessproducts.com/fyi/10-largest-databases-in-the-world, Mar. 2010. [Accessed 2012-10-
19].
4 Introduction there already exist good load-balancing solutions that can be used for the back-end servers, and we also presuppose that data can be accurately transmitted from a vehicle to the back-end servers using hotspots.
The purpose of this thesis has therefore been to identify the dbms most suitable for the system and evaluate how to store the data so that it:
• scales well and can handle these amounts of sensors data,
• supports one- and two-dimensional range queries and
• has fast insertion and retrieval speeds.
There are several published papers related to this area and our purpose. Considerations in the architecture and approach for storing big data is evaluated in [2]. [3], [4], [5] and [6] perform surveys of data stores capable of storing a lot of data and scaling up to support big data applications. [6] and [7] evaluate the performance and scalability of the most common NoSQL databases. [8] investigates the performance of storing sensor data in both SQL and NoSQL databases, and the difference between using physical and virtual servers. The performance of one-dimensional range queries in various key-value stores and distributed ordered tables 2 is investigated in [9]. [10], [11], [12] and [13]
examine the most common methods for two-dimensional spatial indexing, such as kd- trees, quadtrees and R-trees. Variations of these for both spatial and multidimensional queries are presented in [14] and [15]. [16] shows a spatial indexing structure based on the B + -tree and R-tree, but using a grid structure instead of a tree to access the leaf nodes. [17] and [18] examine how to map multidimensional data to one dimension using space-filling curves such as the Z-order curve and Hilbert curve. [19] investigates how to perform spatial queries in key-value stores, and [20] and [21] present methods for performing multidimensional range queries in distributed ordered tables.
1.3 Structure of the report
The rest of this report is structured as follows: In Chapter 2 a data load analysis is performed to approximate how much data the system needs to handle. Chapter 3 then provides background information about databases and database management systems.
Chapter 4 covers the choosing of a database for this type of system; the requirements are examined and Cassandra is identified to be the database that best suits them. The most important aspects of Cassandra are then presented in Chapter 5.
Chapter 6 covers work related to our investigation. Chapter 7 then presents different data models in Cassandra for storing and indexing sensor data to allow for range queries in one- and two-dimensions to be performed. An empirical study is then presented in Chapter 8, describing the different tests that have been performed. The evaluation of the results is then covered in Chapter 9. Chapter 10 estimates the number of servers required to meet the system requirements based on the empirical study. The results
2 A distributed database that partitions on continuous sets of keys and supports one-dimensional
range queries.
1.3. Structure of the report 5
and possible improvements are then discussed in Chapter 11. The report is finally
concluded in Chapter 12 where we also discuss future work.
C HAPTER 2 Data load analysis
The system is intended to be able to store sensor data from all cars in Sweden gener- ated every second while driving. In this chapter we will try to approximate how much data this actually is and how much data a car has to transmit on average each time it has to refuel.
Value Average gas consumption, G cons 0.083 L/km = 8.3 · 10 −8 m 3 /m
Total number of cars, N cis 4335182
Average travel distance per year for a car, d avg 12910 km = 1.291 · 10 7 m Table 2.1: Swedish car statistics
Table 2.1 shows the car statistics that are required for the analysis, taken from Statistics Sweden (SCB) [22]. We assume that the average gas tank volume, V tank , is 60 liters = 0.06 m 3 and that a car drives with an average speed, v avg , of 60 km/h
= 16.67 m/s.
2.1 The amount of data generated by vehicles
Using the average distance a car is driven during a year from Table 2.1 and the assumed average speed, the average time a car spends driving is given by
t avg = d avg
v avg = 774600 s.
Consequently, since a batch is generated every second, the average number of sensor batches generated by a single car in a year is also equal to this value.
N batches per car = 774600.
7
8 Introduction Multiplying this by the number of cars from Table 2.1, the total number of batches per year is
N tot,batches per year = 774600 · 4335182 = 3358031977200 = 3.36 · 10 12 . (2.1) We assume that a batch contains measurements from at most 9 different sensors and that each sensor has at most two dimensions that can be stored using floats (4 bytes).
A batch also contains a timestamp of when the measurements were taken that is stored using a long (8 bytes). The size of a batch is therefore 9 · 2 · 4 + 8 = 80 bytes. This makes the total amount of data generated by a single car in a year
D car per year = N batches per car · 80 = 774600 · 80 B = 61968000 B = 59.10 MB (2.2) and the total amount of data generated by all cars in Sweden is
D year,tot = 3358031977200 · 80 B = 244.34 TB (2.3)
To put this into scope, one of the largest databases today, AT&T’s call records, con- tains 323 TB of information and 1.9 · 10 12 rows that have been gathered over several decades 3 . Our system will need to store almost this much raw data every year, and this is assuming all values have low precision and can be stored using floats. In addi- tion, the database will need to add some overhead for its internal data structures. The sensor batches may also need to be indexed to improve the time it takes to retrieve data, and these indices need to be stored as well. Neither of these are included in the size in Eq. 2.3. The total size of the database will therefore be even bigger than this.
2.2 Data transfer at gas stations
In the previous section we derived an approximation of the amount of data that the sensors in a single car will generate during an entire year. We will now determine how much data is generated between each refueling and approximate how long it will take to transfer this data to the backend-servers.
The average number of refuelings per car and year is given by N ref uel = d avg · G cons
V tank = 17.86 (2.4)
Assuming that the batches are equally spread out over all gas refills, the amount of data that has to be transferred when a car is at a gas station can be computed by dividing the data generated per year for a car in Eq. 2.2 by the number of refuelings per year.
D ref uel = D car per year
N ref uel = 3.3091 MB.
3 Compare Business Products, ”10 largest databases in the world.” [online] http://www.
comparebusinessproducts.com/fyi/10-largest-databases-in-the-world, Mar. 2010. [Accessed 2012-10-
19].
2.3. The required insertion rate 9 Assuming that the WiFi hotspot has a bandwidth of 10 Mbit/s = 1.25 MB/s, and is serving an average of 10 users at the same time, giving a throughput of 0.125 MB/s per user, the time it takes to transfer this data is
t upload = D ref uel
0.125 = 26.47 s.
To get an approximate upper boundary of the maximum upload time of a single car, we assume a gas consumption of 0.03 L / km, a tank size of 90 L and an average speed of 25 km/h. Performing the same analysis, the time it takes to perform this upload is t upload,max = 264 s = 4 min 24 s.
2.3 The required insertion rate
Using the total number of batches generated per year in Eq. 2.1 and the total number of refuelings per year in Eq. 2.4, the number of batches generated by all cars combined between each refueling is
N batches,ref uel = 3.36 · 10 12
17.8588 = 188142540372.25 batches.
If these can be handled before it is time for the next refueling, in t between = 1 year
17.8588 = 1765852.12 s
the system should be able to manage all cars in Sweden. The number of batches that the system has to be able to handle per second (the insertion rate) is then
N batches,ref uel
t between = 106544.90 batches/s.
Thus, the storage solution must be quite fast at inserting data.
C HAPTER 3
Databases and database management systems
A database is an organized collection of data (and its data structures), and a database management system is a collection of software that controls the database. In this report we will use the term database when we talk about the database and its software. What defines a database is the data model, the way the data is structured, and the operations that can be executed to modify it.
Among popular data models one finds the relational model and the object model.
The relational model is a highly flexible and easy to understand data model. The data is organized as two-dimensional arrays, called tables, with a set of defined columns.
Each row must contain a value for each column. Data can be associated between multiple tables with a set of one or more columns common to each table, which is called a key. [23]
The object model arose from the object-oriented programming paradigm as an at- tempt to bring databases and applications closer to each other. The object model adopts the patterns of object-oriented programming such as encapsulation, and tries to remove the overhead that appears when data is transformed between the represen- tation in an application and the database representation.
3.1 Schemas
A database schema is the structure of a database described in a formal language, which in the case of a relational database defines what tables exist and their properties.
3.2 Queries
Queries are performed using a query language, a computer language that is used to access or alter the data in a database. A query can generally be seen as either asking the database a question and expecting a result, called a read, or asking the database to insert something, called a write. A group of queries can be executed uniformly as a
11
12 Databases and database management systems transaction.
3.3 The ACID contract
The ACID contract [24] is a set of properties that if fulfilled guarantee correct trans- actions.
3.3.1 Atomicity
Atomicity demands that no transaction partly fails. Either all operations succeed, or the entire transaction fails and the database is left in an unchanged state. Atomicity must be guaranteed in every situation, including crashes and power failures.
3.3.2 Consistency
Consistency ensures that any database operation will leave the database in a valid state. No data can be written to the database that does not fulfill the defined rules.
3.3.3 Isolation
Isolation guarantees that when multiple users access the database concurrently, the actions of any user should be unaffected and uninterrupted by those of any other user.
In other words, concurrent actions must maintain the consistency of the database.
3.3.4 Durability
Durability means that once a transaction has been performed, it will remain so even in the event of errors.
Many databases accomplish the ACID properties by implementing locking, which means that the database marks all the data a transaction will access and prevents other transactions from accessing it until the transaction has either failed or succeeded. [25]
3.4 Indexing
Indexing is a technique used to improve the performance of database queries. The idea of indices is to reduce the need to examine each entry in the database when running a query. They are not a part of the dataset stored in the database, and thus they do not affect the result of a query. The same data can have multiple indices that improve the time needed to find the data based on different parameters. As an example, a data entry for a user may be indexed by the name of the user and the country of the user.
Even though indices can increase the query speed, they consume space and require an
update each time the indexed data is mutated (which can slow down the maintenance
speed).
3.5. Redundancy (replication) 13
3.5 Redundancy (replication)
Some databases utilize replication of database data to prevent data loss and downtime in case of system or hardware failures. Redundancy can also improve the performance of the database when multiple queries are made in parallel to the same data. Data replication introduces the need of synchronization when data that is replicated gets updated, to maintain consistency.
3.6 Distribution
A distributed database is a database that stores data on multiple nodes that do not share a common processing unit. The nodes may be separated by location and com- municate over network, as well as reside next to each other.
The idea of a distributed database is to increase the performance and capacity. Users can interact with the database as if it was one logical system, but if multiple users query for data located on different nodes the nodes can serve the data in parallel and thus increase the performance.
3.6.1 Clusters
A cluster is a group of nodes that acts as one database towards the users. A user may connect to one of the nodes in the cluster and start asking for data as if it was connected to all nodes in the cluster at once.
3.7 CAP theorem
The CAP theorem states that no distributed computing system can fulfill all three of the following properties at the same time [26]:
• Consistency — all nodes see the same data at the same time.
• Availability — a guarantee that every request receives a response about whether it failed or succeeded.
• Partition tolerance — the system continues to operate despite arbitrary message loss or failure of part of the system.
3.8 Partitioning and sharding
A partition of a database is a division of the database into independent parts. Reg-
ular partitioning can be performed in two ways: horizontal and vertical. Horizontal
partitioning means splitting up a table (or the equivalent) so that rows that belong to
the same table actually are stored in different tables (i.e. different partitions). One
example of horizontal partitioning would be to store rows that correspond to users
14 Databases and database management systems based on what area the user lives in. All users with zipcode < 50000 are stored in the table UsersBelow50000 and the rest are stored in the table UsersOver50000.
Vertical partitioning uses the same reasoning, but instead of dividing up the rows in different partitions, it stores the columns of a table in different partitions. A similar example as to horizontal partitioning would be to store different columns of a user- row in different tables. The table UserFirstName would contain a row with a unique identifier and the first name of a user, and the table UserSurname would contain a row with the same unique identifier and the user’s surname.
Sharding is a special architecture that uses the same approach as horizontal par- titioning, but stores the rows in a single database schema instead of using multiple tables.
One advantage that partitioning provides is that it may reduce the index size, given that there is some way to identify in which table a row will be found without searching the index, which would be possible in the example of horizontal partitioning described earlier.
3.9 Caching
As with caching in general, the idea is to reduce the need to do an actual lookup by storing some data in a layer above the actual database. Caching can be a cheap way to increase the performance by distributing the workload of queries to cheaper systems.
However, if all the data in the database is accessed equally often, the cache would have to be as large and efficient as the database itself.
3.10 NoSQL
NoSQL, interpreted as “Not only SQL“, databases do not use the common relational model. Typically associated with NoSQL databases are the following properties:
• It does not use SQL as its query language.
• It may not give full ACID guarantees.
• Its got a distributed, fault tolerant architecture.
NoSQL databases are often extremely optimized for retrieve and write operations, and seldom offer any functionality other than storing records (e.g. key-value stores).
The lost flexibility is made up for by gains in scalability and performance. These
systems are helpful when working with extremely large data sets that do not require a
relational model for the data structure, e.g for statistical analysis. [4]
3.11. Existing distributed databases 15
3.11 Existing distributed databases
Many database implementations exist that can handle a large amount of data, where the most featured are AT&T’s Daytona 4 , Google’s BigTable [27], HBase 5 , MongoDB 6 and Cassandra 7 . The last three ones are open-source projects, where HBase and Cas- sandra are maintained by Apache. There are many other solutions that can be scaled up using techniques described earlier such as sharding, but the three alternatives that are free to use and designed to solve the problem of scaling up are MongoDB, HBase and Cassandra.
4 AT&T, ”Daytona.” [online] http://www2.research.att.com/ ∼ daytona/. [Accessed 2012-10-19].
5 The Apache Software Foundation, ”Hbase.” [online] http://hbase.apache.org/. [Accessed 2012- 10-19].
6 10Gen, ”Mongodb.” [online] http://www.mongodb.org/. [Accessed 2012-10-19].
7 The Apache Software Foundation, ”The apache cassandra project.” [online] http://cassandra.
apache.org/. [Accessed 2012-10-19].
C HAPTER 4 Choosing a database
According to the data load analysis (see Chapter 2) the system needs to store ap- proximately 240 TB of raw data per year, and the data must be inserted at a rate of approximately 110 000 batches per second including indices. This requires a database that is highly scalable and can perform fast writes.
To be able to handle the scalability requirement, a distributed database is needed.
According to the CAP theorem (see Section 3.7), no distributed database can support more than two out of three of consistency, availability and partition tolerance. For this system, availability and partition tolerance are favoured over consistency. A system in a production environment that is delivered as a service to customers can not afford substantial downtime. This requires that the database is reliable. It is better to get some data eventually than to lose data or have the system go down. In large scale distributed systems the probability of nodes failing is high. The system needs to be robust against such failures. Having the system go down in such cases is not acceptable.
It is also necessary to know if a request was successful or not to ensure that data is not lost and so that it can be discarded from the back-end servers. Eventual consistency is acceptable for several reasons:
• Analysis of the data is not performed instantly, giving the data time to propagate.
• The analysis does not necessarily require all values.
• New values are produced constantly and could be inserted at any time which means that the result of a computation constantly changes.
• Values are immutable which means that there is no issue with updated values not yet being propagated and having an incorrect value. All values that can be retrieved are correct.
Storing sensor data means storing a lot of records. As each record is very small, the database will for a given size contain an uncommonly large amount of records. We need a database that can handle large amounts of records well and maintains good performance as the database size grows.
17
18 Choosing a database
Property System requirement Cassandra HBase
Scalability High scalability (≈ 240 TB/year,
3.36 · 10 12 batches, 110 000 batches/s)
Highly scalable - Lin- ear scalability. Also has compression.
Highly scalable - Linear scalability.
Write performance
Fast writes. (110 000 batches/s)
Nearly unmatched write performance, con- sidered as the fastest there is.
Approximately half the write performance of Cassandra according to [7].
Read
performance
Since the system is not designed for real-time analysis, the read per- formance is not as im- portant as the write performance. However, the faster the better.
Fast read performance.
It is not the fastest, but can compete against some of the faster solutions.
Approximately twice as fast as Cassandra.
Data model Sensor data is simple, immutable and non- relational. Nothing more than a simple data model is required.
Key-value store with a simple data model.
(Based on BigTable’s data model).
Key-value store with a simple data model. 8 (Based on Bigtable’s data model)
Transactions Not necessary. The data is never mutated.
Thus, a database that fulfills the ACID con- tract is not needed.
Not supported. No, but has support for atomic update of a sin- gle row. (Can insert many columns into one row atomically)
One- dimensional range queries
Yes, it’s the most com- mon type of sensor query.
Yes, one-dimensional range queries are supported by default.
Can perform range get natively on row keys, but not on column keys. Achievable with Map/Reduce jobs.
Spatial queries
Yes, needs to be able to perform regional queries to get all batches in a re- gion.
Not by default, but data can be arranged to support spatial queries.
[28]
Not by default, but data can be arranged to support spatial queries.
[19]
Consistency Desired, but not nec- essary. Eventual con- sistency is OK. Parti- tion Tolerance (Relia- bility) and Availability are favoured (CAP)
Eventual consistency, tunable tradeoffs be- tween consistency and latency.
Yes.
Reliability The system should be reliable and robust against failure. Losing data or having the system go down is not acceptable.
Cassandra is reliable.
Both replication across data centers, partition tolerance and single server durability is supported.
Yes.
Availability Yes, it is necessary to know if a request suc- ceeded.
Yes, high availability. Low availability.
Table 4.1: The correspondence between properties required by the system and those supported
by Cassandra and HBase.
19 When it comes to retrieving sensor data, the most common type of query is the range query. It should be possible to retrieve all batches where the speed is greater than 120 km/h, where the temperature is between -5 ◦ and 5 ◦ Celsius, that are located in a certain region, and so on. These are either one-dimensional range queries or two-dimensional regional queries. Combining several such requirements becomes a multi-dimensional range query. The system requires that the database can perform such queries fast enough to allow for non-real-time processing.
The sensor data that needs to be stored is immutable. This eliminates the need for transactions to keep values consistent when updating. Thus, an ACID compliant database is not needed. The sensor data is also simple, i.e. there are no complex relations or inheritances between different entities, so a relational or object-oriented database is not warranted either.
Given the requirements three potential candidates were identified: Cassandra, HBase and MongoDB. However, while looking into MongoDB it was identified that it has poor scalability in comparison to Cassandra and HBase — and in general as well [29]. It also favours consistency over availability [3]. Thus, MongoDB is not a good candidate for the system. This left us with two alternatives, Cassandra and HBase.
HBase and Cassandra both fulfill most of the system requirements, a summary of this can be seen in Table 4.1. However, Cassandra is superior to HBase in two im- portant aspects: HBase favours consistency over availability [3] and has worse write performance [7]. We will therefore not go any deeper into HBase.
Cassandra [30][31] is a fully distributed key-value store with a simple data model.
In accordance with the cap theorem and the system requirements, Cassandra supports availability and partition tolerance but is instead eventually consistent. The level of consistency is tunable and exchanged for increased latency on a per request basis. Cas- sandra also supports replication across multiple data centers and single server durability for reliability.
Cassandra is known for its fast writes, and several studies have been made to support this claim. [7] performs a benchmark comparison between Cassandra (0.6.0-beta3) and HBase (0.20.3). The results show that Cassandra has better write performance (2.1 times as fast), although HBase has much faster reads (8 times). However, in [32] they show that the read performance of Cassandra has since then been increased by 400%.
The performance difference should therefore have been reduced to approximately 2x faster reads for HBase in the latest versions. Cassandra’s write performance has also been improved by ≈ 40% since the comparison in [7] according to [32].
To relate Cassandra to the relational paradigm, [6] performs a comparison of multiple databases including Cassandra and a cluster based variant of the worlds most used database MySQL 9 , where they show that a Cassandra cluster scales better and has better read and write throughput than an equivalent MySQL cluster. For instance, the Cassandra cluster reaches a throughput of approximately 200 000 writes/s with a write-heavy workload using 12 nodes, which is twice as much as the MySQL equivalent.
In a benchmarking study made by netflix for Cassandra 0.8.6 [33] they show that
9 MySQL, ”MySQL :: The world’s most popular open source database.” [online] http://www.mysql.
com/. [Accessed 2012-10-19].
20 Choosing a database a single node using ordinary hardware (Amazon EC2 instances with 15 GB RAM, 4 cores, 1 Gbit network and 4 internal disks of 1.6 TB total) can handle approximately 10 000 writes/s. They also show that the number of writes/s scale linearly with the number of nodes in the cluster, assuming the load is distributed equally over all nodes using Cassandra’s random partitioner (see Chapter 5). Using 288 Cassandra instances, a replication factor of three and 60 EC2 instances to put load on the cluster they acquire a performance of 1.1 million writes/s, where each write is 400 byte. The load is constant per node as the cluster grows. This means that having n nodes will increase the performance n times. Cassandra is thus very scalable.
When it comes to reads, Cassandra supports range queries in one dimension by default. [20] has shown that it is possible to extend Cassandra to support multi- dimensional range queries as well. Cassandra does not support two-dimensional spatial queries by default, but it is possible to store the data in a way that enables such queries to be performed. [28] has come up with a solution for performing spatial queries efficiently by storing a BinTree on top of Cassandra. They claim that the solution works well, but do not provide any performance evaluation or details on their implementation. A further investigation of the solution and its performance is therefore needed to determine if it is efficient enough for this system. The same approach has also been applied to HBase in [19], where it is shown to work well. This is explained more in Chapter 6.
To summarize this chapter, Cassandra seems to be the best available storage solution for the system. However, some things remain to be investigated:
• How to and if sensor data can be stored and indexed in Cassandra to allow for one-dimensional range queries to be performed efficiently.
• The performance (and implementation details) of the BinTree overlay for two- dimensional regional queries.
• If the overhead required for storing the values in Cassandra is small enough that the solution is feasible.
• How the system should be dimensioned according to the capacity needed and determine if this is feasible in practise.
The rest of the report will focus on investigating these things.
C HAPTER 5 Introduction to Cassandra
Cassandra is a highly scalable, eventually consistent, fully distributed key-value store that is optimized for fast write throughput [34]. It is distributed under the Apache License v. 2.0 10 and is thus free to use. This chapter describes the most important characteristics of Cassandra required to understand our approaches and reasoning in the following chapters.
5.1 Data model
The Cassandra data model 11 , as illustrated in Figure 5.1, is based on Google’s BigTable [27] and consists of keyspaces, column families (CFs), rows, columns and super columns.
A keyspace is simply a namespace for column families. Typically a single keyspace is used for an application. The keyspace is analogous to the database in a relational database management system.
Column families are containers for rows. Each row is associated with a row key that is used to access it. The key (translated to a token by the partitioner) also determines which node in the cluster the row is stored on, as described in Section 5.2. The rows are stored in token order, making it possible to perform slice queries on rows if an order preserving partitioner is used. Each column family is stored in a separate file on disk.
Rows are containers for columns. Each row can contain an arbitrary number of columns. However, there is a limit of 2 billion columns per row 12 . A column consists of a key/value pair and a timestamp. The columns are sorted by their key, making it possible to perform column slice queries, see Section 5.4. All columns associated with the same row are stored on the same node in the cluster. Super columns are simply columns that contain columns.
10 The Apache Software Foundation, ”Apache License, Version 2.0.” [online] http://www.apache.
org/licenses/LICENSE-2.0.html, Jan. 2004. [Accessed 2012-10-19].
11 ”DataModel.” [online] http://wiki.apache.org/cassandra/DataModel/, Sep. 2012. [Accessed 2012-10-19].
12 ”CassandraLimitations.” [online] http://wiki.apache.org/cassandra/CassandraLimitations, May 2012. [Accessed 2012-10-19].
21
22 Introduction to Cassandra
Figure 5.1: The Cassandra data model. The keyspace is a container for column families.
Each column family contains a number of rows that are mapped to a set of columns.
5.2 Cassandra architecture
Cassandra uses a ring architecture based on Amazon’s Dynamo [35] to distribute rows over the set of nodes (together called the cluster), see Figure 5.2. The ring represents a cyclic range of token values, referred to as the token space.
1 2
3
t oken( key1)
t oken( key2)
39...
54... 142...