Scalable Storage of Sensor Data for a High Frequency Measurement Vehicular Monitoring System

(1)

MASTER'S THESIS

Scalable Storage of Sensor Data for a High Frequency Measurement Vehicular

Monitoring System

Erik Eliasson Daniel Öhrlund

Master of Science in Engineering Technology Computer Science and Engineering

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

Scalable Storage of Sensor Data for a High Frequency Measurement

Vehicular Monitoring System

Erik Eliasson Daniel ¨ Ohrlund

Lule˚ a University of Technology

Dept. of Computer Science, Electrical and Space Engineering

December 9, 2012

(3)

(4)

A ^BSTRACT

This thesis investigates how to store sensor data for a large scale vehicular monitoring system with high frequency measurements. The system is intended to measure multiple sensors at a rate of one measurement per second and allow for non-real-time analysis of the data. The goal is that the system should be able to support all cars in Sweden, which means that 240 TB of raw sensor data must be stored every year. To be able to keep up with the new that is data generated by cars, sensor batches (a batch is a collection of sensor measurements captured at the same time by a single car) must also be inserted at a rate of at least 110 000 per second.

Cassandra is identified to be the database that best suits the system requirements.

It is then investigated how to store and index the sensor data in Cassandra to be able to perform one- and two-dimensional range queries. The data models focus on how to store the data so that it is equally distributed over the cluster and is able to scale up for large numbers of records.

An empirical study is performed that evaluates the performance for insertion, re- trieval and storage for the different data models. The result is then used to determine how many servers would be required to meet the system requirements. The study shows that it is feasible to store the data in Cassandra. Less than 21 servers are required to reach the insertion rate and the data can be stored using between 360 and 1260 servers (with 2 TB of storage each) per year with three replicas of each node, depending on the indexing approach.

Keywords: d atabases, scalable data storage, Cassandra, big data, sensor data, vehicular monitoring system, vehicle monitoring

iii

(5)

(6)

P ^REFACE

This thesis was carried out during the spring semester of 2012 at Tieto AB in Lule˚ a.

They wanted to investigate how to create a generic system to support pay as you drive and other similar applications that use sensor values from vehicles to generate statistical data. However, the scope of the initial project was deemed too broad. During the analysis of the problem we realised that one of the bigger challenges would be to handle the huge amount of sensor data that the system needs to store. The purpose then changed to finding a scalable storage solution for the system, which is the focus of this report.

We would like to thank Tieto AB Lule˚ a, our supervisors Evgeny Osipov (LTU), Olov Schel´ en (LTU) and Robin Mellberg (Tieto), and Jim Sundqvist (Tieto) for coming up with the original idea for the thesis and providing guidance along the way. We would also like to thank our families and friends for their support and understanding.

Erik Eliasson, Daniel ¨ Ohrlund

v

(7)

(8)

C ONTENTS

Chapter 1 – Introduction 1

1.1 System Specification . . . . 1

1.2 Assumptions, delimitations and purpose . . . . 3

1.3 Structure of the report . . . . 4

Chapter 2 – Data load analysis 7 2.1 The amount of data generated by vehicles . . . . 7

2.2 Data transfer at gas stations . . . . 8

2.3 The required insertion rate . . . . 9

Chapter 3 – Databases and database management systems 11 3.1 Schemas . . . . 11

3.2 Queries . . . . 11

3.3 The ACID contract . . . . 12

3.3.1 Atomicity . . . . 12

3.3.2 Consistency . . . . 12

3.3.3 Isolation . . . . 12

3.3.4 Durability . . . . 12

3.4 Indexing . . . . 12

3.5 Redundancy (replication) . . . . 13

3.6 Distribution . . . . 13

3.6.1 Clusters . . . . 13

3.7 CAP theorem . . . . 13

3.8 Partitioning and sharding . . . . 13

3.9 Caching . . . . 14

3.10 NoSQL . . . . 14

3.11 Existing distributed databases . . . . 15

Chapter 4 – Choosing a database 17 Chapter 5 – Introduction to Cassandra 21 5.1 Data model . . . . 21

5.2 Cassandra architecture . . . . 22

5.3 Communication . . . . 23

5.4 Queries . . . . 24

5.5 Mutating data . . . . 24

5.6 Replication & Consistency level . . . . 25

(9)

5.7 Composites . . . . 25

5.8 Compression . . . . 25

5.9 Secondary indices . . . . 26

5.10 Spatial indexing in Cassandra . . . . 26

5.10.1 Using a Z-curve . . . . 26

5.10.2 BinTree . . . . 28

5.10.3 Why a BinTree is a good fit for Cassandra . . . . 30

Chapter 6 – Related work 31 6.1 Discussion . . . . 32

Chapter 7 – Data models for storing and indexing sensor data 35 7.1 Storing sensor batches in Cassandra . . . . 35

7.1.1 Alternative key generation methods . . . . 36

7.1.2 Alternative model . . . . 36

7.2 Value-embedded index vs Reference-embedded index . . . . 37

7.3 Data models for indexed range queries in Cassandra . . . . 37

7.3.1 Approach 1 - Many rows . . . . 37

7.3.2 Approach 2 - Wide rows . . . . 39

7.3.3 Approach 3 - BinTree overlay . . . . 40

7.3.4 Pagination of query results . . . . 46

7.3.5 Multidimensional queries . . . . 47

Chapter 8 – Empirical study 49 8.1 Cassandra configuration . . . . 50

8.2 Tests . . . . 51

8.2.1 Inserting unindexed batches . . . . 53

8.2.2 Retrieving unindexed batches . . . . 53

8.2.3 Determining the best split threshold for retrievals . . . . 53

8.2.4 Determining the best split threshold for insertions . . . . 53

8.2.5 Insertion rate for the indexing approaches . . . . 53

8.2.6 Retrieval rate for the indexing approaches as the retrieval size grows . . . . 54

8.2.7 Retrieval rate for the indexing approaches as the number of stored indices grows . . . . 54

8.2.8 Storage . . . . 54

8.3 Data structures . . . . 54

8.4 Time measurements . . . . 54

Chapter 9 – Results and Evaluation 57 9.1 Inserting unindexed batches . . . . 57

9.2 Retrieving unindexed batches . . . . 58

viii

(10)

9.3 Determining the best split threshold for retrievals . . . . 59

9.4 Determining the best split threshold for insertions . . . . 60

9.5 Insertion rate for the indexing approaches . . . . 62

9.6 Retrieval rate for the indexing approaches as the retrieval size grows . . 63

9.7 Retrieval rate for the indexing approaches as the number of stored in- dices grows . . . . 64

9.8 Storage . . . . 65

9.8.1 Overhead . . . . 66

Chapter 10 – Hardware estimation for a full-scale system 69 10.1 Insertion rate . . . . 69

10.2 Storage . . . . 71

10.3 Retrieval times . . . . 72

10.4 Putting it all together . . . . 74

Chapter 11 – Discussion 75 Chapter 12 – Conclusion 79 Appendix A – Complexities of a balanced BinTree from equally dis- tributed points in the plane 81 A.1 Height of a balanced BinTree . . . . 81

A.2 Creation complexity . . . . 82

1.2.1 Alternative amortized analysis of the number of operations . . . 83

1.2.2 Traversals . . . . 83

A.3 Simulation verifying the result . . . . 83 Appendix B – Estimating the effect of multiple rows vs one row for

approach 2 87

Appendix C – String constants in tests 89

Appendix D – Test results 91

Appendix E – Approximating approach 3 traversal impact in practice 93 Appendix F – Retrieval time for indexed range queries over network 95

ix

(11)

(12)

C ^HAPTER 1 Introduction

Having access to logged sensor data from every vehicle in a country enables for very rich and powerful applications that have not yet been seen as of today. An example of such an application would be to allow for the highway department to identify dangerous sites based on how many fast decelerations that occur in the area. Another example could be a pay as you drive insurance solution (PAYD), where everyone could get individual insurance fees based on their own driving performance. By providing access to the logged data in a generic way it becomes possible to perform other types of analysis on the data as well and enables for other currently unthought-of applications to be created. The question is how to store the data for such a system.

The systems that exist today that monitor vehicles are mainly fleet tracking systems and pay as you drive insurance solutions. Fleet tracking systems gather sensor data at low frequencies, about once every five minutes, and upload the data either by manually extracting it from the vehicles or by using cellular networks. PAYD solutions aggregate the data extensively and only upload pre defined evaluation metrics. Common to these systems is that neither log and store sensor data frequently [1].

In order to be able to provide generic access to sensor data from all vehicles in Sweden, a system that collects, stores and makes it possible to retrieve the data in a generic way is to be created.

1.1 System Specification

The system is used to gather and store one- and two-dimensional sensor data from vehicles. The purpose of the system is to be able to support non-real-time analysis of the gathered data for various applications that can access the data through an API. It should be possible to retrieve data by specifying a range of interest such as all values in a region or all values with a speed greater than 120.

As can be seen in Figure 1.1 the system consists of:

• client devices (the hardware mounted inside the vehicles),

• servers that the client devices and the applications communicate with (these will be referred to as backend-servers from here on)

1

(13)

2 Introduction

Client device

Server architecture

Server n

Server 1

1

2 n

...

Distributed database architecture

API Hotspot

sensor 1

sensor n

Applications

Figure 1.1: An architectural overview of the system and its components.

• and the data storage solution.

Each client device is identified by a unique identifier that we refer to as the deviceID.

The client devices collect sensor data when the vehicles are driven and as soon as they are connected to the internet they start uploading the logged data to the backend- servers. The backend-servers are responsible for storing the data in an efficient manner in the database, as well as retrieving the correct data from the database upon requests from applications.

According to the specification of requirements, every vehicle will log data from each sensor once a second. The collection of these sensor values combined with the deviceID is referred to as a batch. The system shall support one- and two-dimensional sensor values, and be capable of having every registered car in Sweden connected to the system.

To make sure that all client-devices will be connected regularly, an initial idea is to place hotspots at gas stations as every car has to pass one at some point in time.

However, most client-devices will be connected much more often since hotspots are common in family homes, at offices, stores etc.

Since the sensor values arrive in groups when a car arrives at a hotspot, the type of operation required by the data store is a grouped insert. The same goes for retrievals, as the system is used for non-real-time analysis, values are retrieved in groups. Both one- and two-dimensional range queries should be supported.

Sensor data consists of measurements of a physical quantity, usually accompanied

by a timestamp of when the measurement was performed. A pair of a sensor value

and a timestamp represents the state of a sensor at a specified point in time. That

point in time will never occur again. Thus, a pair of a sensor value and a timestamp

is immutable. This means that to be able to represent the state of a sensor over time,

a lot of data points must be recorded.

(14)

1.2. Assumptions, delimitations and purpose 3 The values generated by a sensor take on a bounded set of values determined by the properties of the physical quantity measured and the precision of the sensor hardware.

The precision (number of decimals) and range of values determine the data type needed.

If the value has lower precision than 7 decimal digits and a value less than 2 ²⁴ it can be stored using a float, which is only 4 bytes. If the precision is higher (up to ≈16 decimals) or the value is bigger, a double value has to be used which is stored using 8 bytes. If the higher accuracy is not needed, the storage space can be reduced by using floats instead.

Most sensors measure in 1 dimension, but there are some, such as GPS, who measure in two dimensions. As of today, no sensor imaginable for this type of system measures values with a higher dimension than 2. Sensors likely to be used for this type of system include those measuring:

• acceleration,

• speed,

• temperature,

• pressure,

• and position (GPS).

1.2 Assumptions, delimitations and purpose

In this thesis we will focus on finding a data storage solution for this system that is capable of storing a huge amount of sensor data, and identifying how to structure the data so that it can be both inserted and retrieved in a fast manner. The specification of requirements states that every vehicle in the system will gather data from its sensors at an interval of one data point per second. This means that if the system shall be capable of handling every registered vehicle in Sweden it has to store about 10 ¹² data points at a total size of approximately 240 TB each year, which is more than the total size of most databases today ¹ . During full load the system must also handle the insertion of more than 100 000 data points per second.

The system is intended to work as a platform that provides access to the data for external third-party applications. The applications that will be built on top of the platform are not known in advance. It is therefore necessary to be able to provide the data in a general way that can be used for any kind of application that might want to use it. The employer therefore wants to store the data without any aggregation.

Because of the amount of time dedicated for this thesis, we will not be able to implement our own data store and will therefore use an existing database management system (dbms). As our budget for the thesis is barely non-existing, the dbms and all other software that is included in our solution must be free to use. We assume that

1 Compare Business Products, ”10 largest databases in the world.” [online] http://www.

comparebusinessproducts.com/fyi/10-largest-databases-in-the-world, Mar. 2010. [Accessed 2012-10-

19].

(15)

4 Introduction there already exist good load-balancing solutions that can be used for the back-end servers, and we also presuppose that data can be accurately transmitted from a vehicle to the back-end servers using hotspots.

The purpose of this thesis has therefore been to identify the dbms most suitable for the system and evaluate how to store the data so that it:

• scales well and can handle these amounts of sensors data,

• supports one- and two-dimensional range queries and

• has fast insertion and retrieval speeds.

There are several published papers related to this area and our purpose. Considerations in the architecture and approach for storing big data is evaluated in [2]. [3], [4], [5] and [6] perform surveys of data stores capable of storing a lot of data and scaling up to support big data applications. [6] and [7] evaluate the performance and scalability of the most common NoSQL databases. [8] investigates the performance of storing sensor data in both SQL and NoSQL databases, and the difference between using physical and virtual servers. The performance of one-dimensional range queries in various key-value stores and distributed ordered tables ² is investigated in [9]. [10], [11], [12] and [13]

examine the most common methods for two-dimensional spatial indexing, such as kd- trees, quadtrees and R-trees. Variations of these for both spatial and multidimensional queries are presented in [14] and [15]. [16] shows a spatial indexing structure based on the B ⁺ -tree and R-tree, but using a grid structure instead of a tree to access the leaf nodes. [17] and [18] examine how to map multidimensional data to one dimension using space-filling curves such as the Z-order curve and Hilbert curve. [19] investigates how to perform spatial queries in key-value stores, and [20] and [21] present methods for performing multidimensional range queries in distributed ordered tables.

1.3 Structure of the report

The rest of this report is structured as follows: In Chapter 2 a data load analysis is performed to approximate how much data the system needs to handle. Chapter 3 then provides background information about databases and database management systems.

Chapter 4 covers the choosing of a database for this type of system; the requirements are examined and Cassandra is identified to be the database that best suits them. The most important aspects of Cassandra are then presented in Chapter 5.

Chapter 6 covers work related to our investigation. Chapter 7 then presents different data models in Cassandra for storing and indexing sensor data to allow for range queries in one- and two-dimensions to be performed. An empirical study is then presented in Chapter 8, describing the different tests that have been performed. The evaluation of the results is then covered in Chapter 9. Chapter 10 estimates the number of servers required to meet the system requirements based on the empirical study. The results

2 A distributed database that partitions on continuous sets of keys and supports one-dimensional

range queries.

(16)

1.3. Structure of the report 5

and possible improvements are then discussed in Chapter 11. The report is finally

concluded in Chapter 12 where we also discuss future work.

(17)

(18)

C ^HAPTER 2 Data load analysis

The system is intended to be able to store sensor data from all cars in Sweden gener- ated every second while driving. In this chapter we will try to approximate how much data this actually is and how much data a car has to transmit on average each time it has to refuel.

Value Average gas consumption, G _cons 0.083 L/km = 8.3 · 10 ⁻⁸ m ³ /m

Total number of cars, N _cis 4335182

Average travel distance per year for a car, d avg 12910 km = 1.291 · 10 ⁷ m Table 2.1: Swedish car statistics

Table 2.1 shows the car statistics that are required for the analysis, taken from Statistics Sweden (SCB) [22]. We assume that the average gas tank volume, V _tank , is 60 liters = 0.06 m ³ and that a car drives with an average speed, v _avg , of 60 km/h

= 16.67 m/s.

2.1 The amount of data generated by vehicles

Using the average distance a car is driven during a year from Table 2.1 and the assumed average speed, the average time a car spends driving is given by

t _avg = d _avg

v _avg = 774600 s.

Consequently, since a batch is generated every second, the average number of sensor batches generated by a single car in a year is also equal to this value.

N batches per car = 774600.

7

(19)

8 Introduction Multiplying this by the number of cars from Table 2.1, the total number of batches per year is

N tot,batches per year = 774600 · 4335182 = 3358031977200 = 3.36 · 10 ¹² . (2.1) We assume that a batch contains measurements from at most 9 different sensors and that each sensor has at most two dimensions that can be stored using floats (4 bytes).

A batch also contains a timestamp of when the measurements were taken that is stored using a long (8 bytes). The size of a batch is therefore 9 · 2 · 4 + 8 = 80 bytes. This makes the total amount of data generated by a single car in a year

D car per year = N batches per car · 80 = 774600 · 80 B = 61968000 B = 59.10 MB (2.2) and the total amount of data generated by all cars in Sweden is

D year,tot = 3358031977200 · 80 B = 244.34 TB (2.3)

To put this into scope, one of the largest databases today, AT&T’s call records, con- tains 323 TB of information and 1.9 · 10 ¹² rows that have been gathered over several decades ³ . Our system will need to store almost this much raw data every year, and this is assuming all values have low precision and can be stored using floats. In addi- tion, the database will need to add some overhead for its internal data structures. The sensor batches may also need to be indexed to improve the time it takes to retrieve data, and these indices need to be stored as well. Neither of these are included in the size in Eq. 2.3. The total size of the database will therefore be even bigger than this.

2.2 Data transfer at gas stations

In the previous section we derived an approximation of the amount of data that the sensors in a single car will generate during an entire year. We will now determine how much data is generated between each refueling and approximate how long it will take to transfer this data to the backend-servers.

The average number of refuelings per car and year is given by N ref uel = d _avg · G _cons

V _tank = 17.86 (2.4)

Assuming that the batches are equally spread out over all gas refills, the amount of data that has to be transferred when a car is at a gas station can be computed by dividing the data generated per year for a car in Eq. 2.2 by the number of refuelings per year.

D _{ref uel} = D car per year

N _{ref uel} = 3.3091 MB.

3 Compare Business Products, ”10 largest databases in the world.” [online] http://www.

comparebusinessproducts.com/fyi/10-largest-databases-in-the-world, Mar. 2010. [Accessed 2012-10-

19].

(20)

2.3. The required insertion rate 9 Assuming that the WiFi hotspot has a bandwidth of 10 Mbit/s = 1.25 MB/s, and is serving an average of 10 users at the same time, giving a throughput of 0.125 MB/s per user, the time it takes to transfer this data is

t _upload = D _{ref uel}

0.125 = 26.47 s.

To get an approximate upper boundary of the maximum upload time of a single car, we assume a gas consumption of 0.03 L / km, a tank size of 90 L and an average speed of 25 km/h. Performing the same analysis, the time it takes to perform this upload is t _upload,max = 264 s = 4 min 24 s.

2.3 The required insertion rate

Using the total number of batches generated per year in Eq. 2.1 and the total number of refuelings per year in Eq. 2.4, the number of batches generated by all cars combined between each refueling is

N batches,ref uel = 3.36 · 10 ¹²

17.8588 = 188142540372.25 batches.

If these can be handled before it is time for the next refueling, in t _between = 1 year

17.8588 = 1765852.12 s

the system should be able to manage all cars in Sweden. The number of batches that the system has to be able to handle per second (the insertion rate) is then

N batches,ref uel

t _between = 106544.90 batches/s.

Thus, the storage solution must be quite fast at inserting data.

(21)

(22)

C ^HAPTER 3

Databases and database management systems

A database is an organized collection of data (and its data structures), and a database management system is a collection of software that controls the database. In this report we will use the term database when we talk about the database and its software. What defines a database is the data model, the way the data is structured, and the operations that can be executed to modify it.

Among popular data models one finds the relational model and the object model.

The relational model is a highly flexible and easy to understand data model. The data is organized as two-dimensional arrays, called tables, with a set of defined columns.

Each row must contain a value for each column. Data can be associated between multiple tables with a set of one or more columns common to each table, which is called a key. [23]

The object model arose from the object-oriented programming paradigm as an at- tempt to bring databases and applications closer to each other. The object model adopts the patterns of object-oriented programming such as encapsulation, and tries to remove the overhead that appears when data is transformed between the represen- tation in an application and the database representation.

3.1 Schemas

A database schema is the structure of a database described in a formal language, which in the case of a relational database defines what tables exist and their properties.

3.2 Queries

Queries are performed using a query language, a computer language that is used to access or alter the data in a database. A query can generally be seen as either asking the database a question and expecting a result, called a read, or asking the database to insert something, called a write. A group of queries can be executed uniformly as a

11

(23)

12 Databases and database management systems transaction.

3.3 The ACID contract

The ACID contract [24] is a set of properties that if fulfilled guarantee correct trans- actions.

3.3.1 Atomicity

Atomicity demands that no transaction partly fails. Either all operations succeed, or the entire transaction fails and the database is left in an unchanged state. Atomicity must be guaranteed in every situation, including crashes and power failures.

3.3.2 Consistency

Consistency ensures that any database operation will leave the database in a valid state. No data can be written to the database that does not fulfill the defined rules.

3.3.3 Isolation

Isolation guarantees that when multiple users access the database concurrently, the actions of any user should be unaffected and uninterrupted by those of any other user.

In other words, concurrent actions must maintain the consistency of the database.

3.3.4 Durability

Durability means that once a transaction has been performed, it will remain so even in the event of errors.

Many databases accomplish the ACID properties by implementing locking, which means that the database marks all the data a transaction will access and prevents other transactions from accessing it until the transaction has either failed or succeeded. [25]

3.4 Indexing

Indexing is a technique used to improve the performance of database queries. The idea of indices is to reduce the need to examine each entry in the database when running a query. They are not a part of the dataset stored in the database, and thus they do not affect the result of a query. The same data can have multiple indices that improve the time needed to find the data based on different parameters. As an example, a data entry for a user may be indexed by the name of the user and the country of the user.

Even though indices can increase the query speed, they consume space and require an

update each time the indexed data is mutated (which can slow down the maintenance

speed).

(24)

3.5. Redundancy (replication) 13

3.5 Redundancy (replication)

Some databases utilize replication of database data to prevent data loss and downtime in case of system or hardware failures. Redundancy can also improve the performance of the database when multiple queries are made in parallel to the same data. Data replication introduces the need of synchronization when data that is replicated gets updated, to maintain consistency.

3.6 Distribution

A distributed database is a database that stores data on multiple nodes that do not share a common processing unit. The nodes may be separated by location and com- municate over network, as well as reside next to each other.

The idea of a distributed database is to increase the performance and capacity. Users can interact with the database as if it was one logical system, but if multiple users query for data located on different nodes the nodes can serve the data in parallel and thus increase the performance.

3.6.1 Clusters

A cluster is a group of nodes that acts as one database towards the users. A user may connect to one of the nodes in the cluster and start asking for data as if it was connected to all nodes in the cluster at once.

3.7 CAP theorem

The CAP theorem states that no distributed computing system can fulfill all three of the following properties at the same time [26]:

• Consistency — all nodes see the same data at the same time.

• Availability — a guarantee that every request receives a response about whether it failed or succeeded.

• Partition tolerance — the system continues to operate despite arbitrary message loss or failure of part of the system.

3.8 Partitioning and sharding

A partition of a database is a division of the database into independent parts. Reg-

ular partitioning can be performed in two ways: horizontal and vertical. Horizontal

partitioning means splitting up a table (or the equivalent) so that rows that belong to

the same table actually are stored in different tables (i.e. different partitions). One

example of horizontal partitioning would be to store rows that correspond to users

(25)

14 Databases and database management systems based on what area the user lives in. All users with zipcode < 50000 are stored in the table UsersBelow50000 and the rest are stored in the table UsersOver50000.

Vertical partitioning uses the same reasoning, but instead of dividing up the rows in different partitions, it stores the columns of a table in different partitions. A similar example as to horizontal partitioning would be to store different columns of a user- row in different tables. The table UserFirstName would contain a row with a unique identifier and the first name of a user, and the table UserSurname would contain a row with the same unique identifier and the user’s surname.

Sharding is a special architecture that uses the same approach as horizontal par- titioning, but stores the rows in a single database schema instead of using multiple tables.

One advantage that partitioning provides is that it may reduce the index size, given that there is some way to identify in which table a row will be found without searching the index, which would be possible in the example of horizontal partitioning described earlier.

3.9 Caching

As with caching in general, the idea is to reduce the need to do an actual lookup by storing some data in a layer above the actual database. Caching can be a cheap way to increase the performance by distributing the workload of queries to cheaper systems.

However, if all the data in the database is accessed equally often, the cache would have to be as large and efficient as the database itself.

3.10 NoSQL

NoSQL, interpreted as “Not only SQL“, databases do not use the common relational model. Typically associated with NoSQL databases are the following properties:

• It does not use SQL as its query language.

• It may not give full ACID guarantees.

• Its got a distributed, fault tolerant architecture.

NoSQL databases are often extremely optimized for retrieve and write operations, and seldom offer any functionality other than storing records (e.g. key-value stores).

The lost flexibility is made up for by gains in scalability and performance. These

systems are helpful when working with extremely large data sets that do not require a

relational model for the data structure, e.g for statistical analysis. [4]

(26)

3.11. Existing distributed databases 15

3.11 Existing distributed databases

Many database implementations exist that can handle a large amount of data, where the most featured are AT&T’s Daytona ⁴ , Google’s BigTable [27], HBase ⁵ , MongoDB ⁶ and Cassandra ⁷ . The last three ones are open-source projects, where HBase and Cas- sandra are maintained by Apache. There are many other solutions that can be scaled up using techniques described earlier such as sharding, but the three alternatives that are free to use and designed to solve the problem of scaling up are MongoDB, HBase and Cassandra.

4 AT&T, ”Daytona.” [online] http://www2.research.att.com/ ^∼ daytona/. [Accessed 2012-10-19].

5 The Apache Software Foundation, ”Hbase.” [online] http://hbase.apache.org/. [Accessed 2012- 10-19].

6 10Gen, ”Mongodb.” [online] http://www.mongodb.org/. [Accessed 2012-10-19].

7 The Apache Software Foundation, ”The apache cassandra project.” [online] http://cassandra.

apache.org/. [Accessed 2012-10-19].

(27)

(28)

C ^HAPTER 4 Choosing a database

According to the data load analysis (see Chapter 2) the system needs to store ap- proximately 240 TB of raw data per year, and the data must be inserted at a rate of approximately 110 000 batches per second including indices. This requires a database that is highly scalable and can perform fast writes.

To be able to handle the scalability requirement, a distributed database is needed.

According to the CAP theorem (see Section 3.7), no distributed database can support more than two out of three of consistency, availability and partition tolerance. For this system, availability and partition tolerance are favoured over consistency. A system in a production environment that is delivered as a service to customers can not afford substantial downtime. This requires that the database is reliable. It is better to get some data eventually than to lose data or have the system go down. In large scale distributed systems the probability of nodes failing is high. The system needs to be robust against such failures. Having the system go down in such cases is not acceptable.

It is also necessary to know if a request was successful or not to ensure that data is not lost and so that it can be discarded from the back-end servers. Eventual consistency is acceptable for several reasons:

• Analysis of the data is not performed instantly, giving the data time to propagate.

• The analysis does not necessarily require all values.

• New values are produced constantly and could be inserted at any time which means that the result of a computation constantly changes.

• Values are immutable which means that there is no issue with updated values not yet being propagated and having an incorrect value. All values that can be retrieved are correct.

Storing sensor data means storing a lot of records. As each record is very small, the database will for a given size contain an uncommonly large amount of records. We need a database that can handle large amounts of records well and maintains good performance as the database size grows.

17

(29)

18 Choosing a database

Property System requirement Cassandra HBase

Scalability High scalability (≈ 240 TB/year,

3.36 · 10 ¹² batches, 110 000 batches/s)

Highly scalable - Lin- ear scalability. Also has compression.

Highly scalable - Linear scalability.

Write performance

Fast writes. (110 000 batches/s)

Nearly unmatched write performance, con- sidered as the fastest there is.

Approximately half the write performance of Cassandra according to [7].

Read

performance

Since the system is not designed for real-time analysis, the read per- formance is not as im- portant as the write performance. However, the faster the better.

Fast read performance.

It is not the fastest, but can compete against some of the faster solutions.

Approximately twice as fast as Cassandra.

Data model Sensor data is simple, immutable and non- relational. Nothing more than a simple data model is required.

Key-value store with a simple data model.

(Based on BigTable’s data model).

Key-value store with a simple data model. ⁸ (Based on Bigtable’s data model)

Transactions Not necessary. The data is never mutated.

Thus, a database that fulfills the ACID con- tract is not needed.

Not supported. No, but has support for atomic update of a sin- gle row. (Can insert many columns into one row atomically)

One- dimensional range queries

Yes, it’s the most com- mon type of sensor query.

Yes, one-dimensional range queries are supported by default.

Can perform range get natively on row keys, but not on column keys. Achievable with Map/Reduce jobs.

Spatial queries

Yes, needs to be able to perform regional queries to get all batches in a re- gion.

Not by default, but data can be arranged to support spatial queries.

[28]

Not by default, but data can be arranged to support spatial queries.

[19]

Consistency Desired, but not nec- essary. Eventual con- sistency is OK. Parti- tion Tolerance (Relia- bility) and Availability are favoured (CAP)

Eventual consistency, tunable tradeoffs be- tween consistency and latency.

Yes.

Reliability The system should be reliable and robust against failure. Losing data or having the system go down is not acceptable.

Cassandra is reliable.

Both replication across data centers, partition tolerance and single server durability is supported.

Yes.

Availability Yes, it is necessary to know if a request suc- ceeded.

Yes, high availability. Low availability.

Table 4.1: The correspondence between properties required by the system and those supported

by Cassandra and HBase.

(30)

19 When it comes to retrieving sensor data, the most common type of query is the range query. It should be possible to retrieve all batches where the speed is greater than 120 km/h, where the temperature is between -5 ^◦ and 5 ^◦ Celsius, that are located in a certain region, and so on. These are either one-dimensional range queries or two-dimensional regional queries. Combining several such requirements becomes a multi-dimensional range query. The system requires that the database can perform such queries fast enough to allow for non-real-time processing.

The sensor data that needs to be stored is immutable. This eliminates the need for transactions to keep values consistent when updating. Thus, an ACID compliant database is not needed. The sensor data is also simple, i.e. there are no complex relations or inheritances between different entities, so a relational or object-oriented database is not warranted either.

Given the requirements three potential candidates were identified: Cassandra, HBase and MongoDB. However, while looking into MongoDB it was identified that it has poor scalability in comparison to Cassandra and HBase — and in general as well [29]. It also favours consistency over availability [3]. Thus, MongoDB is not a good candidate for the system. This left us with two alternatives, Cassandra and HBase.

HBase and Cassandra both fulfill most of the system requirements, a summary of this can be seen in Table 4.1. However, Cassandra is superior to HBase in two im- portant aspects: HBase favours consistency over availability [3] and has worse write performance [7]. We will therefore not go any deeper into HBase.

Cassandra [30][31] is a fully distributed key-value store with a simple data model.

In accordance with the cap theorem and the system requirements, Cassandra supports availability and partition tolerance but is instead eventually consistent. The level of consistency is tunable and exchanged for increased latency on a per request basis. Cas- sandra also supports replication across multiple data centers and single server durability for reliability.

Cassandra is known for its fast writes, and several studies have been made to support this claim. [7] performs a benchmark comparison between Cassandra (0.6.0-beta3) and HBase (0.20.3). The results show that Cassandra has better write performance (2.1 times as fast), although HBase has much faster reads (8 times). However, in [32] they show that the read performance of Cassandra has since then been increased by 400%.

The performance difference should therefore have been reduced to approximately 2x faster reads for HBase in the latest versions. Cassandra’s write performance has also been improved by ≈ 40% since the comparison in [7] according to [32].

To relate Cassandra to the relational paradigm, [6] performs a comparison of multiple databases including Cassandra and a cluster based variant of the worlds most used database MySQL ⁹ , where they show that a Cassandra cluster scales better and has better read and write throughput than an equivalent MySQL cluster. For instance, the Cassandra cluster reaches a throughput of approximately 200 000 writes/s with a write-heavy workload using 12 nodes, which is twice as much as the MySQL equivalent.

In a benchmarking study made by netflix for Cassandra 0.8.6 [33] they show that

9 MySQL, ”MySQL :: The world’s most popular open source database.” [online] http://www.mysql.

com/. [Accessed 2012-10-19].

(31)

20 Choosing a database a single node using ordinary hardware (Amazon EC2 instances with 15 GB RAM, 4 cores, 1 Gbit network and 4 internal disks of 1.6 TB total) can handle approximately 10 000 writes/s. They also show that the number of writes/s scale linearly with the number of nodes in the cluster, assuming the load is distributed equally over all nodes using Cassandra’s random partitioner (see Chapter 5). Using 288 Cassandra instances, a replication factor of three and 60 EC2 instances to put load on the cluster they acquire a performance of 1.1 million writes/s, where each write is 400 byte. The load is constant per node as the cluster grows. This means that having n nodes will increase the performance n times. Cassandra is thus very scalable.

When it comes to reads, Cassandra supports range queries in one dimension by default. [20] has shown that it is possible to extend Cassandra to support multi- dimensional range queries as well. Cassandra does not support two-dimensional spatial queries by default, but it is possible to store the data in a way that enables such queries to be performed. [28] has come up with a solution for performing spatial queries efficiently by storing a BinTree on top of Cassandra. They claim that the solution works well, but do not provide any performance evaluation or details on their implementation. A further investigation of the solution and its performance is therefore needed to determine if it is efficient enough for this system. The same approach has also been applied to HBase in [19], where it is shown to work well. This is explained more in Chapter 6.

To summarize this chapter, Cassandra seems to be the best available storage solution for the system. However, some things remain to be investigated:

• How to and if sensor data can be stored and indexed in Cassandra to allow for one-dimensional range queries to be performed efficiently.

• The performance (and implementation details) of the BinTree overlay for two- dimensional regional queries.

• If the overhead required for storing the values in Cassandra is small enough that the solution is feasible.

• How the system should be dimensioned according to the capacity needed and determine if this is feasible in practise.

The rest of the report will focus on investigating these things.

(32)

C ^HAPTER 5 Introduction to Cassandra

Cassandra is a highly scalable, eventually consistent, fully distributed key-value store that is optimized for fast write throughput [34]. It is distributed under the Apache License v. 2.0 ¹⁰ and is thus free to use. This chapter describes the most important characteristics of Cassandra required to understand our approaches and reasoning in the following chapters.

5.1 Data model

The Cassandra data model ¹¹ , as illustrated in Figure 5.1, is based on Google’s BigTable [27] and consists of keyspaces, column families (CFs), rows, columns and super columns.

A keyspace is simply a namespace for column families. Typically a single keyspace is used for an application. The keyspace is analogous to the database in a relational database management system.

Column families are containers for rows. Each row is associated with a row key that is used to access it. The key (translated to a token by the partitioner) also determines which node in the cluster the row is stored on, as described in Section 5.2. The rows are stored in token order, making it possible to perform slice queries on rows if an order preserving partitioner is used. Each column family is stored in a separate file on disk.

Rows are containers for columns. Each row can contain an arbitrary number of columns. However, there is a limit of 2 billion columns per row ¹² . A column consists of a key/value pair and a timestamp. The columns are sorted by their key, making it possible to perform column slice queries, see Section 5.4. All columns associated with the same row are stored on the same node in the cluster. Super columns are simply columns that contain columns.

10 The Apache Software Foundation, ”Apache License, Version 2.0.” [online] http://www.apache.

org/licenses/LICENSE-2.0.html, Jan. 2004. [Accessed 2012-10-19].

11 ”DataModel.” [online] http://wiki.apache.org/cassandra/DataModel/, Sep. 2012. [Accessed 2012-10-19].

12 ”CassandraLimitations.” [online] http://wiki.apache.org/cassandra/CassandraLimitations, May 2012. [Accessed 2012-10-19].

21

(33)

22 Introduction to Cassandra

Figure 5.1: The Cassandra data model. The keyspace is a container for column families.

Each column family contains a number of rows that are mapped to a set of columns.

5.2 Cassandra architecture

Cassandra uses a ring architecture based on Amazon’s Dynamo [35] to distribute rows over the set of nodes (together called the cluster), see Figure 5.2. The ring represents a cyclic range of token values, referred to as the token space.

1 2

3 t oken( key1)

t oken( key2)

39...

54... 142...

Figure 5.2: Cassandra’s ring architecture with three nodes. The range that each node is responsible for is shown using the same color as the node. Two row keys are mapped to the ring by their token.

Each node is assigned an initial token that determines the range of tokens it is responsible for, i.e. its place in the ring. The token can be assigned manually, but if it is not assigned manually it will be assigned automatically by Cassandra to place the node in the ring so that it takes pressure of the most heavily loaded node. A node is responsible for all rows with a key that translates to a token that is between its own token value and that of the previous node in the ring.

Row keys are converted to tokens using a partitioner and then placed on the appropri-

(34)

5.3. Communication 23 ate node. The rows are stored in the sorted order determined by their token. Cassandra offers two default partitioners, the RandomPartitioner and the ByteOrderedPartitioner.

The RandomPartitioner uses the md5 hash value of the row key as token. This spreads rows semi-randomly over the entire ring, balancing the load over all nodes in the clus- ter ¹³ [36][33]. The ByteOrderedPartitioner is what is referred to as an order preserving partitioner. It orders the rows by their actual key value across the nodes in the ring.

This means that if key1 < key2 then token(key1) < token(key2). This keeps the rows in the same order, making it possible to perform row slice queries (see Section 5.4).

The ByteOrderedPartitioner uses the raw byte array value of the row key as token. A partitioner that does not preserve the order is not compatible with row slice queries in any meaningful way (it is still possible to slice on the tokens).

5.3 Communication

4 1. The cl i ent sends a

r equest t o any Cassandr a node.

2. Coor di nat or node f or war ds t he r eqest t o t he cor r ect nodes 3. The nodes send 3. The nodes send ACKs t o t he coor di nat or 4. The coor di nat or sends an ACK t o t he cl i ent

1 CLI ENT

3 2 3

3

3 Figure 5.3: Communication steps taken by Cassandra when handling a request.

When a request is sent to the Cassandra cluster, the node that receives it acts as the coordinator node for that request. It determines which nodes should receive certain parts of the request and sends it to them. Depending on the consistency level of the

13 ”RandomPartitioner.” [online] http://wiki.apache.org/cassandra/RandomPartitioner, Oct. 2012.

[Accessed 2012-10-19].

(35)

24 Introduction to Cassandra request, see Section 5.6, the coordinator node waits for responses from a certain number of nodes before the request is considered successful. [37]

Cassandra does not currently support streaming. Thus, the entire result for a query must fit in memory at the coordinator node. To retrieve a large number of values, the queries have to be split up into several queries that fit in memory.

Cassandra uses a gossip architecture to notify nodes of each other; which ensures that all nodes will eventually know of each other and what range of keys each node is responsible for. Thus, routing is done using one hop with O(1) lookup complexity. [38]

5.4 Queries

Cassandra supports many different queries ¹⁴ . The ones that we use are:

• Row slice queries: The row slice query retrieves all rows with a token in an interval specified by two row keys in the sorted order determined by the partitioner (token order). This query requires that an order preserving partitioner is used to be meaningful.

• Column slice queries: Retrieves all columns from a given row with a column key between two specified values.

• Count queries: Counts the number of columns in a row.

• Get: Retrieves a single row based on its key.

• Multi-get: Retrieves several rows specified in a list of row keys. The rows are not returned in the same order as the list of keys.

• Multi-get slice queries: Performs a column slice query in all rows specified in a list of row keys.

5.5 Mutating data

In Cassandra the three different operations that alter the data: insert, update and delete; are all perceived in the same way — as a write. The operation of altering data is referred to as a mutation. It is possible to perform several such mutations grouped together, which greatly improves the performance [8].

Deletes are handled as write operations to safely manage distribution of the deletion.

Instead of removing a value it is replaced with a special token, called a tombstone, that propagates to the cluster to mark a value as deleted ¹⁵ . Without this token, a delete operation would require the client to wait until all nodes successfully have removed the value.

14 ”API.” [online] http://wiki.apache.org/cassandra/API, Aug. 2012. [Accessed 2012-10-19].

15 ”DistributedDeletes.” [online] http://wiki.apache.org/cassandra/DistributedDeletes, Sep. 2011.

[Accessed 2012-10-19].

(36)

5.6. Replication & Consistency level 25 A write is performed as follows [34]. When a node receives a write request that should be stored on that node, it is first written to a commit log (sequential file on disk) for durability. It is then stored in an in-memory table structure called a memtable (one per CF, sorted by row key) ¹⁶ . After both these operations have been performed, the write is considered successful. When enough writes have been performed, the memtable is saved to disk as an SSTable ¹⁷ [27] (called a flush). Once written to disk, an SSTable is immutable.

5.6 Replication & Consistency level

Cassandra supports replication of data [39]. A replication factor can be set per keyspace, that determines how many copies there should be of each row in the keyspace (each copy is on a different node).

The consistency level is a setting that can be used to tune consistency on requests.

For a write, the consistency level defines how many nodes that must acknowledge a write before it is considered successful. For a read it defines how many nodes that must respond with the same answer before it is actually handed to the client.

5.7 Composites

Cassandra supports a special data type called composites. A composite is a value that is composed by multiple other values (non-composites e.g. integers or doubles), called the components of a composite. Composites can be used as keys for both rows and columns in Cassandra, and are sorted in lexicographical order. Slice queries can be performed on records with composite keys, but because they are sorted in lexicographical order, only one component can be sliced on at a time (i.e. only one component can contain an inequality statement).

5.8 Compression

Cassandra offers built-in methods for data compression that can be configured on a per column family basis. Compression may bring the following benefits:

• 2x-4x reduction in data size,

• 25-35% performance improvement on reads,

• and 5-10% performance improvement on writes

according to [40]. Compression does not only reduce the data size, but it often improves performance as well. Since the size of the data set is reduced, a larger set can fit in

16 ”MemtableSSTable.” [online] http://wiki.apache.org/cassandra/MemtableSSTable, Dec. 2011.

[Accessed 2012-10-19].

17 ”ArchitectureSSTable.” [online] http://wiki.apache.org/cassandra/ArchitectureSSTable, Feb.

2012. [Accessed 2012-10-19].

(37)

26 Introduction to Cassandra memory. It does not hurt write performance, since SSTables are immutable once they are written to disk (i.e., no recompression is ever needed and thus no performance loss).

5.9 Secondary indices

Cassandra offers a native indexing method for column values called SecondaryIndexes.

Each node indexes the columns in the rows that it holds locally. Secondary indexes can only be used for equalities, but Cassandra can do additional filtering in memory so that inequalities can be resolved in combination with an equality requirement. An example query using secondary indexes might be: select * from row where attr1 = value1 and attr2 ≥ value2.

5.10 Spatial indexing in Cassandra

Being a one-dimensional key-value store, Cassandra does not support spatial queries by default. There are many ways that spatial indexes can be created [11], but not all of these are suitable for Cassandra. [28] have spent a lot of time investigating how to store and index two-dimensional data on top of Cassandra and found two different solutions. The first is based on using a Z-curve and the second on using a BinTree overlay.

5.10.1 Using a Z-curve

Figure 5.4: A Z-curve in two-dimensional space.

(38)

5.10. Spatial indexing in Cassandra 27 A Z-curve [18] is a function that maps multidimensional data to one dimension while preserving the locality of the data points, and can be explained as a continuous line that intersects every point in a two-dimensional plane. An example of how a Z-curve is constructed can be seen in Figure 5.4.

A Z-value (a point on the Z curve) can be encoded as a geohash ¹⁸ , which is calculated by interleaving the bits of the horizontal and vertical coordinates’ binary representa- tions and then base32 encoding it. Geohashes have some nice properties, such as that they can be ordered in lexicographic order.

To perform a regional query one has to calculate the two points on the curve that maps to the top left and the bottom right corner of the region. The top left point will always be before the bottom right point on the curve, and all values in the region will be located on the line between these two points. In Cassandra this would be an easy query to perform, a simple slice.

The drawback with this method is that for some regions, a lot of points that are not included in the region will also be included on the Z-curve interval that has been calculated. An example of this can be seen in Figure 5.5. If the region marked in blue in the figure should be retrieved in one query, almost all points in the database would have to be retrieved, which would make the query a lot slower than necessary. The result would also have to be iterated over to remove the points that are not located inside the requested range. Given that our system should be able to handle about 10 ¹² batches per year, having to retrieve all indices is not feasible.

Figure 5.5: A case where a regional query is very inefficient when using a Z-curve. The queried area is marked in blue and all points that have to be retrieved are highlighted in red.

18 Geohash. [online] http://geohash.org/site/tips.html. [Accessed 2012-10-19].

(39)

28 Introduction to Cassandra It is possible to divide the query for the marked region in Figure 5.5 into multiple sub queries whose results when combined is all the points in the marked region. These calculations however are fairly complex, and the amount of subqueries needed can become pretty large which would introduce a lot of overhead.

However, even if the query implementation would work in theory, there is another limitation with this solution. In Cassandra the values can be stored either as rows or as columns in sorted order. For rows to be stored in sorted order an order preserving partitioner needs to be used. This is not optimal because the records will not be evenly distributed over the z-curve (because of the road networks) and will thus not be evenly distributed over the cluster. If columns would be used instead, there is a limit of 2 · 10 ⁹ columns in a row and we need to store 10 ¹² . This would introduce the need of slicing in potentially 10 ¹² /(2 · 10 ⁹ ) = 500 rows for each sub query which could degrade the performance a lot.

Because of these drawbacks, the Z-curve approach does not seem to be the best choice for spatial queries in Cassandra.

5.10.2 BinTree

The BinTree [41] is a geometric data structure that is constructed by subdividing the space into two equal regions recursively, doing alternating horizontal and vertical splits.

The splits are repeated until there are less than or equal to k elements left in a region.

This results in a binary tree that makes it possible to perform regional queries efficiently by only having to consider the points in the subdivided regions that overlap the query region. Figure 5.6 shows an example of a BinTree where k is 1.

Figure 5.6: An example of a BinTree constructed from 6 points when k = 1. (a) shows the

subdivision of the space and (b) shows the binary tree that represents it. The image is taken

from [41].

(40)

5.10. Spatial indexing in Cassandra 29 Regional queries are performed by traversing the tree into the regions that overlap the query region. When a leaf node is reached the points stored in that region are retrieved. The points are then checked to determine if they are contained in the actual query region and should be included in the result.

The BinTree can be constructed dynamically. Points are inserted into a region as they come until the region is full (≥ k elements). The region is then split and the points are inserted into its children. The resulting tree will be the same as if all points were known in advance, since the splits are performed the same way independent of the order the points are inserted in.

The depth of the tree is determined by the distance between the points as well as the number of points, n, and the capacity of each region, k. If a lot of points are really close to each other, several splits may be required to split the tree into small enough regions that the points will be separated, see Figure 5.7. As an extreme example, imagine that all points have the exact same value. In that case, the tree would split infinitely, giving it an unbounded depth. In reality, all it takes is that there are ≥ k points with the same value for this to happen.

The performance of a BinTree depends a lot on the how the points are spread.

The best performance is achieved when the points are evenly distributed in the plane, resulting in a balanced binary tree with O(n/k) leaves and depth O(log(n/k)).

Figure 5.7: Creating a BinTree from points that are close together may require several splits until the points are separated.

The complexities for creating a dynamic BinTree from n evenly spread elements are computed in Appendix A. The number of insertions, deletions, gets and copies are O(n) in the number of elements. The time it takes to traverse the tree during this creation is O(nlog(n/k)). The total complexity is thus O(n log(n/k)).

The performance of regional queries varies depending on the number of overlapping

regions, the depth of the tree and how balanced the tree is. For a balanced Bin-

(41)

30 Introduction to Cassandra Tree the worst case requires that all O(n/k) nodes are visited, giving a complexity of O(n/k + m) where m is the number of retrieved elements. This occurs when all sub- divided regions are overlapped by the query region. In the best case, when the query region is completely inside a single subdivided region, the query can be performed in O(log(n/k) + m) time by walking down the tree and performing constant work at all levels except the last, where the m elements are retrieved.

5.10.3 Why a BinTree is a good fit for Cassandra

The BinTree works well in Cassandra for a couple of reasons. Firstly, it can easily be modeled in Cassandra to utilize the power of the random partitioner to distribute the data over the cluster. Secondly, it doesn’t require any transactions or locking.

The tree consists of two types of nodes: parents and leaves. Parent nodes contain pointers to other nodes, and leaf nodes contain data items. A column family would be dedicated to representing the tree, and each node in the tree would be modeled as a row. This will distribute the tree node rows evenly over the cluster.

Leaf nodes are stored as rows in a separate column family, and contain their data items as columns. Because the leaf nodes are rows, the data items will be evenly distributed over the cluster as well.

An issue with storing a tree structure on top of Cassandra is that it must be updated

without transactions. This could potentially cause problems if the servers updating the

tree do not perceive the same state of the tree. Servers could then perform updates in

parallel that are not compatible with each other and end up with a tree in an incorrect

state. With the BinTree however, the tree structure end up the same regardless of

what operations are performed because the split operation is idempotent and the data

is never altered after being inserted.

(42)

C ^HAPTER 6 Related work

A few studies have been made in the area of storing sensor data and performing range queries in Cassandra that are related to the work performed in this thesis.

In [20] they present a scheme for performing multidimensional range queries in Cas- sandra. They modify Cassandra internally to automatically create indices when a row is inserted. One index CF is kept for each indexed column and is sorted by that value.

Each row in an index CF contains the values of the other indexed fields as columns as well, to be able to perform multidimensional range queries using only one CF.

Multidimensional range queries are performed by identifying which single condition will result in the least amount of hits, performing a slice query in the index CF corre- sponding to that condition, and then performing filtering on the result in memory for the other conditions.

In [8] they compare Cassandra’s (1.0.7), PostgreSQL’s and MongoDB’s performance when storing sensor values. They compare the impact of using physical or virtual servers and measure the performance for reads and writes, both for a single operation and multiple operations batched together. The data stored in their tests is a minimal sensor type data-structure consisting of a uuid, a sensor value and a timestamp. They use a single server for the database (i.e. scalability is not considered at all), and focus on the case of many small writes and a few large reads from a single sensor. They show that the performance of the different databases varies depending on the type of operation, if virtualization is used and the number of clients, but that Cassandra is the best for multiple write operations and performs well for multiple-reads. They conclude that Cassandra is the best choice for “large critical sensor applications“ because of its horizontal scalability.

In [9] they investigate the performance of range queries in three widely-used key- value stores: Cassandra (0.6.8), HBase and Voldemort ¹⁹ . The focus of the paper is on adding a layer on top of Voldemort that allows for range queries to be performed using a BLink-tree to index the data, and Cassandra and HBase are used as references to compare the result of the BLink-tree implementation in Voldemort. The data in Cassandra is modeled as rows, and the range query evaluated is thus the row slice query.

19 Project Voldemort, ”Project Voldemort — A distributed database.” [online] http://www.project- voldemort.com/voldemort/. [Accessed 2012-10-19].

31

Scalable Storage of Sensor Data for a High Frequency Measurement Vehicular Monitoring System

MASTER'S THESIS

Scalable Storage of Sensor Data for a High Frequency Measurement Vehicular

Monitoring System

Erik Eliasson Daniel Öhrlund

Master of Science in Engineering Technology Computer Science and Engineering

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

Scalable Storage of Sensor Data for a High Frequency Measurement

Vehicular Monitoring System

Erik Eliasson Daniel ¨ Ohrlund

Lule˚ a University of Technology

Dept. of Computer Science, Electrical and Space Engineering

December 9, 2012

A BSTRACT

Cassandra is identified to be the database that best suits the system requirements.

It is then investigated how to store and index the sensor data in Cassandra to be able to perform one- and two-dimensional range queries. The data models focus on how to store the data so that it is equally distributed over the cluster and is able to scale up for large numbers of records.

Keywords: d atabases, scalable data storage, Cassandra, big data, sensor data, vehicular monitoring system, vehicle monitoring

iii

P REFACE

This thesis was carried out during the spring semester of 2012 at Tieto AB in Lule˚ a.

Erik Eliasson, Daniel ¨ Ohrlund

v

C ONTENTS

Chapter 1 – Introduction 1

1.1 System Specification . . . . 1

1.2 Assumptions, delimitations and purpose . . . . 3

1.3 Structure of the report . . . . 4

Chapter 2 – Data load analysis 7 2.1 The amount of data generated by vehicles . . . . 7

2.2 Data transfer at gas stations . . . . 8

2.3 The required insertion rate . . . . 9

Chapter 3 – Databases and database management systems 11 3.1 Schemas . . . . 11

3.2 Queries . . . . 11

3.3 The ACID contract . . . . 12

3.3.1 Atomicity . . . . 12

3.3.2 Consistency . . . . 12

3.3.3 Isolation . . . . 12

3.3.4 Durability . . . . 12

3.4 Indexing . . . . 12

3.5 Redundancy (replication) . . . . 13

3.6 Distribution . . . . 13

3.6.1 Clusters . . . . 13

3.7 CAP theorem . . . . 13

3.8 Partitioning and sharding . . . . 13

3.9 Caching . . . . 14

3.10 NoSQL . . . . 14

3.11 Existing distributed databases . . . . 15

Chapter 4 – Choosing a database 17 Chapter 5 – Introduction to Cassandra 21 5.1 Data model . . . . 21

5.2 Cassandra architecture . . . . 22

5.3 Communication . . . . 23

5.4 Queries . . . . 24

5.5 Mutating data . . . . 24

5.6 Replication & Consistency level . . . . 25

5.7 Composites . . . . 25

5.8 Compression . . . . 25

5.9 Secondary indices . . . . 26

5.10 Spatial indexing in Cassandra . . . . 26

5.10.1 Using a Z-curve . . . . 26

5.10.2 BinTree . . . . 28

5.10.3 Why a BinTree is a good fit for Cassandra . . . . 30

Chapter 6 – Related work 31 6.1 Discussion . . . . 32

Chapter 7 – Data models for storing and indexing sensor data 35 7.1 Storing sensor batches in Cassandra . . . . 35

7.1.1 Alternative key generation methods . . . . 36

7.1.2 Alternative model . . . . 36

7.2 Value-embedded index vs Reference-embedded index . . . . 37

7.3 Data models for indexed range queries in Cassandra . . . . 37

7.3.1 Approach 1 - Many rows . . . . 37

7.3.2 Approach 2 - Wide rows . . . . 39

7.3.3 Approach 3 - BinTree overlay . . . . 40

7.3.4 Pagination of query results . . . . 46

7.3.5 Multidimensional queries . . . . 47

Chapter 8 – Empirical study 49 8.1 Cassandra configuration . . . . 50

8.2 Tests . . . . 51

8.2.1 Inserting unindexed batches . . . . 53

8.2.2 Retrieving unindexed batches . . . . 53

8.2.3 Determining the best split threshold for retrievals . . . . 53

8.2.4 Determining the best split threshold for insertions . . . . 53

8.2.5 Insertion rate for the indexing approaches . . . . 53

8.2.6 Retrieval rate for the indexing approaches as the retrieval size grows . . . . 54

8.2.7 Retrieval rate for the indexing approaches as the number of stored indices grows . . . . 54

A ^BSTRACT

P ^REFACE

C ^HAPTER 1 Introduction