A Comparative Study of Databases for Storing Sensor Data

(1)

INOM

EXAMENSARBETE ELEKTROTEKNIK, AVANCERAD NIVÅ, 30 HP

STOCKHOLM SVERIGE 2019,

A Comparative Study of Databases for Storing Sensor Data

JIMMY FJÄLLID

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(2)

A Comparative Study of Databases for Storing Sensor Data

Jimmy Fj ¨allid

Master of Science Thesis

Communication Systems

School of Information and Communication Technology KTH Royal Institute of Technology

Stockholm, Sweden

29 May 2019

Examiner: Peter Sj ¨odin Supervisor: Markus Hidell

(3)

(4)

Abstract

More than 800 Zettabytes of data is predicted to be generated per year by the Internet of Things by 2021. Storing this data necessitates highly scalable databases. Many different data storage solutions exist that specialize in specific use cases, and designing a system to accept arbitrary sensor data while remaining scalable presents a challenge.

The problem was approached through a comparative study of six common databases, inspecting documented features and evaluations, followed by the construction of a prototype system. Elasticsearch was found to be the best suited data storage system for the specific use case presented in this report, and a flexible prototype system was designed. No single database was determined to be best suited for sensor data in general, but with more specific requirements and knowledge of future use, a decision could be made.

Keywords: IoT, NoSQL

i

(5)

(6)

Sammanfattning

Over 800 Zettabytes av data ¨ar f¨orutsp˚att att genereras av Sakernas Internet vid¨

˚ar 2021. Lagring av denna data gör det nödvändigt med synnerligen skalbara databaser. Det finns m˚anga olika datalagringslösningar som specialiserar sig p˚a specifika användningsomr˚aden, och att designa ett system som ska kunna ta emot godtycklig sensordata och samtidigt vara skalbar är en utmaning.

Problemet angreps genom en jämförande studie av sex populära databaser som jämfördes utifr˚an dokumenterad funktionalitet och frist˚aende utvärderingar. Detta följdes av utvecklingen ut av ett prototypsystem. Elasticsearch bedömdes vara bäst lämpad för det specifika användningsomr˚ade som presenteras i denna rapport, och ett flexibelt prototypsystem utvecklades. Inte en enda databas bedömdes vara bäst lämpad för att hantera sensordata i allmänhet, men med mer specifika krav och vetskap om framtida användning kan en databas väljas ut.

Nyckelord: IoT, NoSQL

iii

(7)

(8)

List of Figures

2.1 Replication architectures . . . 12 2.2 Splitting a collection into multiple shards . . . 14 4.1 Architecture of the prototype system . . . 37

ix

(13)

(14)

List of Tables

3.1 Database comparison . . . 29

xi

(15)

(16)

List of Listings

1 Resolving of a SenML message . . . 33 A.1 A SenML structured message . . . 64 A.2 A SenML message after ingestion into elastic . . . 65 A.3 A SenML message with multiple occurrences of the same base

fields. . . 66 A.4 A resolved SenML Message with multiple occurrences of the

same base fields after ingestion into elastic. . . 67 A.5 A JSON object sent to elastic . . . 69 A.6 A JSON document after ingestion by elastic . . . 70

xiii

(17)

(18)

List of Acronyms and Abbreviations

ACID Data consistency model with strong consistency guarantees.

API Application Programming Interface

BASE Data consistency model with a focus on availability.

CoAP Constrained Application Protocol

CPU Central Processing Unit

CQL Cassandra Query Language

DBMS Database Management System

DSL Domain Specific Language

HATEOAS Hypermedia as the Engine of Application State

HTTP Hypertext Transfer Protocol

InfluxQL Influx Query Language

IoT Internet of Things

JDBC Java Database Connectivity

LSM Log-structured Merge-tree

MQTT Message Queuing Telemetry Transport

ODBC Open Database Connectivity

RAC Real Application Cluster

RAM Random Access Memory

REST Representational State Transfer

xv

(19)

xvi LIST OFACRONYMS ANDABBREVIATIONS

SenML Sensor Measurement Lists

SOAP Simple Object Access Protocol

SQL Structured Query Language

UUID Universally Unique Identifier

XML Extensible Markup Language

(20)

Chapter 1 Introduction

Most application developers will at some point require persistent data storage.

There are many data storage solutions to choose from, and choosing which type of storage to use for a specific application is not always straight forward. Earlier there was a clear distinction between databases working with structured and unstructured data. However, recently the distinction is harder to make because some structured databases now also supports unstructured data, and the other way around.

Cisco has predicted that the Internet of Things (IoT) will generate more than 800 Zettabytes per year by 2021 [1]. Things such as environment sensors, smart parking systems, and self-driving cars can all generate data that could be useful. Cars could be equipped with various sensors that reports information such as outside temperature and current emission levels. The estimated amount of cars in Stockholm, Sweden 2017 was 375 for every 1000 citizens, which with a population of about 950 000 would amount to more than 350 000 cars [2].

Collecting sensor data from only cars in a city would put a high load on a database, but also including stationary sensors throughout a city and potentially sensor data generated from wearables and smartphones requires a scalable system.

Information about the air quality could be a factor in deciding where to go, and a smart parking system could inform the self-driving car where to find a parking slot at the destination. Furthermore, the car could retrieve information about current road traffic conditions to make an informed choice of which route to take.

Connecting these information sources together to make the data useful presents a challenge. While there already exist multiple cloud providers that offer solutions to collect, store, and analyze sensor data, the focus of this thesis is how to design an open system that is not locked to a specific company or product. This is to promote system longevity, flexibility in what type of sensors to use, how they

1

(21)

2 CHAPTER 1. INTRODUCTION

transmit data, and how that data is stored.

To understand how to build such a system, this report will contain two parts: a comparative study, and the development of a prototype system. In the first part of the report a selection of different types of databases will be compared to find the most suitable one. Next, a query interface will be designed to enable the retrieval of collected data and to facilitate integration with other systems. Once a choice has been made, a prototype back-end system will be developed to evaluate the chosen solution.

1.1 Overview

Any computer program that requires persistent data storage must use a database.

This might be some embedded database bundled with the program, or it could be an external database that the program interacts with using some remote interface.

Regardless of which type of database is used, a database interface and some form of query language is required for the interaction. This could be a database driver that integrates the database with a program, or a simple Application Programming Interface (API), leveraging the Hypertext Transfer Protocol (HTTP) to connect arbitrary devices.

1.2 Problem Description

There are many different data storage solutions to choose from that specialize in different use cases, and selecting which one to use can be hard. How to build an open and scalable system to support a large quantity of sensors from varying manufacturers to continuously collect, store, and make the data available? How to compare different data storage solutions and what metrics are relevant?

Problem statement: How to design a data storage solution and build a scalable back-end system for receiving and handling large amounts of sensor data, and how to make that data readily available?

1.3 Problem Context

IoT enables easier and more cost effective deployment of sensors. Every connected machine is a potential information source that can collect and report different kind of metrics. 42 billion IoT devices are forecast to be connected by 2022 [3]. A system to collect sensor data would thus have to scale well to support

(22)

1.4. REQUIREMENTS 3

the growing number of data sources. A sensor might produce a regular stream of data that adds an even load to a database, but it might also send data in bursts, highly irregularly. This means that it is not safe to assume that there will be periods of lesser load where the database can work on a backlog and catch up. A system would thus have to be able handle both steady and variable loads.

Previously, the distinction between relational and NoSQL databases was more clear. The relational databases all supported SQL and the NoSQL databases had their own individual query languages. The relational databases only handled structured data, and the NoSQL databases only handled unstructured data. If there was a need for huge data sets, then NoSQL was the way to go due to the better scalability. However, more recently the distinction has become blurred as some relational databases began to support unstructured data as well as clustering to better scale (at least for read operations) [4][5].

1.4 Requirements

To select a suitable database and build a prototype system, a requirement specification is needed to provide some of the selection criteria. This project aims to design a system for collection of various types of sensor data such as environmental data in a city environment. It should be able to handle the huge amount of connected devices expected for the future, and the increasing amount of generated data. The system should thus be scalable and extendable to support new types of sensor data and methods of collection.

When environmental data is collected, there is no clear point in time after which the data becomes irrelevant. To support this the system should be able to store data for an undefined amount of time, and the collected data should be accessible to the public.

The following list contains the formal requirements for the chosen solution:

• Any sensor can be connected to supply complex data structures.

• New types of sensor readings can be added in the future.

• The collected data must be publicly accessible.

• A new method of transporting a report from sensor to back-end can be implemented in the future.

• A sensor can report different types of information at separate time intervals.

(23)

4 CHAPTER 1. INTRODUCTION

• The system should support scaling the number of sensors to support the growing number of connected devices.

1.5 Purpose

The purpose of this thesis is to answer the questions raised in Section 1.2 by looking into how to design a generic data storage solution. It should be a data storage solution that is suitable for receiving large amounts of sensor data and that promotes interoperability. Part of the purpose is also to look into how to make the data readily available and accessible, and to implement a prototype back-end system.

1.6 Goals

The goal of this project is to determine how to store sensor data and make it available, and why the chosen method is best suited. The project is also expected to deliver a prototype implementation of a back-end system that can receive large amounts of data, store it, and make it available to the public.

The aforementioned goals are listed below:

1. Determine the best way to store sensor data.

2. Determine the most suitable way to access stored sensor data.

3. Design a prototype back-end system to receive sensor data, store it, and make it available.

To aid in the design of the prototype, a use case is chosen to be the GreenIoT research project [6], where IoT sensors are deployed in a city environment to monitor air quality. This is meant to be of benefit for the GreenIoT project and thus by extension benefit the citizens where the project is deployed by making the collected data freely and readily available.

1.7 Deliverables

This project is expected to deliver a comprehensive guide to choosing the appropriate database for storing IoT sensor data. The project is also expected to deliver a prototype back-end system to handle IoT sensor data for the GreenIoT project.

(24)

1.8. RESEARCHMETHODOLOGY 5

1.8 Research Methodology

This research project will be carried out using an inductive approach [7], and it will consist of two parts. The first part of the project will collect and establish metrics for comparison and then form a decision based on the collected data.

The first step will be to gather background knowledge about databases and query interfaces to identify various important properties. Next, a subset of the identified properties will be selected and used as a basis for comparison. The conclusion of which database and query interfaces is most suitable will be formed based on how well it matches the selection criteria.

The second part of the project will focus on the implementation of the chosen database and query interface in a prototype back-end system. This prototype will be developed using an iterative approach where a simple prototype will be developed at first, and then features will be added continuously in the form of new iterations until the prototype is completed.

1.9 Delimitations

The database comparison will be theoretical and based on information available in product documentations and research papers. No practical tests will be performed while comparing the databases.

There are more databases available to choose from than what can be covered by a thesis project. Therefore, this project will focus on some of the most popular choices of SQL, NoSQL, and time-series databases. In regards to query interfaces, only REST and GraphQL will be examined as they are considered to be the top candidates.

1.10 Structure of This Thesis

Chapter1introduce the reader to the problem and its context. Chapter2provides the background necessary to understand the basis for the choice and additional knowledge useful to understand the rest of this thesis. Chapter 3 compares the different alternatives to find the most suitable choice. Following this, Chapter 4 implements the selected database and query interface in a prototype system. The solution is evaluated in Chapter5, and analyzed in Chapter6. Finally, Chapter7 presents the conclusion and provides suggestions for future work.

(25)

(26)

Chapter 2 Background

This chapter provides background information required to understand the properties of different databases, and serves as the basis for the comparison. It also contains information about how data can be retrieved and manipulated using various query interfaces.

2.1 Data Storage Models

A database is a collection of data that is stored electronically and is accessible in various ways. When designing a database there is a trade-off to be made because of the CAP Theorem. This theorem was first described by Eric A. Brewer in a paper published in 1999 [8]. In the paper it is explained that a distributed system can only guarantee 2 out of 3 of:

• Consistency: A read will always return the last written value.

• Availability: Given availability from replication of data over multiple network nodes, a client can always reach some replica.

• Partition-resilience: Given a partition in the network, the system as a whole will still be operational.

This has often been a choice between consistency and availability because it is only relevant at the time of a network partition. When there is no partition in the network, a database can have both consistency and availability. Traditionally, consistency has been the preferred choice which lead to the development of ACID.

7

(27)

8 CHAPTER2. BACKGROUND

2.1.1 ACID

ACID is a data consistency model that promise:

• Atomic: Either all operations in a transaction are completed or everything is rolled back.

• Consistent: After a successful transaction, the database is consistent and structurally sound.

• Isolated: Transactions do not compete for access to specific data and are isolated from each other.

• Durable: The committed data is stored so that it will be available in the correct state even after a failure and system restart.

This model was defined by Haerder and Retuer in 1983 [9], and has been frequently used when strong data consistency is a top requirement. A typical example of when this would be the case is found in the Banking industry.

Whenever a bank transaction occurs it is of paramount importance that the transaction record is stored consistently and permanently.

2.1.2 BASE

Another model that is not as strict as ACID is BASE. The BASE data semantics consists of:

• Basically Available: Guarantees availability as described by the CAP theorem [8]. However, while a response is guaranteed, it might be a response of failure or that the data is inconsistent.

• Soft state: The system state might change over time, without input, as the result of eventual consistency.

• Eventually consistent: System consistency will eventually be achieved provided that the user input stops, but the system will not wait for consistency before processing new requests [10].

The BASE model was first definied by Eric A. Brewer in 1997 [11]. It is a more loose specification than ACID and has a shifted focus to availability instead of consistency. This enables a database to handle a higher insertion load and to easier scale to multiple machines.

(28)

2.2. DATABASETYPES 9

2.2 Database Types

There are many different kinds of databases, and a rough classification could be: structured and unstructured databases, also known as Relational and NoSQL databases. A third classification that exist somewhere in between is time series databases.

2.2.1 Relational Databases

Relational databases are a type of database that builds on the relational model [12]. The data is stored in tables, which consists of rows and columns, that can be connected to other tables to create relationships between the data. The relational database is governed by strict schemes that must be defined before any data is inserted into the database. This means that knowledge about the data structure is required to configure the database before any data can be stored.

Relational databases are typically ACID compliant meaning that they guarantee strong consistency.

2.2.2 NoSQL Databases

NoSQL databases come in different flavors and the main ones are [13]:

• key-value store: data is stored as key-value pairs where the key is some name that is mapped to a simple value.

• document database: similar to a key-value store, but the value is a more complex data structure known as a document.

• column-based store: data is stored in columns instead of rows so that each row can have varying number and type of columns.

• graph database: built on graph theory and treats relationships between data as equally important to the data itself.

A key difference to relational databases is that NoSQL databases does not require a predefined scheme to be configured before data can be inserted. They are built around the loose model of BASE instead of ACID to support high availability.

When discussing document databases in this report, a document is the basic item that is stored. At the base it is a JSON object, but it can contain both regular fields, nested objects, and nested arrays. A JSON object is thus referred to as a document when discussed in relation to a document database.

(29)

2.2.3 Time Series Databases

A time series database has a shifted focus compared to Relational and NoSQL databases in that it can specialize in append operations. A key characteristic of time series data is that the data is tightly coupled with a timestamp. This means that it is primarily inserted and not updated as the data is mainly relevant together with the timestamp from when it was collected.

Because each new data point is coupled with the current time of the measurement, it is almost always inserted at the end, and if records are deleted, it is in bulk. Each value is coupled with a timestamp, as such there is seldom any reason to update an old value, but rather the new data is inserted with the current timestamp. Always inserting data at the end, or current timestamp, means that random inserts could be slow as long as inserts at the end is fast. Another characteristic of time series data is that queries are typically over ranges of values rather than on dispersed data points.

2.3 B

⁺

-tree vs LSM

When data is stored, a data structure is used to enforce some layout that promotes a desired trait such as fast queries or quick inserts. Two well known data structures are B⁺-tree and log-structured merge-tree (LSM).

A B⁺-tree is a generalization of the binary search tree [14] that is often used, in some variant, to store database indices. It is a self-balancing tree structure designed to be quick for both random and sequential access [14]. However, while it is fast at read operations even for very large data sets, it is not as fast to insert new data.

LSM, on the other hand, is a data structure designed for fast inserts and indexing, making it well suited for applications where writes are more common than reads [15]. It is specialized in sequential disk access to exploit that disks are commonly much faster at sequential access than random access because there is no extra overhead of seeking the different data locations [15].

(30)

2.4. INDEXING 11

2.4 Indexing

An index is used to increase performance of specific database queries. Typically, data is stored on disks that are slow to access compared to main memory. To retrieve some record without an index, the database management system (DBMS) would have to scan through every stored record to find the right one which can become prohibitively expensive as the database grows.

An index can be used to create a smaller table containing all the keys from the main table together with pointers to the exact location of each database record.

This smaller table would be much faster to scan through to find the exact location of the sought entry, resulting in fewer disk reads.

As an example, consider a company database containing employee number, first name, last name, and email address for every employee. Assuming that the employee numbers are ordered, binary search could be used to access a specific record when using the employee number as a key. However, if the query is for an employee using the last name as the key, all of the records might have to be scanned to find the right one. In this case an index could be created by putting all the employee names together with a pointer to the full employee record in a new table.

This new table would be sorted by last names, thereby resulting in a fast lookup time for a specific employee by last name. Multiple indices might be needed to support fast queries using other keys such as first name or email address. However, creating a new index requires extra storage space. Another drawback is that insertion of new records, or deletion of old ones are slowed down by the process of updating the indices.

(31)

2.5 Database Replication Architectures

When hosting a database on a single machine, a crash might result in loss of all data. A method of avoiding this is to use database replication. Two common database replication architectures are: master-slave and master-master (also known as multi-master).

When a group of machines hosting a database are configured for master-slave, only one of the machines, the master, performs any writes [16]. The master replicates the database to the slaves, and any update written to the master is pushed to the slaves, as seen in Figure2.1a. In this setup any of the machines can handle queries, which allow for load distribution, but because only the master is allowed to update the database, there is still a performance bottleneck. Master-slave can also be implemented with strong consistency requirements such as ACID as there is only one machine that is the owner of the data.

The second architecture is the master-master or sometimes called multi-master architecture seen in Figure 2.1b. This architecture allows any node to handle both database writes and reads [17], which improves availability, but it cannot guarantee as strong consistency as required by ACID.

(a) Master-Slave (b) Multi-Master

Figure 2.1: Replication architectures

(32)

2.6. SCALING 13

2.6 Scaling

Two methods of scaling a database is vertical and horizontal scaling. With vertical scaling, the machine resources are increased to handle a larger load, while horizontal scaling focus on scaling through multiple machines [18]. It might be easy to scale a single machine by adding memory, Central Processing Units (CPU), and disk storage. However, this method of scaling cannot continue indefinitely and when the resources of a single machine can no longer handle the load, the alternative is to implement horizontal scaling. Adding additional machines to a system adds great scalability, but there are some inherent problems with horizontal scaling that must be dealt with. There are a few ways to scale horizontally depending on what part of the system requires more resources.

2.6.1 Master-Master vs Master-Slave

If there are many read requests it might be enough to add slaves that have a replica of the database but that can only handle read requests. This would be the case in a master-slave architecture, as described in Section2.5. If the system needs to handle a larger amount of write operations, a multi-master architecture would be more suitable. With this method, each machine has a full copy of the database and can handle both read and write requests, but issues concerning data consistency between the machines must be resolved. In both of these methods, the full database is stored on each machine, but if the data set is to large for one machine, a solution is to use sharding.

2.6.2 Sharding

With sharding, as seen in Figure2.2, the data is split into multiple parts that are stored on different machines. This way, only a fraction of the data is stored on each machine and they can all handle both write and read requests. A challenge here is to find a way to split the data evenly across the machines and to make sure that the read and write requests are sent to the correct database instance.

(33)

Figure 2.2: Splitting a collection into multiple shards. (Adapted from https://blog.pythian.com/sharding-sql-server-database/)

(34)

2.7. QUERYINTERFACES 15

2.7 Query Interfaces

Two broad categorizations of query interfaces are those used to retrieve information from a database, and the more general interface used to retrieve information from any source. For the rest of this report they will be referred to as database query interfaces and information query interfaces respectively.

2.7.1 Information Query Interfaces

In order to expose a system to a client to enable data retrieval and manipulation, a common method is to introduce an API that defines a set of methods available to the client to interact with the system. There are multiple different types of APIs and the more well known ones are: Simple Object Access Protocol (SOAP) [19], Representational State Transfer (REST), and GraphQL [20].

2.7.1.1 SOAP

SOAP is an Extensible Markup Language (XML) based protocol created to exchange information in a distributed environment [19]. It is designed to be lightweight and Operating System agnostic, and enables simple communication for programs over HTTP^*. It is usually combined with the Web Services Description Language that defines the available methods and how to call them [21], resulting in a tight coupling between server and client.

2.7.1.2 REST

REST is a well known design architecture for an API that was first described by Roy T. Fielding in his Doctoral dissertation [22]. In the dissertation Fielding describes how REST is designed to promote longevity and enable independent evolution of server and client through the use of hypermedia as the engine of application state (HATEOAS). This means that the client should not have any out- of-band knowledge about the server other than how to handle hypermedia and the entry point location for the API. Some of the key characteristics of the REST architecture described by Fielding is a uniform and stateless design that enables great scalability, further promoting API longevity.

* SOAP is not dependent on HTTP and works over other protocols as well such as the Simple Mail Transport Protocol.

(35)

2.7.1.3 GraphQL

GraphQL is a graph query language developed by Facebook to give the client more control over what data they retrieve [20]. It is presented as an alternative to the RESTful approach and it is contract-driven through the use of schemas that define the functionality. A client can specify which attributes to retrieve which enables a high granularity of the request as well as a potential to reduce network bandwidth.

2.7.2 Database Query Interfaces

There are multiple methods developed to access and manipulate a database. One such method is the Structured Query Language (SQL), which is based on the relational model, and it is typically used to manipulate data in relational databases such as MySQL and OracleDB^*. It was defined by D. D. Chamberlin and R. F.

Boyce in 1974 [23] and has since been through multiple revisions with the latest one called SQL:2016 [24].

For non-relational databases it does not exist a standard interface to be used.

Every database implements an interface tailored for itself, and while some are very similar to SQL, they often lack some features due to the strong consistency requirements of ACID compliant databases [25][26][27]. There are also databases such as Elasticsearch that use their own query strings instead of SQL to provide access and support custom queries [28].

2.7.3 System Integration

A database driver can be used to connect a database to another system. Drivers such as Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) enable the execution of SQL queries from programming languages such as C, C++, and Java [29], thereby facilitating integration with other systems.

Another way to access a database is through a REST API [22], using the Hypertext Transfer Protocol (HTTP) to communicate with the database. Using a REST API introduces an abstraction between the database and the client, allowing for change of database back-end without updating the client.

* Multiple dialects of SQL exist and it can differ much between implementations, but they are all based on the original SQL definition created by Chamberlin and Boyce.

(36)

2.8. RELATEDWORK 17

2.8 Related Work

LittleTable is a relational database developed by Cisco Meraki that is specialized in storing time-series data from IoT devices [30]. It is designed around the assumptions of single-writer, append-only, and that the most recent data can be retrieved again from the IoT device in the event of a crash. This allows for weaker consistency and durability guarantees that simplifies the implementation.

However, there is no support for horizontal scaling, and any such requirement is handled by the use of independent LittleTable instances and external systems to shard the data [30].

Nayak, Poriya, and Poojary discussed different types of NOSQL Databases in their paper Type of NOSQL Databases and its Comparison with Relational Databases [31]. In the paper they describe the four common categorizations of NoSQL databases and give an overview of how they compare to relational databases. They conclude that one of the major drawbacks of NoSQL databases that has resulted in a lower usage than that of relational databases is the lack of a common query language.

The authors of [32] compare the NoSQL database MongoDB to Microsoft SQL Server for a modest-sized data set to find out if NoSQL databases are beneficial for smaller data sets than what is referred to as ”Big data”. In the paper they come to the conclusion that MongoDB performs as well as or better than SQL Server for everything except aggregate queries where SQL server is shown to be up to 23 times faster.

In [33] the authors compare 14 different NoSQL databases based on five aspects:

data model, query possibilities, concurrency control, partitioning, and replication.

They argue that the use case is the dominant factor in determining which database to use because the various NoSQL databases were developed to solve a specific problem that a relational database could not handle efficiently. They also state that the choice of database has to be made based on what type of queries are required as databases such as key-value stores does not support complex queries.

The developers of InfluxDB explains in their documentation some of the design insights and tradeoffs that are specific for time series data such as sensor data [34]. They argue that operations such as delete and update are very rare and thus does not required high performance, and if records are deleted it is often over large ranges of old data. They also state that time series data is mostly append operations because the timestamps of new inserts are primarily very recent.

Finally, they identify that when it comes to time series data, no single points of

(37)

data is too important. In effect, the main focus in on larger data aggregates and not on individual data points.

(38)

Chapter 3 Comparison

The first part of finding a good candidate was to limit the amount of databases to choose from. In this report only a few candidates are selected from the three categories: Relational, NoSQL, and Time series. The candidates are mainly chosen based on a basic suitability for the task, popularity, and longevity of the database^*, with some exception for promising newcomers. The selected candidates for comparison are: InfluxDB, Elasticsearch, Cassandra, MongoDB, TimescaleDB, and OracleDB.

The following sections will compare the databases, starting in Section3.1with a list of the selected comparison criteria. After that, a discussion will follow of each of the criteria, ending with a summary and conclusion.

3.1 Comparison Criteria

To find a suitable database for a given project, a set of requirements and a few chosen properties are required as a basis for comparison. The databases were compared based on the following properties:

• Scalability and Backups

• Maintenance

• Support for new data types

• Query language: How to get access to the data?

• Long term storage

* A strong community, or low risk of being discontinued by the developers.

19

(39)

20 CHAPTER3. COMPARISON

A system to handle sensor data from a city environment must be able to grow to support an increasing amount of connected sensors, and an increasing amount of users as the project is integrated into other products. Another aspect is that historical data could be valuable and should be kept in storage indefinitely. As the systems grows in size, the storage space required will be ever increasing. Thus, an important aspect is how to add more storage and if it will temporarily disrupt the service.

In order to ensure the longevity of the data and to protect against failures, clear procedures for backup and restoration are important. Although there are technologies such as replica shards that protect against database corruption, they do not protect against user errors such as accidental deletion of data, or insertion of data that corrupts the database. While performing full backups every time might be feasible for low scale databases, incremental backups better support growth, and encourage shorter backup intervals.

Part of building a robust system that promotes longevity is to strive for low maintenance requirements, and more importantly, to keep potential downtime to a minimum. Another aspect is flexibility, because there is no standard with a broad adoption for how sensor data should be formatted when reported. This means that the way that data is represented might be changed in the future, and a system that can adapt to this without loosing the old format data has a clear advantage.

While details about how the data is stored is very important, another consideration is how to access the data. What query language is used and does it support queries that include data processing? When requesting aggregates of the data, it might be more efficient to perform the aggregation at the database than to perform multiple queries and let the caller compute the result.

When considering if a specific feature is supported by a database, only built- in features are considered fully supported. While some databases might have extended functionality when combined with other products, that functionality is not considered to be supported by the database itself, but rather a possible extension. If a certain feature is only supported while running in the cloud, it is not considered to be supported by the database as it depends on cloud features not available when running a self-hosted database.

(40)

3.2. SCALABILITY AND BACKUPS 21

3.2 Scalability and Backups

NoSQL databases have been the clear choice from the start when great scalability is needed. The main method of scaling is horizontal scaling through the addition of extra nodes.

In a blog post Netflix demonstrates how Cassandra achieves linear scalability when adding new nodes to the cluster [35]. To scale a Cassandra cluster for read and write operations it is enough to add more nodes to the cluster which can be done seamlessly without any downtime [36]. Backups are performed using a snapshot operation that can be completed on the whole cluster or a single node, and after the first snapshot, further backups can be incremental [37].

MongoDB scales through sharding which is based on a shard key as explained in the official documentation [38]. According to the documentation, both reads and writes are scaled by adding a new shard. MongoDB can begin with a single machine and then at a later time implement sharding to scale horizontally.

However, the initial sharding can only take place if the current database size does not exceed certain limits [39].

Multiple options exist to backup the data, and one of them is to use a built-in tool called mongodump to take a backup of a cluster node. Using this method, a backup of each cluster node has to be performed manually while the load balancing feature is disabled [40]. A drawback with this approach is that it only takes a backup of the documents and not of the index data. The recommended approach is to use snapshot features of the underlying file system to backup the database [40]. Another approach is to use the MongoDB Cloud Manager that can automatically backup clusters running both in the cloud and on local infrastructure using incremental backups [41].

TimescaleDB supports clustering for replication of data and to shard read queries.

However, scaling out to multiple nodes to support higher write workloads is currently unsupported [42]. Using replication, the full data set is available on each replica, and in the case of a failure of the primary node, one of the slaves can take over^*. However, only manual failover is available natively and a third- party solution is required to get automatic failover [43]. To protect against user errors that could result in data loss such as the accidental deletion of data, it is also possible to backup the database using built-in backup functions from PostgreSQL [42].

* If the primary node fails, the time to elect a new master might result in unavailability for new writes and thus data loss.

(41)

OracleDB can be scaled both to support more read and write operations, and larger storage using different techniques. Oracle Real Application Cluster (Oracle RAC) enables horizontal scaling so that multiple database instances share the same storage and can thus handle more read and write operations provided that the bottleneck is CPU and Random Access Memory [44]. Because all the instances share the same database storage, a bottleneck could still be disk access, and there is still a single point of failure in the database storage back-end.

In order to utilize horizontal scaling for extended storage, Oracle Exadata can be used [45]. However, a drawback with both Oracle RAC and Oracle Exadata is that they are both quite expensive and the cost might not always be justifiable for the collection of sensor data [46][47]. To protect against data loss, a utility called RMAN can be used to perform incremental backups and restore operations [48].

InfluxDB scales through clustering and by adding additional nodes. Adding new nodes to the cluster can help to scale both read and write operations, and increase available disk space [49][50]. The clustering feature is only available in the enterprise edition which means that if it might be required in the future, the enterprise version should be used from the start instead of starting with the free version. InfluxDB supports both full and incremental backups of the database to recover from data loss [51].

Elasticsearch scales through clustering by adding additional nodes and utilizing sharding. There are different types of nodes that can be added to the cluster to improve read and write performance, and replica shards can be created and distributed among the nodes to keep backups of the data [52]. There is also a snapshot API that can be used to backup the entire cluster to protect against catastrophic failures that cannot be recovered from by replica shards. The snapshot API will take a full backup of the cluster the first time, and then incremental backups for subsequent calls [53]. An elastic cluster can be created on a single node and then expanded as required through the introduction of additional nodes [52].

While backups and replica shards protect against losing saved data, there is also the aspect of losing future data if the cluster crashes, preventing new data from being written. A solution to protect against this could be to write the data to multiple locations for each insertion. The data ingestion system could write every piece of data both to the database, and to some cheap long term storage solution that is not necessarily optimized for fast queries. In the event that the database becomes temporarily available for new insertions, the data would still be written

(42)

3.3. MAINTENANCE 23

to the other storage solution. Later, when the database is available again, it could ingest whatever data it missed from the secondary storage.

3.3 Maintenance

Regardless of which database is in use, there will likely be some form of maintenance required. Typical maintenance tasks might be to upgrade to the next version of the database, recalculate or compact the indices, extend disk storage, or update a schema. How to perform these tasks and minimize downtime varies among the different databases.

With MongoDB, most maintenance tasks can be performed without downtime when running a cluster. A MongoDB cluster has a collection of nodes with one elected as the master, and each secondary node can be taken offline for maintenance and then reintegrated into the cluster [54]. Once all the secondary nodes are done, a manual fail-over can be performed to elect a new master which allows the previous master to be taken down for maintenance.

Same as MongoDB, InfluxDB also supports maintenance tasks such as upgrades of clusters without downtime by updating one node at the time in the cluster [51].

However, it is still recommended to schedule a maintenance window for an offline upgrade.

Cassandra has two maintenance tasks found in the documentation [55] that should be performed regularly to keep the database healthy called repair and compaction.

Repair is used to synchronize the missed writes to a cluster node, that has been temporarily offline, to enforce data consistency. Compaction is performed to remove any expired data and free up disk space. This task can be automated in Cassandra by enabling the option for autocompaction. Cassandra also supports version upgrade without downtime in a cluster by updating one node at the time [56].

Some of the primary maintenance tasks for Elasticsearch is to rotate indices, delete old indices, and upgrade the cluster to newer versions. The index maintenance tasks can be automated by configuring Elasticsearch to automatically rotate the index and use a tool called Curator [57] together with a cronjob to regularly move or delete old indices. The index maintenance can also be automated directly in elasticsearch using a pipeline and an index lifecycle management policy [53].

Thus the only manual maintenance required is to upgrade to new versions and this can be done without downtime by updating one node at the time.

(43)

TimescaleDB does currently not fully support cluster deployments and any maintenance tasks such as an upgrade that requires a restart will thus have to be performed during a scheduled downtime. As TimescaleDB is based on PostgreSQL [58] the same maintenance tasks apply, but most of them can be automated using external tools such as a cronjob [59].

Oracle provides instructions to help minimize downtime when performing planned maintenance. Features such as online patching and Automatic Storage Management [60] help lead to no downtime for some of the common tasks [61]. Other maintenance tasks can be fully automated such as updating statistics and database tuning [62].

3.4 Support for New Data Types

Building a system to support the collection of sensor data with complex data types that is expected to grow requires a database that can be updated to handle both new types of sensors and new types of readings. A sensor might be added to the system that measures a new type of reading that should be collected, and the database should thus support this without the need to recreate the database or severely disrupt operations. When considering if a database supports new data formats, a new format should not lead to performance degradation.

InfluxDB organizes data into tags for meta-data and fields for measured values, and only the tags are indexed. The schema cannot be created in advance; it is automatically created based on the inserted data. Once the schema has been created it cannot be updated with new tags. Thus, if a new sensor is introduced to the system that reports some new fields or tags, it will automatically be inserted into a new series [63].

Elasticsearch is a search engine designed to index JSON objects which means that it does not depend on any fixed schema to store data other than the use of JSON formatting [64]. Elasticsearch stores the raw data and index everything^*with the help of mapping templates [66]. According to the documentation, elasticsearch use a dynamic mapping template by default which means that new fields are automatically added and their data type is guessed [66]. It further specifies that a dynamic mapping template is applied to new indices, and can be configured to specify the existence and type of a subset, or of all the fields. Thus, when a sensor reports a new type of field it will automatically be discovered and handled.

* Elasticsearch can be configured explicitly to not index certain fields [65]

(44)

3.4. SUPPORT FORNEW DATA TYPES 25

To change the recognized type, or properties of a dynamically mapped field, the mapping template can be updated so that the changes will apply at the next index rotation.

Cassandra supports the addition of new columns by altering the schema before insertion [27]. Because each row can have a different set of columns, a new sensor with different measurements will just use another set of columns.

MongoDB is a document store and support different structures for each document.

There are multiple ways to store sensor data in a document store, but one alternative is to create one document per sensor per measurement. Using this approach means that the addition of new sensor types will just result in different structured documents, but it will not affect the other measurement as they are in separate documents and each document is independent.

TimescaleDB supports storing semi-structured data using a binary JSON format.

Using this format, only the fields that are mandatory on every sensor such as identifier and timestamp are defined as columns, the rest are stored as a binary JSON object [42]. This means that a new sensor can report any type of measurement as long as it also reports the mandatory sensor identifier and timestamp, thereby enabling automated addition of new sensor types to the system. However, since the rest of the fields are stored as a JSON object in a binary format, the whole object has to be deserialized and reconstructed just to access a single field.

The JSON object can be indexed either using a GIN index, or by indexing individual fields [42]. Using a GIN index will only optimize for certain queries that inspect top-level fields in the JSON object which might negatively affect the performance depending on the data structure of collected sensor data. The alternative is to index individual fields, but this is limited to only indexing the fields that are common for all JSON object [42].

OracleDB supports storing time series data using a special schema described in [67]. However, the schema is created before data insertion and is very strict. Sensor reports that does not match the predefined columns will not be inserted into the table. A way to store complex data is to use the same approach as TimescaleDB and use JSON to store it as objects. However, same as with TimescaleDB, using JSON to store data impose some restrictions, and performance might not be equal to storing the data using the less complex data types [68].

(45)

3.5 Query Language

It is important that the data can be stored efficiently and that the system can be scaled as needed, but it is also important that the data can be queried properly to make use of the information. When collecting data from various sensors it might be hard to know what the data will be used for in the future, or what type of queries might be run against the data. Thus, it could be better with a flexible solution that does not require knowledge in advance to optimize for certain queries.

The importance of this criteria depends on how the database will be used. Some of the query languages support powerful queries that put the heavy load on the database and thereby enable the use of weak clients that cannot aggregate the data themselves. Another aspect is that a well known language such as SQL might be important if various applications should communicate with the database directly.

On the other hand, even a language supporting powerful queries might not be enough. In that case an additional service would have to be implemented that queries the database, and then performs the heavy operations and expose whatever interface is best suited for the clients.

InfluxDB implements a query language called Influx Query Language (InfluxQL) that is used for data exploration. It is designed to be and feel similar to SQL but does not implement some relational database specific features such as table joins, but instead it implements some new features useful for time series data [51]. One of the specific features of InfluxQL is continuous queries which enable the creation of automated periodic queries. A typical use case for this is to periodically run a query that calculates some aggregate of the data inside a sliding window time range and inserts it into a new table.

As mentioned in Section3.4, InfluxDB splits data into tags and fields, and only the tags are indexed. This means that upon data insertion the user should know what part of the data will be used to formulate future queries since the tags cannot be changed without deleting the data and reinserting it with new tags. While it is still possible to query based on fields, that will come with a performance impact as all the data entries must be searched without an index.

Elasticsearch implements a custom query language called Query Domain Specific Language (Query DSL) that utilize a JSON-style formatting that enables a wide range of queries to be performed on the data to find specific information [53].

Query DSL supports, among other things, aggregate queries such as to calculate the average, sum, or percentiles of specific fields, as well as to find min and max

(46)

3.5. QUERYLANGUAGE 27

values across collected data points. Elasticsearch also supports a subset of SQL that can be used to read data using common SQL select statements [69].

Cassandra implements its own query language called Cassandra Query Language (CQL) that is similar to SQL but lacks certain features. Some of the differences from SQL is that CQL does not support joins, nested queries, or transactions, and it does not support logical operators such as OR and NOT [27]. Queries can be filtered using the WHERE clause but only on columns that are either the primary key or a secondary index [27]. Thus, future types of queries might not be trivial to perform unless the primary key is used to select the data.

MongoDB implements its own rich query language that can be used to extract information from stored documents. It has built-in support for finding data based on some constraints, as well as, aggregation methods to calculate sums, averages, min, and max values on documents [40].

OracleDB implements SQL which brings the whole range of powerful queries available to relational databases. SQL supports many built-in functions that can be performed on the data such as calculating sums and averages, and finding minimum and maximum values. However, some other functions such as finding the median value is not supported and has to be implemented using complex nested queries which can impact performance.

Indices are important to get good query performance and they are created together with the schema before data insertion. Relational databases store data across multiple tables connected with relations to avoid, among other things, duplicated data. Deciding how to store the data across tables and which columns should be indexed affects what queries can be run efficiently.

TimescaleDB, same as OracleDB, implements SQL which enables a wide range of queries to be run on the data. It also implements some extra functions to support more advanced analytics such as median, percentile calculations, and histograms. Because TimescaleDB is based on PostgreSQL it also requires the database administrator to create appropriate schemas and indices before data insertion to support efficient queries.

(47)

3.6 Long Term Storage

Regularly collecting sensor data from hundreds of thousands to millions of nodes is likely to consume storage space rapidly. All of the databases examined in this report support some form of data retention policy that dictates how long before inserted data is removed from the database. Storage can be scaled to support longer retention periods, but there is a trade-off between keep data for a long time, and have a fast system.

Elasticsearch supports time based indices where new indices could be created regularly to keep the index size manageable. The documentation describes that an index has a lifecycle where it can transition from a hot - warm - cold - delete stage [53]. When an index is created it will be in a hot stage with much read and write activity. Once a new index has been created and data is not longer inserted into the old index, it can be moved to the warm stage where queries are still fast but data is read only. The next step is to move the index to a cold stage when it is seldom queried, and the query response time can be slower. After a certain amount of time an index can be deleted from the cold stage if it is no longer relevant, or if it is archived to some long term storage solution. If it is archived, it can be re-imported to support queries.

MongoDB has a feature called tag-aware sharding that enables the placement of a shard on a specific machine [70]. This could be used to tag shards so that new documents are placed on a fast machine, but when it gets older it will automatically be moved to a slower machine. This enables fast queries of the most recent data, but a drawback is that the cutoff date between new and old documents must be manually specified and regularly updated.

In a whitepaper Oracle describes how table partitioning could be used to move less frequently accessed data to slower storage [71]. This could be used in OracleDB to move old data to more long term storage and make room for new data on the fast storage.

None of the others have a feature to facilitate long term storage of data through separation of recent and old data. Cassandra has a feature to compress the data to reduce storage by up to 33%, but it is only suitable when the inserted data use the same fields or columns [72]. Sensor data which might vary greatly in what type of fields are reported could thus be hard to compress. Both InfluxDB and TimescaleDB support data compression as well to reduce storage requirements.

However, it is unclear if the compression is significant on non uniform data.

(48)

3.7. SUMMARY 29

3.7 Summary

Table 3.1 contains a summary of how the databases match most of the chosen criteria, but some of them such as maintenance and support for complex queries does not fit in the table. In the use case for this project, described in Section1.4, the main focus is on scalability, support for new data structures, and long term storage of the collected data. For this use case Elasticsearch was the top candidate due to the greatest flexibility in all the areas.

Table 3.1: Database comparison

InfluxDB Elasticsearch Cassandra MongoDB TimescaleDB OracleDB

Scale Read^* 3 3 3 3 3 3

Scale Write* 3 3 3 3 7 3

Scale Storage* 3 3 3 3 7 Exadata

Incremental

Backups 3 3 3 Cloud

Manager 7 3

Query

Language InfluxQL Query DSL,

Partial SQL CQL JavaScript SQL SQL

Add new

data types^† 3 3 3 3 7 7

Long term

Storage 7 3 7 3 7 3

* Scaling here means scaling out. ^† Without imposing restrictions or performance degradation.

(49)

(50)

Chapter 4 Prototype development

This chapter explains the approach to implement a prototype system to store sensor data using Elasticsearch, henceforth referred to as elastic, as the database back-end. The prototype was created to handle sensor data generated by the GreenIoT project which utilize the Sensor Measurement Lists (SenML) message structure [73].

The prototype was developed during three iterations where each iteration brought some new extended functionality. The first iteration only implemented basic functionality to parse and insert data into a single node elastic cluster, without any functionality for scaling or setup for long term storage. The second iteration added scalability and support for long term storage using index rotation, and unique identifiers to enable correlation between a document stored in elastic and the original message stored elsewhere. The last iteration moved to a three node cluster, added a data retrieval interface, and created an automated deployment method.

4.1 First Iteration

Elastic can be installed directly on a server or in a containerized environment.

To facilitate migration to other hosts, and extension from single to multi-node cluster running on the same host, elastic was installed in a docker container.

Using a docker container to encapsulate elastic had the benefit of bundling all the dependencies with elastic in an isolated environment, thereby preventing conflicts with other system libraries. In this first iteration, elastic was set up using default settings for a single-node cluster, and elastic version 6.7 was used.

31

(51)

32 CHAPTER4. PROTOTYPE DEVELOPMENT

4.1.1 Data Structure

The first design choice to make was how to internally structure the data. The SenML format used by the GreenIoT project defines a message as a list of measurements (also known as records). A record contains regular fields with values such as the sensor name, the time of measurement, and the measured value.

Multiple representations are defined, and one of them is the JSON representation.

This representation is used by the GreenIoT project, and it is defined as a JSON array containing multiple records as JSON objects. Elastic stores data as JSON objects and cannot directly ingest a SenML formatted JSON array [73]. Thus, a SenML message must be restructured to be a JSON object before insertion into the database.

According to the SenML specification, a record can also contain base fields (in addition to regular fields), as seen in Listing1a. These field start with the letter

”b”, such as ”bn” for base name, and ”bt” for base time. A base field apply to every subsequent regular field with a corresponding name until it is overridden by the specification of a new base field with the same name. Thus, to get the full record, the regular fields has to first be combined with the base fields to form what the SenML specification refers to as resolved records, as seen in Listing1b.

The choice was to either store the SenML message unresolved and let the client retrieve and resolve the records when querying, or to store the measurements in resolved form. A problem with storage of the measurements in unresolved form is that a client would have to retrieve the whole SenML message in order to resolve it even when only interested in a single field. This is because a base field might have been specified earlier in the message that must be applied to the regular field to be retrieved. Another drawback is that elastic cannot be used to aggregate the measured values as they are not correct until they are resolved.

Every SenML message could potentially contain multiple definitions of a base field which should be applied to subsequent regular fields. A structure to store unresolved SenML records would thus also have to keep track of the record order, and the structure would be unnecessarily complex to support potentially multiple occurrences of the same base fields. Because of these drawbacks, the SenML messages were first resolved and then split into multiple documents before ingestion by elastic.

(52)

4.1. FIRSTITERATION 33

1 [

2 {

3 "bn":"urn:dev:ow:10a10240b1020085;",

4 "bt":1.554098400e+09,

5 "bu":"%RH",

6 "n":"humidity",

7 "v":55.85

8 },

9 {

10 "n":"humidity",

11 "t":-5,

12 "v":55.80

13 },

14 {

15 "n":"temp",

16 "u":"Cel",

17 "t":-5,

18 "v":18.5

19 }

20 ]

(a) Original SenML message

1 [

2 {

3 "n":"urn:dev:ow:10a10240b1020085;humidity",

4 "t":1.554098400e+09,

5 "u":"%RH",

6 "v":55.85

7 },

8 {

9 "n":"urn:dev:ow:10a10240b1020085;humidity",

10 "t":1.554098395e+09,

11 "u":"%RH",

12 "v":55.80

13 },

14 {

15 "n":"urn:dev:ow:10a10240b1020085;temp",

16 "t":1.554098395e+09,

17 "u":"Cel",

18 "v":18.5

19 }

20 ]

(b) Resolved SenML message

Listing 1: Resolving of a SenML message

A Comparative Study of Databases for Storing Sensor Data

A Comparative Study of Databases for Storing Sensor Data

JIMMY FJÄLLID

A Comparative Study of Databases for Storing Sensor Data

Abstract

Sammanfattning

Contents

List of Figures

List of Tables

List of Listings

List of Acronyms and Abbreviations

Chapter 1 Introduction

1.1 Overview

1.2 Problem Description

1.3 Problem Context

1.4 Requirements

1.5 Purpose

1.6 Goals

1.7 Deliverables

1.8 Research Methodology

1.9 Delimitations

1.10 Structure of This Thesis

Chapter 2 Background

2.1 Data Storage Models

2.1.1 ACID

2.1.2 BASE

2.2 Database Types

2.2.1 Relational Databases

2.2.2 NoSQL Databases

2.2.3 Time Series Databases

2.3 B

-tree vs LSM

2.4 Indexing

2.5 Database Replication Architectures

2.6 Scaling

2.6.1 Master-Master vs Master-Slave

2.6.2 Sharding

2.7 Query Interfaces

2.7.1 Information Query Interfaces

2.7.2 Database Query Interfaces

2.7.3 System Integration

2.8 Related Work

Chapter 3 Comparison

3.1 Comparison Criteria

3.2 Scalability and Backups

3.3 Maintenance

3.4 Support for New Data Types

3.5 Query Language

3.6 Long Term Storage

3.7 Summary

Chapter 4

Prototype development

4.1 First Iteration

4.1.1 Data Structure