External Streaming State Abstractions and Benchmarking

(1)

External Streaming State Abstractions and

Benchmarking

SRUTHI SREE KUMAR

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Examiner: Professor Seif Haridi

School of Electrical Engineering and Computer Science Host company: Logical Clocks

Host organization: RISE

Swedish title: Extern strömmande statliga abstraktioner och benchmarking

(3)

Abstract

Distributed data stream processing is a popular research area and is one of the promising paradigms for faster and efficient data management.

Application state is a first-class citizen in nearly every stream processing system. Nowadays, stream processing is, by definition, stateful. For a stream processing application, the state is backing operations such as aggregations, joins, and windows. Apache Flink is one of the most accepted and widely used stream processing systems in the industry. One of the main reasons engineers choose Apache Flink to write and deploy continuous applications is its unique combination of flexibility and scalability for stateful programmability, and the firm guarantee that the system ensures. Apache Flink’s guarantees always make its states correct and consistent even when nodes fail or when the number of tasks changes. Flink state can scale up to its compute node’s hard disk boundaries using embedded databases to store and retrieve data. Nevertheless, in all existing state backends officially supported by Flink, the state is always available locally to compute tasks. Even though this makes deployment more convenient, it creates other challenges such as non-trivial state reconfiguration and failure recovery. At the same time, compute, and state are bound to be tightly coupled. This strategy also leads to over-provisioning and is counterintuitive on state intensive only workloads or compute-intensive only workloads. This thesis investigates an alternative state backend architecture, FlinkNDB, which can tackle these challenges. FlinkNDB decouples state and computes by using a distributed database to store the state. The thesis covers the challenges of existing state backends and design choices and the new state backend implementation. We have evaluated the implementation of FlinkNDB against existing state backends offered by Apache Flink.

Keywords

Apache Flink, Distributed Systems, NDB, FlinkNDB, State, State Backends, External State, Stream Processing Systems, Benchmarking, Caching

(4)

Sammanfattning

Distribuerad dataströmsbehandling är ett populärt forskningsområde och är ett av de lovande paradigmen för snabbare och effektivare datahantering. Applicationstate är en förstklassig medborgare i nästan alla strömbehandlingssystem. Numera är strömbearbetning per definition statlig. För en strömbehandlingsapplikation backar staten operationer som aggregeringar, sammanfogningar och windows. Apache Flink är ett av de mest accepterade och mest använda strömbehandlingssystemen i branschen.

En av de främsta anledningarna till att ingenjörer väljer ApacheFlink för att skriva och distribuera kontinuerliga applikationer är dess unika kombination av flexibilitet och skalbarhet för statlig programmerbarhet, och företaget garanterar att systemet säkerställer. Apache Flinks garantier gör alltid dess tillstånd korrekt och konsekvent även när noder misslyckas eller när antalet uppgifter ändras. Flink-tillstånd kan skala upp till dess beräkningsnods hårddiskgränser genom att använda inbäddade databaser för att lagra och hämta data. I allmänna tillståndsstöd som officiellt stöds av Flink är staten dock alltid tillgänglig lokalt för att beräkna uppgifter.

Även om detta gör installationen bekvämare, skapar det andra utmaningar som icke-trivial tillståndskonfiguration och felåterställning. Samtidigt måste beräkning och tillstånd vara tätt kopplade. Den här strategin leder också till överanvändning och är kontraintuitiv för statligt intensiva endast arbetsbelastningar eller beräkningsintensiva endast arbetsbelastningar. Denna avhandling undersöker en alternativ statsbackendarkitektur, FlinkNDB, som kan hantera dessa utmaningar. FlinkNDB frikopplar tillstånd och beräknar med hjälp av en distribuerad databas för att lagra tillståndet. Avhandlingen täcker utmaningarna med befintliga statliga backends och designval och den nya implementeringen av statebackend. Vi har utvärderat genomförandet av FlinkNDBagainst befintliga statliga backends som erbjuds av Apache Flink.

Nyckelord

Apache Flink, Distributed Systems, NDB, FlinkNDB, State, State Backends, External State, Stream Processing Systems, Benchmarking, Caching

(5)

Acknowledgments

First and foremost I would like to express my gratitude to my supervisors Paris Carbone and Mahmoud Ismail for the continuous guidance and motivation throughout the journey. I am also very thankful to RISE and Logical Clock for providing this opportunity. Furthermore, I cannot express my enough thanks to my friend and fellow student Haseeb for continuous efforts and motivation to complete this work. I am also thankful to my friends who were always there to motivate me to complete this work. I am also grateful to Google for providing us research credits to perform all our experiments. Finally, special gratitude to my family for supporting my decision to join the masters. Their encouragement during my studies is much appreciated and acknowledged.

Stockholm, February 2021

(6)

List of Figures

1.1 Reconfiguration in Flink [1] . . . 2

1.2 Rescaling in Flink [1] . . . 3

2.1 Generations of Big Data Processing [2] . . . 6

2.2 Flink Component Stack [3] . . . 8

2.3 Flink System : Actors and Interactions [4] . . . 10

2.4 RocksDB Architecture [5] . . . 14

2.5 RocksDB State Backend Architecture [1]. . . 15

2.6 NDB Cluster [6]. . . 16

2.7 NEXMark Entities [7]. . . 18

2.8 NEXMark Components [7] . . . 19

3.1 (a) Existing Flink state backend (b) FlinkNDB state backend [1] 21 3.2 FlinkNDB Initial Architecture . . . 22

3.3 Activity Diagram . . . 23

3.4 Value State Schema [1] . . . 24

3.5 Map State Schema [1] . . . 25

3.6 List State Schema [1] . . . 26

3.7 List State Attribute Schema [1] . . . 26

3.8 FlinkNDB Optimized Architecture [1] . . . 32

3.9 Activity Diagram with Caching [8] . . . 33

4.1 NEXMark Results. . . 35

4.2 Map state Read and Write Operation Time for Query 3 . . . . 36

4.3 List state Read and Write Operation Time for Query 3 . . . 37

4.4 Value state Read and Write Operation Time for Query 3 . . . . 38

4.5 Map state Read and Write Operation Time for Query 11 . . . . 39

4.6 Value state Read and Write Operation Time for Query 11 . . . 39

4.7 NDW Benchmark . . . 40

4.8 Log processing pipeline. . . 41

(9)

4.9 Experiment 1 - Read Time . . . 43

4.10 Experiment 1 - Write Time . . . 43

(10)

List of Tables

4.1 Experiment 1 configurations . . . 42

(11)

Acronyms

GCE Google Compute Engine

HDFS Hadoop Distributed File system JVM Java Virtual Machine

NDB Network Database

NIST National Institute of Standards and Technology S3 Simple Storage Service

SPS Stream Processing Systems

(12)

Chapter 1 Introduction

Big Data and Cloud Computing are two of the crucial technologies to enter the mainstream information technology. Even though they are not the same, the combination of both had already proved its capabilities, and most of the industries have adopted these technologies. Cloud Computing is the infrastructure, whereas Big Data represents the data or content. According to the official NIST definition [9],

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

Big Data refers to data that is too big to be processed by existing hardware and generated too fast. The significant characteristics of Big Data are represented as 4V’s, namely, Velocity, Volume, Variety, Veracity [10].

With the advent of Cloud Computing,on-demand computing power has become cheaper, which helped enterprises embrace Big Data. Big Data Processing [11] is a set of techniques or programming models to access large- scale data to extract useful information to support and provide decisions.

Initially, data were processed in batches, but the increasing need for real-time analytics in data led to many streaming processing systems. Apache Flink [7] is one of the most powerful and widely used stream processing systems.

Flink is also widely used in production level systems to build large-scale data analytics and processing components over massive streaming data.

(13)

1.1 Background

Production-grade data stream processors such as Apache Flink provide stable and fast end-to-end stateful operations based on distributed snapshots [12]

which offer transactional guarantees for the locally maintained state. While this approach provides fast access to data-parallel stateful operations, it limits performance when any form of reconfiguration is needed (e.g., scaling out).

Whenever the system configuration needs to change, large amounts of the state must be transferred from external storage to different workers before starting the system. This can lead to significant delays when jobs scale out or scale in, proportional to the size and the partitioning granularity of the state that needs to be booted to new operators. In this thesis, we: 1. Investigate the possibility of a new state backend which decouples state from compute and 2. Perform benchmarking to measure the runtime overhead as well as reconfiguration benefits of this architecture.

1.2 Problem

Figure 1.1 – Reconfiguration in Flink [1]

Currently, in all existing state backends[13] [14] officially supported by Flink, the state is always available locally to compute tasks. While it provides fast access to data-parallel stateful operations, it lacks performance when any

(14)

form of reconfiguration is needed (e.g., scaling out). Whenever the system configuration needs to change, large amounts of states have to be transferred from external storage to different workers before starting the system as in Figure1.1. This can lead to significant delays.

Another common problem with the embedded state is with the rescaling operation. Application logic can be either compute-intensive or state intensive. If the application logic is compute-intensive, we will need more virtual CPUs. However, if we do scale-out our compute pipeline, then we have to also scale-out storage together with our compute tasks as in Figure 1.2.

Figure 1.2 – Rescaling in Flink [1]

Similarly, if the application logic is state-intensive, we will need more storage nodes, but we also have to allocate more compute while scaling out storage, as in Figure 1.2. The problem with this approach, especially in the compute-intensive case, is that we also need to re-shuffle and load state around every time we rescale, which is an expensive operation.

1.3 Goals

The goal of this thesis is to explore the possibilities of decoupling storage and compute for Apache Flink. The entire thesis can be further divided into four sub-goals:

1. Implement a new state backend for Apache Flink which externalizes the state from local stream compute nodes while maintaining the same consistency guarantees provided by the underlying, orthogonal snapshotting mechanism.

2. Implement a benchmarking tool to measure the performance of the newly implemented state backend.

(15)

3. Benchmark the performance of our new state backend against existing embedded state backends.

4. Performance Optimization of the implementation based on the benchmarking results.

1.4 Research Methodology

This thesis aims to explore the possibility of externalizing the state information from the local compute nodes in Apache Flink. To facilitate achieving this goal, there are two critical phases. 1. Implementation of a new state backend that adheres to the requirement. 2. To benchmark the new implementation with the existing ones to verify the results. The author has adopted an exploratory research methodology and intends to explore the research question. A quantitative data collection is also performed to compare the results and verify how successful it is.

1.5 Ethics and Sustainability

There are no ethical concerns related to this project. Most of the ethical issues are related to the data that is being used for processing/testing. No personal data has been obtained or used for any processing throughout this project. The data we have used for testing and benchmarking are either created randomly or have ensured that the data used does not contain any sensitive information.

Furthermore, we have not used any data that might cause any security risks to individuals or organizations. Also, references have been provided to any of the previous work which is utilized throughout this project.

With the existing state backend implementations, compute and storage is tightly coupled. Which menas, even if we need to scale either compute or storage, we need to scale out both, which would lead to resource underutilization. Our proposed implementation decouples compute and storage, thus eliminating resource wastage. With the new solution, we can scale out components(storage or compute) selectively instead of having to scale the entire system. Our proposed solution also reduces the rescaling and reconfiguration time, giving a cost-efficient and energy-efficient data processing system.

(16)

1.6 Delimitations

This thesis will not cover the fault tolerance and recovery of the system. It also won’t cover reconfiguration and re-scaling of the system. But it will be covered and implemented in parallel in the thesis work FlinkNDB: Guaranteed Data Streaming Using External State [8].

1.7 Structure of the Thesis

The thesis consists of six chapters which are organized as follows:

• Chapter 2 provides a theoretical background for this work. We have introduced the core concepts and technologies used in this thesis research. We first introduce the Apache Flink and its architecture. Then we cover state management concepts and existing state backends of Apache Flink. We then cover details of NDB, which is the storage engine in the new state backend. Later it also covers the approaches and tools for benchmarking stream processing systems.

• In Chapter 3, we propose the design of the new state backend. Design is followed by the implementation of the same along with the database schema. It also covers the performance optimization techniques that were added to the initial implementation.

• Chapter 4 explains the implementation of the tool that we developed to benchmark our implementation against the existing frameworks. It also includes the details of the data generator used to produce data for experiments.

• Chapter 5 provides the results of different benchmarks. It also includes the different experiments that we have performed to evaluate existing state backends vs. the new state backend that we implemented as part of this thesis.

• Chapter 6 concludes the work with the summary of the thesis work and the new state backend’s performance. It also includes possible improvements and extensible features for future work.

(17)

Chapter 2 Background

Big data processing has been an active research area for the past few decades.

Initially, data were processed in batches with having the data set known before processing. Later, the need for real-time data analytics led to stream processing, where data is being produced continuously and being processed on the fly.

2.1 Data Processing Systems

The growth of big data processing is generally divided into four generations, as depicted in Figure2.1.

Figure 2.1 – Generations of Big Data Processing [2]

(18)

Apache Hadoop is considered the first-generation system. Apache Hadoop provides an open-source implementation of MapReduce [15]. It introduced the concept of Map and Reduce in the distributed data processing. Apache Hadoop focused on batch data. MapReduce model reads input data from disk, map a function across the input data, reduce the map results, and store results back to disk. One of the significant drawbacks of Apache Hadoop is that it involved many disk operations.

Hadoop was followed by a wide variety of different frameworks that introduced fewer improvements over the Hadoop. They are considered as 2nd generation. Tez, which is prominent among the second generation, introduced the concept of interactive programming in addition to batch processing.

Apache Spark is considered as the third generation tool for big data analytics. It is a unified model for both batch and stream processing. Spark also supports iterative processing, which is helpful for machine learning. RDD (Resilient Distributed Dataset) [16] is the core of the spark. Spark is faster than Map Reduce as it does in-memory computation and optimizes processing. It also provides high-level APIs in Java, Scala, and Python.

Apache Flink [17] is the next generation or the 4th generation stream processing framework. Flink provides real-time stream processing compared to other frameworks. DataSet and DataStream are core APIs of Apache Flink for batch and stream processing, respectively. Flink also supports other high- level APIs such as Table API and SQL. It also supports iterative processing and stateful stream processing computations. One of the significant challenges in scalable data stream processing [18] is Fault tolerance and scalable state management. Apache Flink handles these challenges in an efficient way. The following sections cover Apache Flink in detail.

2.2 Apache Flink

Apache Flink can be defined as a distributed data processing engine and a scalable data analytics framework with near real-time analytical capabilities.

On top of these, Flink also provides scalable, distributed, fault-tolerant, and stateful stream processing capabilities, making it a popular choice of different industries. Flink is also widely used in production level systems to build large- scale data analytics and processing components over massive streaming data.

Apache Flink is a pioneering data processing framework that supports both batch (bounded data) processing and stream (unbounded data) processing.

A bounded data stream has a defined start and an end. While processing a bounded stream, we can ingest the entire data-set before starting any

(19)

computation. Hence, we can perform operations such as sorting and data summarization. An unbounded stream has a start but no end. Hence, the data needs to be continuously processed when it is generated. Data is processed based on the event time (Event time is the time when each event occurred on its producing device). In the following sections, we define the significant aspects of Flink architecture.

2.2.1 Flink Component Stack

Flink is a layered system where different layers are stacked on top of each other.

Each layer increases the level of abstraction. Figure2.2shows the break down of the Flink Software stack from high-level domain-specific libraries to the runtime.

Figure 2.2 – Flink Component Stack [3]

Users can deploy Flink locally on a single node, in a cluster, or using cloud providers such as GCP. In the local setup, Flink runs locally on a single machine for basic testing and experimentation. Flink supports different cluster deployments such as standalone, YARN and Mesos. It also supports different cloud providers such as Amazon and Google.

(20)

The runtime layer receives the program in the form of a Job graph. The runtime layer supports most of the core functionalities, such as support for distributed stream processing, JobGraph to ExecutionGraph mapping, and scheduling. It also provides essential services for the upper API layer.

The core API of Flink provides a set of abstract types such as DataStream and DataSet. The DataSet API handles the data at rest (bounded stream).

These input datasets are created before the processing starts from data sources (e.g., reading text or CSV files). The DataSet API allows users to implement different transformations on data sets such as filtering, mapping, joining, and grouping. Transformations on a dataset converts the input dataset into a new dataSet. Programs can combine multiple transformations. The output of the transformation is returned via sink (e.g., write data to text or CSV files).

The DataStream API handles the transformations on a continuous stream of the data. The data streams are created from various sources such as message queues, socket streams. To process the continuous data stream, it provides various transformations such as filtering, updating states, defining windows, and aggregation. Similar to the DataSet API, the output of the transformation is returned via the sink. The below code snippet in Java obtains an execution environment, reads data stream from a socket, does datastream transformation, and writes to a sink.

The libraries layer comprises high-level libraries build on top of Flink dedicated for different functionalities. The FlinkCEP is the Complex Event Processing (CEP) library used to detect event patterns in a stream of events.

The Table and SQL API’s are two relational API’s. The table API allows the composition of queries from relational operators, whereas the SQL API is based on Apache Calcite, which implements the SQL standard. Gelly is Flink API for graph processing.

2.2.2 Flink Architecture and Process Model

Three runtime components manage a Flink Cluster: JobClient, JobManager, and TaskManager. The job manager manages the resource allocation and distributes the jobs among the available task managers. The task manager performs the computations and updates the job manager on the progress of the tasks.

Flink uses the Akka framework [19] for its distributed communication.

Akka is an actor based system that is used to develop concurrent, fault-tolerant, and scalable applications. Each actor is Akka and is considered independent, and they communicate with each other asynchronously. Figure2.3 illustrates

(21)

the different actors in a Flink system and their interactions.

Figure 2.3 – Flink System : Actors and Interactions [4]

Job Client

The Client module is the user-facing component of the system. The JobClient fulfills multiple responsibilities, which includes job submission, communication with the JobManager, receiving the status of the currently running job, querying the job status. The user program received by the JobClient is compiled and optimized to the logical graph representation that can be submitted to the JobManager for execution.

Job Manager

JobManager is the central process that is responsible for Flink job execution.

The JobManager takes care of physical translation, resource allocation, and task scheduling. JobManager allocates the work across different task managers.

Task Manager

TaskManager is responsible for the actual execution of the task. Upon receiving a task from JobManager, the TaskManager spawns a thread that

(22)

executes the task. State updates are sent back to the JobManager by the TaskManager. The tasks are deployed lazily to the task manager, i.e., a task that depends on the output of another task will only be deployed after completing its dependent task.

2.3 State Management in Apache Flink

State management and state backends are among the significant features of Apache Flink or any other stream processing systems. Flink supports both stateful and stateless computation. A stateful operation remembers information across multiple events. The most commonly used operations such as windows and joins rely on the state. It is also essential to store the state in Flink to make the system fault-tolerant using checkpoints and savepoints. State also allows rescaling and reconfiguration of the application as Flink takes care of redistributing the state across the instances. Flink provides different state backends that define how and where the state is being stored.

A keyed state is bound to a key and is used on a keyed stream. Keyed State in Flink is organized into Key Groups. Key groups are the atomic unit by which Flink can redistribute keyed state and hence there are as many key groups as the maximum parallelism specified. During the execution, each parallel instance of a keyed operator operates with the keys for one or more Key Groups.

In Flink, a keyBy() transformation is used to transform a data stream into a keyed stream. Keyed State exists in two forms: Raw and Managed.

The raw state is seen as raw bytes by Flink and knows nothing about its data structure. Operators keep the state in their own data structures. The managed state is represented as data structures controlled by the Flink runtime.

Using a managed state is recommended because Flink can automatically redistribute the state when the parallelism is changed. It also has better memory management.

2.3.1 ValueState<T>

A ValueState keeps a value that can be updated and retrieved. As it is a keyed state, value is scoped to the key of the input element, and hence there will be only one value for each key that the operation sees. This state interface provides two methods. The value can be updated using update(T) and retrieved using T value().

(23)

2.3.2 ListState<T>

List state keeps a list of elements where T is the type of the values that this list state keeps. Value can be appended to the list and can retrieve an Iterable over the current list. Elements are added using add(T) or addAll(List<T>).

The Iterable can be retrieved using Iterable<T> get(). The existing list can be updated with an update(List<T>). The below code snippet in Java shows an example for List state.

ListState can be a keyed list state or an operator list state. When it is a keyed list state, the key is automatically supplied by the system, so the function always sees the value mapped to the current element’s key. In this way, the system can handle stream and state partitioning consistently together. When it is an operator list state, the list is a collection of state items independent of each other and eligible for redistribution across operator instances in changed operator parallelism.

2.3.3 MapState<UK, UV>

MapState stores a list of key-value pairs. The elements can be added, updated, and retrieved. Elements can be added using put(UK, UV ) or putAll(Map<UK, UV>) method. isEmpty() can be used to check whether this map contains any key-value mappings, whereas contains(UK ) can check whether there is a given mapping. The value associated with a user key can be retrieved using UV get(UK). The iterable views for mappings, keys, and values can be retrieved using entries(), keys(), and values(), respectively. remove(UK) can be used to delete the mapping of a given key. All the state can be accessed and modified by user functions and checkpointed consistently by the system as part of the distributed snapshots. All the states mentioned above have a clear() method that removes the value mapped under the current key.

2.3.4 ReducingState<T>

ReducingState keeps a single value that represents the aggregation of all values added to the state. ReducingState is similar to ListState, but the state’s elements will be combined using a reduce function. Elements can be added using add(T) method and retrieved using T get().

(24)

2.3.5 AggregatingState<IN, OUT>

AggregatingState is also similar to ListState, but the added elements to this type of state will be eagerly pre-aggregated using a specified AggregateFunction. The difference between ReducingState and AggregatingState is that, in the case of AggregatingState, the aggregate type may differ from the type of elements added to the state. Elements are added using add(IN) and retrieved using OUT get().

2.4 Flink State Backends

A state backend defines how and where the state of a streaming application is stored and checkpointed. Different state backends store the state in different forms and use different data structures to hold a running application.

Apache Flink provides the following three backends to store application state information, as discussed below.

2.4.1 Memory State Backend

The memory state backend is the default state backend used by Flink. This makes use of the task managers heap memory to store the state information.

The state is persisted as objects, which makes the read/write operation quicker.

Upon checkpointing, this state backend will snapshot the state, and the snapshot is stored in the Job manager’s heap memory. The memory state backend is mainly used for local development and debugging.

2.4.2 File System State Backend

Like memory state backend, file system state backend also stores the state information in the task managers heap memory. Upon checkpointing, the snapshot data is saved to a file system such as HDFS, S3 bucket, or task manager’s local filesystem. The file system state backend is mainly used in high availability systems, and it is encouraged to use for jobs with large states, long windows, large key/value states.

2.4.3 RocksDB State Backend

RocksDB state backend stores the state information in a RocksDB database.

RocksDB is an ordered key-value store based on LSM (Log-Structured Merge)

(25)

trees [20]. LSM is a data structure designed to increase write throughput.

RocksDB comes with a very high read/write rate and is optimized for fast, low latency storage.

Figure 2.4 – RocksDB Architecture [5]

RocksDB mainly consists of three main parts: memtable, sstfile, and logfile2.4. The memtable is an in-memory data structure, and all the writes are inserted into the memtable. Writes are optionally written to the write-ahead log. Once the memtable is full, it becomes READ-ONLY and is replaced by a new active memtable. The READ-ONLY memtables are periodically flushed to disk as sstables(Sorted String Tables). All the reads are fetched initially from memtable, and if that fails, the read operation access the READ-ONLY memtable in the reverse chronological order. If it is still not found, it will access the sstables starting from the most recent. This means that reads might take a bit more time than writes to complete. To speed up the process, sstables are compacted and guarded against bloom filters to avoid unnecessary scans.

RocksDB state backend has a layered architecture2.5. The memtable of RocksDB act as a transient in-memory layer and the sstable act as a persistent storage layer. RocksDB stores the values as serialized bytes. Each read /write operation in RocksDB goes through value serialization /deserialization.

A read operation first checks the memtable for the entry, and if it is not found, it checks the sstable. Once we fetch the data, it has to be deserialized.

In the case of a write operation, it is directly applied to memtable after value serialization. Hence writes in RocksDB are much faster compared to reads.

(26)

Figure 2.5 – RocksDB State Backend Architecture [1]

Each Flink task manager maintains its Rocks DB instance. On checkpointing, the state data will be saved into the configured persistent file system such as HDFS. The state size with RocksDB is only limited to the disk space. Hence, RocksDB StateBackend can be used for jobs with huge states. The state is stored in RocksDB as a serialized byte array, making the state read/write operation expensive compared to a read/write of memory state backend and file system state backend. This will also reduce throughput.

2.5 NDB - Network Database

NDB [21], popularly knows as MySQL cluster, is an in-memory database with a shared-nothing architecture adapted for distributed computing. NDB comes with a wide range of capabilities such as high-availability, data-persistence features, transactional consistencies, and, more importantly, user-defined partitioning. NDB Cluster ensures transactional guarantees, supporting the READ COMMITTED transaction isolation level. Each component in an NDB Cluster, Figure 2.6 is known as a node. Usually, the term node indicates a computer, but in the context of NDB, the cluster node refers to a process. There are three different types of nodes in an NDB cluster: Management node, Data

(27)

node, and SQL node.

Figure 2.6 – NDB Cluster [6]

The management node manages the cluster and other nodes within the cluster. The management nodes are responsible for configuration and arbitration in case of network partitions. One can claim that the core of NDB is its Data node as it stores the actual cluster data. Cluster tables are stored in memory rather than disk. The maximum number of nodes that the NDB cluster can have is 255, out of which the maximum number of data nodes can be up to 48. It is recommended to have at least two replicas of the data to achieve redundancy and high availability. The SQL Node is used to access the cluster data from the data nodes. The SQL node can be considered as a traditional MySQL server that uses the NDBCLUSTER storage engine. Data can be accessed from NDB in two ways:

1. Directly from the data nodes using the NoSQL API.

2. Using MySQL connector or MySQL command-line client, or API to connect to an SQL node that in turn fetches data from the data node.

The operations such as sharding and partitioning of the data are done automatically by NDB and transparent. NDB can also handle failures automatically.

(28)

2.6 Apache Kafka

Apache Kafka is a distributed event streaming platform consisting of servers and clients that communicate via a high-performance TCP network protocol.

Kafka comes with multiple capabilities such as publish (write) and subscribe to (read) streams of events, store streams of events durably and reliably, and process streams of events as they occur or retrospectively. In Kafka, events are organized and durably stored in topics.

Producers and consumers are the main actors in Kafka: producers are the applications that publish events to Kafka, and consumers are the applications that subscribe to these events to read and process them. Apache Flink comes with Kafka connector[22] for reading data from and writing data to Kafka topics exactly-once guarantees.

2.7 Benchmarking SPS

Technology has been advancing rapidly due to which a large amount of data(big data) is being generated. More and more stream processing systems have been developed in the recent past to address the need for fast and reliable big data processing. Each of these processing frameworks comes with diverse capabilities and performance characteristics. Hence, benchmarking stream processing systems is an active area of research, and various research have been conducted in the recent past. The most critical performance indicators in stream processing systems that have been included in most of the previous benchmarks are latency and throughput.

The recent works on benchmarking stream processing systems such as Benchmarking Distributed Stream Data Processing Systems [23] and Evaluation of Stream Processing Frameworks [24] has built its own performance benchmarking frameworks to analyze the performance of stream processing systems. The former evaluates Apache Storm, Apache Spark, and Apache Flink’s performance by measuring the throughput and latency of windowed operations. The latter evaluates emerging frameworks such as structured streaming and Kafka stream along with Flink and Spark Streaming. They also measure latency, throughput, and resource consumption as performance indicators. Benchmarking Streaming Computation Engines:

Storm, Flink, and Spark Streaming [25] evaluated the processing systems of Storm, Flink, and Spark using latency and throughput as the performance indicators.

(29)

With existing benchmarking tools, we lack a robust system to evaluate the different stream processing systems for individual capabilities. For example, how different systems perform for different state sizes, how much time each processing engine takes to snapshot or recover from failure, and which one is performant for what kind of data.

2.7.1 NEXMark

Figure 2.7 – NEXMark Entities [7]

NEXMark [26] or the Niagara Extension to XMark is a benchmark for running queries over continuous data streams [1]. The paper presents an online auction as a business case with four users. The online auction schema consists of three streams (Person, Auction, Bid) and one stored relation(Category).

Eight queries are described over the business case. The benchmark also provides a data generator that can generate continuous streams of Person, Auction and Bid.

Apache Beam [27] adopted NEXMark as an integration test case that supports all of the Beam runners(Eg: Flink, Spark, Direct, Apex). The suite of queries can be run in both Batch mode and Streaming mode regardless of the runner. Test data is finite in both modes. However, Batch mode uses a BoundedSource, whereas, Streaming mode uses an UnboundedSource, which can trigger streaming mode in runners.

The main components of the NEXMark implementation are Generator and the Launcher. The generator generates random time-stamped events.

However, it also preserves the correlation between Bid, Auction, and the Bidder. The Launcher creates the sources (Direct, Avro, PubSub, Kafka), which use the generator to generate events.

(30)

Figure 2.8 – NEXMark Components [7]

2.7.2 Apache Beam NEXMark Benchmark Suite

Apache Beam is a unified platform for both batch and stream processing pipelines. Apache Beam allows users to define a model to represent and transform datasets irrespective of any specific data processing platform. Once defined, it can be run on any of the supported run-time frameworks (runners), including Apache Flink, Apache Apex, Apache Spark, and Google Cloud Dataflow. Apache Beam also comes with different SDKs that let users write the pipeline in programming languages such as Java, python, and GO. Apache Beam has implemented its benchmark suite [28]. The implementation defines 13 queries[0-12] over the three entity models (Person, Bid, and Auction) representing the online auction system as shown in Figure2.7.

(31)

Chapter 3 Design and Implementation

This chapter outlines the design decisions and the implementation of the new state backend for Apache Flink.

3.1 Design Decisions

The state is a first-class citizen in every stream processing system, and how the state is stored and accessed is defined by state backends. Apache Flink comes with different state backends, and these state backends implement how the state is stored, retrieved, updated, snapshotted, and restored. The state is an essential aspect of fault tolerance in long-running distributed applications.

Thus state backends are an essential part of stream processing systems to handle failures and recover from failures and the system’s reconfigurations.

State backends are also capable of handling applications that involve a considerable state.

With Flink, for a high-performance application, an in-memory state backend such as the memory state backend or filesystem state backend would be preferred. Whereas, if the state is large enough and does not fit in the memory, a state backend such as RocksDB is preferred.

A common factor among all existing state backends officially supported by Flink is that the state is always available locally to compute tasks as in Figure3.1(a). This is not a necessary restriction but rather a design choice of each particular backend that makes deployment and state access a bit more convenient while creating other challenges.

This system’s design is based on the fact that in situations where quick system reconfiguration is needed, the best fit for a state backend would be an external distributed database. We call the new state backend FlinkNDB.

(32)

Figure 3.1 – (a) Existing Flink state backend (b) FlinkNDB state backend [1]

FlinkNDB decouples state and computes by using a distributed database to store the state. Figure 3.1 depicts the significant design difference of FlinkNDB over existing Flink state backends. FlinkNDB is designed to save users from the trouble of waiting for the pipelines to recover. State in FlinkNDB is not embedded but external to the compute nodes.

3.2 FlinkNDB Design

FlinkNDB state backend is designed to decouple compute and storage.

Externalizing state from computing would reduce the reconfiguration delay and the recovery time when failure happens in the system. FlinkNDB uses NDB as the external database. One of the major reasons for it is NDB is one of the fastest in-memory databases, and with primary critical operations in NDB, it gives the best performance. Also, Flink’s key group can be utilized for partitioning in NDB. More details of NDB and the schema used for FlinkNDB is defined in the next section. The initial design of FlinkNDB has two layers.

The application layer and the persistent database layer (NDB) as shown in Figure3.2.

The Flink application layer abstracts the different state backends’

complexities and manages each read and write request according to the state backend configured by the user.

At the database level, we have mainly three different tables per state: the

(33)

Figure 3.2 – FlinkNDB Initial Architecture

active table, which stores the most recent value; the committed table, which stores the versioned changes over epoch and the snapshot tables which stores the completed checkpoint details, which consist of the metadata and can be utilized for recovery and reconfiguration. As reconfiguration and recovery are out of this thesis’s scope, more details on them are included in the thesis FlinkNDB: Guaranteed Data Streaming Using External State [8]. In this initial design, each read and write requests from the application is served by the active table. Hence, every read and writes here ends with a network call to read/write from the database (Figure 3.3), and we also have value serialization/de-serialization cost.

FlinkNDB uses the key group as the partitioning key for NDB. A composite that consists of a key, key group, state name, and namespace serves as the active table’s primary key. The primary key might vary according to the state type, discussed in detail in the upcoming chapter. The key is the partition key used by the keyed stream and will be unique across the keyed stream. Flink assigns the key group, and a single key group can have multiple keys. As FlinkNDB uses a single database for all state across the application, we need state names to differentiate different states with the same key.

(34)

Figure 3.3 – Activity Diagram

3.3 NDB: Decoupled Storage

FlinkNDB uses NDB or Network database as the external data storage system. FlinkNDB uses Flink KeyGroups to create user-defined partitioning to colocate the key group data on individual NDB data nodes instead of stored across multiple nodes in the distributed database. FlinkNDB has also leveraged the primary key operations all the time to get the best performance from NDB.

Each state has three different tables: active table, committed table, and snapshot table, to store its state information—the active table stores the most recent value for the current running epoch. Hence the active state values are updated in every epoch. The committed table stores the versioned changes over epoch. Hence for the same key, we will have different entries in the table over different epochs. The snapshot table stores the completed checkpoint details, consisting of the metadata used for recovery and reconfiguration.

The metadata from the snapshot table is used to fetch the actual data for the completed checkpoint from the committed table. All these tables in FlinkNDB use the key group as the partition key.

In FlinkNDB, we have separate tables to store this information for different

(35)

state types. FlinkNDB has a total of 11 tables to store all three different states.

The different tables for each state and the schema is discussed in the following section.

3.3.1 Value State

Value state has three different tables as mentioned in Figure3.4. The active state stores the state values for the current running epoch. FlinkNDB stores key, key group, namespace, state name, epoch, and value for value state, out of which key, key group, namespace, and state name form the composite primary key. Value is stored in the form of serialized bytes. Epoch number indicates when the particular key was last updated. All the database reads and writes goes to the active table.

Figure 3.4 – Value State Schema [1]

The committed table stores the versioned changes over epoch. It has the same attributes as an active state table. The epoch is included in the composite primary key along with the key, key group, namespace, and state name. This enables us to have multiple entries for the same key for different epochs.

The snapshot table stores the metadata for the last completed checkpoint.

The metadata stored in the table includes key group, namespace, epoch, and status. With this, the application can recover from the last completed checkpoint before the failure. The primary key is the composite key of the key group, namespace, and epoch.

(36)

3.3.2 Map State

Map state stores a hashmap comprising key-value pair as its value. Hence, each entry of this map(defined as user map and its key as a user key) forms a unique entry in the active and committed state table. Figure3.5 shows the different tables for map state.

Figure 3.5 – Map State Schema [1]

The active table for map state stores the most recent value. We store an additional user key of the value map in the table, and the value field will store the value of this particular user key. Hence for a particular key, we will have multiple user keys and multiple entries in the table. Key, key group, namespace, state name, user key forms the composite primary key for this table.

The structure of the committed table is similar to the active table. However, along with the key, key group, namespace, state name, and user key epoch is the primary key to maintaining versioning.

All snapshot tables maintain the same metadata information of that particular state. Hence, the map state snapshot table’s structure is the same as that of the value state snapshot table and stores the metadata information of the key group, namespace, epoch, and update status.

3.3.3 List State

The structure of list state tables is quite different from value state and map state tables. As the list state stores list of values, it is essential to persist the

(37)

Figure 3.6 – List State Schema [1]

list values and their indices as in Figure 3.6. Hence, we have introduced an additional table to store the index information. When we add an element to a list, it is crucial to know the existing list’s size. The additional list attribute table (Figure3.7) stores this information, i.e., the current size of the list.

Figure 3.7 – List State Attribute Schema [1]

The list state stores list of values, and in the active state table, we have a unique entry for each list element. We have the additional list index attribute, which indicates the index of the particular value. Key, key group, namespace,

(38)

state name, and list index form the composite primary key for this table.

The list attributes active table stores the current length of the list. This table holds the most current value of the current running epoch. Key, key group, namespace, state name forms the composite primary key for this table.

This table is queried before each writes to a list state table to know the list’s current index.

Similar to other committed tables, The committed table stored the versioned values in the list state. The schema is similar to the active table, but the epoch number is also the primary key to maintaining versioning.

The list attribute committed table stores the list length for a particular key over different epochs. It has the same structure as the list attribute active table, but epoch is also part of the primary key.

The snapshot table maintains the same metadata information like value state and map state. Hence the schema is the same as that of value state and map state snapshot table and stores the metadata information of keygroup, namespace, epoch, and status.

3.4 Implementation

In Flink keyed state interface provides access to different state types that are all scoped to the key of the current input element. We have implemented three of the state, namely: Value state, Map State and List State. The implementation details of these state types are explained below.

3.4.1 Value State

The NDBValueState<K, N, V> class which is the implementation of value state

extends AbstractNDBState<K, N, V>and implementsInternalValueState<K, N, V>, where, K is the type of the key, N is the type of the namespace and V is the type of value that the state state stores. AbstractNDBState<K, N, V>is the base class for FlinkNDB state implementations that store state in NDB. All the auxiliary methods used by the runtime are kept in the internal state hierarchy and are not intended to be used by user applications. Internal state classes offer access to getters and setters of namespace and additional features, such as access to raw value or state merging. The value state mainly has two methods to retrieve or update the value. ^{V value()}returns the current value for the state. The system automatically supplies the key, so the function always sees the value mapped to the current element’s key. The system can manage the stream and state partitioning in this manner, consistently.

(39)

1 p u b l i c V val ue () {

2 try {

3 byte[] v a l u e B y t e s ;

4 O b j e c t [] pk = new O b j e c t []{

5 b a c k e n d . g e t C u r r e n t K e y G r o u p I n d e x () ,

6 this. b a c k e n d . g e t C u r r e n t K e y () . g e t C l a s s () . h a s h C o d e () ,

7 this. g e t C u r r e n t N a m e s p a c e () . h a s h C o d e () ,

8 this. s t a t e N a m e

9 };

10

11 K e y V a l u e r e s u l t = this. b a c k e n d . d b S e s s i o n . find ( K e y V a l u e .class, pk ) ;

12

13 if ( r e s u l t != null && r e s u l t . g e t V a l u e () != null) {

14 v a l u e B y t e s = r e s u l t . g e t V a l u e () ;

15 }

16 else {

17 r e t u r n g e t D e f a u l t V a l u e () ;

18 }

19

20 d a t a I n p u t V i e w . s e t B u f f e r ( v a l u e B y t e s ) ;

21 r e t u r n v a l u e S e r i a l i z e r . d e s e r i a l i z e ( d a t a I n p u t V i e w ) ;

22

23 } c atc h ( I O E x c e p t i o n e ) {

24 th row new F l i n k R u n t i m e E x c e p t i o n (" E rro r w hil e r e t r i e v i n g data from NDB . ", e ) ;

25 }

26 }

void update(V value) updates the operator state accessible by ^{V value()} to the

given value. The next time ^{V value()} is called (for the same state partition) the returned state will represent the updated value. When a partitioned state is changed to null, the current key is deleted and the next access returns the default value.

1 p u b l i c void u p d a t e ( V val ue ) {

2 if ( val ue == null) {

3 cl ear () ;

4 r e t u r n;

5 }

6

7 try {

8 K e y V a l u e kv = b a c k e n d . d b S e s s i o n . n e w I n s t a n c e ( K e y V a l u e .cla ss) ;

9

10 kv . s e t K e y G r o u p ( b a c k e n d . g e t C u r r e n t K e y G r o u p I n d e x () ) ;

11 kv . s e t K e y (this. b a c k e n d . g e t C u r r e n t K e y () . h a s h C o d e () ) ;

12 kv . s e t S t a t e N a m e (this. s t a t e N a m e ) ;

13 kv . s e t N a m e S p a c e (this. g e t C u r r e n t N a m e s p a c e () . h a s h C o d e () ) ;

14 kv . s e t V a l u e ( s e r i a l i z e V a l u e ( va lue ) ) ;

15 b a c k e n d . d b S e s s i o n . s a v e P e r s i s t e n t ( kv ) ;

16 } c atc h ( E x c e p t i o n e ) {

17 th row new F l i n k R u n t i m e E x c e p t i o n (" E rro r w hil e a d d i n g data to NDB ", e ) ;

18 }

19 }

(40)

3.4.2 Map State

The NDBMapState<K, N, UK, UV> class which is the implementation of map state

extends AAbstractNDBState<K, N, Map<UK, UV>> and implements InternalMapState<K, N, UK

, UV>, where, K is the type of the key, N is the type of the namespace, UK is

the type of the keys in the map state and UV is the type of the values in the map state. The methods in map state are quite different from the value state.

UV get(UK key)returns the current value associated with the given key.

1 p u b l i c UV get ( UK u s e r K e y ) t h r o w s C l u s t e r J E x c e p t i o n {

2

3 M a p K e y V a l u e mkv = this. b a c k e n d . d b S e s s i o n . find ( M a p K e y V a l u e .class, g e t M a p S t a t e V a l u e s P K ( u s e r K e y ) ) ;

4 byte[] r a w V a l u e B y t e s = mkv . g e t V a l u e () ;

5 r e t u r n ( r a w V a l u e B y t e s == null ? null : d e s e r i a l i z e U s e r V a l u e ( d a t a I n p u t V i e w , r a w V a l u e B y t e s , u s e r V a l u e S e r i a l i z e r ) ) ;

6 }

void put(UK key, UV value)associates a new value with the given key.

1 p u b l i c void put ( UK userKey , UV u s e r V a l u e ) t h r o w s I O E x c e p t i o n , C l u s t e r J E x c e p t i o n {

2

3 M a p K e y V a l u e mkv = g e t M a p K e y V a l u e O b j e c t () ;

4 mkv . s e t U s e r K e y ( s e r i a l i z e V a l u e ( userKey , u s e r K e y S e r i a l i z e r ) ) ;

5 mkv . s e t V a l u e ( s e r i a l i z e V a l u e ( userValue , u s e r V a l u e S e r i a l i z e r ) ) ;

6 this. b a c k e n d . d b S e s s i o n . m a k e P e r s i s t e n t ( mkv ) ;

7 }

void putAll(Map<UK, UV> map)copies all of the mappings from the given map into the state.

1 p u b l i c void p u t A l l ( Map < UK , UV > map ) t h r o w s I O E x c e p t i o n , C l u s t e r J E x c e p t i o n {

2 if ( map == null) {

3 r e t u r n;

4 }

5

6 List < M a p K e y V a l u e > m k v L i s t = new ArrayList < >() ;

7

8 for ( Map . Entry < UK , UV > e ntr y : map . e n t r y S e t () ) {

9 M a p K e y V a l u e mkv = g e t M a p K e y V a l u e O b j e c t () ;

10 mkv . s e t U s e r K e y ( s e r i a l i z e V a l u e ( ent ry . g e t K e y () , u s e r K e y S e r i a l i z e r ) ) ;

11 mkv . s e t V a l u e ( s e r i a l i z e V a l u e ( ent ry . g e t V a l u e () , u s e r V a l u e S e r i a l i z e r ) ) ;

12 m k v L i s t . add ( mkv ) ;

13 }

14

15 this. b a c k e n d . d b S e s s i o n . m a k e P e r s i s t e n t A l l ( m k v L i s t ) ;

16 }

void remove(UK key)deletes the mapping of the given key.

1 p u b l i c void r e m o v e ( UK u s e r K e y ) t h r o w s I O E x c e p t i o n , C l u s t e r J E x c e p t i o n {

2 this. b a c k e n d . d b S e s s i o n . d e l e t e P e r s i s t e n t ( M a p K e y V a l u e .class, g e t M a p S t a t e V a l u e s P K ( u s e r K e y ) ) ;

3 }

boolean contains(UK key)returns whether there exists the given mapping.

(41)

1 p u b l i c b o o l e a n c o n t a i n s ( UK u s e r K e y ) t h r o w s I O E x c e p t i o n , C l u s t e r J E x c e p t i o n {

2

3 M a p K e y V a l u e mkv = this. b a c k e n d . d b S e s s i o n . find ( M a p K e y V a l u e .class, g e t M a p S t a t e V a l u e s P K ( u s e r K e y ) ) ;

4 r e t u r n ( mkv . g e t V a l u e () != null) ;

5 }

3.4.3 List State

TheNDBListState<K, N, V>class which is the implementation of list state extends

AbstractNDBState<K, N, List<V>>and implementsInternalListState<K, N, V>, where, K is the type of the key, N is the type of the namespace and V is the type of the values in the list state. public void add(V value)adds value to the current list. The method retrieves the current list index from the list attribute table and adds the element to the next index.

1 p u b l i c void add ( V val ue ) t h r o w s I O E x c e p t i o n {

2 P r e c o n d i t i o n s . c h e c k N o t N u l l ( value , " You c a n n o t add null to a L i s t S t a t e . ") ;

3

4 try {

5 int i nde x = u p d a t e L i s t S t a t e A t t r i b u t e s (1) ;

6 L i s t K e y V a l u e lkv = g e t L i s t K e y V a l u e O b j e c t () ;

7 lkv . s e t L i s t I n d e x ( i nde x ) ;

8 lkv . s e t V a l u e ( s e r i a l i z e V a l u e ( value , e l e m e n t S e r i a l i z e r ) ) ;

9 b a c k e n d . d b S e s s i o n . s a v e P e r s i s t e n t ( lkv ) ;

10 } c atc h ( C l u s t e r J E x c e p t i o n e ) {

11 th row new F l i n k R u n t i m e E x c e p t i o n (" E rro r w hil e a d d i n g v alu e to list in NDB ", e ) ;

12 }

13 }

Thevoid update(List<T> values)updates the existing values to the given list of values. The next time when we call the^get()for the same state partition, the returned state will represent the updated list whereasvoid addAll(List<V> values)

adds the given values to existing list of values. The next time^get()is called, the returned state will represent the updated list.

1 p u b l i c void u p d a t e ( List < V > v a l u e T o S t o r e ) {

2 P r e c o n d i t i o n s . c h e c k N o t N u l l ( values , " List of v a l u e s to add c a n n o t be null . ") ;

3

4 if (! v a l u e s . i s E m p t y () ) {

5 try {

6 int i nde x = u p d a t e L i s t S t a t e A t t r i b u t e s ( v a l u e s . size () , res et ) ;

7

8 List < L i s t K e y V a l u e > l k v L i s t = new ArrayList < >() ;

9

10 for (int j = 0; j < v a l u e s . size () ; j ++) {

12 lkv . s e t L i s t I n d e x ( i nde x + j ) ;

13 lkv . s e t V a l u e ( s e r i a l i z e V a l u e ( v a l u e s . get ( j ) , e l e m e n t S e r i a l i z e r ) ) ;

14 l k v L i s t . add ( lkv ) ;

15 }

16 b a c k e n d . d b S e s s i o n . m a k e P e r s i s t e n t A l l ( l k v L i s t ) ;

(42)

17

18 } c atc h ( I O E x c e p t i o n | C l u s t e r J E x c e p t i o n e ) {

19 th row new F l i n k R u n t i m e E x c e p t i o n (" E rro r w hil e u p d a t i n g data to NDB ", e ) ;

20 }

21 }

TheIterable<V> get()internally callsList<V> getInternal()which gets internally stored value.

1 p u b l i c List < V > g e t I n t e r n a l () {

2 try {

3

4 L i s t S t a t e A t t r i b l e n g t h O b j = this. b a c k e n d . d b S e s s i o n . find ( L i s t S t a t e A t t r i b . class, g e t L i s t S t a t e A t t r i b P K () ) ;

5

6 // if the re is no o b j e c t or list is empty , we will r e t u r n the d e f a u l t va lue

7 if ( l e n g t h O b j == null && l e n g t h O b j . g e t L i s t L e n g t h () == 0) {

9 }

10

11 V e l e m e n t ;

12 byte[] v a l u e B y t e s ;

13 List < V > r e s u l t A r r = new ArrayList < >() ;

15

16 for (int i = 0; i < l e n g t h O b j . g e t L i s t L e n g t h () ; i ++) {

17 lkv . s e t L i s t I n d e x ( i ) ; // set the ind ex for p r i m a r y key

18

19 L i s t K e y V a l u e r e s u l t = this. b a c k e n d . d b S e s s i o n . load ( lkv ) ;

20 if ( r e s u l t != null && r e s u l t . g e t V a l u e () != null) {

21

22 v a l u e B y t e s = r e s u l t . g e t V a l u e () ;

23 d a t a I n p u t V i e w . s e t B u f f e r ( v a l u e B y t e s ) ;

24 e l e m e n t = e l e m e n t S e r i a l i z e r . d e s e r i a l i z e ( d a t a I n p u t V i e w ) ;

25 r e s u l t A r r . add ( e l e m e n t ) ;

26 } else {

28 }

29 }

30

31 r e t u r n r e s u l t A r r ;

32

33 } c atc h ( C l u s t e r J E x c e p t i o n e ) {

34

35 th row new F l i n k R u n t i m e E x c e p t i o n (" Er ro r w hil e r e t r i e v i n g data from NDB ", e )

;

36 } c atc h ( I O E x c e p t i o n e ) {

37 th row new F l i n k R u n t i m e E x c e p t i o n (" U n e x p e c t e d list e l e m e n t d e s e r i a l i z a t i o n f a i l u r e ", e ) ;

38 }

39 }

External Streaming State Abstractions and Benchmarking