Fault Tolerant Distributed Complex Event Processing on Stream Computing Platforms

(1)

Fault Tolerant Distributed Complex Event Processing on Stream Computing Platforms

PARIS CARBONE

Master’s Thesis at ICT

External supervisors: Konstantinos Vandikas, Farjola Zaloshnja Examiner: Johan Montelius

TRITA xxx yyyy-nn

(2)

(3)

Abstract

Recent advances in reliable distributed computing have made it possible to provide high availability and scalability to traditional systems and thus serve them as reliable services. For some systems, their parallel nature in addition to weak consistency requirements allowed a more trivial transision such as distributed storage, on-line data analysis, batch processing and distributed stream processing. On the other hand, systems such as Complex Event Processing (CEP) still maintain a monolithic ar- chitecture, being able to offer high expressiveness at the expense of low distribution. In this work, we address the main challenges of providing a highly-available Distributed CEP service with a focus on reliability, since it is the most crucial and untouched aspect of that transition.

The experimental solution presented targets low average detection la- tency and leverages event delegation mechanisms present on existing stream execution platforms and in-memory logging to provide availabil- ity of any complex event processing abstraction on top via redundancy and partial recovery.

(4)

First and foremost I would like to thank my wife Maria for the ongoing sup- port and encouragement that kept me going with this project. I would also like to thank my supervisors, Konstantinos and Farjola, for their motivation and ideas that inspired my work and for offering me whatever resources I needed for my implemen- tation. Many thanks go to my good friend and classmate Lars Kroll for his ongoing motivation, corrections and discussions that improved the quality of my work signif- icantly. Last but not least, I would like to thank my professor and examiner Johan Montelius for his professional insight, advice and support that helped clarify a lot of issues and questions around this project.

(5)

Introduction

1.1 Background

There is an increasing potential today for on-line data analysis in a society that becomes more and more interconnected. The vision of future sustainable large cities that can be smart and proactive when it comes to dangers or critical situations is not necessarily far fetched. Streams of generated events can be correlated and give further insight over ongoing situations. Examples of some useful situations could include accident or fire detections by sensor networks, fraud and attack detection in cellular telephony networks, financial market monitoring and on-line trading among others.

Traditional ways of detecting patterns in such data streams was to first store everything in databases and then apply analysis jobs, perhaps using some big data framework like Map Reduce [1]. One of the main issues with this approach is storage capacity, since available storage becomes observably insufficient to accumulate such large scale streams. In addition, critical situations demand preferably quick detection to take actions while they are happening. On-line stream processing has progressed throughout the years, however, its progress has been fragmented into two schools [2] : Distributed Stream Processing (DSP) which targets scalability of mainly aggregate operations and Complex Event Processing(CEP) targeting high expressiveness and advanced pattern matching. The potential of the combination of those two schools of stream processing with sufficient reliability guarantees could offer a big advantage towards efficient on-line stream processing.

1.2 Motivation

The usage of Complex Event Processing (CEP) [3, 2] engines today is limited since such systems are mainly deployed in-house in corporations, configured to handle low to medium rate streams of sensor or enterprise events. Their execution allows for high-level expressions deployed as continuous queries that analyse and inform on-line streams without the need to store them in any form of intermediate storage.

(8)

While existing CEP solutions suffice for the demands of corporations and in- stitutions handling reasonable traffic, the main critical industries that process the majority of sensor traffic such as telecommunications are in need of systems with similar expressiveness that can be deployed in data intensive clusters to be scalable enough to handle incoming traffic and react quickly to critical situations. Also, a major requirement is to eliminate any possibility of failed detections due to the lack of availability.

Stream execution platforms such as Twitter’s Storm [4] and Yahoo’s S4 [5] at- tracted significant interest recently since they allowed a trivial transition of stream oriented applications towards cloud environments. By hiding the complexity of resource management, event delegation, flow control and fault tolerance it is eas- ier today to deploy custom computational graphs to such systems and choose the amount of parallelism depending on the required bandwidth. It is evident that there is a potential for efficient, fault tolerant CEP deployments of existing engines using such stream execution platforms with an extra care of not violating any consistency requirements.

1.3 Goals

This project is an effort to take existing monolithic engines into a distributed setup that is both sustainable and consistent and contains the following tasks:

• Offer a modular system design and implementation that achieves distribution of CEP engines through context-aware routing of events.

• Provide fault tolerance and a recovery mechanism that is independent of the CEP technology used.

The research method used is quantitive since it evaluates the solution’s cost and performance through experimental measurements and comparisons.

This work is a part of Services and Software Research department’s operations at Ericsson around complex event processing that are targeting but are not limited to cellular telephony data analysis for the purpose of offering improved quality of service.

1.4 Limitations

Since some parts mentioned in this work are subjects of other projects carried out at Ericsson or planned as future work, the following were not considered:

• CEP query analysis, re-writing and planning.

• Provide support for more CEP engines other than Esper

(9)

1.5. REPORT OUTLINE

1.5 Report Outline

In chapter 2 we provide a brief description of the nature of complex event processing, an overview of fault tolerance mechanisms employed in stream processing and a detailed description of distributed stream processing platforms with a main focus on Storm. Additionally, we further outline the main features of other systems used in our approach for component definitions and distributed storage.

Chapter 3 offers a formal definition of all the abstractions, limitations and assumptions of the problem domain. Furthermore, it presents the key mechanisms considered for fault tolerance and the incentives behind them.

In chapter 4 we sum up all design and implementation considerations and the general architecture.

Performance costs and benefits of the solution are further presented in chapter 5 along with comments around the benefits and drawbacks of the solution considerations.

Finally, chapter 6 underlines the outcomes of this work and gives further rec- ommendations and plans for future work.

(10)

(11)

Chapter 2

Background

2.1 Complex Event Processing

Event processing, by definition [6] , is a method of tracking and analysing streams composed by events, infer situations and further draw conclusion from them. Com- plex event processing (CEP), is a type of event processing that involves multiple sources (eg. sensors) to infer composite events or complex patterns based on events occurred in a specified sliding time frame. The common purpose of complex event processing systems is to extract meaningful or critical situations (intrusions, frauds, catastrophic incidents etc.) and further allow for quick reaction to them.

Consider an oil leakage detection system, deployed by an oil company, that needs to process events from several thousands of sensors spread over multiple base stations that continuously broadcast their measurements (temperature, oil levels etc.). Since severe oil leakage can bring disastrous outcomes to both the company and the environment such a system should be carefully designed.

To start with, it should offer the capability of expressing what a critical situation is in a high level query language. The wider range of situations that can be described in that language the more expressive it is considered to be. In our example an oil leakage detection issue command could be described in the following statement:

T r i g g e r a W A R N I N G m e s s a g e t h a t c o n t a i n s the b a s e s t a t i o n n a m e and t i m e w h e n the f o l l o w i n g h a p p e n s

o v e r an h o u r in a b a s e s t a t i o n :

Any c o n t a i n e r ’ s p r e s s u r e d r o p s u n d e r 2.3 atm

f o l l o w e d by an a v e r a g e t e m p e r a t u r e m e t r i c of the s a m e a r e a to m o r e t h a n 20 d e g r e e s or a v o l u m e d e c r e a s e to

l e s s t h a n 50% of t h a t s a m e c o n t a i n e r ’ s c a p a c i t y .

From this example we could vaguely extract several requirements a target system should have in terms of expressiveness and functionality:

1. Sliding Window Operation: The ability to regard events over a specific

(12)

time window only at a time. In this example we have a sliding window of 1 hour.

2. Partitioned state: The ability to maintain a separate partitioned states.

In this example the system should maintain a different state per base station.

This kind of state partitioning is the key to distribution as we will see later.

3. Multiple Streams Processing: The ability to regard more than one streams in the definition of a detection situation. In this case we have to combine temperature with pressure and volume sensor streams.

4. Aggregations: Being able to aggregate through events of a stream. For example here we need to compute the temperature average of a base station.

5. Pattern Detection: The ability to detect situations in a specific order. In the example above we have to detect whether a pressure drop was followed by either a temperature increase or a volume decrease.

For most of the above requirements a specific family of processing engines has been developed throughout the years, called “Complex Event Processing” (CEP) [3, 2] engines along with an associated field. Such engines are powerful enough to process complex situations as the example above and perhaps far more complex than that. They generally offer high expressiveness and the ability to scale vertically by holding many states simultaneously given enough hardware support. Current commercial and open source CEP engines such as Esper are lightweight enough and can utilise the resources of a single host machine quite well. This section gives a brief overview of their general architecture and features.

2.1.1 Architecture

While each CEP engine adopts its own query language and data model there are some shared characteristics that illustrate the general behaviour. Informally they operate in an inverse-DBMS fashion, that is, instead of storing and retrieving data to process, they run operators over received data on the fly. Their model is maintained in data structures that keep event history or computed data such as aggregations.

In addition, query rules apply on the fly in the form of state transitions that are most commonly implemented with non-deterministic finite automata.

Sliding Time Windows and Non-Determinism

The source of non-determinism in most CEP systems is their dependency to real clocks for sliding windows. That further means that clock ticks might trigger state changes or the discarding of previously considered events which leads to non- determinism if we expect the same output for the same input of events. However, in many solutions available it is possible to enforce externally timed windows based on event timestamps for historical processing such that the local clock of the engine can

(13)

2.1. COMPLEX EVENT PROCESSING

Event Data

Event Model Rules

Figure 2.1: CEP state

be shifted forwards or backwards based on the timestamp of the last event received.

This feature offers a potential for considering CEP engines with deterministic behaviour. This work is also limited to CEP engines that can support externally timed windows since it is the only sane approach for investigating reliability in a distributed setting.

Partitioned State

As mentioned before CEP engines can maintain a partitioned state for sliding window operations. To illustrate that functionality for externally timed windows consider the following example in EPL (Event Processing Language) :

s e l e c t avg ( d u r a t i o n ) f r o m P h o n e C a l l

. std . g r o u p w i n ( c o u n t r y ). win : e x t _ t i m e d ( t i m e s t a m p , 1 h o u r ) This query simply triggers the average duration of all phonecalls per country considering only phonecalls that belong to the last one hour from the time point of the last event received in each country. As seen in Figure 2.2 if a phone call event arrives for consumption with country=Italy and timestamp=14:00, a phone call event previously received for Italy at 12:00 would not be participating in the average.

State partitioning is an important feature since it allows CEP engines to cor- relate data from different contexts and also requires only partially ordered streams where events are only ordered per partition.

2.1.2 CEP Fault Tolerance

Some commercial engines support state checkpointing and recovery from serialised state. However, in this work we will only regard CEP engines as black boxes without any further access to their internal state whatsoever. That means that when an

(14)

Sweden

Italy

Greece

12:00 14:00

PhoneCall

Figure 2.2: Partitioned State

engine has been reset it starts from scratch and the only option left for recovery is to replay logged events.

2.2 Fault Tolerance in Distributed Stream Processing

Early attempts of scaling real time event stream processing in general resulted in optimisations of current DBMSs to handle sequential data, also known as Stream Database systems [7, 8, 9]. However, their real time performance was limited due to the high cost of database operations involved. Their lack of more sophisticated pattern matching also made them unsuitable for numerous cases of enterprise usage.

Stream processing solutions that are favoured among academia, involved the application of group communication techniques such as the publish-subscribe paradigm [10] to disseminate events through multiple operators that were quite simplistic in nature. This effort led to a new set of graph–based systems, often called Distributed Stream Processing systems (DSP) [2], targeting highly-scalable real time stream processing, such as Aurora [11], Borealis [12], STREAM and Spark [13]. The common feature of those systems is the notion of distributing different operators in a computation graph of processors while disseminating different parts of streams on each of them. Furthermore, several interesting stream fault tolerance solutions [14]

came out of the DSP field, some of which will be briefly described in this section.

2.2.1 Active Replication

Active replication relies on the synchronised consumption of events between equal replicas that operate as state machines and is among the primary techniques employed in DSP systems[15][16][17] to offer fault tolerance. Total ordering of events with consensus protocols such as Paxos [18] is adopted and widely accepted as the most fault tolerance approach. However, there are several variants proposed[14], some of them achieving consensus indirectly, others even considering looser synchro- nisation requirements between operator replicas [16]. In most proposals, there is

(15)

2.2. FAULT TOLERANCE IN DISTRIBUTED STREAM PROCESSING

extensive local event logging involved in producer operators and two-way communication is required between consumers and producers for periodic acknowledgements that eventually lead to local log truncating as seen in Figure2.3.

P1

P2

C1

C2 ack

Figure 2.3: Consumers-Producers in Active Replication

Processing Pairs and Flux

In many solutions the concept of “processing pairs” is adopted, where there are only two stream operator replicas based in the assumption that at least one operator is correct [19]. To further assist an operator potential recovery while also offering consistent output (ordered events, no duplicates) out of processing pairs FLUX [16] employs the notion of “boundary operators” by interposing two new operators, before and after the replicas. The approach in FLUX also builds on the assumption that boundary operators are fault tolerant and that they maintain local buffers that never overflow.

2.2.2 Passive Standby and Upstream Backup

There are two kinds of passive replication considered in distributed stream process- ing systems, passive standby and upstream backup[20]. Passive standby typically requires operator serialisation support and relies on periodic state deltas on passive standbys. Furthermore, all primaries have to log their output buffers between state updates to assist in complete recovery of a backup after a primary failure. Passive standby is considered a rather slow technique for fault tolerance and is rarely used in distributed stream processing. Instead upstream backup[20] is more suited in most cases.

With upstream backup, each upstream server buffers the input of downstream nodes until it is acknowledged that the backup “operation” is done. Upon failure, the upstream backup sends back to the takeover standby node all buffered events to recover the state. Since a history of events needs to be replayed this technique

(16)

leads to slower recovery times, however, it has reduced runtime cost and is better suited for osystems with limited number of considered events.

2.2.3 Parallel Recovery

A recent interesting recovery approach that was employed in Spark[13] is parallel recovery. It is based on the idea of splitting a stream computation into a series of small, deterministic batches. Spark uses a storage abstraction called Resilient Distributed Datasets (RDDs) which contain historical batch operator state and operator history involved and whenever lost RDDs are detected due to failures they are being reassigned for computation to available nodes in the cluster. Generally parallel recovery is restricted to fine grained stream operations, however it offers an interesting approach, inspired by systems like Hadoop[21] to stream processing in general.

2.3 Distributed Stream Computing Platforms

Several parallel processing systems like Hadoop[21] are mainly in use today for distributed processing, however, their architecture does not allow further flexibility to process on-line data. For that purpose a new class of systems, serving as distributed stream computing platforms has emerged that offers functionality suited for parallel processing of continuous streams of data such as Twitter’s Storm[4] and Yahoo’s S4[5]. In this section we will describe their architecture’s benefits and usage with a main focus on Storm since it is a crucial part of this project’s implementation.

2.3.1 Twitter Storm

One system that gained a lot of hype during the last two years is Twitter’s Storm[4], a distributed stream computing platform written in Clojure that was meant to sim- plify distributed stream processing. Storms abstracts the execution plan of a pro- cessing graph into a concept called Topology. Once deployed, a topology translates into endlessly running tasks handled by working processes allocated throughout a cluster that follow a given routing scheme. Storm handles failing tasks and processes as well and reallocates them reliable via configuration information maintained in Zookeeper[22]. The main deployment decisions are being made by a single fail-fast process called Nimbus which also implements an Apache Thrift RPC [23] interface.

Storm Topologies A topology is an Thrift[23] object that is being submitted to Storm execution and contains all instructions needed in order to deploy a distributed execution. A representation of a possible topology can be seen in Figure 2.4. The main abstractions contained in a topology are:

• Spout definitions: Spout components as their name suggests are respon- sible for emitting streams to the topology and are the main sources of

(17)

2.3. DISTRIBUTED STREAM COMPUTING PLATFORMS

Spout 0

1

2

3

Bolt-1 4

5

6

Bolt-2 7

8

Bolt-3 9

Figure 2.4: Storm Topology Example

tuples. Storm implements the needed flow control and frequently polls spouts to emit new tuples. Typical implementations of spouts involve listening to external in-memory queues such as Kafka or querying a database.

• Bolt definitions: Bolt components consume, process and emit streams of tuples provided by Storm. They are the main processing elements and typically are implemented to do filtering, aggregations, joins and even making transactions with external databases.

• Parallelism: The amount of parallelism per bolt or spout definition instructs Storm to maintain a certain amount of instances (tasks) per each. Since it is explicitly given before deployment and is impossible to change during runtime, parallelism should be proportional to the amount of scalability that is approximately needed for processing a stream.

• Streams: Streams are unbounded series of tuples that follow a specific schema. Each stream emitted by a component should have a unique id and a specified schema. Storm handles primitive field type serialisation and provides support for custom types via a wrapper using Kryo[24].

• Subscriptions and Groupings: Since Storm uses subscription message queues for tuple dissemination between tasks it is important to know beforehand how streams need to be disseminated between them. Thus, it should be specified in the topology which bolts subscribe to other bolts or spouts. In addition, groupings should be given as well which are in fact routing schemes that define how tuples should be disseminated between the actual tasks of connected components. There are several

(18)

typical groupings already implemented by Storm to choose from. Some interesting groupings include:

– Shuffle Grouping does random routing between component tasks and is preferred for filtering operations.

– Fields Grouping applies content aware routing based upon given fields tuple by mapping each value to one responsible target task.

Fields grouping is used mainly for partitioning state-driven compu- tations such as aggregations.

– All Grouping replicates and disseminates the stream to each sub- scribed target task.

– Global Grouping choose exactly one tasks as the receiver of the whole stream.

– Direct Grouping gives the routing choice to the component itself.

Most importantly, Storm offers components the ability to implement cus- tom groupings based in topology information upon runtime. By default shuffling grouping is picked by storm for dissemination.

Storm API

There is a Storm API available in Java and Clojure that clearly reflects its pull-based mechanism on data pipelining. As seen in Figure 2.5 the Storm API coordinates flow control in a pull-based fashion. First, Storm notifies a spout to push the next tuple to a given output queue managed by Storm and finally triggers and execute callback in the receiving bolt(s) for consumption. Bolts can emit tuples at any time depending on their functionality. Even though such a mechanism suffices for the purposes of managed event dissemination it would be more clear to adopt a message-based approach. To make the API more elegant, a message-based component subscription framework was integrated on top to mask the communication scheme as it will be discussed later.

STORM S

1) getNext!

output

2) emit A B

3) execute A 4) emit B

Figure 2.5: Storm API

(19)

Node 1 Zookeeper Nimbus

Node 2

worker-1 Supervisor-2

worker-2

worker-1 worker-2

Node 3

task-1

task-2

task-3

worker-1 Supervisor-3

worker-3 task-4

task-5

File-IO Zookeeper ZeroMQ Thrift-RPC

JVM process JVM thread

Daemon

Physical / Virtual node

User App

Figure 2.6: Storm Architecture Overview

Storm Architecture

Storm abstractions themselves do not say much about the architecture so it is important to give a clear view of how it works under the hood. As depicted in Figure 2.6 there are several communication and processing mechanisms involved that will be explained in a little more detail.

Apache Zookeeper: A middleware for distributed resource coordination. It is widely adopted by many distributed frameworks such as Apache Hadoop to achieve leader election, distributed locking and barrier establishment. It is feature-light and requires a live ensemble of at least three nodes to achieve a quorum. Writes are coordinated through one master node into data structures called z-nodes and thus, it is mainly used for storing infrequent data such as configurations or state checkpoints. Storm uses Zookeeper to store the status of the cluster so that it can recover from an outage in any of its distributed component services. This allows the main component services (Nimbus, Su- pervisors) to be stateless and simply sync with Zookeeper when configuration data is in need.

Nimbus and Supervisors: They are both daemons that assemble the backbone of Storm and coordinate the whole resource allocation through Zookeeper.

Nimbus is the main entry point to Storm and offers an Apache Thrift RPC

(20)

interface for client applications to interact and submit or kill topologies. Its role is similar to Hadoop’s “Jobtracker”. Cluster’s resources are allocated in a “fair” slot-based fashion. Each supervisor monitors a physical node in the cluster and offers a number of available slots. Each slot represents a potential JVM process called worker that handles tasks for a specific topology (Figure 2.7). Supervisors are further updating their resources’ state in Zookeeper when a change occurs.

Nimbus

Zookeeper

Supervisor-3

Supervisor-2

Supervisor-1

worker 3

worker 1

worker 2

Figure 2.7: Nimbus and task allocation

Nimbus allocates slots and pushes in Zookeeper all relevant files needed from workers for the execution of tasks. Assigned workers are further notified by supervisors to fetch needed files and spawn and handle task threads each executing the functionality of a topology task (Bolt or Spout instance). Each worker uses a specific port for message communication and is also responsible for handling message queue subscriptions by delegating events to task threads through assigned virtual ports. It is important to note that each task has a unique id since its definition by Nimbus and it is guaranteed by Storm that there is only one active instance in the cluster for each task id. Even if the task has being restarted or reallocated in another worker it will continue having the same task id.

Fault tolerance of the aforementioned components is straightforward. Each worker process writes heartbeats through File IO that are being monitored by the respected supervisor. If the worker stops writing heartbeats its supervisor will detect it and will try to restart it. Upon failure of a supervisor worker processes will keep running but they will be unable to recover if they crash until the supervisor daemon starts again. Also, upon the failure of Nimbus no resources can be allocated or deallocated until it is alive again, however, the running topologies will not be affected.

ZeroMQ: A popular networking protocol implemented in C++, known for its good performance that offers multiple communication patterns through simple socket interfaces. It supports multiple senders and receivers on a single

(21)

reliable connection and further allows senders to push messages even before any recipient bounds to its port. The two types of sockets used in Storm are:

Push/Pull: A pair of unidirectional sockets for multiple senders and multiple receivers. Sender messages are being queued in a round-robin fashion for fairness in a multiple sender-receiver setup. To avoid flooding, when buffers are full sending is blocking until messages are pulled. By using push/pull sockets ZeroMQ can take the role of a fair scheduler in batch consumption worker scheme. In Storm worker processes bind virtual ports with pull sockets and further delegate respected tuples to tasks internally via pair sockets.

Pair: A simple one-to-one bidirectional socket used mainly for local commu- nication.

Message guarantees (delivery, ordering) follow the respected underlying protocol used (e.g. TCP) and apply only if both the sender and a receiver are correct processes (quasi-reliable). Mind that persistence of messages is not considered in ZeroMQ unlike some other higher-level messaging middleware implementations (i.e. RabbitMQ) and thus needs to be done in the application layer if needed.

At this point of writing, there is a community effort to replace ZeroMQ with a custom messaging layer that builds on Netty since ZeroMQ hides buffer visibility to allow for more customised flow control.

At least once delivery

One of the main advantages of Storm over Yahoo’s S4 is its “at least once delivery”

property. Even though this feature was not used in our design it is worth to be mentioned how it works.

Since workers might fail tuples can get lost. An application could as well be implemented in a way that tuples can get lost before they are processed. If tasks cannot afford lost tuples but can work with replays of tuples this mechanism can be beneficial. In this case Storm can be configured to maintain some extra tasks un- derneath the topology called “acker tasks” that transparently track the evolution of each tuple from the point it first gets emitted from a spout task via Directed Acyclic Graphs (DAGs). Those “acker tasks” are also persisting their state in Zookeeper to be able to recover their state if they fail. Each original tuple is being given a unique random 64bit id at its creation. Once an original tuple is being emitted a DAG is being created for that tuple at the respective “acker task” for that tuple that is being used for tracking its progress. Whenever a tuple is being further consumed by a target bolt the storm API provides a callback for an acknowledgement which further ends up at the respective “acker” task that in turn updates the DAG with that tuple as processed. In most cases a tuple leads to the generation and further dissemination of new tuples.

(22)

S B1

B2

B3 A B4

C D

B

A

B

C D

DAG for Tuple A

Figure 2.8: DAG per original tuple

As shown in the example in Figure 2.8 tuple A leads to the creation of tuples B and C in bolt B1. Upon emission bolt B1 notifies an “anchor” A→ B, C via the API. The DAG is updated in advance by the “acker” task by adding the tuple IDs of B and C and tagging A as processed. Once all nodes in the DAG are processed the DAG is being discarded.

In the case of an unprocessed tuple, each topology configuration has a configured timeout that signifies a delivery failure and thus the respected spout is being notified with a callback in order to replay the lost tuple. In this example, if B3 fails before processing tuple B and the timeout occurs spout S will eventually get notified to replay A. Note that this will further lead to bolts B1, B2 and B4 processing the same tuples once again. Generally, this mechanism suffices for applications that do not have strict ordering requirements. It could also be used for applications with stateless operations such as stream filtering. To offer transactional processing the application itself has to take care of such guarantees.

Further Comments

Storm offers useful abstractions and hides the complexity of resource allocation and flow control for distributed stream oriented applications. However, its API is rather verbose and does not encourage concurrent applications support. Among others, it is hard to offer fail-fast behaviour to tasks if needed. For that reason, in this work we extended its functionality with Kompics [26], a component messaging framework, in order to offer a more elegant and reliable interface for distributed applications on top. The Kompics framework is further described in a following section. Furthermore, resource allocation in Storm is almost static and does not allow manual configuration. Thus, it is hard to guarantee the same setup for strictly valid experimental comparisons. However, its good overall performance in addition to its flexible routing capabilities and parallelisation simplicity made it our preferred choice to build our distributed complex event processing setup.

(23)

2.4. COMPONENT BASED MODEL AND COMMUNICATION

2.3.2 Yahoo S4

Yahoo was among the first adopters of the concept of distributed stream computing platforms with S4 [5] and its architecture inspired systems such as Twitter Storm and IBM InfoSphere[25]. It is based on Zookeeper as well for node coordination and employs an actor based architectural design. The actors of S4 are called Processing Elements or P Es and listen to unique streams identified by their key. Communi- cation management such as routing and dispatching of events is mainly handled by underlying logical hosts called Processing Node (PN) in each host. Routing is key-based per event and resource allocation and parallelism is dynamic based on the content variation in a similar fashion to Map Reduce.

The main architectural advantage of S4 over Storm is that it is more dynamic and transparent when it comes to resource allocation. Its P Es are lightweight, more simplistic than Storm’s bolts and the actor programming model makes it more elegant to build distributed applications. However, S4 does not allow for cus- tom content-aware routing and manual configurations. Furthermore, the processing graph design is XML based which makes it cumbersome for creating customised processing graphs. When it comes to fault tolerance no mechanisms are considered at all and finally its push-based approach could potentially create flooding issues resulting to lost events and unavailability of PN s.

2.4 Component Based Model and Communication

The general orientation for reliable distributed systems designs today encourages modular designs that do not rely on locking and shared resource access since shared access can lead to several unexpected behaviours and inflexibility for further distribution. Furthermore, modular constructs lead to better reusability, testing and clear designs that allow future reconfigurations. Message-passing concurrency is favoured for scalable distributing computing and there are several approaches that take advantage of it to offer reliable services with clearly specified guarantees.

2.4.1 Kompics

Kompics[26] is a simple component-based message passing framework designed at Swedish ICT (SICS) showing good performance that will be briefly described since it was used in our implementation. Up to now there are bindings available in Java and Scala. Several main abstractions that Kompics is configured upon and are strictly bound together are Events, Ports, Components and Channels that will be described in brief.

Event: A basic immutable construct carrying expected attributes. An event is meant to be transferred between component entities as a basic means of prov- ing communication context.

(24)

Port: A component interface definition that specifies its expected interactions. A port contains definitions of expected input (requested) and output (indicated) event types commonly denoted with the symbols− and + respectively. Con- textually, a port is directly associated to a specific abstraction of a service, mechanism or protocol definition such as Network or Timer. For example, a Network port type would look like Figure 2.9 given a Message event definition since it requests and indicates a Message as well.

NETWORK

request<Message>

indication<Message>

-

+

Figure 2.9: Kompics Network Port Type

Component: A basic processing unit that provides the implemented behaviour of one or more ports. Components might offer the behaviour of specified port(s) and also require the behaviour of other ports that they rely upon.

Implementation wise, a component should implemented a required set of Event Handlers and bind them to ports that are being used for delegation purposes.

This is also called Subscription in Kompics.

Channels: They serve as the glue between components in terms of binding them together via their offered or required ports. Such bindings are being made explicitly upon the components pre-configuration.

In typical setups there is one Kompics scheduler operating per runtime, (ie. one per JVM) with a specified pool of worker threads, each of them serving a number of component instances. Job allocation between worker threads functions efficiently by further allowing work stealing in case workers are out of read components to be assigned. This feature boosts the overall performance and provides efficient utilisation in multi-core environments.

Further Comments

Kompics is a valuable tool and the implemented message passing concurrency offers a great advantage, especially for Java applications since Java is not designed and commonly not used for this type of interaction between entities. It further encourages design simplicity for distributed applications and results in good performance since it transparently handles component execution efficiently.

An expected side-effect observed in the Java version is somewhat verbose configurations since each message type should have its own definition and bindings

(25)

2.5. DISTRIBUTED IN-MEMORY STORAGE

should be exhaustingly defined. Furthermore, managing Kompics runtimes can be non-trivial and sometimes inelegant if execution is managed by an external service like Storm. Finally, some nice additions to Kompics that are currently missing would be a Finite State Machine (FSM) API support and event priority queueing.

2.5 Distributed In-memory Storage

There is an increase in available distributed storage solutions today. Their performance and consistency guarantees are usually dependent on the design goals. Many of them achieve eventual consistency and good load balancing among participating nodes. Often, in-memory storage is preferred especially when data is only tem- porarily needed. For that there are several solutions available today such as Riak, Redis and memcached. Since Riak was a part of the implementation in this project we will give a brief overview about its architecture.

2.5.1 Riak

Riak[27] is an open source, fully distributed key value store that is clearly inspired by Amazon Dynamo’s[28] architecture. It is written in Erlang and is widely deployed commercially today.

Data Model : Since Riak is a key value store data is partitioned in key/value pairs. Furthermore, multiple keys can be grouped hierarchically into buckets.

Buckets can hold their own configuration as well that applies to all keys con- tained and could be seen as sets of keys belonging to the same namespace.

Values can be stored in Riak as raw byte arrays or string JSON format.

Data Distribution : A typical 160bit key range is equally partitioned in a ring structure into vnodes and each physical node (participant in the Riak cluster) shares an equal number of vnodes. For improved rebalancing performance consistent hashing is employed as well.

Fault Tolerance and CAP Tuning: Key/Value elements can replicated into N vnodes where N is configurable. Riak is also known for its CAP tuneable property. Write and Read operations can be configured to require a specific number of replicas to agree. By default, quorum based majority is being used for both reads and writes. Mind that consistency requirements can also be bucket specific as well.

Storage backend : The storage backend of Riak is also configurable. Some com- mon storage backends used are:

Memory: In-memory storage suited for very fast access of temporary objects.

Bitcast: Log-structured file system for fast and persistent access

(26)

LevelDB: Persistent storage with efficient data compression (using Google Snappy) suited for large amounts of data but with slower access speeds.

(27)

Chapter 3

Solution Model

3.1 Model Assumptions

A more formal definition of the core abstractions and their required properties can give a better insight behind the reasoning of our design. Such abstractions are events and event streams.

3.1.1 Events and Streams

Let V be the set of all possible value sets. For example, if events can have integer fields then Z ∈ V and similarly for other types like strings, floats etc..

Let e∈ V₁× . . . × V_n for n∈ N, V1, . . . , V_n ∈ V. We call such an e an event. The set of all possible such events we call Ω. Also let κ∶ Ω → N, (v1, . . . , v_n) ↦ n and t∶ Ω → N that represents the externally set time.

Furthermore, let a schema S ⊂ Ω be defined such that ∀_e1,e2∈Sκ(e₁) = n = κ(e₂) and e₁, e₂ ∈ V₁× . . . × V_n for V₁, . . . , V_n ∈ V. Also, for any schema S a function σS∶ X → N for X ⊆ S is called a stream over S.

No total ordering is considered in σ_S such that for e1, e2∈ dom σ_Swith σ_S(e₁) <

σ_S(e₂), the statement t(e₁) < t(e₂) does not necessarily hold true.

Definition 1.

∀σS∃_p∶S→N∀e1,e2∈dom σ_SσS(e1) < σS(e2) and p(e1) = p(e2) Ô⇒ t(e1) < t(e2) (3.1) Such a p we call a partitioning function.

In practice, the mapping applied by p can be a hash function that regards the event values over a subset of attributes that we call partitioning attributes.

3.1.2 Query Processor Model

We further model a Query Processing Engine Q = (q, p, S, ˆσS, ˆσ^′_S), where S is a schema, ˆσ_S is a set of all possible input streams over S and ˆσ_S^′ is a set of all possible

(28)

output streams. Furthermore, we define p as a partitioning function for all σ_S∈ ˆσ_S and q a mapping function q∶ ˆσS→ ˆσ_S^′ where the following properties apply:

• Correctness: Given σ_S ∈ ˆσS the result of the consumption of σ_S by Q is σ_S^′ = q(σS) ∈ ˆσ^′_S

• Determinism: for σ_S, ρ_S∈ ˆσ_S with σ_S= ρ_S Ô⇒ q(σ_S) = q(ρ_S)

• Partition Consistency: p is also a partitioning function for all σ^′_S∈ ˆσ_S^′.

• No duplicate emissions: for the consumption of an event e ∈ dom σ_S at most one output e^′∈ dom σ^′_S will be emitted

3.1.3 Query Parallelism

Q

Q1 Q2 Q3

: operator tasks responsible for the same partition

Figure 3.1: Query Parallelism Overview

We adopted a query parallelism model for CEP similar to the one defined in distributed DBMS theory [29]. Due to this design intra-query parallelism can be further decomposed into inter-operator and intra-operator parallelism respectively, as depicted in Figure3.1:

• Inter-operator parallelism: The analysis and decomposition of a high- level query Q into a series T = q0, q1, q2...qn of subqueries that we call base queries, typically each regarding one operator. For a CEP query that process is language specific and involves re-writing techniques [30]. Such queries could be a single filter, pattern matching or aggregation operation for example that

(29)

3.1. MODEL ASSUMPTIONS

could run in separate engines in a pipeline. The benefit of inter-operator parallelism in CEP is mainly that it alleviates additional state data accumulated in one engine that can be distributed into multiple engines. There is also an additional cost of event delegation delays between engines.

• Intra-operator parallelism: The amount of parallelism per base query i.e.

the number of engines running the same query. Intra-operator parallelism involves the partitioning and dissemination of events between multiple operator instances per query for linear scalability.

3.1.4 Query Planning

Query planning is not the main focus of this work, however, a formal definition of its purpose and what a query plan is included in our scope. The two main components as seen in Figure3.2.:

Query planner : A service that receives a CEP query and a stream definition as an input and generates an distributed execution plan P

Stream Computing Platform : A unit that simply takes a plan and sets up a distributed computation based on it.

Planner {Q,S}

Stream Computing

Platform P

Figure 3.2: Query Planner

A distributed execution plan P = (T, σ_S, λ, p, w) consists of : T : A series of n base queries q0, q1, . . . , qn

σS: The input stream identifier used in the execution

λ: A given amount of intra-operator parallelism per operator p: The partitioning function p for σ_S where p∶ S → N

w: The sliding window value that applies to∀q_i∈ T where w ∈ N

(30)

With the exception of λ that should regard the rate of σ_S, all other elements can be inferred from Q via query analysis techniques.

To further assist how a plan looks in practice consider the following query in EPL that emits the average temperature per area in sweden over the last seconds:

s e l e c t avg ( t e m p ) as avgtemp , t i m e as w h e n f r o m s e n s o r _ e v t ( c o u n t r y = ’ sweden ’)

. g r o u p w i n ( l o c a t i o n ). win : e x t _ t i m e d ( time , 10 s e c o n d s ) The resulting query plan could be the following:

- - - - q0 - f i l t e r ::

s e l e c t temp , t i m e

f r o m s e n s o r _ e v t ( c o u n t r y = ’ sweden ’)

. g r o u p w i n ( l o c a t i o n ). win : e x t _ t i m e d ( time , 10 s e c o n d s ) q1 - a g g r e g a t e ::

s e l e c t avg ( t e m p ) as avgtemp , t i m e as w h e n f r o m s e n s o r _ e v t

. g r o u p w i n ( l o c a t i o n ). win : e x t _ t i m e d ( time , 10 s e c o n d s ) S = s e n s o r s _ t e m p

p a r a l l e l i s m = [3 ,3]

p : p ( e ) - > h a s h ( e . l o c a t i o n ) w = 1 0 0 0 0

- - - -

In this example there are two base operators, one for filtering and one for aggre- gating the events. The stream id is an identifier of the stream and is implementation specific. Could be for example a table name in a database. The planner in the example also decides that the approximate number of replicas should be 3 for both base queries. Finally, the partitioning function is defined as the hash of the location attribute and the window is 10000 milliseconds as specified by the query.

3.2 Distributed CEP Execution

Once a query plan has been submitted by the query planner to a stream computing platform it translates into a processing graph of stratified query processors, one for each query operator instance that will run in parallel.

(31)

3.2. DISTRIBUTED CEP EXECUTION

3.2.1 Execution Environment Assumptions

Here we will sum up all model assumptions of the execution environment.

Process model

Each task has a fail-stop behaviour. That further signifies that whenever a disfunc- tion occurs in any of its subcomponents if any it will stop functioning.

Topology View

The topology information of the actual execution is being passed to each node upon its initialisation. Thus, each task is expected to have access to the following information from the point of starting:

• Its base operator id and query

• The partitioning function p

• Its task id

• All task ids of the instances belonging to the directly preceding, succeeding and current base operator

Mind that topology information is never reconfigured. Thus, all tasks ids would stay the same even after potential node restarts.

Links

We further make the following assumptions regarding communication:

• Each task can address and send events to any other task of the directly succeeding base operators

• Channels are quasi-reliable and preserve ordering. That is, any event sent between two correct tasks will eventually be delivered in the correct order

• Message buffers cannot overflow or invoke node failures Event Routing

Event routing should preserve the partitioned ordering property between all incom- ing and generated streams throughout the processing graph. That further leads to the following two invariants:

1. Every stream produced by any query processing task preserves the partitioned ordering property at Theorem.1

(32)

2. Every query processing task processes streams that have the partitioned or- dering property

Invariant 1 is true from the partitioned consistency property of the query proces- sor definition above. In order to satisfy the second invariant and given that ordering is preserved by the communication channels the problem could be converted to the following:

3. Every event belonging to the same partition traverses through a single path in the processing graph.

This can be satisfied through partition aware routing. Since there can be a mapping from an event partition to N provided by p, which is also used by the query processor, a function that maps directly a partition to a target node of the next operator can be trivially chosen such as mod hashing. This way we can guarantee that there will be no inconsistencies in any stream partition transferred whatsoever since events of the same partition will always be routed to the same query processor as shown in the example in Figure 3.3.

Q0-1

Q0-2

Q1-1

Q1-2

Q1-3

src

Events belonging to the same partition

Figure 3.3: Partition Aware Routing Example

3.2.2 Stream Handling

The source of input streams, their consumption scheme and generated stream scheme is implementation specific, however it is important to mention the general idea at this point.

Input Streams

A stream source could be any type of data provider from a variety of sources. For example a database containing historical events such as sensor metrics, an event queue holding real time events or any other storage or distributed queue. In the general design we considered the existence of a main stream source provider that input tasks with the required adapters can connect and pull each event sequentially

(33)

3.3. FAULT TOLERANCE

and further delegate it based on whether they are responsible for it as seen in Figure 3.4. To determine whether a source task is responsible for a given event a technique such as mod hashing can be employed as well based on the source task id space.

We further assume that source tasks can embed monotonically increasing sequence numbers to events per partition if needed.

Stream

Figure 3.4: Stream Input

3.3 Fault Tolerance

To guarantee an always continuous flow of events several ideas were considered.

From our link assumptions only tasks of the next base operator are addressable and thus group communication techniques would not work in an active replica setting to achieve total ordering. However, since state on engines is partitioned their output would be consistent if only partition ordering was guaranteed between replicas. Furthermore, introducing redundancy could ensure that there is at least one continuous flow path throughout the processing graph of a partition.

3.3.1 Query Processing Pairs

For redundancy we adopted processing pairs of query operators, with the assumption that at least one of the pair replicas is correct. We further did not regard any leader election algorithm at all and instead allowed both replicas to emit their generated streams. Since both replicas are responsible for the same partitions at least one of them will eventually receive emitted events in any common partition.

This technique introduces quadruple traffic in the system, as seen in Figure3.5, however, through this redundancy we guarantee that there will always be at least one node to process any generated stream. Furthermore, we assume that events of each partition are always ordered with increasing timestamps or sequences to allow duplicate discarding.

(34)

Figure 3.5: Query Processing Pairs

3.4 Recovery

When a query processor is up again following a failure it can wait until it has consumed a whole partition window before allowing further emissions in that partition. That could be acceptable in cases where windows are short, however, for long windows it would be more reasonable to go after a recovery method. Since state serialisation is not regarded in our setting, the only option available is to re- produce the missing state of the query processor. Since state is partitioned, upon the restarting of a query processor we would need to guarantee the following:

Before the consumption of an event e belonging to a partition p1, all previous events of that partition p1 have been consumed prior to e in the correct order.

As a means to achieve that we considered transactional upstream logging to an external storage and partial, per partition, recovery.

3.4.1 Upstream Logging and Partial Recovery

The upstream backup technique, in its basic form, considers the logging of the overall input of a downstream node into an upstream node before emitting it in order to replay it to a downstream replacement for the purposes of recovery. However, this basic approach:

(35)

3.4. RECOVERY

1. Assumes sufficient buffer capacity of the upstream node in order to store the overall input

2. Incurs increased recovery overhead for the retransmission and consumption of the overall buffered input that is not needed if events belonging to only one stream partition are necessary for the ordered consumption of an event in the same partition.

Thus, the recovering consumer will need to replay the whole history of events buffered by an upstream as seen in Figure 3.6.

Upstream Consumer

Replay

Recovering

Figure 3.6: Upstream Recovery

As a first experimental approach we considered using a fault tolerant distributed key value store that serves as an upstream log for each partition. Thus, for an output event e where p(e) = µ, µ ∈ N, a producer itself will first append e to a unique register which can be identified by a key derived from the producer’s id that we call pId, pId∈ N and µ. For example, a hash function h ∶ N → M could be used to derive the key as such h(pId + µ), assuming M is the key space of the store.

This allows for partial, on-the-fly recovery of a consumer at the point it receives an event of an “unrecovered partition” since the consumer can directly fetch all events of only that partition as seen in Figure 3.7.

Producer Consumer

Replay

Recovering k

KVS

Figure 3.7: Partial Recovery

(36)

3.4.2 Log Pruning

Since the current state of a sliding window based CEP engine depends solely on the events received inside the current window range it can be inferred that not all logged events will be needed for recovery. Thus, we define a log pruning operation given a schema S⊂ Ω as such:

• Let L be a set on schema S of the ordered logged events of a partition

• Let R be another set where R⊂ L that we call pruning set of events to be removed from L

• w∈ N the given sliding window

• e_n the last event stored in L

Then, e∈ P ⇐⇒ ts(e) < ts(en) − w, e ∈ L

Furthermore, given the query processor pairs scheme above, having two producers and two consumers, we make the following conventions:

1. Only one consumer node among a query processors pair will prune the logs.

This could be the node with the highest id for example. In case of the node being down, we rely on the assumption that eventually that node will be restored and be able to prune the partition logs.

2. Pruning will be initiated asynchronously upon a consumption of an event of a partition based on a given pruning factor z∈ R that denotes the affordable degree of expired events. Larger z will mean less storage operations but higher amount of expired events in a partition log.

3. A consumer will initiate a pruning operation on both producers logs for a partition

4. During recovery of a partition both produced logs of it will be restored and the one starting from the event with the highest timestamp will be picked for recovery.

5. Recovery and pruning of logged events should always be done in an orderly fashion.

6. Upon recovery a consumer should also prune its own log for that partition so it can start over logging from the recovered state

This type of log pruning resembles upstream backup pruning, however, in this case there is a different log maintained per partition, for each each event producing replica. Since both logs are used for recovery purposes in order to fetch to most up

(37)

3.4. RECOVERY

to date window it is further more efficient to prune all replicas logs based on the largest timestamp received.

This event logging scheme in general is not perfect since it resides on per-event transactional logging which can the throughput rate of the system significantly.

However, this could be further improved by introducing batch logging of tuples.

(38)

(39)

Chapter 4

Solution Design

In this chapter we give an explanation of the general architecture, components and tools used to implement our reliable distributed deployment of CEP engines based on the model described in Chapter 3. The design was an effort to offer a more modular, component based approach for the execution of the distributed CEP engines that allows for future replacements of subcomponents such as the CEP engine itself.

4.1 Architecture Overview

Our architecture builds on Storm functionality and features to deploy query processing engines. The reason behind this choice is mainly Storm’s simplicity for handling the deployment and routing between given processing entities (bolts). We further composed a component stack on each bolt by using the Kompics framework in order to provide a cleaner and more extensible interface for future integrations and efficient resource utilisation between components. From a top down view as seen in Figure 4.1 there is a client application called Query Topology Builder that embodies the translation of input execution plans into a Storm topology and submits it to the cluster master (Nimbus) for deployment.

Query Topology

Builder Plan

Master

Slave Slave

Slave QP

Storage

QP

Client Cluster

Src

Figure 4.1: System Architecture

(40)

Once submitted and allocated via the master server (Nimbus), the main tasks of the topology (sources and query processors) are being configured upon preparation.

4.1.1 Query Processors

Query Processor

STRM_BOLT

WINDOW MANAGER

ESPER_CEP STORE_GC

BOLT CEP GC

+ + +

- - -

Figure 4.2: Query Processor Implementation

The query processors are Storm bolts with fail-fast behaviour that encapsulate the query processing requirements modelled in Chapter 3. Their role is to:

• Discard event duplicates

• Delegate events from/to a CEP engine

• Enforce the partition window state of the engines for recovery purposes

• Implement transactional logging of events and pruning if needed

To achieve this we introduced a component stack with respective implementations on top of Storm bolt as seen in Figure 4.4. The window manager serves as the decision core for maintaining an up to date state of the engine and doing storage read and writing through interacting with the CEP and bolt components. We will further explain each component in more detail:

Bolt: A bolt component simply wraps the functionality of a Storm bolt. It emits any received tuples to bound components wrapped into a STRM_EMIT type of message and further delegates its initialisation topology parameters through a STRM_PREPARE message.

CEP: The CEP component wraps the functionality of a CEP engine in its simplest form which includes the consumption of CEP_CONSUME events containing an actual event for processing and the emotion of CEP_EMIT events that contain events generated from the CEP engine. Our original implementation of the CEP component uses Esper as its engine.

(41)

4.1. ARCHITECTURE OVERVIEW

BOLT

<STRM_EMIT>

<STRM_FAIL>

-

+

<STRM_EXECUTE>

<STRM_PREPARE>

Figure 4.3: Bolt component type

CEP

<CEP_CONSUME>

<CEP_EMIT>

-

+

Figure 4.4: CEP component type

WM: The Window Manager (WM) component binds and interacts with all other components, serving as the core of a query processor and its behaviour can be seen in Algorithms 1 and 2 in the Appendix. As it can be observed from the algorithm, store operations are synchronous since reads and writes are transactional. We will explain our store implementation further below.

WP: The Window Pruner (WP) component handles the log pruning requests of the WM. As seen in the Algorithm 3 in Appendix it maintains the last timestamp per partition window pruned and initiates a pruning procedure based on a fixed window size and a pruning factor given.

4.1.2 Event Sources

Event sources are storm spouts that filter and delegate each input based on whether they are responsible for its partition. Furthermore, they put a monotonically increasing sequence per partition to events to aid duplicate event discarding with explicit ordering.

Fault Tolerant Distributed Complex Event Processing on Stream Computing Platforms