Improving the performance of stream processing pipeline for vehicle data

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Improving the performance of stream processing pipeline for vehicle data

GU WENYU

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Improving the performance of stream processing pipeline for vehicle data

Gu Wenyu

2020-10-27

Master’s Thesis

Examiner

Gerald Q. Maguire Jr.

Academic adviser Anders Västberg Industrial adviser Lasse Öberg

KTH Royal Institute of Technology

School of Electrical Engineering and Computer Science (EECS) Department of Computer Science

SE-100 44 Stockholm, Sweden

(3)

(4)

Abstract | i

Abstract

The growing amount of position-dependent data (containing both geo position data (i.e. latitude, longitude) and also vehicle/driver-related information) collected from sensors on vehicles poses a challenge to computer programs to process the aggregate amount of data from many vehicles. While handling this growing amount of data, the computer programs that process this data need to exhibit low latency and high throughput – as otherwise the value of the results of this processing will be reduced. As a solution, big data and cloud computing technologies have been widely adopted by industry.

This thesis examines a cloud-based processing pipeline that processes vehicle location data. The system receives real-time vehicle data and processes the data in a streaming fashion. The goal is to improve the performance of this streaming pipeline, mainly with respect to latency and cost.

The work began by looking at the current solution using AWS Kinesis and AWS Lambda. A benchmarking environment was created and used to measure the current system’s performance.

Additionally, a literature study was conducted to find a processing framework that best meets both industrial and academic requirements. After a comparison, Flink was chosen as the new framework.

A new solution was designed to use Fink. Next the performance of the current solution and the new Flink solution were compared using the same benchmarking environment and. The conclusion is that the new Flink solution has 86.2% lower latency while supporting triple the throughput of the current system at almost same cost.

Keywords

Cloud computing, Stream processing, Flink, AWS

(5)

(6)

Sammanfattning | iii

Sammanfattning

Den växande mängden positionsberoende data (som innehåller både geo-positionsdata (dvs. latitud, longitud) och även fordons- / förarelaterad information) som samlats in från sensorer på fordon utgör en utmaning för datorprogram att bearbeta den totala mängden data från många fordon . Medan den här växande mängden data hanteras måste datorprogrammen som behandlar dessa data uppvisa låg latens och hög genomströmning - annars minskar värdet på resultaten av denna

bearbetning. Som en lösning har big data och cloud computing-tekniker använts i stor utsträckning av industrin.

Denna avhandling undersöker en molnbaserad bearbetningspipeline som bearbetar

fordonsplatsdata. Systemet tar emot fordonsdata i realtid och behandlar data på ett strömmande sätt. Målet är att förbättra prestanda för denna strömmande pipeline, främst med avseende på latens och kostnad.

Arbetet började med att titta på den nuvarande lösningen med AWS Kinesis och AWS Lambda. En benchmarking-miljö skapades och användes för att mäta det aktuella systemets prestanda.

Dessutom genomfördes en litteraturstudie för att hitta en bearbetningsram som bäst uppfyller både industriella och akademiska krav. Efter en jämförelse valdes Flink som det nya ramverket. En ny lösning designades för att använda Fink. Därefter jämfördes prestandan för den nuvarande lösningen och den nya Flink-lösningen med samma benchmarking-miljö och. Slutsatsen är att den nya Flink-lösningen har 86,2% lägre latens samtidigt som den stöder tredubbla kapaciteten för det nuvarande systemet till nästan samma kostnad.

Nyckelord

Molntjänster, Strömbearbetning, Flink, AWS

(7)

(8)

Acknowledgments | v

Acknowledgments

I am writing the thesis as collaboration within Scania ECCE group. I would like to first thank my supervisors Lasse and Pei for their careful guidance and support. Working with cloud products turned out to be very challenging and I encountered more difficulties than expected along the journey. I am grateful to have them as my supervisors and to have received great and immediate help from them. Additionally, I appreciate all the help from my team no matter whether it a technical discussion or simply mental encouragement.

Also, I would like to thank my examiner Gerald Q. Maguire Jr. and my advisor Anders Västberg for their valuable feedback and insight on my thesis. Their valuable comments help me to improve my thesis. They inspired me to have new ideas and instructed me how to improve a thesis.

Last, I want to thank my parents and my friends for being there with me.

Stockholm, October 2020 Wenyu Gu

(9)

(10)

Table of contents | vii

Abstract ... i

Keywords ... i

Sammanfattning ... iii

Nyckelord ... iii

Acknowledgments ... v

Table of contents ... vii

List of Figures ... ix

List of Tables ... xi

List of acronyms and abbreviations ... xiii

1 Introduction ... 1

1.1 Background ... 1

1.2 Problem ... 2

1.3 Purpose ... 2

1.4 Goals ... 3

1.5 Research Methodology ... 3

1.6 Delimitations ... 3

1.7 Structure of the thesis ... 4

2 Background ... 5

2.1 Data Processing ... 5

2.1.1 Batch processing ... 5

2.1.2 Micro batch processing ... 5

2.1.3 Stream processing ... 6

2.1.4 Time Domains ... 6

2.2 Stream Processing Architecture ... 6

2.2.1 Lambda architecture ... 6

2.2.2 Kappa architecture ... 7

2.3 Cloud Computing and Cloud Products ... 8

2.3.1 Cloud computing ... 8

2.3.2 Amazon Web Services... 9

2.4 Related work ... 11

3 Current solution ... 13

3.1 Protocol buffers ... 13

3.2 Current pipeline breakdown ... 13

3.2.1 Decoder ... 14

3.2.2 Enricher ... 14

3.2.3 Cleaner ... 14

3.2.4 Normalizer ... 14

3.3 Position messages ... 14

4 Analysis of stream processing frameworks ... 17

4.1 Requirements for stream processing applications ... 17

4.1.1 Fault tolerance ... 17

4.1.2 Delivery guarantee ... 17

4.2 Frameworks overview ... 18

(11)

4.2.1 Spark Streaming ... 18

4.2.2 Flink... 19

4.2.3 Storm... 21

4.3 Framework comparison ... 22

4.3.1 Latency ... 22

4.3.2 Fault tolerance ... 22

4.3.3 Scalability ... 23

4.3.4 Delivery guarantee ... 23

4.4 Discussion ... 24

5 Implementation of New solution and Benchmark Environment ... 27

5.1 Flink project design ... 27

5.2 Benchmark environment ... 28

5.2.1 Benchmark environment design ... 28

5.2.2 Data generator ... 30

5.2.3 Benchmark environment setup ... 30

6 Experiments and Results ... 33

6.1 Evaluation metrics ... 33

6.2 Evaluating current processing pipeline ... 33

6.2.1 Testing plans ... 33

6.2.2 Throughput of Lambda function ... 34

6.2.3 Lambda warm-up period ... 37

6.2.4 Evaluation results ... 38

6.3 Evaluating Flink project ... 39

6.4 Cost calculation ... 43

7 Conclusion and Future Work ... 45

7.1 Conclusion ... 45

7.2 Future work ... 45

References ... 47

(12)

List of Figures | ix

List of Figures

Figure 1-1: Data flow diagram of trucks processing pipeline in Scania ... 2

Figure 2-1: Batch processing with a time-based batch interval ... 5

Figure 2-2: Stream processing ... 6

Figure 2-3: Lambda architecture ... 7

Figure 2-4: Kappa architecture ... 8

Figure 2-5: Kinesis Stream Architecture ... 10

Figure 3-1: Current pipeline overview ... 13

Figure 4-1: Spark architecture in cluster mode... 19

Figure 4-2: Workflow of a Spark Streaming job ... 19

Figure 4-3: Example of Flink’s Dataflow graph ... 20

Figure 4-4: Flink architecture ... 21

Figure 4-5: Storm topology ... 21

Figure 4-6: Storm architecture ... 22

Figure 5-1: Dataflow graph of new architecture design (the abbreviations are explained in the text below) ... 27

Figure 5-2: Benchmark environment overview ... 28

Figure 5-3: Calculating latency of the current pipeline ... 29

Figure 5-4: Calculating latency of the new solution ... 30

Figure 6-1: Current processing pipeline in subsystems ... 34

Figure 6-2: Decoder subsystem latency for different message arrival rates ... 35

Figure 6-3: Fitting the queue time of decoder subsystem ... 36

Figure 6-4: Cleaner subsystem latency for different message arrival rates ... 37

Figure 6-5: Lambda cold start problem ... 38

Figure 6-6: Latency of the current solution under different loads ... 39

Figure 6-7: Subsystem latency in the Flink solution ... 39

Figure 6-8: Fitting the queue time of the new solution ... 40

Figure 6-9: Latency of Flink solution under different loads ... 41

Figure 6-10: Histogram of RPS ... 42

Figure 6-11: Best fit of RPS ... 42

(13)

(14)

List of Tables | xi

List of Tables

Table 3-3-1: Example of a raw position message structure ... 15

Table 4-1: Comparison of candidate frameworks ... 25

Table 5-1: Configuration settings of the current pipeline in benchmark environment... 30

Table 6-1: Throughput of each component in the current solution ... 37

Table 6-2: Fitting with various distributions ... 42

Table 7-1: Comparison results of two solutions ... 45

(15)

(16)

List of acronyms and abbreviations | xiii

List of acronyms and abbreviations

AWS Amazon Web Services

API Application programming interface CAPEX capital expenditures

ESP event stream processing GPS global positioning system IaaS Infrastructure as a Service JSON JavaScript Object Notation JVM Java virtual machine OPEX operating expenses PaaS Platform as a Service

RDD resilient distributed datasets RPS requests/second

SaaS Software as a Service VPN virtual private network XML Extensible Markup Language

(17)

(18)

Introduction | 1

1 Introduction

This chapter describes the specific problem that this thesis addresses, the context of the problem, the goals of this thesis project, and outlines the structure of the thesis.

1.1 Background

The growing volume of data generated from vehicles opens up new possibilities to understand and explore the transportation domain. Global positioning system (GPS) sensors on trucks continuously push real-time location data to data centers, connecting individual vehicles to a central system. By receiving information from all transport vehicles, the central system can track each vehicle and derive a better understanding of the overall traffic flow in which these vehicles are located. However, the increasing amount of data brings new challenges to computer systems. One important aspect is latency. For example, real-time position data from a moving truck can be used to infer the traffic situation of this vehicle. If one truck moves slowly on a section of road, there is a high possibility that it might be stuck in traffic. If the traffic jam is likely to be long and severe, then route rescheduling would be desirable. However, the value of rerouting decreases with time. If it takes a long time to process the location data of the set of trucks in order to determine that one or more of these trucks is in a traffic jam, then the latency of rerouting is increases and has diminished value. The result is that location data from ten minutes in the past is not meaningful or effective with regard to planning what a vehicle should do now.

Scania is a world-leading manufacturing company for commercial transportation, especially trucks. The Connected Service Group in Scania offers preprocessing services for truck data. At Scania, there are several pipelines for processing different types of truck data. Their general workflow can be found in Figure 1-1. Sensors installed on the trucks periodically collect information about the trucks, including their location, dashboard data, and other components. These data are sent via a radio basestation and routed over the Internet to a cloud data center. Depending on the type of data, the data are either sent separately or grouped into one message. When data arrives at the destination servers at the cloud data center, the service running in the cloud manipulates and transforms the raw data into a clean and suitable format. One processing pipeline is responsible for each type of data. The location pipeline handles location data, while another pipeline handles vehicles’ dashboard data. A dashboard service at the edge of the cloud monitors the whole pipeline and shows evaluation metrics via a graphical interface. After leaving the cloud, the processed data is passed to our customers for a deeper analysis. One of such an example is a fleet management system that accurately shows the driving route of trucks and provides analytics of the driving process.

To facilitate maintenance of the processing pipelines, all processing blocks are deployed in Amazon’s public cloud using “serverless computing”. Latency is one of the most important

performance metrics and the system should provide near-real-time results, i.e., low latency processing.

Additionally, Scania is also interested in lowering their operational costs. Using serverless computing technology reduces the workload of managing physical servers, but this is at great expense.

Considering that processing is still needed for a long time and even more data will be processed in the future, it is worth considering whether there is a more economical option than the current one. In this thesis, the focus is on the pipeline that handles position data in the cloud, particularly the processing blocks (for location data) shown in Figure 1-1.

(19)

2 |

Figure 1-1: Data flow diagram of trucks processing pipeline in Scania

1.2 Problem

Scania runs their current processing pipeline for vehicle data in the Amazon cloud, using Amazon Web Services (AWS)^* serverless computing service. The choice of the current technology stack was made because it allows developers to focus on application development without needing to worry about provisioning or managing servers. The downside is that it is costly as it runs on a function-as-a-service platform, i.e., AWS Lambda (see Section 2.2.1). The latency of the current pipeline could be improved to approach real-time. While the system works well with the current amount of data; however, its capacity to handle an increasing amount of data is still in doubt. To cope with a varying (and increasing) number of messages, the system should be scalable. Overall, the performance of the current solution needs improvement.

The research question in this thesis is:

How could the latency of the stream processing be improved in a cloud-based pipeline while maintaining high throughput and low cost? Where latency means the time taken to pass through the complete processing pipeline in the cloud.

1.3 Purpose

The purpose of this degree project is to improve the current pipeline by using a more suitable framework. Scania pays a lot of attention to efficiency; hence, a new solution at lower cost and with competitive performance is of great interest to the company. Moreover, this pipeline does preliminary work for subsequent analytics. Therefore, by shortening the processing time, the company can offer a

* Further details of AWS are given in Section 2.3.2.

(20)

Introduction | 3

better cloud-based service to their customers; while, the customers receive results faster and have a greater potential to use of their data to improve their own processes.

1.4 Goals

The main goal in this degree project is to decrease the latency of the processing pipeline without sacrificing other attributes, such as throughput and cost. Narrowing the scope of the project, the goal is to reduce the latency of the processing pipeline running in a public cloud. Therefore, our primary goal can be summarized as

Goal 1: Design and evaluate a new architecture to replace the current Scania stream processing pipeline.

Scania’s Connected Service Group currently uses Amazon Kinesis Data Streams (see Section 2.3.2.1) and Amazon Lambda functions (see Section 2.3.2.2) to build the current service.

Unfortunately, the growing amount of data makes it more and more expensive to run the service using the current infrastructure. At the same time, in academia there are other big data frameworks that could potentially be applied to streaming data, such as Apache Flink and Apache Streaming. A new design could use one of these frameworks. However, before a comparison can be made of alternative stream processing architectures, there is a need to collect data for such a comparison.

Goal 2: Create a benchmark environment to measure the performance of two alternative pipelines.

The current pipeline is deployed in Scania’s real-world production environment. AWS provides a dashboard and basic monitoring facilities to measure the performance of a service, such as the number of messages processed and the number of failures. To assess the system and estimate its maximum capacity, we need to collect data on other metrics, such as throughput and latency. Unfortunately, Amazon does not provide this data. Additionally, run the experiments a number of times and measure the latency distribution. I believe that a probability density plot rather than a single number better describes the system’s latency and throughput. Therefore the second goal is to create a benchmark environment that can be used to assess the system’s capacity.

1.5 Research Methodology

In this thesis, an empirical research method is used. Initially, a literature study will be conducted to study stream processing and big data frameworks. With the knowledge obtained from this, one framework that meets the project’s requirements will be chosen. A benchmark environment will be created in which performance measurements of the current stream processing can be made. Data used for evaluation will be synthetically generated to mimic the real data stream, i.e., a stream of fake messages will be generated and injected at different rates and following different distributions to mimic the real-data stream. To assess the performance of our new solution, our solution will be deployed and tested in the same benchmarking environment that was used to measure the

performance of the current streaming stream process. After obtaining results for the two solutions, their performance will be compared using a number of different metrics (specifically latency and cost).

1.6 Delimitations

This thesis focuses on the processing pipeline in the cloud. As shown in Figure 1-1, process blocks in the cloud computing service delimit the scope of the thesis. Data transmission from sensors on trucks until it arrives in the destination servers is outside of the scope of this thesis. Also, the data analytics after data leaves the cloud are also out of the scope of this project.

(21)

4 |

1.7 Structure of the thesis

Chapter 2 presents relevant background information about data processing techniques. In the second part of background chapter the concept of cloud computing and several popular cloud products are introduced. Chapter 3 introduces the current solution used in Scania to process position messages.

Chapter 4 gives an overview of three popular stream processing frameworks and requirements to the selected framework. Chapter 5 presents the design of new solution as well as the benchmarking environment. Chapter 6 presents an evaluation of the performance of two solutions and shows the results. Finally, in Chapter 7, some conclusions based on findings are drawn.

(22)

Background | 5

2 Background

This chapter provides background knowledge that is related to this thesis project. Section 2.1

introduces different types of data processing methods, including stream processing, and describes the differences between them. Section 2.2 describes the types of streaming architectures. Finally, Section 2.3 describes the concept of cloud computing and cloud products which are relevant to this project.

2.1 Data Processing

Over the last decade, the term “big data” has attracted increasing interest, mainly due to the amount of information that has become available. However, big data is not simply about the amount of data.

Other aspects of big data, namely variety and velocity, also contribute to a comprehensive definition of big data [1]. Considering the growing demands for making use of the increasingly available data, big data cannot be simply handled via traditional batch processing methods. As a result, new data processing methods are being introduced. This section presents different types of data processing schemas and discusses which one is most suitable for this project.

2.1.1 Batch processing

In batch processing, data is collected after arrival into a group called a batch. The batch is regarded as a unit and processed as a whole. In general two ways of grouping data into batches can be

distinguished: time-based and event-based. A time-based approach collects the data at regular intervals. For example, the system might wait for incoming data for one hour and afterward process the last hours’ worth of data together. Figure 2-1 shows an example of a time-based batch processing paradigm. An event-based batch is triggered by some specific condition; for example, when the number of received messages exceeds a threshold or a specific message is received.

Batch processing is an efficient way to process a large amount of data when the program has a loose latency requirement. Batch processing executes the job at a specific time interval. As a result it helps to save energy and reduce operational costs. Batch processing is frequently suitable in daily life.

For example, in finance departments many reports are generated on a daily, monthly, or yearly basis, hence the batch interval would be one day, one month, or one year. Additionally, in the financial world the batch processing must be complete by a specific time limit. For example, in banking and market trading the daily transactions have to be balanced by the a specific hour following the close of business.

Figure 2-1: Batch processing with a time-based batch interval

2.1.2 Micro batch processing

Similar to batch processing, micro batch processing collects and processes data in groups, but at a more frequent interval than traditional batch processing. Sometimes the batch is created from data collected over a period of minutes or seconds. By adopting a smaller time increment, the system can output results faster; for example, with a delay of seconds or even milliseconds.

(23)

6 |

Micro batch processing supports near real-time processing, but the results are not exactly up-to- date; thus, inducing a certain amount of latency. Micro batch processing is widely used in e-commerce companies to track user activities online and append the aggregated and processed data into logs.

2.1.3 Stream processing

Unlike batch processing, stream processing performs actions on the data immediately after it arrives.

As illustrated in Figure 2-2, there is no waiting time between the arrival and processing times. This means that rather than processing data in a group, data is processed individually in streaming processing.

Scania is currently working to provide a service for vehicle locations with low latency. The main priority of the project is to provide near real-time result. Unfortunately, in batch and micro-batch processing, the data needs to wait for a short or long period to be processed, thus failing to the desire for near real-time data. Providing near real-time data matches the characteristics of the stream processing pattern, therefore stream processing was chosen and applied in the current project. In Chapter 3 examines several popular streaming frameworks and compare them in different dimensions.

Figure 2-2: Stream processing

2.1.4 Time Domains

Time is an important notion when processing data. Two different types of time domains can be

considered in data processing: event time and processing time [2]. Event time describes the time when an event actually happens. Processing time is the time between when data is received by stream processors and when data is output. For example, if an online shopping website runs a service recording user activities in the backend, the event time of a user checking out an item is the time the user clicks on the item with their mouse. In this case, processing time captures the time between when the click is observed and where there is output from the system based upon this click. An important difference between these two notions of time is that event time is tied to when an event happens and never changes. However, the processing time is variable and desponds on many factors.

Ideally, event time and processing time should be aligned. However, in reality, many aspects of the system need to be considered, including communication time over the network, queuing time (i.e., waiting to be processed), and the actual service time for processing the request. Therefore, in the real world, it is fairly common to have a substantial time gap between event time and processing time.

2.2 Stream Processing Architecture

A streaming architecture is a set of technologies that handle incoming streaming data. Some action or actions are performed when data arrives. There are two kinds of modern stream processing

architectures: (1) Lambda architecture, takes advantage of both stream and batch processing and supports both workflows and (2) the Kappa architecture, purely focuses on stream processing. This section introduces both architectures and discuss their differences.

2.2.1 Lambda architecture

The Lambda architecture was introduced by Marz and Warren [3]. It is designed to handle a huge quantity of data via one of two processing methods, stream and batch processing, at the same time. It has now becomes a famous framework and has been widely adopted by companies, such as Twitter [4].

(24)

Background | 7

Figure 2-3 shows an overview of the Lambda architecture. It is composed of three different layers:

batch layer, streaming layer, and serving layer. When new data arrives, it is simultaneously fed into two layers: batch and streaming. The streaming layer is also called the speed layer.

Figure 2-3: Lambda architecture

The goal for the streaming layer is to process the data immediately and produce a result in near real-time fashion.

In contrast, the batch layer stores the data when it arrives, and computes results of batches. By combining the current data with historical data, the batch layer operates on more data than the speed layer and generates a global result. This general result provides additional information to customers, that cannot be seen from an immediate result. However, this result is delivered with high latency due to time required to collect data.

The serving layer receives queries and gives the customer access to result from the speed and the batch layer.

A huge advantage of the Lambda architecture is that users could get a complete view of their data via a batch view at the cost of high latency. While from the speed layer, users get the most up-to-date information with low latency. The batch layer accumulates data in the data center and batches can be recomputed in a matter of days or weeks. The batch view gives a more accurate and less skewed result, which can provide a great supplement to the nearly instant processing of the speed layer. However, using Lambda architecture still faces noticeable challenges. Since it contains two data processing workflows, two different sub-systems are needed. No single framework and working environment can be used for both systems. The effort to design, prepare, build, and operate two subsystems is high. For enterprises using the Lambda architecture this effort is even more difficult when the system is

distributed [5].

2.2.2 Kappa architecture

The Kappa architecture was introduced by Jap Kreps as a simplification of the Lambda architecture.

He shared his experiences working with Lambda architecture in a post [6] and proposed an alternative method to implement real-time processing, that he called the Kappa architecture. The idea of the Kappa architecture is to abstract both batch and stream processing jobs into a single workflow.

Figure 2-4 shows the workflow of the Kappa architecture. When data arrives, it first flows into the streaming layer, where some immediate processing is done. After that, the processed data goes into the

(25)

8 |

serving layer and is saved. The serving layer is responsible for receiving user queries and responding with the latest information.

The Kappa architecture simplifies the Lambda architecture by removing the batch layer. To replace the batch processing, data is simply fed into the streaming layer and recomputed again. If the system needs a modification and re-deployment, all historical data in the raw format will be replayed through the pipeline after re-deployment and the results will be updated. The biggest improvement of the Kappa architecture is that only one instance of the transformation logic is needed, rather than keeping two different operational logics. However, this design introduces additional work for the database as it must read and re-write a large amount of data.

Figure 2-4: Kappa architecture

2.3 Cloud Computing and Cloud Products

Since the year 2000, cloud computing has been frequently mentioned and it has become a popular choice to operate modern information technology service. Cloud computing is a general term to describe any services that can be accessed via a network. Cloud-based products are widely used in Scania’s processing services. This section introduces the concept of cloud computing and then describes the cloud products used by Scania.

2.3.1 Cloud computing

The underlying idea of cloud computing dates back to the 1960s, when IBM first introduced a new time-sharing schema to use a computer [7]. A major problem at the time was that computers needed to wait for user tasks and their computing power were wasted for some time. To tackle this problem, IBM invented an new computing paradigm, which allows multiple programs to simultaneously execute on the same machine. One computer was connected to several typewriter-like remote consoles, where users could submit their tasks to the central computer. Instead of waiting for only one user, the computer processed the tasks of multiple users, which greatly reduce the idle time and improved efficiency.

In the 1990s, resource sharing was adopted by telecommunication companies. Historically, they offered point-to-point connections. However, when the connection was not used for some fraction of the time, the result was inefficient resource usage. As a solution, they started offering virtual private networks (VPNs). A VPN could provide the same bandwidth as a point-to-point connection but at a much low cost. These VPNs also supported dynamic routing, enabling routers to balance the load over different routes. This is where the term “cloud” originates [8].

In the 2000s, this idea of sharing resources was extended to computers and lead to the term “cloud computing”. Cloud computing provides virtualized resources to users, including storage, processing

(26)

Background | 9

power, and network connectivity. These resources reside in multiple physical servers^*, instead of one physical computer. Users access cloud services via a common interface, without knowing the complex infrastructure behind it. To achieve this, a cloud service provider deploys a large number of resources in a data center. In each computer, virtualization enables multiple system images to run on top of the hardware, making it possible to serve several customers at the same time – each of whom sees their own virtual machine.

Cloud computing has been widely adopted by business. There are three basic types of cloud computing models customers can utilize: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). IaaS provides fundamental building blocks to deploy applications, including network, storage, and servers. IaaS gives users complete control of the virtual machine and offers the highest flexibility to configure their implementation in detail. In PaaS, users cannot touch the underlying infrastructure. In this model, they get access to an operating system and thus focus mainly on program or service development, deployment, and operation. Finally, SaaS offers clients a complete service that is ready for use. SaaS removes the need for knowledge of the underlying architecture or any concerns about resource management. The only thing users need to think about is how to use the provided service. A typical example of a SaaS service is Google email. With all functions deployed in the Google cloud, users can send and receive emails through an easy-to-use web page, without worrying about network connections, proxy settings, storage resources, etc.

One huge appeal of cloud computing is that cloud services are managed by cloud providers and users no longer need to worry about maintenance and provisioning. Moreover, it is impractical for some companies to spend the capital needed to build up their own infrastructure. Cloud computing offers an on-demand service and it is more flexible than having dedicated physical servers. It is on- demand and has a “pay-as-you-go” subscription, thus customers are charged simply on the basis of resource usage (usually CPU time). This shifts costs from capital expenditures (CAPEX) to operating expenses (OPEX). Another noticeable benefit is that cloud computing provides the potential for scaling when facing varying demands (i.e., loads). Instead of purchasing more hardware, users upgrade or downgrade their subscription level in the cloud. These benefits of using cloud computing services motivate companies to move their current computing architecture to the cloud, nourishing the rapid development of cloud providers. Well-known cloud operators include Amazon Web Services, Microsoft Azure [9], and Google Cloud [10]. The next section will give further details of Amazon Web Services.

2.3.2 Amazon Web Services

Amazon Web Services (AWS) [11] is one of the biggest cloud platforms in the world. AWS is provided by Amazon. AWS offers 175 cloud products and services, including computing power, cloud storage, network connectivity, and security services. Scania currently runs a streaming pipeline on the AWS cloud. The following subsections introduce three of the main AWS products used in Scania’s current streaming project: Amazon Kinesis Streams (Section 2.3.2.1), Lambda function (Section 2.3.2.2) and DynamoDB (Section 2.3.2.3).

2.3.2.1 Amazon Kinesis Data Streams

Amazon Kinesis Data Streams [12] is a streaming platform to receive real-time and continuous data. Kinesis streams could be used as a connected service between producers and consumers. It receives data records from producers, processes the data, and feeds results to a consumer. When the data arrives at Kinesis Streams, the Streams application starts to consume it. Usually, it takes less than a second from start to finish and relieving the final data.

Figure 2-5 shows an example of the Kinesis Stream architecture. Common data sources include data from website logs, stock price updates, social media feeds, and IoT devices. As for producers, AWS services (such as AWS Lambda and Amazon Kinesis Data Analytics) or other streaming frameworks (such as Apache Spark) could be used to build custom processing application. In this project, GPS receivers installed on vehicles collect vehicle locations (at rates ranging from 1 Hz to 400

* Typically realized using high density commodity servers.

(27)

10 |

Hz, typically 1 Hz), place this data into messages, and send these messages to our Kinesis Streams applications.

One benefit of Amazon Kinesis Streams is that it is a fully managed service. Therefore, it eliminates the need to manage the infrastructure and provision servers. Moreover, the service can scale up or down within a minute.

There are some important notions used in Kinesis Streams. A Kinesis Stream is essentially

composed of a set of shards. Each shard contains a sequence of messages, also known as data records.

A data record is the fundamental unit to store data in Kinesis Streams. Each data record is comprised of a sequence number, a partition key, and a data blob. A data blob stores the actual data. A partition key is used to group data into shards in a Kinesis Stream. Data records with the same partition key will be assigned to the same shard. The sequence number is a unique monotonically increasing identifier for data records within a shard. When a producer sends data to a Kinesis Stream, the content producer pushes data and a partition key and then this data is encrypted by AWS and written to a data blob. The sequence number is added when the data record is sent to a shard. From the consumer’s side, there is one worker per shard to process messages and it processes messages one by one Kinesis does not guarantee strict ordering of data.

Figure 2-5: Kinesis Stream Architecture

2.3.2.2 AWS Lambda

Amazon Lambda is a serverless computing service [13]. Users upload their function to AWS Lambda, where they will be managed and executed by Amazon. AWS Lambda supports multiple languages, such as Python and JavaScript. Users choose their preferred language and write functions in that language.

AWS Lambda is a serverless computing service, hence developers do not need to think in terms of discrete servers when developing their programs. The by the service provider takes care of all server related work, including preparing a working environment and managing server status.

Amazon Lambda works in the following way. Each function is executed in a separate container, simplifying the preparation of the environment and reducing the trouble of environment conflicts.

When a user uploads a function, Lambda instantiates a container with the necessary memory and computing capacity, then loads the function into the container and executes it. The container is hosted in a multi-tenant machine using virtualization technology. Therefore, when using the Lambda service, users do not need to know how the underlying system works or how the infrastructure is built. Amazon Lambda is a PaaS product that minimizes the maintenance work and allows programmers to focus on program development.

(28)

Background | 11

2.3.2.3 Amazon DynamoDB

Amazon DynamoDB is a NoSQL database [14]. DynamoDB is a fully-managed database service, thus users focus only on their applications. AWS takes care of database security and data backup. In the current processing pipeline, DynamoDB is used as the database because it works well with other AWS products and has great functionality (such as the ability to scale).

Since DynamoDB is a non-relational database and it stores data in a key-value format. Basic concepts in DynamoDB are tables, items, and attributes. A table is a collection of data records, called items in DynamoDB. Each item can be described by a set of attributes, which are similar to columns in a relational database system. Inside a table, a primary key distinguishes items. There are two ways to construct a primary key: a partition key comes from one single attribute while a composite primary key is a combination of two attributes. A secondary index is another attribute in the table other than the primary key. A secondary index can be used in queries. DynamoDB supports the use of a secondary index to speed up query processing.

In addition, DynamoDB introduces two reading/writing modes to address performance

predictability issues. When a cloud-based project faces a burst of requests, customers want to lower the impact of this burst on the system’s overall performance and get the burst processed as soon as possible. To do this, DynamoDB offers an on-demand reading/writing capacity mode. The capacity of the database can be instantly adjusted depending on the system workload. This mode is suitable for a system that has an unpredictable load pattern and wants to achieve a steady low latency service time.

In this on-demand mode, DynamoDB charges users by the number of requests. For services that face a regular and consistent workload, DynamoDB has a provisioned reading/writing capacity mode. When they use this mode, customers specify the expected amount of read/write requests their services will have. DynamoDB can also apply an auto-scaling policy in the face of varying demand. DynamoDB charges users monthly based upon the level of capacity they request for provisioned reading/writing capacity mode.

2.4 Related work

Processing location data from vehicles has been an active area and benefits from the improvements in hardware sensors as well as the software technologies. These technologies allow geoprocessing systems to process large volumes of data in a reasonable time.

Many researchers look into the problem of data placement in geo-distributed systems [16, 17].

Geoprocessing systems receive location data worldwide and have distributed data centers. To minimize the response time and fully utilize the resource in all data centers, one problem arises as to move data to a data center that is more suitable to process. Qifan et al. proposed a low-latency system to analyze geographical datasets worldwide [15]. They designed an algorithm to determine which data to move and where to move. They preferred to move dataset which contains high value-per-byte, i.e.

datasets associated with many queries.

Another challenge for geo-distributed systems is choosing a proper storage service that can replicate data and provides data consistency [16]. Zhe et al. presented a cost-effective database for geographically distributed data [17]. They found that using multiple cloud products can get lower latencies as well as reduce the cost. They designed a set of replication policies considering the requirements from geoprocessing applications and their workload characteristics.

(29)

(30)

Current solution | 13

3 Current solution

This chapter introduces the current stream processing service running in Scania and its technology stack. Section 3.1 introduces Protocol Buffers, the data structure used to store and serialize position data. Section 3.2 gives a brief introduction to the current streaming pipeline. A detailed explanation of each component is given in the subchapters. Section 3.3 presents the structure of position data, i.e., what the data looks like.

3.1 Protocol buffers

Protocol buffers is a method to serialize structured data. The concept of protocol buffers was published in 2008 by Google and aims to improve the efficiency of data communication[18]. Some applications such as Internet of things (IoT) build services with constant or frequent message exchange; in our example, vehicle positions from sensors. When services generate a great number of messages, those messages consume a lot of bandwidth and other resources. Therefore, an efficient way of transferring data is needed. Protocol buffers are similar to JSON (JavaScript Object Notation) and XML

(Extensible Markup Language). However, protocol buffers have a smaller size after serializing and provide high-speed serialization of data into a binary format -- unlike the two other options [16, 17].

With protocol buffers, users define the data structure in a proto description file (.proto). Within the proto file, users can flexibly describe their data schema. For example, which fields are required for a message and which fields are optional. A protocol buffer compiler compiles the proto file into the language that the user is programming in. Over the years of developing, it now supports languages such as Java, Python, and C++. With the compiled protobuf file in users programming language, they can easily construct and messages and convert them from the binary code.

3.2 Current pipeline breakdown

Currently, Scania has a processing pipeline for vehicle location data, as shown in Figure 3-1. It is designed using AWS services, including AWS Lambda (see Section2.3.2.2), AWS Kinesis Streams (Section 2.3.2.1), DynamoDB (Section 2.3.2.3), and protocol buffer (Section 3.1). Messages generated from the vehicle are received by a Kinesis Stream, which is the beginning of the processing pipeline.

Afterward, those messages will go through four components: Decoder, Enricher, Cleaner, and Normalizer. A Kinesis Stream will receive outbound data after leaving the pipeline and transfer the data to our external customers. Every component in the pipeline is a Lambda function and is executed in its container. A Kinesis Stream is used to connect two Lambda functions. The following subsections will introduce each component.

Figure 3-1: Current pipeline overview

(31)

14 |

3.2.1 Decoder

The decoder is used to decode raw position messages into a specified schema that used internally for the whole pipeline. This schema regulates the structure of the message, including message header, message metadata, and position data. It also specifies the data type of each field.

Currently, Scania has different types of trucks connected to the data center and the GPS sensors used in these different types of trucks are different. Different versions of the sensors generate position messages with slight differences. For example, messages created from older sensors contain data fields that a new version of sensor does not collect. Ideally, Scania wants to create a general message schema that can describe all of the different versions of the messages. In this way, no matter what version of the truck and its message, all position messages can be processed in the same way; hence, no separate processing services are needed for different types of positioning messages. The decoder reads different types of messages and converts them into a unified format thus making later processing easier. It also verifies that the decoded message is complete and no required fields are missing after decoding.

3.2.2 Enricher

The enricher adds additional information to messages, such as external driver information and vehicle subscription information. That information is needed by Scania’s external customers, but are not contained in the original message’s content. These data are stored in separate tables in DynamoDB, and some are also stored in local cache.

The workflow of fetching and adding external information is as follows: When a message arrives at the Enricher, the Enricher checks if driver-related information is in the local cache. If so, then Enrich will retrieve this information and add it to the message. If not, then Enrich will send a request to DynamoDB. If DynamoDB does not contain the driver information, Enricher will make a batch request to the original API for the list of information not in DynamoDB. After getting this data, Enricher will insert this data into DynamoDB to make it complete.

3.2.3 Cleaner

Cleaner is used to clean dirty position messages. These messages could be messages with errors in their content, empty fields, or late messages. Inside Cleaner, a set of rules and errors are defined to check the correctness of each message. For example, the position rule will check whether the vehicle position data is complete and no field (such as longitude, latitude) is missing. The timestamp rule will check that the message timestamp is both available and reasonable, i.e. not from a future time or a timestamp one year ago. Cleaner uses the vehicle’s previous locations to check whether the message is consistent. For example, if the incoming message has a much earlier timestamp than its previous position message, an error will occur. Also, a new position with a large distance from its previous position is unrealistic and should be logged and thrown away. Distance is calculated by a Haversine distance formula [19], i.e., the distance between two points on a sphere. The previous vehicle location data is stored in DynamoDB and it will be updated with the new position after Cleaner validates the new position.

3.2.4 Normalizer

Normalizer is used to converting vehicle messages into a normalized format. For example, the speed of vehicle in the original messages are in different units, such as knot (km/h) or meters per second (m/s).

For the position messages, the normalizer transforms the speed field to m/s, making it easier for future analysis.

3.3 Position messages

In this thesis project, our services processes vehicle positions. It is important to understand what vehicle position data looks like. One example of a position message is shown in Table 3-3-1. Due to privacy issues, the table shows only part of the data fields from an actual position message. The most important position information that the GPS sensors collect includes GPS latitude, longitude, and altitude. GPS sensors installed on Scania’s trucks use the World Geodetic System (WGS84) as a

(32)

Current solution | 15

reference coordinate system. In this system, latitudes and longitudes are represented in a decimal degree format. For example, Stockholm is at (59.3293° N, 18.0686° E).

Table 3-3-1: Example of a raw position message structure

Data field Description

Driver Id <not included for reasons of privacy>

Vehicle Id <not included for reasons of privacy>

Num of positions

Number of positions contained in this message

<positions>

GPS Latitude sign

North or South

GPS Latitude

Range from 0^° to 90^°, with accuracy in increments of 0.00001 degrees^*. GPS

Longitude sign

East or West

GPS Longitude

Range from 0^° to 180^°, with accuracy in increments of 0.00001 degrees.

Altitude sign

Below sea-level or above sea-level.

Altitude Altitude value in meters, with an accuracy of 1 meter.

Timestamp Epoch timestamp since January 1, 1970, when this position was collected

</positions>

Each position message is a set of position data collected over a period of time. The position message starts with some basic data, such as driver and vehicle-related information. Then it has a position-related part, starting with the number of positions carried in this message and the set of positions. Each position contains geometric information (latitude, longitude, and altitude) and a timestamp indicating when this position was recorded. When the number of the position is 1, it means there is only one position included in this message. As mentioned earlier, different GPS sensors generate position messages with slight differences. These differences mainly come from different fields in the basic data part, while the position-related part is basically the same. Regardless of the version of the GPS sensors, vehicles’ positions are represented in the same coordinate system with the same accuracy.

The raw position messages are encoded in bit fields. In Table 3-3-1, each field takes up a specific length of bits. By combining bits string of each field, a raw position message (as a bit string) is obtained. Base64 algorithms are used afterward to encode the data.

The frequency at which the GPS sensors collect positions and the numbers of positions in each message depend on the truck’s subscription level, thus for trucks that need an accurate mapping service, the position data is collected every few seconds and each position message contains very few positions. In this way, the vehicle’s location is frequently updated in the system, hence more valuable insights can be provided to Scania’s customers. For trucks that subscribe to a basic level of service, the position is collected every minute and there are multiple positions collected in each position message than for trucks that subscribe to a premium service.

* Thus it is precise to 1.1132 m at the equator.

(33)

(34)

Analysis of stream processing frameworks | 17

4 Analysis of stream processing frameworks

To determine the new architecture for the position pipeline, it is important to understand current stream processing frameworks. Section 4.1 defines the requirements that the new framework should meet. Three popular stream processing frameworks were select and an overview of them is given in Section 4.2. A comparison of these frameworks is given in Section 4.3. Section 4.4 concludes our findings and gives a final choice to be used for implementation and evaluation.

4.1 Requirements for stream processing applications

Scania provides customers with transportation solutions. The company manufactures physical vehicles and makes digital applications to monitor and improve the transportation process. This concept is known as digital twin technology. Within Scania, more than 600,000 vehicles are equipped with sensors and communicate to a cloud data center. These sensors push vehicles’ real-time data to Scania’s cloud-based applications, imposing high requirements on processing speed and message processing capacity.

A number of researchers have published requirements for stream-based products. Stonebraker et al. identified 8 requirements that a good real-time stream application should meet [20]. For example, one of these requirements is that a stream processing system should generate predictable and

repeatable results. With this high-level guidance, and combing these with the company’ practical requirements, the following criteria are proposed for comparing existing stream processing engines:

1. Support streaming processing with low latency, i.e., the overall latency to process a single message is preferably lower than 500ms.

2. A fault-tolerant framework. (More details in Section 4.1.1) 3. With the potential to scale up or down.

The framework should be scalable in face of varying demand. Also, the scaling process should not disrupt current data processing.

4. With the delivery guarantee control. (More details in Section 4.1.2) 5. Able to work with Protocol Buffers.

The vehicle’s message schema is defined in Protocol Buffers internally in Scania. Protocol Buffers files can be compiled into several languages, including C++, C#, Java, and Python.

To use defined message schemas, the selected framework should have common language support for Protocol Buffers.

4.1.1 Fault tolerance

According to Waldemar Hummer, et al. [21] two common faults in event stream processing (ESP) systems are buffer overflow and node failure. A buffer overflow error occurs when the system cannot allocate enough memory for the incoming event. This can occur when the system encounters a burst of queries that exceed its current capacity. One solution is load shedding, i.e., dropping excessive data - thus lowering the demand for memory [22]. However, this means that some queries will not be served, thus the system’s accuracy is sacrificed to ensure the system continues to operate with its current resources. There has been considerable study of load shedding, including when to shed load, where to shed load, and how much load to shed [23].

Another common failure is node failure when a processing node supporting the system is unavailable, possibly due to hardware problems or power failure. In this case, the system needs to immediately diagnose the failure, and recover the data that have been sent to the failed node but not delivered yet.

Our desired framework should be resilient to two types of failures. Even if a failure happens, it should recover from the fault quickly with minimal loss.

4.1.2 Delivery guarantee

One important task for distributed frameworks is to ensure data is received and processed by

distributed nodes. The term delivery guarantees describe the mechanism of how data is handled in the system in case of failures. There are in total three types of delivery guarantees: at-most-once, at-least-

(35)

18 |

once, and exactly-once. The at-most-once guarantee means the message will only be delivered once. If it fails, there is no chance to deliver it again. Under an at-most-once guarantee, there is a possibility that the message will be lost. At-most-once is also the easiest mechanism to achieve among all three options as the system does not make an extra effort on checking the result of work or recover data. The next level, at-least-once, means the message is delivered at least one time. There is a possibility that data will be delivered more than one time. Exactly-once is the strongest guarantee as it ensures messages will be only delivered one time even if a failure happens.

In our case, we prefer more a framework with an exactly-once guarantee. If a position message is delivered twice in the system, our customers receive duplicate processed positions. Consider a

mapping service as an example. Duplicate positions make a vehicle look as if it is going back and forth on the map. This will influence other services provided to the vehicle, such as the routing service.

Additionally, if a position message is lost, then a position is missed by the tracking service; hence, valuable insights cannot be updated concerning the vehicle.

4.2 Frameworks overview

There are extensive studies on big data processing technologies. Early frameworks, such as Hadoop and MapReduce [24], manage to run a task in parallel and process large-scale datasets. They are the first generation of batch processing frameworks and greatly increased the size of data that could be handled. In the last couple of years, when an abundance of information can be generated in a short period, a need has emerged to process information quickly. Modern frameworks, based on earlier solutions, facilitate the data processing with a delicate design. For example, Percolator by Google [25]

is designed to perform incremental processing results on a large dataset, i.e. update the PageRank score of webpages when new web pages appear. LinkedIn used Samza to process massive datasets in a stream fashion [26].

Based on the requirements we mentioned, we selected several stream processing frameworks as candidates. We then narrowed the scope of the new architecture to three options: Spark Streaming, Apache Flink, and Apache Storm. A final decision was made by considering several factors, such as the popularity of a framework, the size of the user community, and available information on the Internet.

A relatively new framework with less information available and lacking good documents was not preferable. The following subsections give an overview of each framework in terms of their motivation, key concepts, and architecture design

4.2.1 Spark Streaming

Spark is a batch-processing framework and designed for large-scale data-intensive tasks [27]. Spark was initially started at the University of California Berkeley in 2009 and became open source in 2010.

It is written in Scala and supports application programming interfaces (APIs) in Java, Python, and R.

It is viewed as a successor of MapReduce and aims to improve the performance of MapReduce by keeping data in memory. MapReduce takes data from a file as input and writes the final result to a file.

This programming model is inefficient for iterative tasks such as using gradient descent to train a neural network. At each round of training, the MapReduce task needs to reload data from storage, which incurs unnecessary delay. Spark improves this by storing all intermediate data in memory instead of on a disk. In this way, data is loaded from storage only once at the beginning.

The key concept in Spark is resilient distributed datasets (RDD). An RDD is the fundamental operation unit in Spark and it stores a collection of data records. An RDD is distributed and can be partitioned across several nodes. Spark can be deployed in a standalone mode, meaning that a Spark application runs within one machine. Spark also supports deployment in a cluster mode, as shown in Figure 4-1. In the cluster mode, Spark follows a master/slave architecture. Developers start their Spark application, i.e., a driver program, on a master node. When launched, the driver program creates a Spark context that is a gateway to all Spark functionalities. It reads user code and creates an execution graph, which describes how RDDs are derived from previous RDDs through Spark operations. Next, a Spark job is split into different stages in the Spark context and later distributed to workers. Spark relies on a cluster manager to work with multiple worker nodes. A scheduler in the cluster manager

(36)

assigns tasks to workers. The cluster manager also monitors the resource usage of each worker. If the workload of one worker is too high, the cluster manager can allocate more resources to that worker.

Figure 4-1: Spark architecture in cluster mode

Spark Streaming [28] is an extension of Spark released in 2013 and designed for streaming data.

Data could either come from streaming resources (e.g. Amazon Kinesis, Kafka, or a network socket) or from Spark intermediate data, i.e., RDDs. The main idea of Spark Streaming is to chunk streaming data into small batches and hand these batches to the Spark engine to execute. Figure 4-2 shows the workflow of a Spark Streaming job. The incoming stream feeds into Spark Streaming and later is converted into discretized streams (D-Streams), each of which is a collection of RDDs. RDDs arrive at the Spark engine in order, in the same sequence as in the input stream, and are processed in a batch.

Final results can be written to a file or monitored in a dashboard application.

Figure 4-2: Workflow of a Spark Streaming job

4.2.2 Flink

Flink is an open-source big data analytics platform for processing both batch and streaming data.

Flink is based on a research project called Stratosphere [29] whose goal is to build a parallel

computing framework. It was donated to Apache in 2014 with the name Flink. Flink is written in Java and Scala, and it supports some Python APIs.

(37)

20 |

The main characteristic of Flink is that it unifies stream processing and batch processing in a single programming model, obviating the need to maintain two different systems for two data types.

To achieve this, the core of Flink is a distributed streaming engine. Flink views batch data as a special case of streaming data but with finite length. Therefore, batch jobs can also run on a streaming engine.

On top of the streaming engine, Flink supports two groups of APIs: the DataSet API for bounded datasets and the DataStreaming API for unbounded datasets. Despite the choice of API, all Flink programs are converted to a dataflow graph and the graph is executed by Flink’s streaming engine.

Figure 4-3 shows an example of Flink’s dataflow graph. It is consist of stateful operators with connecting streams in between. The operators are executed in parallel and are computed by several parallel instances. The number of parallel instances is controlled by the parallelization factor. In Figure 4-3, the parallelism is 1 for the source operator and 2 for map operator. Each parallel instance takes part of the data and pushes its result to the next operator in a one-to-one forwarding or

redistribution way, depending on the type of operator. One Flink job has three stages: reading data from a source, operating on the data, and writing results to a data sink.

Figure 4-3: Example of Flink’s Dataflow graph

Flink’s architecture is depicted in Figure 4-4. A Flink cluster consists of a client, a job manager, and multiple task managers. When the client starts a Flink program, it generates a dataflow graph from the user code and sends it to the job manager. The job manager administers the execution of the dataflow graph. The scheduler in the job manager determines the distribution of tasks. And the checkpoint coordinator inside the job manager manages the state of the cluster and reacts to node failures by using a checkpoint mechanism. The actual task execution is done at task managers. Inside a task manager, there is a smaller execution unit called a task slot. The number of task slots indicates the number of concurrently execution jobs a task manager runs.

(38)

Figure 4-4: Flink architecture

4.2.3 Storm

Storm [30] is a stream processing framework. It was initially created at a company called BackType, which was later acquired by Twitter in 2011. It became open source in 2012 and an Apache Top-Level Project in 2014. Storm is mainly written in Clojure. Since Storm runs on top of a Java virtual machine (JVM), it supports all JVM languages and non-JVM languages which have adapters to JVM, such as Ruby and Python.

The data processing procedure is called a topology in Storm. An example is depicted in Figure 4-5.

A topology is a directed graph consisting of nodes and edges. Nodes represent data operation tasks and edges represent how data will pass to operators. There are two types of nodes: sprout and bolt. Sprout is the entry point of the topology and receives the actual input stream. Bolts are processing nodes that receive data from a sprout. They perform operations on the incoming data stream and then release a new stream.

Figure 4-5: Storm topology

Figure 4-6 shows the architecture of Storm. Two types of nodes can be distinguished in this architecture: the master node and worker node. A background process, called nimbus, runs on a master node and is responsible for distributing tasks to workers and monitoring their performance.

The worker nodes run a daemon process, called the supervisor. It receives the work assigned by nimbus and executes it. Each supervisor executes a subset of a topology, and one topology can be handled by several worker nodes. Because Storm itself cannot manage the nodes’ state, it relies on Apache ZooKeeper [31] to handle this.