Examensarbete 30 hp Juni 2016
Handling Data Flows of Streaming Internet of Things Data
Yonatan Kebede Serbessa
Masterprogram i datavetenskap
Teknisk- naturvetenskaplig fakultet UTH-enheten
Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:
Box 536 751 21 Uppsala Telefon:
018 – 471 30 03 Telefax:
018 – 471 30 00 Hemsida:
Handling Data Flows of Streaming Internet of Things Data
Yonatan Kebede Serbessa
Streaming data in various formats is generated in a very fast way and these data needs to be processed and analyzed before it becomes useless. The technology currently existing provides the tools to process these data and gain more meaningful information out of it. This thesis has two parts: theoretical and practical. The theoretical part investigates what tools are there that are suitable for stream data flow processing and analysis. In doing so, it starts with studying one of the main streaming data source that produce large volumes of data: Internet of Things. In this, the technologies behind it, common use cases, challenges, and solutions are studied.
Then it is followed by overview of selected tools namely Apache NiFi, Apache Spark Streaming and Apache Storm studying their key features, main components, and architecture. After the tools are studied, 5 parameters are selected to review how each tool handles these parameters. This can be useful for considering choosing certain tool given the parameters and the use case at hand. The second part of the thesis involves Twitter data analysis which is done using Apache NiFi, one of the tools studied. The purpose is to show how NiFi can be used for processing data starting from ingestion to finally sending it to storage systems. It is also to show how it communicates with external storage, search, and indexing systems.
Examinator: Edith Ngai
Ämnesgranskare: Matteo Magnani Handledare: Markus Nilsson
It is with great honor that I express my gratitude to the ”Swedish Institute” for awarding me the ”Swedish Institute Study Scholarship” for my Masters study at Uppsala University, Uppsala, Sweden.
I also like to extend my gratitude for my supervisor Markus Nilsson for providing me with a chance to work the thesis at Granditude AB and give me important feedback on this report and my reviewer Matteo Magnani from Uppsala University for being my reviewer and see my progress each time. My gratitude also goes to the whole team at Granditude for being supportive and provide good working environment.
And last but not least, I would like to thank my family and friends for their prayers and support. Thank You!
1 Introduction 1
1.1 Problem Formulation and Goal . . . 1
1.2 Scope and Method . . . 2
1.3 Structure of the report . . . 3
1.4 Literature Review . . . 3
2 Internet of Things Overview 4 2.1 Technologies in IoT . . . 4
2.1.1 Radio Frequency Identification (RFID) . . . 5
2.1.2 Wireless Sensor Network (WSN) . . . 5
2.1.3 TCP/IP (IPv4,IPv6) . . . 5
2.1.4 Visualization Component . . . 5
2.2 Application Areas . . . 6
2.2.1 Smart Home . . . 6
2.2.2 Wearable . . . 6
2.2.3 Smart City . . . 6
2.2.4 IoT in Agriculture - Smart Farming and Animals . . . 7
2.2.5 IoT in Health/Connected Health . . . 7
2.3 Challenges and Solutions . . . 8
2.3.1 Challenges . . . 8
2.3.2 Solutions . . . 9
3 Overview of Tools 11 3.1 Apache NiFi History and Overview . . . 11
3.1.1 NiFi Architecture . . . 12
3.1.2 Key Features . . . 14
3.1.3 NiFi UI components . . . 16
3.1.4 NiFi Elements . . . 17
3.2 Apache Spark Streaming . . . 22
3.2.1 Key Features . . . 22
3.2.2 Basic Concepts and Main Operations . . . 23
3.2.3 Architecture . . . 24
3.3.4 Features . . . 27
4 Review and Comparison of the Tools 28 4.1 Review . . . 28
4.1.1 Apache NiFi . . . 28
4.1.2 Spark Streaming . . . 29
4.1.3 Apache Storm . . . 31
4.2 Diﬀerences and Similarities . . . 32
4.3 Discussion of the parameters . . . 32
4.4 How each tool handles the use case . . . 35
4.5 Summary . . . 38
5 Practical analysis/Twitter Data Analysis 39 5.1 Problem definition . . . 39
5.2 Setup . . . 40
5.3 Analysis . . . 40
5.3.1 Data Ingestion . . . 41
5.3.2 Data Processing . . . 42
5.3.3 Data Storage . . . 43
5.3.4 Data Indexing & visualization . . . 44
5.3.5 Data Result & Discussion . . . 46
5.3.6 Data Analysis in Solr . . . 48
6 Evaluation 50 7 Conclusion and Future Work 54 7.1 Future Work . . . 54
Appendix - Apache License, 2.0. 62
Chapter 1 Introduction
The number of connected devices to the internet is increasing each year at an alarming rate. It is expected that 50 billion devices will be connected to the internet by 2020 according to Cisco of which most of the connections are from Internet of Things (IoT) devices such as wearable, smart home appliances, connected cars and many more .
And large volume of data is produced from these devices in a very fast rate that needs to be processed in real-time to gain more insight from it. There are diﬀerent kinds of tools designed to process only one form of data; either static or real-time, or some are designed for processing both static and real-time. This thesis project mainly deals with handling/processing of real-time data flows after a thorough study on some selected stream analytic tools has been made.
The thesis project is done at Granditude AB . Granditude AB provides advanced data analytics and big data solutions built up on open source software to satisfy the needs of their customers. The company mainly uses open source frameworks and projects in the Hadoop ecosystem.
1.1 Problem Formulation and Goal
There are diﬀerent types of data sources; namely real-time and static data sources. The data produced by real-time sources has characteristics such as fast, continuous, very large, structured/unstructured. And the data from the static source is a historical data stored which is very large and is used for enriching the real-time data. Real-time data being produced in a fast way; it has to be processed by the rate it is produced before it is perished. So this is one problem which streaming data face that the data may not be processed fast enough. The data coming from these two sources need to be combined, processed and analyzed to provide meaningful information out of the analysis that in turn is vital for making better decisions. But this is also another area of problem for stream data flow processing where these data from both sources is not combined due to poor integration of the diﬀerent sources (static & real-time) together or data coming from diﬀerent mobile devices that will result in a data that is not analyzed properly, not enriched from historical data and hence produce poor result.
The other problems of streaming data that makes its handling or processing diﬃcult is inability to adopt to real-time changing conditions as for example when errors occur.
There are many tools which mainly process stream data; but studying, understanding,
and using all these platforms as they come is not scalable and not covered in this work.
This project aims to make processing of flow of streaming data using one tool. To achieve this, first overview of selected tools in this area is done and then the tool that is going to be used in the analysis is chosen after review and discussion of the tools using certain parameters and use case. This thesis project generally tries to answer questions such as:
• What tools currently exist that are used for data extraction, processing, and also analysis? - to study some of the selected tools in this area - architecture, key features, components
• Based on the study, describing which tool is good for a particular use case?
• Which tool best handles both static and real-time data produced for analysis?
• which tool enables to make changes in the flow easily?
The defined use case consists of both real-time and static data to be processed and analyzed. The real-time data is tweets from Twitter API and the static one is initially stored tweets from NoSQL database HBase. Then the two data sources need to be combined and filtered out based on given properties. Based on the filtered result, incorrect data will be logged in to another file while the correct data will be stored to HBase.
Finally, some of the filtered data will be indexed into Solr which is an enterprise search platform.
In this process, we will see what happen to each input sources before and after they are combined together. What techniques are used to merge, filter, what priority levels should be given to each of them, are also some of the questions that are answered during these stage. The basis for separating the data as correct and incorrect is also defined.
1.2 Scope and Method
The project is mainly divided in two parts which are: Theoretical and Practical/Analysis part. In the theoretical part, IoT will be studied as it uses many devices that produce these large amounts of data in a fast way. In addition to that, the challenges it has and the solutions that should be taken, and common use cases that are existing in IoT are covered. Next to that, overview of some of selected tools/platforms is done which consists of the study of their main components, features, and common use cases. Besides these, the tools are further reviewed by defining use case and certain parameters and see how each of the tools handle the parameters defined. Finally, based on the discussion result one tool will be selected to use it for analysis part of the project.
The tools are chosen based on the requirement that they should be data processing or streaming tools and are within the Hadoop framework. Based on this requirement, the tools chosen are:
• Apache Spark Streaming, Version 1.6.0
• Apache Storm, Version 0.9.6
1.3 Structure of the report
Here the structure of the paper is briefly outlined. In Chapter 2, Overview of Internet of Things is done which comprises of technologies that make up IoT and common use cases. The challenges and solutions of IoT is also discussed briefly. Chapter 3 deals with Overview of selected tools (Apache NiFi, Apache Spark Streaming, Apache Storm). It discusses about the key features of each tool, their architecture, and diﬀerent compo- nents/elements they have. Chapter 4 is a continuation of the previous chapter; it defines certain parameters and use case to discuss the characteristics of the tools to see how each of the tools behave. Finally based on the discussion, one tool is selected to use it in the practical part. In Chapter 5, practical phase of the project is discussed. It uses the tool chosen from the previous step to make Twitter data analysis. Chapter 6 discusses about the evaluation of the tool against performance. Finally Conclusion and Future work is outlined in Chapter 7.
1.4 Literature Review
Many of the papers discuss the technologies involved, common use cases, and the chal- lenges IoT is facing and solutions for that. For example, giant technological company, Ericsson is engaged in the IoT Initiative (IoT-i) with the objective of increasing the benefits and possibilities of IoT and identifying and proposing solutions to tackle the challenges with a team that comprise of both industries and academia . In  Mio- randi et al. present a survey of technologies, applications, and research challenges for the IoT. The survey also suggests RFID as the basis for the IoT technology to spread widely.
In  Cisco white paper defines IoT as the Internet of Objects that changes everything considering the diﬀerent ways of our lives that it impacts such as in education, commu- nication, business, science, and government.
Diﬀerent IoT application areas are also discussed in their report “Unlocking the Poten- tial of the Internet of Things”, by the McKinsey Global Institute  which describes the broad range of potential applications that include home, vehicles, humans, cities, facto- ries as settings. In  the white paper discusses how IoT is being used in the health care to improve access to care, increase quality and reduce cost for care. Some of their products include “Massimo radical-7” for clinical care and “Sonamba Daily Monitoring”
Solution for early intervention/prevention which can be used as wearable devices. Weber approaches IoT from the perspective of an Internet-based global architecture and dis- cusses a significant impact on the privacy and security of all stakeholders involved .
Spark Streaming uses Discretized Streams (DStreams) which is defined by Zaharia et al.
in  as stream programming model that is capable of integrating with batch systems and provide consistent and eﬃcient fault recovery.
Since Apache NiFi is a new framework/tool for data flow management and processing, papers regarding the study of its features, programming models and so on could not be found readily. So the study of the tool is made mostly by referring and studying from its page .
Internet of Things Overview
Internet of Things (IoT) as defined by International Telecommunication Union (ITU) , is a global infrastructure for the information society enabling advanced services by interconnecting things based on existing and evolving inter-operable Information Commu- nication Technology (ICT). It was first coined by Kevin Ashton in 1999 in MIT Auto-ID Labs . Internet of Things (IoT) as a name stands is a combination of two words:
Internet and Things . An Internet is a network of networks interconnecting millions of computers globally using a standard communication protocol, TCP/IP. A Thing is any physical or virtual thing that can be identifiable, distinguishable and be given an address as in . Examples of Things include humans, cars, food, diﬀerent machines, electronic devices which can be sensed and connected . So when combined Internet of Things refers as a technology that seamlessly interconnects these “Things” using existing and evolving communication technologies and standards anywhere, anytime that is capable of exchanging information, data and resources between them. Internet of Things aims at making these things smarter in a way that makes them to get information without or little human intervention. By these it allows communication between Human-to-Human (H2H), Human-to-Things (H2T) and Things-to-Things (T2T) providing unique identity to each and every object which is described in .
In the subsequent subsections, the technologies that are used in IoT, common use cases and the challenges that IoT is facing currently and their solutions are discussed.
2.1 Technologies in IoT
Diﬀerent kinds of technologies are used in the IoT applications. Basically they can be categorized as Hardware, Middleware and a presentation component . The Hardware components include things such as embedded sensors while the Middleware consists of application tools for analysis. The presentation component is all about how this analyzed data is presented to the end user, i.e visualization in diﬀerent platforms.
Below are some of the main technologies behind IoT implementations.
2.1.1 Radio Frequency Identification (RFID)
RFID is a wireless microchip that enables to uniquely identify “Things”. It was first founded in Auto-ID lab in MIT in 1999 . It is an easy, reliable, eﬃcient, and secured technology and it comes cheap compared to other devices. It consists of a reader and one or more tags that can be active, passive, semi passive based on the computational power and the sensing capability . Passive tag RFID does not use battery while the active one uses their own battery. RFID has various uses such as personal identification, distribution management, tracking, patient monitoring, vehicle management, and so on.
2.1.2 Wireless Sensor Network (WSN)
It is also one of the main technologies used in IoT which can communicate information remotely in a diﬀerent ways. It has smart sensors with micro controllers that enable to gather, process, analyze and distribute certain measurements such as temperature fluctuations, sound, pressure, health heart beat rates instantly in real time .
2.1.3 TCP/IP (IPv4,IPv6)
TCP/IP is a protocol that identifies computers on a network. There are two kinds of TCP/IP protocol namely IPv4 and IPv6. IPv4 is currently the most widely used but most of its address spaces are being depleted. When we consider IoT that interconnects anything, IPv4 will not be good choice because of the less number of address spaces it has. So the new version IPv6 is a good solution for future when everything is connected because it has very huge address space that can provide an address and uniquely identify anything . Even if it is not yet being used widely nowadays, it is a future for IoT when thinking about connecting almost anything because of its large address space available.
2.1.4 Visualization Component
This is also important component of IoT because without a good visualization, interaction of the user with the environment is not achievable . It needs to be noted that while designing any kind of visualization for IoT, the way the analyzed data is presented matters to make better decisions. That means easy to understand and user friendly interface products need to be designed that use already existing technologies such as touch screen for display purposes in smart phones, tablets, and other devices according to the needs of the end user.
2.2 Application Areas
IoT is a future of technology where all things are interconnected to exchange data and provide information for the better of the society. There are a lot of application areas that already break IoT market and some not widely deployed yet. Examples of common IoT Application areas include Transportation which has many domains such as Traﬃc management, parking for vehicles, highway and road construction, smart vehicles for the public. IoT can also make infrastructures to be available with reduction in costs and resources by providing smart metering for utilities in water and light distribution and smart grid systems. So all these applications and many others show that IoT is being applied in all kinds of areas and lives for better services and promises that it will be used even more widely in the future.
In the next subsections, selected IoT use cases are discussed briefly.
2.2.1 Smart Home
Smart Home is a technology that enables almost all home appliances that are used in daily basis to be connected to the internet or to each other . This helps to provide better services and act according to the preference of the owner. The home appliances may include Heating, Ventilation, and Air Conditioning (HVAC) systems, microwave ovens, lighting systems, refrigerator, garage, Smart TVs and so on. Examples include controlling the temperature of the house, lighting systems in the rooms and checking whether the oven is on/oﬀ. These things can be deployed in a smart home environment and can also be monitored by voice control from a smart phone (Siri and HomeKit from Apple products for example) .
This area of IoT is also getting popular nowadays as more and more wearable devices are getting manufactured. Wearable is a small mobile electronic device that comes with wireless sensor communication capability to process and gather information .
Wearable devices can work by themselves or by being connected to the smart phone via a Bluetooth. Examples include smart watches, wrist band sensors and rings to mention a few. For example, smart watches provide a variety of uses for the individuals such as email notifications, alerts for messages and incoming calls being connected via Bluetooth. The other kind of wearable that is being used widely is wrist band sensors where they can be applied for interactive exercise and activity tracking (heart beats, pulse rates, etc...).
Examples include Apple smart watch, Samsung Gear smart watch, and Google Glass.
2.2.3 Smart City
It is a technology that delivers smart urban services to the general public maintaining safer environment and minimizing cost. It aims at using the available resources wisely
also be used in pollution reduction which arise from traﬃc congestion in bigger cities hence playing vital role towards sustainability of the city.
2.2.4 IoT in Agriculture - Smart Farming and Animals
This IoT application area is a promising area especially in countries where economies are mainly dependent on agricultural productions. It is a technology where traditional agri- cultural equipment such as tractors have smart sensors that measure the temperature, humidity of the soil and water distribution. It also include animals in agricultural farms where they will be identified using RFIDs . It enables animals to be traced and detected in real time when outbreak of contagious diseases occurs. This technology can also be used for preventive maintenance of the equipments well in advance. It revolution- ize how traditional farming is done and will take it to the next level of using the data generated from the embedded smart sensors to get better productions and make better decisions such as what seeds to plant, the expected crop yields and water utilization levels. It also enables farmers to deliver their products directly to the consumers .
2.2.5 IoT in Health/Connected Health
It is one of the most widely used IoT use case. It is a technology that enables hospitals and patients to be connected remotely. Connected health technology keeps the patients to be connected 24/7 which enables monitoring their health conditions and sending data to the hospital which in turn helps doctors to flexibly control and monitor their patients’ well being. These can be achieved by using smart phones and embedding wearables in the form of implantable that work remotely into their bodies, or palms so that these devices transfer generated data for further process to the doctor’s end; notifying emergency conditions, trace symptoms for health threats well in advance. This is vital for both hospitals and patients. For the first, the number of doctors to patients’ ratio is not at equally distributed stage, so this technology enables doctors to follow more patients from where they are which rather was not possible without this technology. The other benefit is, since the data is gathered by the devices, the tendency of error to exist is minimized than it was to be entered by human intervention which in turn is readily available to the doctors fastening better decision making. For the patients’ side, it is good for emergency cases and it enables preventive care for the patients especially the elderly people .
2.3 Challenges and Solutions
As there are a lot of emerging applications and evolving technologies in the IoT field, the challenges it has also increased with these growing trends in applications and technologies.
In the following subsection, major challenges and solutions are discussed.
There are a lot of challenges the IoT fields is facing currently. Bandwidth and battery problems in small devices, power disrupts related to the devices and configurations are some of the problems that this field is facing . Apart from these, it can be generalized that the major challenges facing the IoT are: Data Security, Data Control &
Access, No uniform standard/structure, and Large Volume of Data produced.
1. Data Security : Data security in terms of IoT is defined as a necessity to ensure availability and continuity of a certain application and to avoid potential operational failures and interruptions with internet connected devices. The threats here could come in diﬀerent levels such as at the device level, network or system/application levels. They also come in a variety of ways such as using arbitrary attacks such as Distributed Denial of Service (DDoS) and malicious software . Diﬀerent devices such as sensors, RFID tags, Cameras, or network services (WSN,Bluetooth) could be vulnerable to such attacks which in turn can be used as botnets . Home appliances such as refrigerator, TVs can also be used as botnets to attack this and similar devices.
2. Data Control & Access/Privacy : It is known that IoT applications produce large volume of data in a faster rate from diﬀerent devices and these smart devices collect and process personal information . But knowing what risks these devices have, how the data is produced and used, who owns and controls it and who has access to it are some privacy questions that one need to ask while getting the services of these devices. It is obvious that the data that is produced from these devices face privacy concerns from the users. The concerns most of the times could come in two forms , first where personal information about the individual is controlled, identified, and the owner does not know who access it or to whom it is known. Secondly, the individuals’ physical location can be traced and be known his/her whereabouts, hence violating privacy. This shows that privacy is one of the basic challenges in the IoT field as is anywhere in the IT field.
3. No Uniform Standards/Structures : IoT is comprised of diﬀerent components such as hardware devices, sensors and applications. These diﬀerent components are manufactured and developed by diﬀerent industries. When these components designed to be used in IoT solutions, they need to exchange data. Problems arise while trying to communicate because the standard used in one product is not used
sometimes ad-hoc protocols from diﬀerent vendors are being used for example in wireless communications. The absence of uniform standard/structure for diﬀerent technologies used in IoT is one challenge for the field.
4. Large Volumes of Data Produced : This is another challenge in IoT, i.e the data produced from various sensor and mobile devices is heterogeneous, continuous, very large and fast. These produced data need to be processed instantly before it is expired. Managing these kinds of data is against the capacity of traditional databases. As the number of connected devices is expected to increase in the future, the data produced from these devices is going to increase exponentially and a good analytic platform and storage systems are needed.
As the challenges of IoT are larger, solutions that adhere to these challenges should be developed and come into work to provide better services that are trusted by all parties such as users, companies, and so on. Some of the solutions include using standard en- cryption technologies that comply with IoT. Since the devices are mobile, the encryption technologies that are going to be used must be faster and less energy consuming because energy consumption is another problem of IoT devices. Using authentication and autho- rization schemes for controlling access level to view the data is also another solution that should be considered while designing IoT applications.
Some of the solutions put regarding the problems discussed include:
1. Having Uniform Shared Standards/Structures : This is helpful in a way that having a standard protocols or structures makes vendors to follow this struc- ture and will not create a problem when there is a need to integrate the diﬀerent parts developed by diﬀerent organizations. For example if hardware and sensor de- vice designers, network service providers and application developers all follow some standard for IoT, it will greatly reduce the problem that will arise due to integration problem and compatibility issues .
3. Using Anonymization : Anonymization is a method of modifying personal data so that nothing is known about the individual. It does not only include de- identification by removing certain attributes but has to also be linkable because a large volume of data is being produced each time . Methods such as K- anonymity can be used.
4. Robust storage systems : As the data produced from IoT devices is large volume data, it is needed to have fast and powerful storage mechanisms such as fault
tolerant NoSQL databases which can handle very large data even more than it is needed currently.
Overview of Tools
In this chapter, three tools that are mainly used in the analysis of streaming data will be studied. The tools chosen are Apache NiFi, Apache Spark Streaming and Apache Storm.
Their general overview and features will be reviewed which can be used as a basis for study of their similarity and diﬀerences in the next chapter.
3.1 Apache NiFi History and Overview
Apache NiFi originally named “Niagara Files”, was first developed by the National Se- curity Agency (NSA) in the United States in 2006, and has been used there for 8 years.
It was first developed to automate data flow between systems . Later by November 2014, it was donated to Apache Software Foundation (ASF) as a Technology Transfer Program by the NSA. In July 2015, it has become Top Level Project for ASF and 6 releases of NiFi exist the time this paper is written (0.6.0).
Data Flow is an automated and managed flow of data between systems. This means that there is a flow of information from one system to another which one can be considered as a producer and the other as a consumer. So these flows of information need to be guaranteed to make sure that they reach the intended parties at the time needed. But it is clear nowadays that there are a lot of challenges or problems that data flow between systems are facing. It is much more a challenge today than earlier times because in those days organizations do not have very large systems that are exchanging information; they only have one or two systems that are not of a big problem or too complex to integrate and exchange data/information between them. But currently, there are a lot of challenges for data flow management systems to handle diﬀerent sets of data.
The major problems of data flow are:
• Integration problem: This is a problem because the diﬀerent systems existing in the organizations have diﬀerent architectures and even the new built systems may not consider the architectures of the existing ones’. Integrating the diﬀerent systems existing in the organization is beneficial to both the organization and the users. For the company, the systems are integrated means information can easily flow between the diﬀerent systems which in turn is good for better decision making.
For the users, they would be able to get what they request in a fast and easy way without knowing exactly where each module or functions are found. And having integrated system with good data flow provides this to the users eﬃciently and eﬀectively.
• Priorities of organizations change over time: This is to mean that, what was considered invaluable at one time may be considered as a valuable thing next and need to be considered while making decisions. So in these kinds of conditions, the data flow system must be robust and fast enough to handle the new changes that occur and adapt to the existing ones without aﬀecting other flows.
• Compliance and Security: This is a problem for data flow management systems because whenever organizational policies or business decisions change, there will be a possibility that data security will be mistreated when trying to adhere to the new laws or decisions. So systems must always be kept secured for users whether there is a change in organizational policies or business decisions which again enhances data flow management.
NiFi supports running environment ranging from a laptop to many enterprise servers depending on the size and nature of the data flow involved. It also requires larger or suﬃcient disk space as it has many repositories (content, flow file, provenance) where their content is stored in a disk. It can run on any machine with major operating sys- tems (Windows, Linux, Unix, Mac OS) and its web interface rendered on latest major browsers such as Internet Explorer, Firefox and Google chrome.
3.1.1 NiFi Architecture
NiFi supports both standalone and cluster mode processing. Their features are discussed below.
NiFi requires Java so that the JVM holds it and the amount of memory it uses depends on the JVM. It has a web server inside the JVM where it displays its components in user friendly UI. The flow file, content, and provenance repositories are all stored inside a local storage.
The diﬀerent parts of the architecture as in Figure 3.1 from :
• Flow controller: it is the main part of NiFi architecture that control thread allocation for the diﬀerent components.
• Processor: it is the main building block for NiFi and it is controlled by the flow controller.
• Extensions: operate within JVM and holds the diﬀerent extension points in NiFi.
• Flow File Repository: is a place where NiFi keeps track of the state of active processor. It uses Write-Ahead logging which lives on a specified disk partition.
• Content File Repository: holds actual content of given flow file and are stored in the file system.
• Provenance Repository: it holds information about the data; what happen to the data, how and where it moves over some period of time beginning from its origin. These whole information is indexed and hence making the search easy.
Figure 3.1: NiFi standalone Architecture - source 
NiFi can also be used in a cluster, where the NiFi cluster manager (NCM) is the master and the other NiFi instances connected to it are the Nodes (Slaves). In this model, it is the Nodes that make the actual processing of the data and the NCM is for managing and monitoring the changes.
NiFi cluster uses a site-to-site protocol which enables it to communicate with other NiFi
Figure 3.2: NiFi Cluster Architecture - source 
instances, other cluster, or other systems such as Apache Spark. Figure 3.2 shows that the Nodes are only communicating with the NCM and not with each other. The com- munication between the Nodes and the NCM can be by Unicast or Multicast.
When one Node fails, the other nodes will not automatically pick the load but rather it is the NCM that calculates the load balance and distribute it to another Node. The other functions of the NCM are: communicate data flow changes to all the Nodes, receives health (whether they are working properly) and status information from the Nodes. The nodes are regularly checked for load balancing by the master so that they are given flow files to process according to their load. As many Node instances as possible can be added horizontally to the master cluster whenever there is a need to add as long as the NCM is working and operating.
3.1.2 Key Features
Apache NiFi has a lot of useful features that allows providing better flow management mechanisms when compared with other systems. It can be said that it is designed by learning the drawbacks that other systems have. These features are also considered as an advantage it has over other systems.
The points below are some of the main features of NiFi .
• Flow specific Quality of Service (QoS) : It comprises Guaranteed delivery Vs Loss Tolerance, Latency Vs Throughput. The QoS achieved when considering the flow relates to how the flow is configured to give high throughput with low latency and being loss tolerant. NiFi is designed to be loss tolerant in a way that data loss is unacceptable. Guaranteed delivery is achieved by using both content repository and persistent Write-Ahead Logging (WAL). By this it keeps track of changes made to the flow file’s attributes and also the connection the flow file belongs . Then it writes these changes to the log before they are written to the actual disk. And finally, the contents will be written to the disk. This is important for recovery and prevents data loss.
Latency is the time required to process the flow file from beginning to end. And Throughput is the result that is get after processing of a flow file in a given time.
It also describes how much flow file is processed in a given time at once i.e the micro batching in a specified time. Every time the processor finishes processing specific flow file, before sending it to the next component, there need to have an update of the repository which is expensive and takes much more time. Since this process is expensive it is good that more work is done i.e more flow files are micro batched for processing at once in a given time. The problem with this is the next component or processor cannot start until the repository is updated and it has to wait until these flow files are processed, hence producing latency. So this has to be solved in order to provide better processing speed and as well better throughput.
NiFi enables to provide lower latency with higher throughput while configuring the processor in the settings tab hence the average and suitable point can be chosen to get the best result according to the need.
• Security: One of the concerning issues in other flow management systems is se- curity. NiFi provides security in two forms which are: system to system and user to system security mechanisms. In the first, it enables encryption and decryption of each of the flows involved. And also when communicating with other NiFi in- stances or other systems, it enables to use encryption protocols like 2-way SSL.
For the second, i.e user to systems, it provides 2-way SSL authentication and also control user’s access levels by having privilege levels as Read Only, Data Flow Manager (DFM), Provenance and Admin.
• Dynamic Prioritization: NiFi has queuing mechanism that enables it to retrieve and process the flow according to the specified queue prioritization schemes. It can be based on the size or the time and it even allows making custom prioritization
schemes. The need for prioritizing the queues arises because of constraints in band- width or diﬀerent resources or how critical is the event. This is helpful to set the priorities according to the required properties or needs at hand because the priori- ties set at one time may not be good enough for other times and it will aﬀect the decision if not set properly; hence NiFi allows dynamic priority setting for diﬀerent scenarios according to the need.
• Data Provenance: Data Provenance is one of the most important features of NiFi that enables to manage and control the flow of data from beginning to end by automatically recording each performed action. From the data provenance page, it enables the user/DFM to see what happen to the data, where it comes from, and where it goes, what is done with it and so on. This is useful when problems occur because it increases traceablity and help track the issue. It also enables to see the lineage or flow hierarchy of the data.
• Extensibility: Another feature that NiFi provides is Extensibility to its various components such as Processors, Reporting Tasks, Controller Services and Prioritiz- ers . This is useful because it enables various users or organizations to design their own extension points/components and embed it in NiFi to gain a better service in their own specializations. One of the most extensible components that are widely used is the processor. Most organizations design their own processors to ingest or egress data to use it with NiFi. For example,in IoT applications data is produced from diﬀerent devices with diﬀerent formats. And this diﬀerent data formats need to be utilized to process and gain insight from them; for this NiFi’s extensibility feature can be used to design processors that ingests these diﬀerent formats to NiFi where eventually its inbuilt processors process the ingested data according to the need. This makes Extensibility as one of the key feature of NiFi.
3.1.3 NiFi UI components
NiFi provides visual command and control for creating, managing and monitoring data
flows. After the user starts the application, then writing the URL, https://<hostname>:8080/nifi on a web browser brings a blank NiFi canvas for the first time. The <hostname>is the
name of the server or the address that NiFi instance is running on. And 8080 is the default port number for NiFi. The points below show the diﬀerent components of the UI as in Figure 3.3.
• Default URL address: As shown in the Figure 3.3, since the machine is running on local, the hostname is “localhost” and with a default port number 8080 which can be changed in the “nifi.properties”file in the NiFi directory.
• System Toolbar: NiFi has about 4 system toolbar namely Component, Action, Search, and Management toolbar as shown in the Figure 3.3.
Figure 3.3: NiFi UI canvas
– Action: consists of buttons to perform Actions on a particular component.
Some of the actions are Enable, Disable, and Start if the process is not started or stopped, Stop if the process is started, Copy to copy the particular compo- nent, Group to group diﬀerent components together and so on.
– Search: consists of the search field to search components existing on the canvas.
– Management: consists of buttons used by diﬀerent users (DFMs, Admin) according to their privilege levels. It includes Bulletin boards, Summary page, Provenance, and so on.
• Status Bar: From the Figure 3.3 above, the Status Bar include the Status and Component Info labeled on the figure. The Status shows the active threads that exist if threads are being used; it also shows the total number of queues on the flow file between diﬀerent components. It shows the clusters existing and how many nodes are connecting and the timestamp as last refreshed time. The Components Info shows as to how many processors or other components are running, stopped, invalid, or disabled and so on.
• Navigation Pane and Bird’s Eye View: The navigation pane NiFi provides enables to navigate, zoom in, and zoom out the components in the canvas. And the Bird’s Eye View allows the user to view the data flow easily and quickly.
3.1.4 NiFi Elements
NiFi has diﬀerent elements and some of them are further discussed in the subsections that follow and Figure 3.4 shows the main components that it supports.
1. User Management: NiFi provides a mechanism for user management and control- ling privilege accesses. It supports user authentication either by client certificates or using Username/Password mechanisms. Authenticated users use HTTPS for ac- cessing data flows in a browser. In order to use username/password mechanism,
Figure 3.4: NiFi main components
login identity provider and “nifi.properties” file need to be configured for the con- figuration file, and the provider should be set to indicate which provider should be used. i.e
Likewise for controlling access levels, NiFi provides pluggable authorization mech- anism that enable users to have access to the system and assign diﬀerent roles. For this, “nifi.properties” file is configured for these two properties:
• nifi.authority.provider.configuration.file - specify configuration files for autho- rization providers
• nifi.security.user.authority.provider - which provider to use from the configured ones
It also provides Roles for controlling authorization. Next shows some of the Roles that it provides. And users can have diﬀerent Roles assigned to them.
• Administrator: configuring user account and the size of thread pools
• Data Flow Manager (DFM): manipulate the data flow as designing, in- gesting, routing, ...
• Read Only: only view the data flow but not allowed to change
• Provenance: able to query provenance repository and view lineage. Able to
finally output the data in to other systems. It is also the main extension point that can be designed to enable organizations to input/output their flow files using NiFi.
Figure 3.5 below describes its Anatomy:
Figure 3.5: NiFi Processor Anatomy
• Processor Type and Name: As the name implies the Processor Type spec- ifies the type of the processor used and it is “PutFile” processor used in this example. This processor is responsible to write flow files to a disk. The name of the processor is bold; by default it takes its type name as a name and also allows renaming it in the settings tab of configuration page for the processor.
So in this example, the name is “Save Matched Tweets” which is a Put File type processor that stores matched tweets to the disk.
• Status Indicator: This is the icon at the left top corner of the processor that shows the current status of the processor. There is diﬀerent status indicators available based on the validity of the processor.
– Running: this shows that the processor is running. It has a green play Icon.
– Stop: it shows that the processor is currently stopped and it has a red icon
– Invalid: shows that the processor cannot be started because there are missing properties that need to be set. The missing properties can be seen by hovering over the icon. Its icon is a triangle with exclamation mark inside it.
– Disabled: this shows that the processor is disabled and cannot be started until enabled.
• Flow Statistics: this shows the statistics happening on the data flow over the past 5 minutes. They are In, Read/Write, Out, Tasks/Time fields that shows the number of flow files and the total size entered/ingested in to the processor; the total size of the flow file content read from disk and written to the disk; and then the number of flow files and the total size of the flow file content that are transferred to the next processor/component; and the number of tasks this processor perform and the time it takes to perform it over the past 5 minutes respectively.
3. Input/Output ports: Input Port is one of the components of NiFi and it is used for transferring data coming from other components or systems into diﬀerent Process Groups.
Output Port is also used for transferring data from process groups to destinations outside of a Process Group or other components/systems such as Apache Spark.
4. Process Group and Remote Process Group: Process Group is another NiFi component that logically groups set of components that makes easier for mainte- nance. It prompts the user for unique name and provides some kind of abstraction.
Remote Process Group (RPG) on the other hand has same idea as that of Process Groups but this is to connect other instance of NiFi remotely. It asks for the URL of the remote instance than a unique name so that the connection is created be- tween RPG and the NiFi instance. It uses Site-to-Site communication protocol to communicate with remote instances or other systems.
5. Template: Template is another component of NiFi where it enables re-usability of the components created inside the templates. It enables users to create Templates, export them in an XML format to be used in other NiFi instances. Then this Template can be imported into other NiFi instances for usage. So it is a feature that makes re-usability possible for NiFi data flows.
6. Funnel: Funnel is another component that is used for combining diﬀerent com- ponents or processors into one and makes prioritizing easier. If there are data flows with many processors, setting priorities at each of the processors is against performance and NiFi provides the possibility to set priorities and change them dynamically at a single point i.e in Funnel.
7. Provenance and Lineage: Data provenance is one of the key features as well as element of NiFi that keeps a very great detail of each data that it ingests. It has provenance repository that stores everything that is mapping to the data from beginning to the end such as ingesting, routing, transforming, cloning, etc...
This means that everything that passes through NiFi is recorded and indexed which makes it easier for searching, tracking problems that occur easily and provide solu- tions and also monitors the overall data for compliance.
There is a provenance icon in the Management toolbar at the top right corner in the NiFi UI and it displays everything that has happened in the data flow. It enables to search and filter by Component Name, UUID and Component Type. When the
“View Details” Icon is clicked, it displays the details of that particular event which has 3 tabs as in Figure 3.6: Details tab which lists the time, type of event, UUID and so on. Attributes tab lists all the attributes that exist at the time the event occurred with their previous values and the Content tab enable to download or view the content. NiFi also provides a possibility to see the provenance data for each processor by right clicking on the processor and choosing Data Provenance.
Figure 3.6: NiFi Provenance
lineage, “Show Lineage” which shows the graphical representation in details of what happen to the data. It enables to see details, parents and expand the particular event occurred as it is needed. It has a Slider that enables to see which event is created at what time and the time it takes to create by dragging the slider. It also enables to download the lineage graph as shown in Figure 3.7.
Figure 3.7: NiFi Lineage
3.2 Apache Spark Streaming
Apache Spark is an open source fast and general engine for large scale data processing . It was originally developed in AMPLab in UC Berkeley, California  and currently is a top level Apache project. Spark’s core abstraction is called Resilient Dis- tributed Datasets (RDD) which is an immutable collection of elements. Apache Spark is a main API with diﬀerent components comprised of Apache Spark SQL, MLlib, GraphX and Spark Streaming.
• Spark SQL :- is one of the modules in the Spark general core API that enables the user to work with traditional structured data .
• GraphX :- is another Spark API for graphs and graph-related operations .
• MLlib :- is a Spark API for machine learning which consists of various kinds of machine learning algorithms .
• Spark Streaming :- is a Spark API mainly dealing with computations or analysis of live streams of data flowing each and every specified time .
Apache Spark Streaming is one of the components of the Spark core API that uses the streams of data as micro batches to process them. It is also possible to use other compo- nents from Spark API such as MLlib and Spark SQL together for further processing.
3.2.1 Key Features
As one of the components in the Spark API, Spark Streaming shares main features that Spark provides and adds others on top of it. Some of the main features are listed below.
• Spark Streaming provides high level abstraction called Descritized Streams (DStream) which are built on Resilient Distributed Datasets (RDDs) - Spark’s main abstrac- tion.
• It makes integration of streaming data with batch processing easy because it is part of the Spark API.
• It receives data from diﬀerent sources such as HDFS, Flume, and Kafka; it also enables custom made receivers.
• It supports use of diﬀerent programming languages such as Java, Scala, and Python
• Fault Tolerance - it has “exactly-once” Semantics which make sure that data is not lost and reached exactly one time avoiding duplicates which also is advantageous for data consistency.
• Provides Stateful transformation that maintains the state even if one of the nodes
3.2.2 Basic Concepts and Main Operations
The main programming model for Spark Streaming is its abstraction called Discritized Streams (DStream). DStream is a continuous stream of data which internally is repre- sented by Resilient Distributed Datasets (RDDs).
DStream can be created by ingesting data streams from diﬀerent sources such as Kafka, Flume, Twitter, or it can also be created by diﬀerent transformations on other DStreams.
RDD is Spark’s main abstraction point that consists of fault tolerant collection of elements which can be executed in parallel . Figure 3.8 shows that DStreams as a continuous
Figure 3.8: Continuous RDDs form DStream - source 
stream of batches of RDDs at a specified time interval. When all these batches of RDDs are combined, they form a DStream. There are diﬀerent transformations supported by DStreams similar to the RDDs on Spark API. These transformations allow the data from input DStreams to be modified. Examples of such transformation functions include map, filter, and reduce.
Spark Streaming also provides various kinds of operations on DStreams. The main oper- ations are Transform, Window, Join, Output operations .
• Transform and Join :- The Transform operation allows RDD-to-RDD operations over DStreams such as joining data stream with other datasets. Spark Streaming enables diﬀerent DStreams to be joined with other DStreams. There is stream- stream joins which enables streams of RDDs to be joined with streams from other RDDs. And also stream-dataset joins which enables streams to be joined with datasets with transform operation .
• Window Operation :- Since the live streams of data coming from various sources are continuous, they cannot be computed as batch of files and traditional operations cannot be performed on them. Spark Streaming provides a solution for this by a Window operation that enables these streams of data to be processed, transformed, and computed within a range of specified time interval over a Sliding Window. Ev- ery Window operation must specify Window Length and Sliding Interval to perform its actions over a Window . Window Length is the time for the total Window while Sliding Interval is the rate or interval at which the operation is performed.
There are many Window operations that Spark Streaming supports such as window, countByWindow, reduceByKeyAndWindow and so on.
• Output Operation :- Spark Streaming supports many output operations that make sure the processed streams and data are stored in an external storage such as HDFS, or file systems, databases, and or even displayed on live dashboards.
Print, saveAsTextFiles, saveAsHadoopFiles, foreachRDD are some of the output operations that Spark Streaming provides.
How Spark Streaming operates can be summarized as:
• Receiving the input - the sources could be from Kafka, Twitter, log data, etc... and divides them into small batches
• Spark Engine - Processes the data received from Spark’s memory
• Output batches of processed data to storage systems
The tasks are assigned dynamically to the nodes based on the available resources which enable fast recovery from failures and to have better load balancing between the nodes.
Its ability to divide the input streams into small batches enables it to process the data in batches and reduce latency it takes to calculate if taken one by one.
Figure 3.9: Spark Cluster - source 
In addition to this, Spark Streaming runs on a Cluster as in Figure 3.9. The Main program in a Spark Cluster (also known as Driver program) has Spark Context that coordinates Spark application to run on a cluster. The first step is creating a connection to a Cluster Manager available which allocates resources to individual applications. Once connection is created, Spark acquires Executors on Worker Nodes which in turn run application code in the Cluster. And then Spark sends the code to Executors that are able to run tasks
3.3 Apache Storm
The other tool that is going to be studied in this chapter is Apache Storm. Overview of the tool, what features it has, the components, and main use cases will be briefly studied.
Apache Storm is a distributed, resilient, real-time computation system . It was de- veloped by Nathan Marz and became open source on September 2011 . It works in much more similar ways to Hadoop except that Apache Storm is for real-time streaming data while Hadoop is for batch processing.
3.3.2 Basic Concepts and Architecture
In this subsection, the diﬀerent components and concepts of Storm will be discussed and its Architecture is also presented.
• Tuple :- is the primary data structure in Storm which is a list of values that supports any data type .
• Streams :- is a core abstraction in Storm by which these unbounded tuples form a sequence or a stream. It can be formed by transformation of one stream to another.
It has primitive types such as long, string, byte arrays and also supports custom types to be defined by users but by implementing their own serializers.
• Spouts :- are the main entry point of streams for Storm. Diﬀerent external sources such as Kafka and Twitter API ingest their data through Spouts. Spouts can be Reliable where replaying the lost tuple is possible if failures occur and Unreliable where replaying is not possible and the data will be lost.
• Bolts :-is where the main processing takes place. It takes in inputs from Spouts and processes it where finally the processed tuples will be emitted to downstream Bolts or some storage to databases. The processing part includes stream transformation, running functions, aggregating, filtering, or joining data or sending it to databases.
• Topology :- is the main abstraction point of Storm. It is a network of Spouts and Bolts which are connected with stream groupings. So each node of a graph/network represents either a Spout or a Bolt and the edges represent which Bolts are sub- scribed to which component i.e Spout or Bolt.
In Figure 3.10, the nodes are spouts (S1, S2) and bolts (B1, B2, B3, B4). So B1, B2, B4 are subscribed to streams coming from S1. B4 additionally is subscribed to streams coming from S2. This shows that in a Topology, the tuples are streamed to only the components that they are subscribed to.
• Trident :- is an API which is part of Storm that is built on top of it. It supports
Figure 3.10: Storm Topology - source 
• Stream Grouping :- Storm has diﬀerent inbuilt stream groupings and also sup- port custom made stream grouping. The main stream groupings include Shuﬄe Grouping and Field Grouping. Shuﬄe Grouping randomly distributes the tuples among the tasks of the Bolt while Field Grouping groups the tuples having same field name .
• Task :- refers a thread of execution.
• Worker :- executes the subset of all the Tasks existing in the Topology.
Storm supports both local mode and remote modes of operation where the local mode operation is mainly useful for developing and testing topologies and in remote mode;
topologies are submitted for execution in a cluster . There are two kinds of nodes on a Storm cluster i.e Master Node and Worker Node. The Storm architecture has three main components namely Nimbus which is a daemon that runs on Master Node, a Supervisor which also is a daemon running on Worker Node, and a Zookeeper that mainly handles communication between Nimbus and Supervisor as shown in Figure 3.11.
Their functionality is summarized in Table 3.1:
Figure 3.11: Storm Cluster -source 
Nimbus Supervisor Zookeeper
Assign tasks to worker nodes
Receives the work assigned to its worker
Handles the communication between Nimbus and Super- visor
Monitor for fail- ures
Start and stop worker nodes as required
Keeps the state of the topol- ogy
Distribute code among cluster components
Table 3.1: Storm Architecture Components Functionality
The features of Storm also show its advantages and why it is popular nowadays for stream data processing. Some of the main features include:
• Reliability :- It provides guaranteed message processing by using “exactly-once”
from the Trident API or “at least-once” Semantics from the core Storm. It also makes sure that specific messages will be replayed in case failure occur on those specific messages.
• Fast and Scalable :-Supports parallel addition of machines horizontally and scales fast with increasing number of machines.
• Fault-Tolerant :- Failure in Storm occurs for example when the worker dies or when the node itself dies. For the first case, the supervisor handles the failure by automatically restarting the worker while for the second case; the tasks will time-out and be assigned to other machine or node.
• Support for many Languages :- Storm uses a Thrift API that makes it possible to support many programming language such as Scala, Java, Python and etc...
Review and Comparison of the Tools
In this chapter, the tools that are studied on the previous chapter are further reviewed and then compared based on some selected parameters. The parameters are not selected based on any particular model but rather from the characteristics of the tools. It is important to answer questions like:
• Which tool is more preferable if one parameter is more wanted than the other
• What would be the complexity if we use this tool for such and such cases
• How each tool responds to the parameters specified The selected parameters include:
(i) Ease of use (ii) Security (iii) Reliability
(iv) Queued data/ data buﬀering
4.1.1 Apache NiFi
• Ease of Use : NiFi’s Ease of use comes with its friendly drag and drop User Inter- face where it controls the activity and the flows. If we have more complex data flows with diﬀerent types, handling them from the command line is very complex and would not provide any good detail. But NiFi solves this issue by allowing all the flows to be designed in a UI which solves complexity issues and allows fast recovery from problems making maintenance easy. Another feature that makes NiFi easy to use is its flows can be changed and customized on the fly without aﬀecting other parts of the flow. It also accepts data from variety of sources in diﬀerent formats
• Security : NiFi has inbuilt security and supports diﬀerent security schemes both at the user and system level. It allows each data flow to be encrypted/decrypted by providing processors which provide this. It provides both certificate and user- name/password authentication mechanisms. It does this by 2 way SSL authentica- tion where a specific user is allowed to access if the certificate it uses is legitimate by sending acknowledgment between the client/browser and the server. It also has access level authorization scheme where users are assigned diﬀerent Roles. This is important for use cases where security is more needed such as financial, governmen- tal, and similar sectors.
• Reliability : Reliability of a system is its ability to function properly for its in- tended purpose without failure. It includes the ability for providing guaranteed delivery of the processes at hand. NiFi is a reliable system and provides this fea- ture by using the Content Repository and the Write-Ahead-Log (WAL) mechanism where the content of data is stored first in the log files before they are written to the disk. And hence if problem occur, then it is possible to get the data from the log files without aﬀecting the flow.
• Queued Data/Buﬀering : Data buﬀer is a physical memory area where data is temporarily stored. Queuing of data occur because the data is not processed at a given time, or the node failed. So this queued data has to be put in some memory as a data buﬀer. But it takes memory space if the data that is queued is always kept and there has to be an eﬃcient way to handle such cases where resources are not exhausted. With this regard, NiFi provides buﬀering of queued data in an eﬃcient way where all the queued data is kept in memory. It has back pressure mechanism where certain limit for processing data is specified and if that limit is reached, more data will not be processed until the queued data is processed and memory space released. By providing these features NiFi handles queued data in an eﬃcient way.
• Extensibility : The extensibility feature of NiFi comes with various uses. It has many extension points where the user is able to design such as in processors, reporting tasks, and controller services to mention a few according to their needs.
Flow files can be changed in real-time without aﬀecting other flow files. There is no need to recompile the whole flow because if new flow files is created or the old removed, its eﬀect is seen from the UI in real-time without compilation.
4.1.2 Spark Streaming
• Ease of Use : Spark Streaming Ease of Use feature comes with its core API Spark that has APIs for diﬀerent programming languages. It has support for Scala, Java, and Python. It could be useful for users who are familiar to the languages mentioned and shows that it is flexible and addresses more users as the language it supports increases. It also has interactive shell and supports diﬀerent APIs.
• Security : Spark supports authentication through Kerberos security and using Shared Secret . Using the Kerberos authentication requires creating Principal and Key tab File and configuring Spark history server to use Kerberos. Only Spark
running on YARN cluster supports Kerberos authentication and it does not allow in standalone mode. The second type of authentication is using Shared Secret where a handshake between Spark and the other system is made to allow communication between them. In order to communicate, both must have the same shared secret key.
For this authentication to work, “spark.authenticate” parameter must be configured to true.
• Reliability : Spark Streaming is a reliable fault-tolerant computation framework where data processing is guaranteed. It uses diﬀerent mechanisms to address fault tolerance or guaranteed delivery of data such as Exactly-Once delivery semantics and Write Ahead Logging (WAL). Exactly-Once semantics is one form of delivery semantics where data is processed exactly one time. It does not allow duplicates to be formed. Failure may occur in two forms which are node or executor failure and driver/main program failure. When the node fails, it is automatically restarted and the normal operation continues because the data blocks in the receivers are replicated. Once the data is ingested to the node, it is guaranteed that it will be processed. When the driver/main program dies, all nodes fail and received blocks failed. If DStream checkpoint is enabled, then it is possible to restart the main program from the last checkpoint. And then all executors will be restarted.
DStream check point is a way to specify fault tolerant directory such as HDFS to regularly store the status. Failure may also occur when input data is being loaded.
When this happens, Spark Streaming recovers some of the data but not whole.
So the solution provided by Spark to recover all the files is using WAL where the data ingested is written synchronously to fault tolerant storage such as HDFS, S3 before being processed. So if the data is received correctly, acknowledgment is sent and then the data will be processed. If acknowledgment is not sent, that means failure occurs so Spark reads the log files and the data will be sent again for being processed from there. So all these methods used make Spark a reliable and fault tolerant processing framework.
• Queued Data/Buﬀering : In real-time data processing, queue is created when- ever the data is not processed in a specified time interval and the processing time is slower than that of the rate at which the data is received. So this data will be queued in a buﬀer and keeps increasing if it is not processed or removed. In Spark Streaming also, the data will be queued as DStreams in memory and the queue will keep increasing. So in order to overcome this, Spark Streaming provides a way to set configuration parameters which help to limit the rate at which data is received and processed. It also uses other methods such as reducing batch processing times or having right size for the batches so that they can be processed by the rate they are received.
• Extensibility : When new application code existed and if it is needed to replace the old application code, Spark Streaming provides two ways in which one is shutting