Data Cleaning Extension on IoT Gateway: An Extended ThingsBoard Gateway

(1)

Data Cleaning Extension on IoT Gateway

An Extended ThingsBoard Gateway

David Adolfsson Fredrik Hallström

Faculty of Health, Science and Technology Computer Science

CDissertation 15 HP

Supervisors: Mohammad Rajiullah, Andreas Kassler Examiner: Per Hurtig

Date: 20210531

(2)

(3)

Forewords

We would like to extend our gratitude to our supervisor Andreas Kassler for much valued input and help with the direction of the thesis work. We would also like to thank Mohammad Rajiullah for providing feedback on the dissertation. Finally, we thank Bestoun S Ahmed Al-Beywanee for providing us necessary resources to get started.

i

(4)

ii FOREWORDS

(5)

Abstract

Machine learning algorithms that run on Internet of Things sensory data requires high data quality to produce relevant output. By providing data cleaning at the edge, cloud infrastructures performing AI computations is relieved by not having to perform pre- processing. The main problem connected with edge cleaning is the dependency on unsupervised pre-processing as it leaves no guarantee of high quality output data. In this thesis an IoT gateway is extended to provide cleaning and live configuration of cleaning parameters before forwarding the data to a server cluster. Live configuration is implemented to be able to fit the parameters to match a time series and thereby mitigate quality issues. The gateway framework performance and used resources of the container was benchmarked using an MQTT stress tester. The gateway’s performance was under expectation. With high-frequency data streams, the throughput was below 50%. However, these issues are not present for its Glava Energy Center connector, as their sensory data generates at a slower pace.

iii

(6)

iv ABSTRACT

(7)

List of Figures

2.1 Architectural overview of ThingsBoard gateway components[1] . . . . 10

3.1 Original design of ThingsBoard gateway . . . 16

3.2 Updated design of ThingsBoard gateway . . . 17

4.1 Deﬁnition of the function install . . . 22

4.2 Snippet from setup.py . . . 22

4.3 Docker command to pull image . . . 22

4.4 Setup command for ThingsBoard IoT gateway . . . 22

4.5 Opening bash in container . . . 23

4.6 Copying ﬁles from container . . . 23

4.7 Copying ﬁles to container . . . 23

4.8 Building live environment for latest code . . . 23

4.9 Installing live environment for latest code . . . 23

4.10 ThingsBoard cluster connection.py . . . 24

4.11 Active connectors . . . 25

4.12 Check conﬁguration ﬁle . . . 25

4.13 Generating date for GEC’s URL . . . 26

4.14 Formatting date and time for GEC’s URL . . . 26

4.15 Resending requests upon bad responses . . . 27

4.16 Code for reconstructing telemetry from GEC . . . 28 ix

(12)

x LIST OF FIGURES

4.17 Batch implementation . . . 29

4.18 Detailed device name in uplink converter . . . 30

4.19 Update cleaning conﬁguration . . . 30

4.20 Cleaning conﬁguration parameters . . . 31

4.21 Retrieving cleaning parameters . . . 31

4.22 Looping through the array to check if the device exists . . . 32

4.23 Creating device and adding the sensor data . . . 32

4.24 Adding device array . . . 33

4.25 Adding telemetry and applying cleaning . . . 34

4.26 Removing the ﬁrst data point of a speciﬁc sensor . . . 34

4.27 Exponential Smoother . . . 35

4.28 Gateway service calling cleaning methods . . . 36

5.1 Default gateway . . . 39

5.2 Cleaning framework with cleaning set to off . . . 40

5.3 Cleaning framework with cleaning set to exponential smoothing . . . . 40

5.4 Cleaning framework with cleaning set to KMeans . . . 40

(13)

Chapter 1 Introduction

1.1 Background

Sensor data generated in industries have become more reliable over time, but missing values and faulty data in the form of outliers remain an issue when analyzing and evaluating data. This becomes a considerable problem when machine learning algorithms perform computations on corrupt data. By gathering the data before its initial destination it’s possible to identify and correct anomalous data. Data correction at such an early stage in combination with AI moving from a centralized cloud model to the edge, provides lower latency and is thus critical for the future of IoT.

ThingsBoard is an IoT platform where sensor data can be processed, monitored and visualized. The platform also provides an IoT gateway that is able to capture, process and forward data from multiple sources. By having various kinds of connectors on the gateway it can utilize several industry protocols for data transmission, such as MQTT, OPC-UA and HTTP. The gateway is designed to run on systems with constrained resources and runs python. The gateway framework does not support data cleaning out of the box but is open source and can be extended to suit the need for the user.

1

(14)

2 CHAPTER 1. INTRODUCTION

1.2 Objectives of the thesis

The problem of not having a cleaning framework for missing data and outliers at the edge, which is configurable without disrupting the production flow of data, is the problem that this thesis aim to resolve. The main goal is to develop a data cleaning framework that is easily configurable and can detect outliers and missing values in time series data. The framework should be built upon the already existing ThingsBoard IoT gateway which currently does not provide data cleaning. The main goal is to develop a framework that Glava Energy Center (GEC) could use. The secondary goal is to develop the framework with Bharat Forge Kilsta (BFK) in mind as they are a potential user.

1.3 Method

A preliminary plan was initially made with different focus in various time spans. The direction of the project proved to be hard to anticipate and a more ﬂexible, agile-like approach was adopted. Meetings were held every other week with the supervisors and other stakeholders. Through these meetings the next steps in the project were discussed and determined throughout the thesis work.

1.4 Stakeholders

The work presented in this thesis is part of the AI4Energy project, a collaboration between Glava Energy Center and Karlstad University funded by the Swedish Energy Agency. AI4Energy aims to optimize energy management schemes for smart grids and predict uncertainties in renewable energy production. While GEC has been the main target in developing the framework it may also be used in the Smart Forge project. Here machine learning algorithms are used to automate the heat control system in BFK’s forging line to reduce scrap. The gateway provides a data cleaning framework to correct

(15)

1.5. DELIMITATIONS 3 faulty data in the time series upon which the machine learning algorithms run.

1.5 Delimitations

The focus of this thesis is to develop a framework where cleaning can be easily implemented and conﬁgured. As there is a large amount of different time series with various characteristics - what cleaning algorithm is most suitable lies outside the scope of the thesis. That being so, a few algorithms have been implemented as a proof of concept.

1.6 Disposition

This thesis is structured as follows. Chapter 2 presents the ThingsBoard platform and gateway which are used to process the data along with data pre-processing and why data pre-processing is relevant. Chapter 3 discusses the various design choices and arguments for specific implementations towards GEC and possible live configuration integration. Furthermore, this chapter reasons between the benefits and drawbacks of where the cleaning framework could be integrated in the original structure. In chapter 4 the integration of the framework is presented with code and descriptive text. The fifth chapter describes the benchmark tools, test environment and presents the collected results from the performance test of the framework. Chapter 6 concludes the dissertation with a discussion of the results and proposes future changes to the framework.

1.7 Work distribution

The project has mainly been built using a pair programming approach, though some areas have required more attention from one of the students due to either experience, curiosity, interest or convenience. Fredrik has focused a lot on the structure of how

(16)

4 CHAPTER 1. INTRODUCTION the device arrays used in cleaning are built and integrated into the existing framework.

David has put his focus on integrating the cleaning framework and data cleaning algorithms on said structure. When both those areas were completed, collaboration was done to review each others code and ﬁx eventual bugs. This way we managed to have a lot of insights to the functionality of the code we were not "in charge" of.

As for the benchmarking part, Fredrik has worked on adjusting the MQTT stresser to suit our needs. David deployed the data cleaning gateway framework on the Jetson Nano and installed monitoring tools to evaluate the running container.

All areas of the dissertation has been written and reviewed together.

(17)

Chapter 2 Background

2.1 Introduction

The main objective of this thesis is to construct an IoT gateway framework based on ThingsBoard gateway architecture. The gateway is responsible for pre-processing data from IoT devices before forwarding to IoT device management platform ThingsBoard.

This implementation is part of two projects: SmartForge, where BFK wants to automate control of their forging line to reduce the amount of metallic waste from faulty forging and AI4ENERGY, in which GEC aim to use machine learning for renewable energy prediction. The machine learning algorithms that stem from the ThingsBoard platform require high quality data to operate as intended. Section 2.2 describes what IoT and big data is and how pre-processing ties in to the subject. Section 2.3 discusses data pre-processing in general and some data characteristics of importance in the thesis. In section 2.4 and 2.5 the ThingsBoard platform is introduced together with the gateway and its components. Section 2.6 concludes the background by describing the environment setup.

5

(18)

6 CHAPTER 2. BACKGROUND

2.2 IoT and big data

Internet of things (IoT) is estimated to have been "born" between 2008 and 2009 as the amount of "things" per person increased to 1.84 by 2010 from the previous ratio of 0.08 by 2003[2]. IoT could be described as a network of "things" ranging from embedded sensors, smart home equipment to control systems. These devices have the purpose to exchange data with other devices and systems over the internet. The wide range of applications that exists for IoT devices is divided into four main categories: consumer, commercial, industrial and infrastructure[3]. This thesis will be focusing on the industrial and infrastructural categories while not including the consumer and commercial aspects. The industrial applications of IoT is aimed at gathering and analyzing data from connected equipment to monitor and regulate industrial systems. Similarly to the industrial applications of IoT the infrastructural applications could be used to increase productivity by aiding decision making[3].

The ﬁeld of analyzing and processing large amounts or complex sets of data is called big data[4]. As the amount of data collected from IoT devices is increasing, big data is necessary to detect unseen patterns and hidden correlations[5]. [6] recognizes the necessity for big data analysis as the volume exceeds the limit for human interpretation.

Conventional recurring characteristics of big data are volume of the generated and stored data, the variety of data managed and the velocity at which the data is generated and processed[4][7].

Collected data can be inconsistent and incorrect in different ways. When analyzing data, especially for machine learning purposes, low quality data input results in low quality output. Machine learning algorithms applied on low quality data compromises succeeding decision making and may have costly consequences[8]. One solution to this problem is data pre-processing, and more speciﬁcally data cleaning (data cleansing).

Data cleaning is the process of detecting and correcting faulty data. A few methods for correcting faulty data are discarding, tuning or replacing the data [8].

(19)

2.3. DATA PRE-PROCESSING 7

2.3 Data pre-processing

Data pre-processing consists of four underlying subjects: Data cleaning, data integration, data reduction and data transformation, which are described below. The main purpose of pre-processing data is to improve data quality in terms of accuracy and completeness, among others[9].

• Data Integration - Data may come from several sources and could be structured differently. For the data to be useful it has to be transformed into a uniform format[10].

• Data Reduction - Data reduction decreases the gathered data without interfering with the striven information. Expansive data collections may hold irrelevant data which can be excluded[10].

• Data Transformation - Transformation of data consists of reformatting and stan- dardization of data[10].

• Data cleaning - Data is bound to have anomalies as the quality of data is never perfect. This is due to factors such as signal interference in data transmission, equipment anomalies and many more kinds of data anomalies. These anomalies can be split into different categories[11] depending on their characteristics.

2.3.1 Data characteristics

The main data quality characteristics presented in this thesis are accuracy and completeness. These two characteristics represent the anomalies reappearing throughout the historical data.

Accuracy: Data is considered accurate if it corresponds to the real-world value that it is supposed to represent. Accuracy is a measure of how close a value v is compared to

(20)

v’, that is considered the correct value[12]. A data point is considered an outlier if it sur- passes a certain deviation from a reference value. Validity can be considered a separate characteristic, however, in some cases it is included in the accuracy characteristic [13].

Data validation checks if the data meets the requirements of the type of metadata to be expected[14]. In this thesis it will be considered as part of the accuracy characteristic.

Completeness: Completeness is the extent to which data is present in a data collection compared to the source data. It is measured in percentage of real-world information entered compared to the source data[12]. Values missing from a data collection are represented by either a "NaN"-string or a null value.

There are other characteristics worthy of mentioning that are of less relevance to this thesis as no observations of related quality dimensions are affected in the sample data, these are:

• Consistency - to what extent the data is of the same format as previous data[12].

• Timeliness - delay between a real world state change to that of the change in the information system[12].

• Uniqueness - measurement of duplicates of entities in a data set, if an entity is unique in a data set there is only one of that logical entity that exists within the data set[15].

2.4 Related work

[16] makes a comparison of several anomaly detection algorithms for real time big data, such as HTM, MAD, Twitter ADVec, ARIMA, among others. The algorithms run on several time series datasets, including energy consumption, CPU utilization and taxi rides in New York, which have been manually inspected for outliers. The study was looking for algorithms that performed well in a variety of conditions, was fast and with

(21)

2.5. THINGSBOARD PLATFORM 9 a satisfactory rate of True Positive outliers. It has shown that of all tested algorithms, ARIMA (autoregressive integrated moving average) had the best true positive rating, the lowest false positives and almost no false negatives. ARIMA is used for predicting (forecasting) future values in time series.

[17] uses simple and exponential smoothing to study prediction effectiveness on data generated from an advanced mechanical operating system in real time. The simple moving average (SMA) and exponential moving average (EMA) are tested with different periods or span (window length of values to perform smoothing on). The comparison between observed and predicted data is carried out by calculating the Root Mean Squared Deviation(RMSD), Split Error (SE) and Average Deviation (AD) methods respectively. Sharp deviations are observed by AD and RMSD, however, the methods do not handle deviations that continue to occur over time.

2.5 ThingsBoard platform

ThingsBoard is an open source IoT management platform which provides data collection, visualization and processing for IoT devices. It provides out-of-the-box support for industry standard IoT protocols such as HTTP, MQTT and CoAP while being highly customizable. The platform features real-time dashboards on its WebUI and Remote Procedure Calls (RPC) to directly execute command on connected devices[18]. Data can either be streamed directly to the cluster or through an IoT gateway. ThingsBoard comes in two different architectures, monolithic and microservices.

In the monolithic mode ThingsBoards different components share the same operating system resources and are launched in a single Java Virtual Machine. The minimiza- tion of memory used to run ThingsBoard of the monolithic architecture is of advantage as it can run in an environment with constrained resources. The monolithic architecture consists of the transport components, rule engine component and the core services. In

(22)

Figure 2.1: Architectural overview of ThingsBoard gateway components[1]

comparison to the monolithic architecture, the microservices architecture is preferable to run when scalability and high availability is required [19].

2.6 ThingsBoard gateway

The ThingsBoard cluster can connect to and utilize IoT gateways in order to unburden itself, among other things. Gateways can be used to gather device clusters and perform ﬁltering and analysis on the data[1].

The Thingsboard gateway (ﬁgure 2.1) and cluster communicates over MQTT through the ThingsBoard Gateway Client, which is one of several core components of the Things- Board IoT gateway. Some other components are Connectors, Converters, Event Storage and Gateway Service, some of which will be introduced in the coming section. The ThingsBoard Gateway Client works by polling Event Storage and delivering the stored

(23)

2.6. THINGSBOARD GATEWAY 11 telemetry data to the ThingsBoard cluster. The Gateway API enables multi-device data towards the platform over a single MQTT connection[1].

The gateway is a software component written in Python and is designed to run on Linux based microcomputers that support Python 3.5[1].

2.6.1 Connectors

Data from external sources flow through the appropriate configured connector into the gateway. Depending on the protocol defined in each connector it will either poll or subscribe to updates from devices. Connectors are highly customizable which allows for a wide range of device connectivity, however, in this work only OPC UA and Request connectors are to be used in production[1]. When benchmarking the gateway framework an MQTT simulation server is set up that stress tests the gateway by sending MQTT messages. For this the MQTT connector is used.

2.6.2 Converters

Converters are attached to connectors on the device side and serves the purpose of transforming the data to a correct format that are interpretable by both the device and the ThingsBoard cluster. Down-link converters are necessary for calls from ThingsBoard to devices whereas Up-link converters transform device data to a format recognized by ThingsBoard[1].

2.6.3 Event storage

The gateway uses event storage to store temporary data before they are forwarded to the ThingsBoard server. Data can be stored using either in-memory storage or persistent ﬁle storage which stores data in base64 encoding[1].

(24)

2.7 Setup

A Jetson Nano device will host a containerized environment which runs the IoT gateway.

The connections towards BFK and GEC will be established through the OPC UA and Request connectors. The request connector towards GEC will be referred to as the Glava connector.

2.7.1 Docker

The ﬁnished gateway framework will run in a docker container to ensure reliability by having pre-compiled images. A container is an environment with all the required dependencies for a speciﬁc application and can therefore be deployed anywhere that docker is available. Containers generally require less computational power than virtual machines by virtualization of the operating system rather than the hardware, which makes it suitable for running on hardware with limited resources[20].

2.7.2 Jetson Nano

A Nvidia Jetson Nano is a micro computer targeted towards AI development in a compact solution. The device has a separate graphical processing unit (GPU) with 128 cores[21].

2.8 Summary

In this chapter we introduced the ThingsBoard platform, big data characteristics and why data pre-processing is important. The gateway was then introduced as a way to reduce load on the ThingsBoard cluster. This chapter sufﬁces for the reader to understand design decisions in the following chapter.

(25)

Chapter 3 Design

3.1 Introduction

In this chapter the design of gateway framework will be discussed. The main design decision is whether the gateway should collect batches or one continuous stream of data. This will be discussed in sections 3.2 and 3.3. Section 3.4 discusses the speciﬁc Glava connector. Sections 3.5 and 3.6 brings up different choices regarding the cleaning architecture and how it can be conﬁgured in real time.

3.2 Glava implementation

The data that we are interested in getting is stored by GEC on their web server and is continuously being updated with more data. HTTP requests are sent to the web server to receive the data. The data can be collected by either requesting a batch of data or by sending more frequent requests in which only a single measurement point is requested. When reasoning for these two different designs a few key features was taken into consideration.

A batch implementation would reduce the load on GEC’s server in comparison to 13

(26)

14 CHAPTER 3. DESIGN real time as the server would be handling fewer requests. It was observed during the comparison that the server struggled to respond to a larger number of requests but not to send an equivalent amount of data in a single response. This requires the gateway to store extra data in memory or storage, however, regardless of implementation, the gateway itself needs to do the same computations when cleaning the data. Batch requests was in general much more likely to succeed in comparison to sending a request for each measurement point which also required a delay between every request to not get rejected by the server. The server was not capable of receiving a high ﬂow of requests and at the same time respond to them within reasonable times, therefore a batch implementation was superior in this case.

Another feature that would have to be taken into consideration is the simulation of real time data in case of the batch implementation as the ThingsBoard cluster only accepts single points of data. Simulating real time data in a batch environment would require extra processing and therefore would be lagging slightly in time. This should be taken into account depending on the requirements of strict timeliness. In the case of GEC, there are no such requirements.

Lastly the feature that was taken into consideration was the processing and limited resources on the Jetson Nano device. In a batch implementation the device requires more RAM or storage depending on how the batch of data is temporarily stored. This would not be a problem in a real time implementation, however, higher load would be put on the devices network hardware instead. As both implementations have advantages and disadvantages, both were implemented and which one is used is conﬁgurable.

3.3 Glava batch implementation

When developing the Glava batch implementation we decided to integrate it with the connector and not the uplink-converter which was the other alternative. The reasoning

(27)

3.4. CONNECTOR USED FOR GLAVA 15 to integrate the solution in the connector was that it is responsible for sending the HTTP requests to the server. In GEC’s case the url that is used has to be altered as it requires timestamps for the interval of the data requested. Another reason to integrate it in the connector was that the server sometimes responds with the body "no valid license" and if several requests were made at the same time the HTTP response code would be 500, indicating an unknown error.

3.4 Connector used for Glava

The Glava implementation requires HTTP requests to the web server in order to get their data. This leaves two options for choice of connector in ThingsBoard: the request connector or the REST connector. The REST connector relies on connecting to the API’s host URL and its port before requesting data, however, this does not work for GEC as their API consists of further subdirectories after the port. As the connector tries to connect to the API through an incomplete URL the connector itself will fail. The request connector conﬁguration allows the port to be integrated with the host URL and to create mappings for different subdirectory endpoints. Same functionality and native support for GEC’s API led to the request connector being our choice of design.

(28)

16 CHAPTER 3. DESIGN

3.5 Cleaning implementation

The ﬁgure in 3.1 shows the original design of the ThingsBoard gateway.

Figure 3.1: Original design of ThingsBoard gateway

The design of the cleaning implementation would affect the ﬂow of data in different ways depending on where the cleaning would be integrated. In case of an early implementation in the connectors or converters duplication of code would be present in the project. By having the cleaning in separate connector the time series would already be organized. This solution could be considered if only a single connector was active. If multiple connectors were used there would be either duplicate code or inherited functionality from a superclass. Either way it would require repetition of code. As seen in ﬁgure 3.1 tb_gateway_service.py acts as a junction for the connectors and provides an alternative solution. By integrating the code at this point cleaning can be applied to all active connectors, however, the data is unorganized and would

(29)

3.5. CLEANING IMPLEMENTATION 17 require separation of the time series. This design utilizes the predefined flow of data as the script passes the data on to the event storage from where it gets forwarded to the ThingsBoard cluster. Theoretically, the data cleaning could also be implemented directly in the event storage. This approach would require dealing with encoded and unorganized data without separated time series. Our choice of design was to implement the cleaning at the junction as it provides coverage for all connectors and does not require encoding/decoding. The updated design of the ThingsBoard gateway can be seen in figure 3.2.

Figure 3.2: Updated design of ThingsBoard gateway

Regardless of whether a batch or real time implementation is implemented the cleaning requires an unsupervised approach. In a batch implementation supervised methods are possible, however, these requires dynamically adjusted training sets which must be manually controlled for labeling. If training is done on a data set where anomalies exists without a label, the algorithm assumes that the anomaly is part of the normal behavior. This may lead, depending on the current training data, that an

(30)

18 CHAPTER 3. DESIGN algorithm expects anomalies and would therefore not acknowledge them. With the cleaning arrays stored in memory this problem would recur with every gateway restart.

Techniques that operate in unsupervised mode do not require training data and thus are more widely applicable. The techniques in this category make the implicit assumption that normal instances are far more frequent than anomalies in the data. If this assumption is not true then such techniques suffer from high false alarm rate. Unsupervised learning ﬁnds patterns in unlabeled data, hoping that the algorithm through mimicry will build a compact representation of its world [22].

3.6 Live conﬁguration

For the gateway to be as usable as possible even in production environments where a restart of the gateway could be interrupting other important processes, there would have to be a way to change the cleaning algorithms or parameters for the cleaning algorithms during runtime without a restart of the gateway. To enable a feature such as live configuration there are a few options, we could configure so that RPC calls are sent to the gateway which executes code to change the behavior. Another solution that is simpler in practice is to alter a configuration file that is read by the gateway with set intervals to overwrite the configuration that is currently being used. The RPC calls would be a more reactive solution, however, it would require more research in comparison to loading a configuration file with a set interval which we already knew how to do. We settled on implementing a configuration file that a separate thread will load with a set interval.

To alter the configuration one would need a way to access the configuration file or upload a new one to the container. Both options are available and it is up to the user to decide how they would like to do it. If the user decides to upload a new file he/she could download the original configuration file, alter it and then upload it to the gateway once

(31)

3.7. SUMMARY 19 again. Otherwise the user could access the container, go to the ﬁle in the container and alter the ﬁle directly.

3.7 Summary

In this chapter we have evaluated different design decisions regarding the gateway framework. Distinct designs regarding sections 3.2 and 3.3 do not always provide an advantage and is depending on situation hence why it should be conﬁgurable. The cleaning section discusses the integration of the data cleaning and different approaches for cleaning. The chapter ends with a brief argument for live conﬁguration.

(32)

20 CHAPTER 3. DESIGN

(33)

Chapter 4 Implementation

4.1 Introduction

In this chapter we introduce all the code which has been added to extend the gateway to enable conﬁgurable data cleaning. In section 4.2 the necessary libraries and packages are presented. Section 4.3 describes the environment setup. Section 4.4 goes through the Request connector conﬁguration and integration. Sections 4.5 and 4.6 concludes the chapter by explaining how the time series are separated and how the cleaning is performed.

4.2 Prerequisites

For the added code and scripts to work properly the gateway will need to install the python libraries scipy and tsmoothie. This is handled by the gateway on ﬁrst boot through the install method in ﬁgure 4.1.

For the cleaning part to properly work the line ’thingsboard_gateway.cleaning’ was added to the script setup.py which is located in the default home folder in the docker environment. Figure 4.2 shows a snippet of the end result of the setup.py script.

21

(34)

22 CHAPTER 4. IMPLEMENTATION

Figure 4.1: Deﬁnition of the function install

Figure 4.2: Snippet from setup.py

4.3 Environment conﬁguration

4.3.1 Docker container

When setting up the environment a docker container is set up running the ThingsBoard IoT gateway. First the docker image was pulled using the command in ﬁgure 4.3 followed by the initialization using the command in ﬁgure 4.4.

Figure 4.3: Docker command to pull image

Figure 4.4: Setup command for ThingsBoard IoT gateway

When the gateway was up and running it could be accessed by either entering the command in ﬁgure 4.5 or by copying the ﬁles from the docker container and edit them outside of the container.

To get the ﬁles outside of the container the command in ﬁgure 4.6 was used.

(35)

4.3. ENVIRONMENT CONFIGURATION 23

Figure 4.5: Opening bash in container

Figure 4.6: Copying ﬁles from container

After editing the ﬁles they can be uploaded to the gateway again using the same command as before but with the paths swapped as seen in ﬁgure 4.7.

Figure 4.7: Copying ﬁles to container

When the files have been replaced or updated in the gateway the command in figure 4.5 is used to execute commands inside of the container. When inside the container the setup.py script is used to first build a live environment that runs the code. This is done by using the command in 4.8. After that the command in 4.9 is used to install the live environment. To then make the container run the updated code a restart of the container is necessary for the new live environment to be used.

Figure 4.8: Building live environment for latest code

Figure 4.9: Installing live environment for latest code

4.3.2 ThingsBoard conﬁguration

The gateway is created as a gateway device on the ThingsBoard server web inter- face. The access token that is used to authenticate the gateway is available through

(36)

24 CHAPTER 4. IMPLEMENTATION the Device details tab. Connection to the ThingsBoard server is conﬁgured in the ﬁle

’tb_gateway.yaml’ located in the configuration folder. In this main configuration file the hostname or ip address of the ThingsBoard server is specified along with the port of the MQTT service on the server. The access token is pasted underneath the security label as seen in figure 4.10. Memory storage is used for storing incoming data before being sent to the server.

Figure 4.10: ThingsBoard cluster connection.py

The gateway provides support for a variety of connectors using different protocols.

A connector is activated by removing hashes commenting out the configuration. Each connector requires a name, type and configuration file parameters. It is possible to have multiple connectors active at the same time as long as they have different names and configuration files.

4.4 Request connector implementation

4.4.1 Endpoint conﬁguration

To receive a continuous stream of data from GEC’s web server the ending of the URL subdirectory needs to be conﬁgured with each request sent. While the host and major part of the subdirectory for each mapping remain the same the interval time needs

(37)

4.4. REQUEST CONNECTOR IMPLEMENTATION 25

Figure 4.11: Active connectors

adjusting. This is necessary for requests sent towards GEC, hence the code in ﬁgure 4.12. If another request connector is implemented it will simply avoid using the GEC speciﬁc methods.

Figure 4.12: Check conﬁguration ﬁle

In format_url the request configuration file is read to determine whether a batch or real time implementation is currently in use. The method also reads the manually configured scan period from the same configuration file. The scan period should be set according to the rate of speed at which the endpoint publishes data. When gathering a batch of data the time interval is simply the scan period times the batch size defined in the configuration file. For real time the batch size is set to 1. The time itself is retrieved and adjusted in format_url and format_time_to_endpoint using the datetime module to match the required format.

(38)

Figure 4.13: Generating date for GEC’s URL

Figure 4.14: Formatting date and time for GEC’s URL

4.4.2 Error handling

Since the requests made to GEC sometimes results in "no valid license" or code 500 responses when the server does not have data to send, a solution is required to deal with such responses without breaking the gateway. Therefore the implemented code seen in 4.15 tries to read the json payload and if it fails it will be caught by the except statement.

The loop tries to resend the request up to ﬁve times in hope of a better response. If it still does not succeed, another try-except statement will be dealing with the unsuccessful request by logging the occurrence.

(39)

Figure 4.15: Resending requests upon bad responses

4.4.3 Real time implementation

When the request.json file defines that the connector is configured for GEC and real time data is used, the data will be reconstructed by the method real_time_glava() which changes the structure of the telemetry data. The timestamp is added in the telemetry that is received from GEC to ensure that the ThingsBoard cluster "knows" the correct timestamp of the data. The code for this method can be seen in figure 4.16.

(40)

Figure 4.16: Code for reconstructing telemetry from GEC

4.4.4 Batch implementation

As the ThingsBoard server does not accept batches of data it needs to be forwarded one at a time. This is implemented by first specifying in the request.json file that the connector is configured for GEC and that it should use the batch implementation.

The flow of data then gets directed to the method real_time_sim_of_batch() where separate dictionaries are created for each group of data points. A group of data points is considered as all data points from a specific device at a certain time. This can be seen in figure 4.17. The dictionaries that is created will be of the same format as if single data points would be requested, not altering the flow of data. The batch of data is simply split into their individual dictionaries and forwarded to the method send_to_storage() in the script tb_gateway_service.py. By dealing with the batch implementation in this way it is possible to send all data to the cleaning segment instantly and free occupied memory.

(41)

Figure 4.17: Batch implementation

4.4.5 Request converter

For the GEC implementation to work properly it is necessary to separate the different devices as there could exist time series with the same name. To distinguish between two time series two parameters were added to request.json which are first_part_of_name and second_part_of_name. The parameters are only used when the configuration states that the connector is used for GEC and have the purpose of constructing a more detailed name of the device. The parameter first_part_of_name is used as the devices main name and the second_part_of_name parameter is used to identify which sensor group.

This implementation provides a reliable way of keeping track of which data point that belongs to which time series. To build the ﬁnal device name the if-statement in ﬁgure

(42)

30 CHAPTER 4. IMPLEMENTATION 4.18 is used in the script json_request_uplink_converter.py.

Figure 4.18: Detailed device name in uplink converter

4.5 Data cleaning

4.5.1 Live conﬁguration initialization

Upon initialization of the cleaning class a separate thread responsible for live configuration is created. The method loads the latest saved configuration file and waits 60 seconds before calling itself again, repeating the procedure.

Figure 4.19: Update cleaning conﬁguration

4.5.2 Cleaning conﬁguration

The configuration file cleaning.json holds parameters of interest when applying cleaning methods. It is possible to configure cleaning algorithm, window size and standard deviation for a specific sensor. As seen in figure 4.21 specific sensors can be tailored

(43)

4.5. DATA CLEANING 31 according to the time series data. Additional parameters can be added if required. This is done by first specifying the name and value in the configuration file and adjusting the return statement of get_cleaning_method accordingly.

Figure 4.20: Cleaning conﬁguration parameters

Figure 4.21: Retrieving cleaning parameters

(44)

4.5.3 Adding new devices

Data sent to the method send_to_storage in TBGatewayService is going through a check to see if the device already exists in an array of devices. This check is performed in the class DataCleaning with the method doesDeviceExist(). If the device is not already added to the array it will return the value -1 to represent that the device could not be found, otherwise it will return the position in the array. This can be seen in ﬁgure 4.22.

Figure 4.22: Looping through the array to check if the device exists

When a new device has been detected the data point(s) for the device is sent to the method createDevice() in the DataCleaning class. Here a new array is initialized and all the sensors in the device gets added to the array. For this method to add all necessary data for every sensor the method getTelemetryData() is called, seen in ﬁgure 4.23. Here the check_type() method is called to make sure that the data points received is of a valid data type and not a NaN. If the data point is NaN an obvious outlier is returned which will later be detected by the data cleaning algorithm.

Figure 4.23: Creating device and adding the sensor data

The structure of the device array is shown in ﬁgure 4.24. Each sensor in the device array holds a telemetry array with data points, time series array with timestamps and a

(45)

4.5. DATA CLEANING 33 series dictionary responsible for cleaning data and putting the correct value inside the telemetry array.

Figure 4.24: Adding device array

4.5.4 Adding telemetry

Now that devices exist in an organized array and new data keeps coming in, figure 4.22 returns the device’s position. The method to add the telemetry to that position, addTelemetry(), loops through the different sensors in that device and adds the data points to the correct time series. If the amount of data points exceed a certain threshold defined in the cleaning.json configuration file the cleaning initiates. If there are enough data points and a cleaning method is defined in the configuration file, cleaning methods will be applied on the time series, as seen in figure 4.25.

To not run out of memory in the container the time series is limited according to the specified window length by removing the first elements with help of the remove- FirstElement(). The method removeFirstElement() loops over the sensors to find a specific sensor and removes the first data points of that sensor which can be seen in 4.26.

(46)

Figure 4.25: Adding telemetry and applying cleaning

Figure 4.26: Removing the ﬁrst data point of a speciﬁc sensor

(47)

4.5. DATA CLEANING 35

4.5.5 Cleaning algorithms

The series dictionary holds four arrays: original, smooth, up and low. The smoothing is performed on the original series as the length reaches the window length. Upper and lower bounds are obtained from the smoothed time series. If a data point is tagged as an anomaly the return value will correspond to the value of the border it exceeds, if not it will return the observed value. At the end of the method the window is shifted to always remain constant in length. The exponential smoother uses an exponential weight decay making more recent values more important than past ones. A convolutional smoother is also implemented that acts in a similar fashion but enables different window functions from the exponential smoother.

Figure 4.27: Exponential Smoother

(48)

4.6 ThingsBoard gateway service

Besides performing and verifying installs and instancing the cleaning class the gateway service is also responsible for delivering messages from the connectors to the cleaning segment and afterwards re-formatting the message to a JSON formatted string. The latter is performed in the send_to_storage method which is called from each connector thread. Figure 4.28 shows the code that calls the cleaning methods.

Figure 4.28: Gateway service calling cleaning methods

4.7 Summary

In this chapter the implementation of the gateway extension has been presented according to the design decisions discussed in the prior chapter. Mainly, the Glava request connector implementation and data cleaning has been explained in such a way that a reader may further build upon it.

(49)

Chapter 5 Results

5.1 Introduction

In this chapter the benchmarking procedure is presented along with the results. Sec- tion 5.2 describes the environment in which the tests were ran. Section 5.3 describes benchmarking parameters and conditions bringing additional insight. The next section, 5.4, presents the results of the benchmark. Concluding the chapter, section 5.5 adds a discussion of the results and completion of the thesis objective.

5.2 Benchmarking environment

5.2.1 Environment

The programs for performing, monitoring and processing data are spread out on three separate devices on a local network. The gateway is installed on a Jetson Nano device together with docker monitoring tools cAdvisor, Prometheus, and Grafana. These tools are used to record the CPU and memory usage on running Docker containers.

When performing the benchmarking of the gateway two virtual machines (VM) 37

(50)

38 CHAPTER 5. RESULTS were utilized to simulate a production environment. The first VM was set up with an MQTT broker and an MQTT stress tester on two containers. For each benchmark, a new MQTT stress testing container was built with a new configuration. The configuration defines the number of messages, devices, and seconds between each message sent.

When the messages from the second VM arrive at the broker the messages are forwarded to the gateway. The second VM hosts a ThingsBoard server to which the gateway forwards its data.

5.2.2 Conditions

In the case where cleaning is speciﬁed, the cleaning is performed on all time series. For the exponential smoother the smoothing is always performed to create its boundaries and the comparison against the observed value. For the clustering method the clustering is performed on all values and an assumption is drawn from that. The computational resources required are therefore the same whether an outlier is detected or not. This also means that the computation will be the same regardless of incoming time series value.

That being so, the value passed from the MQTT stresser will be of the same value each time. The data will still be separated to different time series, only they will hold the original value.

5.3 Comparisons

To determine the performance of the gateway several parameters are tested against each other. In the gateway the cleaning method and window length are varied while in the MQTT stresser the time between each message is changed along with number of devices and messages generated by each device. Each device corresponds to a time series.

Changing the devices and messages adjusts the amount of data being produced. The cleaning methods are varied to reveal how the computation matters. Said computation

(51)

5.4. BENCHMARKING RESULTS 39 is performed on a set array called window length. A bigger window length implies more computation. Through a python script, the amount of messages successfully passing through the gateway to the ThingsBoard server are recorded.

5.4 Benchmarking results

The results of this thesis can be seen in figures 5.1, 5.2, 5.3 and 5.4. The figures representing the result of this thesis is constructed as follows - the number of devices connected represents real sensors connected as separate sources in ThingsBoard, number of messages per device represents the number of data points sent in a time series. The total messages column sums up the total amount of messages sent to the gateway during the benchmark. "Time between messages" is the set amount of time between each message sent (in each device). For each benchmark either exponential smoothing (exp), KMeans Clustering (clust), or no cleaning is used as seen in the column "Cleaning Method". Furthermore, the number of historical data points kept in memory is defined in column "Window size". The amount of received messages in the gateway is then displayed, followed by the time used by the gateway for our cleaning framework implementation. Lastly the container’s used resources can be seen in the last two columns where memory and CPU usage can be seen.

In ﬁgure 5.1, the bare minimum and starting point of the cleaning framework implementation is tested on a container hosted on a system running Windows 10, having an Intel I7-7700K processor and 16GB of RAM resulting in several messages lost.

Figure 5.1: Default gateway

(52)

40 CHAPTER 5. RESULTS

Figure 5.2: Cleaning framework with cleaning set to off

Figure 5.3: Cleaning framework with cleaning set to exponential smoothing

Figure 5.4: Cleaning framework with cleaning set to KMeans

(53)

5.5. DISCUSSION 41

5.5 Discussion

5.5.1 Gateway performance

The throughput of all the benchmarks was considerably worse than expected, even the result seen in ﬁgure 5.1 was poor as not even the default gateway provides 100%

throughput. With that in mind the results of the cleaning framework seems completely ﬁne as ﬁgures 5.3 and 5.4 shows. The time for the complete framework with cleaning methods activated is relatively low and does not appear to be the bottleneck in the gateways performance. As the gateway was initially tested on a much more powerful machine with similar unacceptable results the hardware constraints can also be overlooked as a bottleneck. This implicates that the constraints are present in the original frameworks architecture.

As for the algorithms, the exponential smoother is as expected faster in every case being a more lightweight algorithm. The memory and CPU for each method are more or less indistinguishable. Because of the high drop rate, the window sizes are irrelevant for this benchmark as for the window size of 50 no time series reaches that limit, which explains the fast cleaning for greater window size. The same applies for the rows containing a device size of 100. This is further enhanced looking at ﬁgure 5.2 as the times are similar.

5.5.2 Objective completion

The objective of the thesis was to develop a cleaning framework upon the ThingsBoard IoT gateway which should be configurable without disrupting the flow of data. As to that description the objective goal is met, however, for the gateway to be considered production ready the performance would have to be better and the cleaning would have to be fitted to each time series. The framework is currently not having any problems when connected to GEC as all the data is cleaned and forwarded within a reasonable

(54)

42 CHAPTER 5. RESULTS time frame. When connected to a MQTT stress tester the throughput of the data is below 50% which is far from acceptable. The main result of this thesis can however be considered somewhat successful as the framework has been developed to primarily suit GEC’s needs and not BFK’s.

5.6 Summary

In this chapter we present a stress testing environment where the gateway performance is measured on several points. The cleaning adds an acceptable computational addition to the existing framework, however, the framework performance proved to be worse than anticipated. For relatively slower streams of data the performance issues can be overlooked.

(55)

Chapter 6 Conclusion

An IoT gateway has successfully been extended with a data cleaning framework and live conﬁguration of cleaning parameters, which was the objective that was set in chapter 1. Some less complicated cleaning methods have been added to test and verify functionality. While the cleaning accuracy has not been the focus of the thesis work, the algorithms are adequate for several time series if ﬁtted correctly.

The resulting framework provides a good quality service for limited data streams, as with GEC’s data, generating throughput every 6 seconds. For higher frequency data generation the framework will drop data leading to missing values in the target host.

To successfully utilize the gateway it’s important to know of the limitations on both the data and the gateway itself.

The gateway is ready for implementation of more sophisticated cleaning methods, making it an alternative for GEC to build upon as their data pre-processing framework.

6.1 Future work

In addition to the work presented in this thesis the cleaning methods used could be completely replaced by better suited algorithms such as speciﬁcally trained ML models.

43

(56)

44 CHAPTER 6. CONCLUSION As for the framework itself the historical data storage of each time series can be further developed as the current solution is rough and is built around GEC’s structure of their data. Further developing the framework, detection of missing values should be considered as only data points with incorrect or "NaN" values are detected and not missing values as in data intervals not arriving.

6.2 Personal project evaluation

The project has introduced us to a lot of interesting and new subjects and environments.

While it has been educative we feel that we would have needed more time to test, optimize and further increase our understanding of the environment. A lot of time has been spent on setting up environments for several sets of hardware requiring special compilations and conﬁguration. Other things that has taken a lot of time debugging is checking why data from GEC doesn’t arrive to the gateway which have been out of our control. Lastly the benchmarking required a lot of extra work to simply set up an environment that would work. Even though it worked we had not really tested the MQTT connector and had no experience in setting up the MQTT stress testing environment. While it has been tedious work to set it all up we feel like we have learnt a lot of relevant technologies. Regarding the development of the cleaning framework it has been interesting working in a modern multi-threaded code base.

(57)

Bibliography

[1] What is thingsboard iot gateway?, thingsboard, inc., https://thingsboard.io/docs/iot-gateway/what-is-iot-gateway/, 2021-02-12.

[2] Dave Evans. The internet of things: How the next evolution of the internet is changing everything. 2011.

[3] Internet of things, https://en.wikipedia.org/wiki/internet_of_things, 2021-02-05.

[4] Big data, https://en.wikipedia.org/wiki/big_data, 2021-02-05.

[5] M. Marjani, F. Nasaruddin, A. Gani, A. Karim, I. A. T. Hashem, A. Siddiqa, and I. Yaqoob. Big iot data analytics: Architecture, opportunities, and open research challenges. IEEE Access, 5:5247–5261, 2017.

[6] James Taylor. Real-time responses with big data. Decision Management Solutions, vol. 53, pages 1–22, 2014.

[7] M. Al-Mekhlal and A. Ali Khwaja. A synthesis of big data deﬁnition and characteristics. In 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pages 314–322, 2019.

[8] Aytac Ozkan. Big data and advanced analytics: Improving data quality for big data using advanced analytics. 2019.

45

(58)

46 BIBLIOGRAPHY [9] Z. Guan, T. Ji, X. Qian, Y. Ma, and X. Hong. A survey on big data pre-processing.

In 2017 5th Intl Conf on Applied Computing and Information Technology/4th Intl Conf on Computational Science/Intelligence and Applied Informatics/2nd Intl Conf on Big Data, Cloud Computing, Data Science (ACIT-CSII-BCD), pages 241–

247, 2017.

[10] V. Desai and D. H A. A hybrid approach to data pre-processing methods. In 2020 IEEE International Conference for Innovation in Technology (INOCON), pages 1–4, 2020.

[11] M. Chen, Z. Huang, Q. Wu, W. Xu, and B. Xiong. Pre-processing and audit of power consumption data based on composite mathematical statistics model. In 2018 2nd IEEE Conference on Energy Internet and Energy System Integration (EI2), pages 1–4, 2018.

[12] F. Sidi, P. H. Shariat Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha. Data quality: A survey of data quality dimensions. In 2012 International Conference on Information Retrieval Knowledge Management, pages 300–304, 2012.

[13] Caihua Liu, Patrick Nitschke, Susan P. Williams, and Didar Zowghi. Data quality and the internet of things. Computing, 102(2):573 – 599, 2020.

[14] David Plotkin. Data stewardship. In David Plotkin, editor, Data Stewardship, pages 127–162. Morgan Kaufmann, Boston, 2014.

[15] David Loshin. Master data management. In David Loshin, editor, Master Data Management, The MK/OMG Press, pages 87–103. Morgan Kaufmann, Boston, 2009.

[16] Z. Hasani. Robust anomaly detection algorithms for real-time big data:

(59)

BIBLIOGRAPHY 47 Comparison of algorithms. In 2017 6th Mediterranean Conference on Embedded Computing (MECO), pages 1–6, 2017.

[17] Amar Kumar, Alka Srivastava, Nita Bansal, and Alok Goel. Real time data anomaly detection in operating engines by statistical smoothing technique. In 2012 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pages 1–5, 2012.

[18] What is thingsboard?, thingsboard, inc., https://thingsboard.io/docs/getting- started-guides/what-is-thingsboard/, 2021-02-12.

[19] Thingsboard monolithic architecture, thingsboard, inc., https://thingsboard.io/docs/reference/monolithic/, 2021-02-26.

[20] What is a container?, docker inc, https://www.docker.com/resources/what- container, 2021-02-12.

[21] Nvidia announces jetson nano: $99 tiny, yet mighty nvidia cuda-x ai computer that runs all ai models, nvidia, inc., https://nvidianews.nvidia.com/news/nvidia- announces-jetson-nano-99-tiny-yet-mighty-nvidia-cuda-x-ai-computer-that-runs- all-ai-models, 2021-02-12.

[22] Unsupervised learning, https://en.wikipedia.org/wiki/unsupervised_learning, 2021− 02− 26.

Data Cleaning Extension on IoT Gateway: An Extended ThingsBoard Gateway