Performance Analysis of Distributed Spatial Interpolation for Air Quality Data

(1)

STOCKHOLM, SWEDEN 2021

Performance Analysis

of Distributed Spatial

Interpolation for Air

Quality Data

KTH Master Thesis Report

Albert Asratyan

(2)

Albert Asratyan <asratyan@kth.se> / <asratyan.albert@gmail.com> School of Electrical Engineering and Computer Science (EECS) KTH Royal Institute of Technology

Place

Stockholm, Sweden

Examiner

Vladimir Vlassov <vladv@kth.se> KTH Royal Institute of Technology

Supervisors

Sina Sheikholeslami <sinash@kth.se> KTH Royal Institute of Technology

Kartik Karuna <kartik.karuna@cabinair.com> CabinAir Sweden AB

Emil Helin <emil.helin@cabinair.com> CabinAir Sweden AB

(3)

Deteriorating air quality is a growing concern that has been linked to many health related issues. Its monitoring is a good first step to understanding the problem. However, it is not always possible to collect air quality data from every location. Various data interpolation techniques are used to assist with populating sparse maps with more context, but many of these algorithms are computationally expensive. This work presents a threestep chain mail algorithm that uses kriging (without any modifications to the kriging algorithm itself) and achieves up to×100 execution time improvement with minimal accuracy loss (relative RMSE of 3%) by parallelizing the load for the locally tested data sets. This approach can be described as a multiple step parallel interpolation algorithm that includes specific regional border data manipulation for achieving greater accuracy. It does so by interpolating geographically defined data chunks in parallel and sharing the results with their neighboring nodes to provide context and compensate for lack of knowledge of the surrounding areas. Combined with the cloud serverless function architecture, this approach opens doors to interpolating data sets of huge sizes in a matter of minutes while remaining cost efficient. The effectiveness of the threestep chain mail approach depends on the equal point distribution among all regions and the resolution of the parallel configuration, but in general, it offers a good balance between execution speed and accuracy.

Keywords

Distributed Computing, Parallel Execution, Data Interpolation, Kriging, Apache Ray, Geostatistics, Python, Cloud Services, AWS, Air Quality

(4)

Försämrad luftkvalitet är en växande oro som har kopplats till många hälsorelaterade frågor. Övervakningen är ett bra första steg för att förstå problemet. Det är dock inte alltid möjligt att samla in luftkvalitetsdata från alla platser. Olika interpolationsmetoder används för att hjälpa till att fylla i glesa kartor med mer sammanhang, men många av dessa algoritmer är beräkningsdyra. Detta arbete presenterar en trestegs ‘kedjepostalgoritm’ som använder kriging (utan några modifieringar av själva krigingsalgoritmen) och uppnår upp till × 100 förbättring av exekveringstiden med minimal noggrannhetsförlust (relativ RMSE på 3%) genom att parallellisera exekveringen för de lokalt testade datamängderna. Detta tillvägagångssätt kan beskrivas som en flerstegs parallell interpoleringsalgoritm som inkluderar regional specifik gränsdatamanipulation för att uppnå större noggrannhet. Det görs genom att interpolera geografiskt definierade databitar parallellt och dela resultaten med sina angränsande noder för att ge sammanhang och kompensera för bristande kunskap om de omgivande områdena.

I kombination med den molnserverfria funktionsarkitekturen öppnar detta tillvägagångssätt dörrar till interpolering av datamängder av stora storlekar på några minuter samtidigt som det förblir kostnadseffektivt. Effektiviteten i kedjepostalgorithmen i tre steg beror på lika punktfördelning mellan alla regioner och upplösningen av den parallella konfigurationen, men i allmänhet erbjuder den en bra balans mellan exekveringshastighet och noggrannhet.

Nyckelord

Distribuerad Databehandling, Parallell Körning, Datainterpolation, Kriging, Apache Ray, Geostatistik, Python, Molntjänster, AWS, Luftkvalitet

(5)

I would like to thank my university supervisor, Sina Sheikholeslami, and my examiner, Vladimir Vlassov, for all of the feedback and guidance that they have provided. Their knowledge and insight have contributed greatly to the quality of this work.

I would also like to thank Kartik Karuna and Emil Helin, my colleagues and supervisors from CabinAir Sweden AB, for giving me a lot of room for creativity for this thesis. I would like to thank my family and friends for always supporting me and listening to all of my monologues about the thesis progress.

Special thanks go to Attila Fodor, my former colleague, who has taught me everything I know about air qualityrelated engineering. Without him, I would not have had enough domain knowledge to undertake this project in the first place.

(6)

1 Introduction

1

1.1 Motivation . . . 1

1.2 Research Area and Context. . . 2

1.3 Problem Statement . . . 4

1.4 Goals and Requirements . . . 5

1.5 Methodology . . . 5

1.6 Benefits, Ethics and Sustainability . . . 5

1.6.1 Benefits . . . 6

1.6.2 Ethics . . . 6

1.6.3 Sustainability . . . 7

1.7 Delimitations . . . 7

1.8 Thesis Contributions . . . 8

1.9 Outline of the Thesis . . . 8

2 Background

9

2.1 Air Quality. . . 9

2.2 Spatial Analysis and Interpolation . . . 12

2.2.1 Proximity Interpolation . . . 13

2.2.2 Inverse Distance Weighted Interpolation . . . 14

2.2.3 Kriging Interpolation . . . 16

2.2.4 Summary and Comparison . . . 20

2.3 Platforms and Frameworks . . . 20

2.3.1 MapReduce . . . 21

2.3.2 Apache Hadoop . . . 21

2.3.3 Apache Spark . . . 22

(7)

2.3.5 Hadoop vs Spark vs Ray . . . 24

2.4 Cloud Computing and Amazon Web Services. . . 25

2.5 Related Work . . . 27

3 Algorithms and Frameworks

29

3.1 Data Structure Choice . . . 29

3.1.1 Air Quality Data Format . . . 29

3.1.2 Decimal Degree Precision . . . 30

3.1.3 Grid Representation . . . 31

3.2 Interpolation Algorithm Choice . . . 32

3.3 Big Data Framework Choice . . . 34

3.4 Framework Summary . . . 35

4 Implementation

36

4.1 Single Process Interpolation . . . 36

4.1.1 System Overview . . . 36

4.1.2 Time Complexity Example . . . 37

4.2 Simple Parallel Approach . . . 37

4.2.1 Hardware Use Case . . . 37

4.2.2 System Diagram . . . 38

4.2.3 Time Complexity Example . . . 39

4.3 Parallel Approach Problem . . . 40

4.3.1 2x2 Grid Parallel Execution . . . 42

4.3.2 3x3 Grid Parallel Execution . . . 43

4.4 Chain Mail Approach . . . 44

4.5 System Diagrams . . . 50 4.5.1 TwoStep Interpolation . . . 50 4.5.2 ThreeStep Interpolation . . . 53 4.6 Software Design . . . 56 4.6.1 Data Format . . . 56 4.6.2 Splitting Data . . . 56 4.6.3 Retrieving Edges. . . 57 4.6.4 Merging Edges . . . 58

4.6.5 Generating New Edges . . . 59

(8)

4.7 Cloud Integration Potential . . . 60

4.7.1 HighLevel Cloud Design . . . 61

5 Results and Evaluation

63

5.1 Test Bench . . . 63

5.2 TwoStep Kriging (Synthetic) . . . 64

5.3 ThreeStep Kriging (Synthetic) . . . 64

5.3.1 Root Mean Square Error Evaluation . . . 66

5.3.2 Testing Accuracy . . . 67

5.4 Air Quality Data Interpolation . . . 70

5.5 Execution Time Evaluation . . . 71

5.5.1 Edge Step Performance Impact Evaluation . . . 71

5.5.2 Data Set Size Performance Evaluation . . . 73

5.6 Cloud (AWS) Performance and Cost . . . 76

5.6.1 Testing Cloud Scalability . . . 76

5.6.2 Master Node Performance . . . 78

5.6.3 Cost . . . 79

5.7 Performance Summary . . . 82

5.7.1 Parallel Approach Advantages . . . 82

5.7.2 Parallel Approach Downsides . . . 83

5.7.3 Note on TwoStep Chain Mail Approach . . . 84

6 Conclusions

85

6.1 Discussion and Conclusion . . . 85

6.2 Limitations . . . 86

(9)

1.2.1 Simple example of spatial interpolation . . . 3

1.3.1 Parallel interpolation edge collision example . . . 4

2.2.1 Interpolated temperature map of Sweden [30] . . . 12

2.2.2 Proximity (Voronoi) interpolation example . . . 13

2.2.3 Inverse distance weighted interpolation example [36] . . . 15

2.2.4 Kriging interpolation example . . . 17

3.2.1 Spherical and exponential models [63] . . . 33

4.1.1 Single thread kriging execution system diagram . . . 37

4.2.1 A distributed system with a master node . . . 38

4.2.2 Simple distributed kriging execution system diagram . . . 39

4.3.1 A synthetically generated surface . . . 41

4.3.2 A surface generated by a single process interpolation . . . 42

4.3.3 A surface generated by the simple parallel approach (2×2 grid) . . . . . 43

4.3.4 A surface generated by the simple parallel approach (3×3 grid) . . . . . 43

4.4.1 Transforming chain mail into a grid . . . 45

4.4.2 Chain mail interpolation . . . 46

4.4.3 Chain mail interpolation explanation, initial state . . . 47

4.4.4 Chain mail interpolation, generating edges (in red) . . . 47

4.4.5 Chain mail interpolation, resolving overlaps and forming new regions . 48 4.4.6 Chain mail interpolation, second step results . . . 49

4.4.7 Chain mail interpolation, before the final interpolation . . . 50

4.5.1 Twostep chain mail interpolation system diagram . . . 51

4.5.2 Threestep chain mail interpolation system diagram . . . 54

4.6.1 Edge points that have to be duplicated and shifted . . . 59 4.7.1 Threestep chain mail interpolation highlevel serverless cloud design . 62

(10)

5.2.1 Twostep chain mail interpolation result . . . 64 5.3.1 Twostep kriging interpolation result for 2×2 (a) and 3×3 (b) grids . . . 65 5.3.2 Comparison of single process (a), simple parallel (b), and threestep

chain mail (c) approaches . . . 65 5.4.1 Comparison of single process kriging (a), simple parallel approach (b),

and threestep chain mail approach (c) for air quality data in Greater London area . . . 71 5.5.1 Comparison of execution times for different added edge data points

numbers for 2×2 threestep chain mail approach . . . . 72 5.5.2 Comparison of execution times (in seconds) for different interpolation

methods . . . 74 5.5.3 Comparison of execution times (in seconds) for 2×2 interpolation methods 75 5.5.4 Comparison of execution times (in seconds) for 3×3 interpolation methods 75 5.6.1 Serverless cloud interpolation system performance comparison . . . 77 5.6.2 Master node performance for different data set sizes . . . 79

(11)

2.1.1 Common Air Quality Index . . . 10

2.2.1 Feature comparison of different interpolation techniques . . . 20

2.3.1 Hadoop vs Spark vs Ray overview . . . 24

3.1.1 Example of collected air quality data . . . 30

3.1.2 Coordinate precision . . . 31

5.3.1 rRMSE for parallel approaches relative to single process interpolation (100 points, 20% of edges taken) . . . 67

5.3.2 rRMSE for parallel approaches relative to single process interpolation at the different edge steps . . . 68

5.3.3 rRMSE for parallel approaches relative to single process interpolation at different data set sizes . . . 69

5.3.4 rRMSE for parallel approaches relative to single process interpolation at different parallel grid configurations . . . 69

5.5.1 Execution time (seconds) for different data set sizes at different edge step values . . . 72

5.5.2 Execution time (seconds) of all interpolation approaches (100 to 500 points) . . . 73

5.5.3 Execution time (seconds) of all interpolation approaches (600 to 1000 points) . . . 74

5.6.1 Cloud scalability test . . . 76

5.6.2 Master node execution times (cloud) . . . 78

(12)

Introduction

This chapter provides the reader with an overview of the whole thesis. First, Section 1.1 introduces the area of application together with the company where the research has taken place. Section 1.2 presents the research area on which the work builds upon. Section 1.3 explains the problem that this thesis is addressing. Section 1.4 presents the end goal of the project. Section 1.5 describes the methodology employed in the work. Section 1.6 touches on the ethical and sustainabilityrelated impacts of the project. Section 1.7 explains the scope and the constraints of the conducted work. Section 1.8 highlights what the thesis adds to the academia. Section 1.9 gives a short outline of the rest of the document.

1.1 Motivation

In recent years, the Internet of Things (IoT) has gained lots of traction, and there are no signs of its growth slowing down. One of the main IoT strengths is its ubiquitous data gathering possibilities. The standard civic applications of IoT data collection sensors could include traffic monitoring or water management [1]. A particularly interesting healthrelated area to look at is air quality, especially considering the evergrowing global industrialization and production volumes affecting the environment around us. Even though recently there has been made some progress with reducing exposure to unhealthy air in more developed areas, global pollution is still rising, sometimes reaching 99% of the population living in areas with dangerous levels, primarily in the Southern, Eastern, and SouthEastern regions of Asia [2]. The first step in tackling this problem is understanding its scale and raising awareness about the health risks

(13)

associated with unhealthy air.

Air quality can be characterized by many variables, but one of the most important ones (and the most used one) is particulate matter (PM) [3]. PM, which is also sometimes called particle pollution, is a mixture consisting of liquid droplets and solid particles that are found in the air [3]. Some of these particles are dark or large enough to be visible to the human eye (examples are dust, smoke, soot). However, the smaller the particle, the higher the health risk it introduces. Some of them could get into the bloodstream, and, subsequently, into the lungs, causing or increasing probabilities of nonfatal heart attacks, aggravated asthma, decreased lung functionality, or irregular heartbeat [4]. From the environmental perspective, increased PM quantities have been linked to making streams and lakes more acidic, changing nutrient balance in coastal waters and large river basins, depleting soil nutrients, damaging sensitive crops, and worsening acid rain effects [4].

CabinAir Sweden AB is a Stockholmbased company that works with air filtration systems in the automotive industry, working with leading car manufacturers, such as Volvo. CabinAir’s goal is to improve the health, safety, and wellbeing of drivers and passengers by producing cuttingedge incabin air purification systems [5]. To get closer to achieving this goal, CabinAir would like to use its sensorcollected air quality data to give context about surrounding air to its users and raise awareness on the issue of unhealthy air. However, even with a fleet of cars constantly collecting spatial air quality data, it is near impossible to fill every spot on the map. Different spatial interpolation techniques are able to help with this by estimating the surrounding air quality based on the few points that have been collected. The following section will present it in greater detail.

1.2 Research Area and Context

The method of processing spatial information to derive new data and meaning from the original one is called spatial analysis. Sometimes also called spatial statistics, spatial analysis includes many techniques that are able to work with geometric or geographic properties of data. It goes way back in time, beginning with the first attempts at surveying and cartography, dating to 1400 B.C in Egypt [6]. Since then the field has expanded a lot, and today spatial analysis can be split into the following main categories:

(14)

• Spatial autocorrelation – measuring and analyzing statistical degree of dependency among observations on a geographic or twodimensional geometric space.

• Spatial regression – capturing spatial dependencies in regression analysis, skipping statistical problems associated with unstable parameters.

• Spatial interaction – estimating the flow of variables, such as material, people, or other information, between different locations in geographic space.

• Spatial interpolation – estimating the variables at unobserved geographical locations based on the available data from the known locations.

This thesis focuses on the last category, spatial interpolation. Also called multivariate interpolation, spatial interpolation is based on interpolating functions of more than one variable. It is an important field in geostatistics, where it can provide a context on the surrounding areas based only on a few points collected by sensor networks. Different spatial interpolation techniques will be discussed further in the thesis, and some of their performance analysis in a distributed environment will be provided. A generic example of what a simple spatial interpolation looks like is shown in Figure 1.2.1. The left figure shows the initial (raw) data points mapped onto a surface, where the right one shows a surface produced as a result of interpolation. Brighter colors represent higher values, whereas darker – lower values. In terms of air quality data, the darker the resulting surface – the better the air quality of the region.

Figure 1.2.1: Simple example of spatial interpolation

(15)

with time complexities for some of the widely used algorithms (such as kriging interpolation) going as high as O(N4_{) [7]. Given a large data set, the execution time of}

a single process execution of an algorithm with O(N4_{) time complexity can get out of}

control fairly quickly. Addressing this problem by running the interpolation in parallel instances can lower the execution times substantially since the initially large input will be distributed among many smaller nodes.

1.3 Problem Statement

Collected air quality data can and will be sparse in some geographic regions. This problem can be addressed by interpolating the existing values to the neighboring unknown areas. However, being a computationally expensive task, the execution time of spatial interpolation is an important aspect to consider. For systems handling large data sets, parallelization should be considered as a means of improving performance. Otherwise, computations may take hours because execution time will grow exponentially with linear growth in the number of points.

Figure 1.3.1: Parallel interpolation edge collision example

The purpose of this research is to improve realworld spatial interpolation performance by introducing a distributed approach to interpolation. The algorithms in use will be the same as in the conventional single process interpolation cases. However, parallel execution has its own issues. With the parallel approach, individual executions will not

(16)

know the context of their neighbors, which will result in regional edges having colliding values. Consider Figure 1.3.1. The four regions have been interpolated in parallel, but since their algorithms do not know about the existence of their neighbors, there can be a great value difference at the neighboring edge coordinates. It would be desirable to have a simple solution for solving the edge inconsistency problem without making any modifications to the original interpolation algorithms by focusing on the distributed configuration instead.

1.4 Goals and Requirements

This thesis will try to solve the stated problem by addressing the following goals: • Improve execution speed of spatial interpolation algorithms without losing

accuracy by parallelizing their execution;

• Design a parallel interpolation system in such a way that the edge collision issue described above (Section 1.3) is removed or reduced to an acceptable level.

1.5 Methodology

The start of the project consists of a literature review with related work and key concepts in the field of spatial analysis, spatial interpolation, distributed computing, and specific interpolation algorithms. First, qualitative research methodology is used for the literature study to get a good understanding of previous and ongoing research in the related fields. This is used as a framework for understanding how spatial interpolation can be efficiently parallelized from a theoretical point of view.

Then, both qualitative and quantitative methods will be used for an evaluation of the implementations for single and distributed interpolation systems. More qualitative methodology will be applied for analysis of the distributed approach and its accuracy.

1.6 Benefits, Ethics and Sustainability

This section covers the societal impact of the work, which would include benefits together with an overview of ethics and sustainability.

(17)

1.6.1 Benefits

When initially discussing project prospects, one of the use cases presented by CabinAir was being able to show an estimation of the surrounding air quality in an easy way to customers, improving general knowledge about air quality in their daily lives and understanding the associated consequences of being exposed to unhealthy air. A system of this type could provide health benefits to its users, who will be aware of the general quality of the air around them, and be able to adjust their actions accordingly. For example, some areas with bad air quality could be avoided, provided that users know about it in advance.

1.6.2 Ethics

There are a couple of ethical aspects, such as data anonymity and data estimation, that should be considered in this work.

Data Anonymity

Since General Data Protection Regulation (GDPR) and similar rules have been introduced in Europe and elsewhere in the world, data privacy and sensitivity have become an important consideration for any business [8]. GDPR rules require certain compliance for data processing, and it must be mentioned that all of the data provided by CabinAir has been anonymized and stripped of all user identifiable information, making it impossible to track or analyze user behavior for any malicious intent. The main use of data in this project is to provide a global context for surrounding air quality based on the sensor readings [9].

Data Estimation

It is important to highlight that the developed system, as well as all spatial interpolation techniques, uses statistical (or deterministic) methods that are only estimations of what the realworld measurements would look like. Thus, it must be noted that sometimes a source of data like this might be inaccurate/incorrect, and it should be clearly understood by its end users that this is only an estimation, with all the associated shortcomings/risks of not having actual data for some of the parts of the system. Notably, a system of this type should not be used for critical tasks where extreme accuracy is a requirement.

(18)

1.6.3 Sustainability

The United Nations Sustainable Development Goals have been used as a foundation for creating a sustainable future for everybody. Some of the challenges addressed by these goals include poverty, hunger, good health, responsible consumption and production, sustainable cities and communities, climate action, industry, innovation, infrastructure, and life on land [10]. In total, 17 main objectives have been defined by the United Nations organization, aimed to be completed by 2030. Of course, not all of these are addressed in this work, but here are some of the specific goals affected by the thesis:

• Sustainable goal #3 is about ensuring healthy lives and promoting wellbeing for all [11]. Health risks considered by the UN include maternal mortality and mortality from noncommunicable diseases, which have been linked to unhealthy air [4]. Subgoal 3.9 specifically mentions reducing the number of deaths and illnesses from hazardous air. The result of this work will indirectly contribute to advancing efforts of this goal by raising awareness about surrounding air quality. • Sustainable goal #11 focuses on making cities and human settlements sustainable [12]. The source also claims that air pollution has caused 4.2 million premature deaths in 2016. Subgoal 11.6 states that air quality and other waste management should be heavily controlled to reduce the environmental impact of cities. This goal will indirectly benefit from the practical results of this work, as the designed system’s use case is aimed at big cities, which will allow good monitoring of air quality in urban areas.

• Lastly, sustainable goal #13 aims to combat climate change [13]. Climate change has been named a cause of air quality gradually getting worse [14]. Solving climate change problems is still in its relatively early stage of development. Therefore, it is important to keep raising awareness about this issue by showing deteriorating air quality to the general public in order to change opinions.

1.7 Delimitations

Due to the large scope of having a complete global interpolation system, research and experiments will be performed over a limited geographic area. Moreover, the raw data used in this work is the property of CabinAir, so only the visualization and performance

(19)

results will be shown in this work. The performance of the system may heavily depend on the available data and its characteristics (such as point distribution in space), so this should be considered when trying to replicate the results.

The focus of this work is on the distributed aspect of spatial interpolation, so no changes are made to the actual interpolation algorithms. Moreover, the provided data is spatio temporal, but the research focuses only on the spatial aspect of it, and the resulting system is not dependent on the time of data recording at all. If it happens that two data points have the same geographical coordinates but different timestamps, only the more recent one is considered. Because of this, the resulting system performs snapshot data analysis and not data forecasting.

1.8 Thesis Contributions

The contributions of the thesis include the design and development of a chain mail parallel interpolation algorithm that greatly improves execution speeds of standard data interpolation algorithms (specifically, kriging) without any modifications to the data interpolation itself and also allows to use multiple threads/cores for a more efficient resource utilization. The results satisfy the goals defined in Section 1.4 and this work should provide a perspective on what direction the development of distributed data interpolation algorithms could take.

1.9 Outline of the Thesis

The report has the following structure:

• Chapter 2 presents key concepts and related work necessary for understanding the scope of the problem and the proposed solution;

• Chapter 3 introduces the framework of the project, including the choice of big data software, data structures, and interpolation algorithms;

• Chapter 4 is about design and implementation of the system;

• Chapter 5 presents the results of the performed experiments and their evaluation; • And Chapter 6 concludes the work with a discussion of the results and potential

(20)

Background

This chapter will introduce all of the relevant concepts and background required for understanding the work in more detail. Air quality, spatial interpolation algorithms, and big data frameworks will be covered.

Section 2.1 presents a short discussion about the characteristics of air quality and how it can be represented. Section 2.2 introduces the concept of spatial interpolation and talks about different interpolation algorithms, specifically Voronoi interpolation, Inverse Distance Weighted interpolation, and kriging. Section 2.3 will discuss different big data frameworks available for building a distributed interpolation system. Section 2.4 covers the concept of cloud computing and introduces some services that are going to be used in this work. Section 2.5 will provide a short overview of the research related to this study.

2.1 Air Quality

There are many different ways of measuring air quality, but one of the most common methods is to use Air Quality Index (AQI) [15]. AQI measurement can differ from country to country, but the main idea is the same: the lower the concentration of certain unhealthy gasses and particles in the air the better. Some of the most common pollutants that AQI considers are:

• NO2 – nitrogen dioxide. With regards to health risks, NO2 exposure may

cause a change in lung function [16]. Longterm and chronic exposures to NO2

(21)

respiratory symptoms for people with asthma [17]. Environmental risks of high NO2 concentrations include contributions to acid rains, which are harmful to

vulnerable ecosystems, such as forests or lakes. Moreover, a correlation between rising levels of NO2and reduced crop yields has been found [18].

• PM2.5and PM10– particulate matter of different sizes: particles with diameters

of less than 2.5 μm and less than 10 μm respectively. Particulate matter is the sum of all hazardous solid and liquid particles in the air. Both organic and inorganic particles such as dust, smoke, and pollen are included in this metric [19]. Similar to NO2, PM can affect human health and in the worst cases even

reduce the life expectancy of people with existing lung and heart conditions by a few months [20]. World Health Organization (WHO) also reports that PM has shown additive effects in combination with other pollutants [20].

• O₃ – ozone, another pollutant with a distinct pungent smell. Ozone has been found to be harmful at concentrations found only in urban areas so far, where it is produced the most [21]. Numerous studies have shown that ozone affects respiration, central nervous system, and the heart. It may even cause early death and problems with reproductive health in some rare cases [22].

There are two established AQIs that are used in Europe: Common Air Quality Index (CAQI) and European Air Quality Index (EAQI) [23] [24]. The index compares the current hourly concentration of PM, NO2, O3in μg/m3to the proposed demarcation,

and calculates a score. An example of CAQI is shown in the table below: Table 2.1.1: Common Air Quality Index

Qualitative Name Index Pollutant concentration in µg/m

3 NO2 PM10 O3 Very low 025 050 025 060 Low 2550 50100 2550 60120 Medium 5075 100200 5090 120180 High 75100 200400 90180 180240 Very high 100 >400 >180 >240

However, there are other useful metrics that describe air quality. For instance, CabinAir also measures volatile organic compounds (VOCs). VOCs are responsible for scents, odor, and pollutants. Natural VOCs are not dangerous by themselves. However, manmade VOCs are considered to be pollutants and they have caused

(22)

allergic and respiratory problems in children [25]. Some of these compounds can even react with ozone (O3), which can cause serious sensory irritation [26]. Finally, it may

contribute to smog development as well [27].

All of these variables may be used together for providing a good overview of the current air quality situation. CabinAir’s air quality data also contains a timestamp, latitude, and longitude. The following subsection will consider how to represent such data.

Spatial Air Quality Data Representation

Spatial data, also called geospatial data, is a type of data that can be represented in a geographic coordinate system [28]. Air quality data falls under this definition because air contents and air quality are heavily dependent on their location and surroundings. There are two main ways of representing such spatial data: vector data structures and raster data structures.

Vector Data

Vector data structures are good for representing data mapped as polygons or lines. A good use case for vector representation would be if raw data could be represented as a graph with specific node dependencies [29]. On the flip side, vector data structures are complex and less intuitive for users of cartographic data. Nevertheless, the biggest disadvantage of vectors is that most of the cartographic operations are computationally expensive. Another big problem with vector structures is the fact that they are not suited for area filling, which is one of the most important features to have for a data interpolation system [29].

Raster Data

For raster data structures, their biggest advantage is simplicity, both in terms of computation and representation. A grid is a prime example of raster data structures. It has been around in cartography since its birth. Analyzing grid data structures is easier than vector structures. Variogram calculation, autocorrelation statistics, interpolation, and filtering all benefit from the structural simplicity of a grid [29]. The biggest disadvantage associated with raster data is the required volume of data. Volume could be reduced but at the expense of using lower resolution grids. This would result in accuracy loss, and potentially a whole cartographic entity if the resolution decrease is

(23)

stark enough. Therefore, there is a tradeoff between accuracy and data size of grids for most of the use cases.

2.2 Spatial Analysis and Interpolation

Spatial interpolation is the process of using known points to estimate values at unknown locations. The most common example would be a precipitation map. There are only so many weather stations available around the region, so it is almost impossible to completely fill the map with real data. However, spatial interpolation can estimate the temperatures by using only the known values from a relatively small sample of geographical coordinates (often supported by other available data for greater accuracy, such as wind, topography, or historical data). Such interpolated surface is called a statistical surface. An example of a temperature map with interpolated values is shown in Figure 2.2.1 below.

(24)

Extensive data collection can be quite an expensive task, if even possible at all. Because of this, it makes sense to collect data only from the most significant regions. Then it is up to the interpolation algorithms to fill the gaps. There are multiple existing spatial interpolation algorithms and some of the most popular are proximity interpolation, Inverse Distance Weighted (IDW) interpolation, and Kriging interpolation [31].

2.2.1 Proximity Interpolation

Proximity interpolation, also known as Thiessen polygons or Voronoi tessellation,

was created by Georgy Voronoy, a Ukrainian mathematician [32]. The core idea of this approach is to partition a plane into regions based on their proximity to each point from a given data set. Figure 2.2.2 below illustrates a simple case of Voronoi interpolation.

Figure 2.2.2: Proximity (Voronoi) interpolation example

The figure represents a finite set of initial points (black dots) {p1, ..., pn} in the

Euclidean space. Every other point in the plane is colored according to the Voronoi cell that it belongs to. In turn, each Voronoi cell Ri consists of every point in the

Euclidean space whose distance to piis less than or equal to the distance of any other

(25)

Formally, the Voronoi algorithm can be defined as follows. Let X be a metric space that can be measured by a distance function d. Let K be a set of indices and (Pk)k∈Kbe

a collection of nonempty subsets in the space X. The Voronoi cell Rkassociated with

the site Pkis the set of all points in X whose distance to Pkis smaller than or equal to

the distance to the other sites Pj, where j is any index not equal to k. Then, any Voronoi

cell would be [33]:

Rk ={x ∈ X|d(x, Pk)≤ d(x, Pj)f or all j ̸= k} (2.1)

This is the simplest and one of the oldest interpolation methods. The biggest downside of this approach is that regional surfaces (or grid values) change abruptly across the boundaries. This is not an accurate representation of most realworld applications [34]. On the upside, its simplicity means easy computations (its time complexity can be either O(N log N ) or O(N2_{)), and it was used widely even before computer hardware}

allowed for more complex calculations [35].

Pseudocode

Listing 1 presents a pseudocode for the O(N2₎ _{implementation of Voronoi}

interpolation. surface variable is the resulting interpolated surface, and data_set is the array of initial points provided by the user.

1 # surface is a 2d array filled with empty points 2 for interpolated_point in surface:

3 interpolated_point.min_distance = 999999 4 interpolated_point.value = -1

5 for raw_point in data_set:

6 current_distance = distance_between(interpolated_point, raw_point) 7 if current_distance < interpolated_point.min_distance:

8 interpolated_point.min_distance = current_distance 9 interpolated_point.value = raw_point.value

10 return surface

Listing 1: Pythonstyled pseudocode for Voronoi interpolation

2.2.2 Inverse Distance Weighted Interpolation

Inverse Distance Weighted (IDW) is a deterministic type of multivariate

(26)

points are calculated with a weighted average of the values from the available points. The name of the method comes from the application of weighted average because the algorithm uses the inverse of the distance to each known point for weight assignment. An example of an IDW interpolated area is shown in Figure 2.2.3.

Figure 2.2.3: Inverse distance weighted interpolation example [36]

Since the weights are proportional to the proximity of the known points to the unsampled location, they can be specified by the IDW power coefficient. The bigger the power coefficient, the higher the weight of nearby points will be. A value Z of the unsampled location j can be estimated by the following equation:

Zj = ∑ iZi/dnij ∑ i1/dnij (2.2)

where n is a weight parameter that is used as an exponent to the distance, which makes the difference in distances much starker for different points. This means that for large n, the importance of nearby points will be much greater, and the farther points will have a minimal impact on the value. On the other hand, a small n will make sure that even the farther points will be able to influence the estimated value. In other words, a big n will result in the output resembling the Voronoi interpolation to a certain degree, since the unknown points will be influenced only by the closest known point, and a small n will provide an image with smoothed out edges.

This method is more popular, as it is both relatively accurate and has decent computational performance. IDW interpolation time complexity is also O(N2_{). It is}

(27)

a bit slower than Voronoi interpolation but has better accuracy. The main advantage of IDW is that it is easy to understand, which means that it is easy to draw conclusions from the produced output. A feature to keep in mind concerning realworld data is that IDW is a purely deterministic algorithm. This means that if there are any statistical biases in the data, the accuracy of the final result will suffer because they are not accounted for. IDW’s accuracy can also suffer if the data distribution is uneven [37]. Furthermore, both minimum and maximum values can only occur at the initially sampled locations.

Pseudocode

Listing 2 presents a pseudocode implementation of Inverse Distance Weighted interpolation. Where possible, the pseudocode uses the same set of notations as the previous example.

1 radius = 10 # defines how far to search, defined by user 2 for interpolated_point in surface:

3 in_radius = []

4 for raw_point in data_set:

5 current_distance = distance_between(interpolated_point, raw_point) 6 if current_distance < radius:

7 raw_point.weight = 1 / (current_distance ** p) 8 in_radius.append(raw_point)

9 interpolated_point.value = 0 10 total_weight = 0

11 for point in in_radius:

12 value = value + point.value * point.weight 13 total_weight = total_weight + point.weight

14 interpolated_point.value = interpolated_point.value / total_weight 15 return surface

Listing 2: Pythonstyled pseudocode for IDW interpolation

2.2.3 Kriging Interpolation

In geostatistics, kriging is another algorithm for spatial interpolation, first published in 1951 and named after its inventor, Dave Krige. The technique is also known as Gaussian process regression or WienerKolmogorov prediction. As one of its names suggests, the method is modeled by a Gaussian process, which is governed by prior covariances. A Gaussian process is a concept in probability theory and statistics that describes a stochastic process where every finite collection of random variables

(28)

has a normal distribution. Kriging is one of the most used methods in the field of spatial analysis and an example of kriging interpolation is presented in Figure 2.2.4 below.

Figure 2.2.4: Kriging interpolation example

The brightest (or the darkest) points indicate where the raw points were, whereas the smoothened surface around them is the result of interpolation based on the numerical values from the initial points. Given an initial set of known points s and their values at their locations Z(s), the estimation at an unknown location Z∗(s0)is a weighted mean

that would be equal to:

Z∗(s0) =

N

∑

i=0

λiZ(si) (2.3)

where N is the size of s, and λ is the array of weights.

It differs from simpler interpolation methods (such as IDW or Voronoi interpolations) in that it uses spatial correlation between known points to interpolate the values in the spatial field. Kriging interpolation is also capable of generating an estimate of uncertainties around every interpolated value [38]. Kriging weights are calculated in such a way that the nearby points are given more weight/importance than the far away ones. In this regard, kriging is similar to IDW, but what is also taken into account in

(29)

kriging is the clustering of points. Clustered points are given less weight than single random points because the clusters reflect the same value, more or less. This is done for reducing clustering bias in the final estimation. It is important to highlight that kriging will be less effective if there are no spatial correlations between points [38].

Kriging can be described as a twostep process [39]:

1. Determine the spatial covariance structure of the known points by fitting a variogram for these points, where a variogram (or semivariogram) is a function describing the degree of spatial dependence of a spatial random field.

2. Use the weights derived from the covariance structure to interpolate values for the unknown points or blocks across the required space/grid.

There are several subtypes of kriging. Some of the most common are:

• Ordinary kriging. This is one of the simplest forms of kriging that assumes point stationarity. Point stationarity means that the mean and variance of the values are constant across space. The problem of this subtype is the fact that stationarity assumption can be hard to meet, which is especially relevant for air pollution distributions [38].

• Universal kriging. The difference between universal kriging and ordinary kriging is that the universal one relaxes the stationarity requirement by allowing the mean of the values to differ in a deterministic way for different locations, where only the variance is kept constant across space. This type is more suitable for environmental applications [38].

• Block kriging. In contrast to the two previous variations, block kriging estimates averaged values over gridded blocks instead of single known points. The blocks often have a smaller prediction error than individual points.

• Cokriging. Extra known values are used to enhance the precision of interpolation of the variables of interest at all locations.

Kriging starts with the calculation of a variogram for further weight assignment. There are also multiple types of variograms that could be used for this: linear, exponential, spherical, and Gaussian. All of these options modify the point covariance, producing different estimations. All in all, there are many available combinations of kriging and variograms, which allow tailoring the results to the raw data.

(30)

All kriging types assume isotropy, meaning that there is uniformity in all directions. Kriging has other limitations as well: it can be sensitive to choosing the correct model since the weights are heavily dependent on variograms. Generally, kriging accuracy may suffer if:

• there are not enough observed points available; • the data is limited in only specific areas of the space; • or the data is not spatially correlated.

In these cases, it is difficult to generate an accurate variogram, and other methods like IDW may be preferred instead [40].

Kriging time complexity is the worst of the three algorithms, being O(N4₎_{for N points}

of input data [7]. It has been shown by Srinivasan et al. that kriging complexity can be reduced to O(N3₎_{[7]. However, kriging would still not scale well with a rapidly growing}

number of input coordinates, and it should be used only when accuracy importance outweighs the potential execution time bottlenecks.

Pseudocode

Listing 3 presents a pseudocode implementation for kriging interpolation. The pseudocode omits the modelfitting details, only specifying the type.

1 semivariogram_data = [] 2 for raw_point_1 in data_set: 3 for raw_point_2 in data_set:

4 distance = distance_between(raw_point_1, raw_point_2)

5 correlation = 0.5 * (raw_point_1.value - raw_point_2.value) ** 2 6 semivariogram_data.append((distance, correlation))

7

8 semivariogram_model = fit_model(semivariogram_data, model_type='spherical') 9

10 for interpolated_point in surface: 11 interpolated_point.value = 0 12 for raw_point in data_set:

13 current_distance = distance_between(interpolated_point, raw_point) 14 weight = semivariogram_model.get_weight(current_distance)

15 value = value + raw_point.value * weight 16 return surface

(31)

2.2.4 Summary and Comparison

Table 2.2.1 below presents a quick summary of the discussed interpolation methods:

Table 2.2.1: Feature comparison of different interpolation techniques

Interpolation

method Accuracy Execution speed Type

Special requirements

Voronoi Low Medium,

O(N2) Deterministic No statistical data bias Inverse Distance Weighted Medium Medium, O(N2) Deterministic No statistical data bias

Kriging Good Slow,

O(N4) Statistical

Data stationarity, data isotropy,

data can’t be too sparse

From the table above, it is apparent that the choice of the algorithm should be made based on which of the two characteristics is more important: accuracy or execution speed. Inverse distance weighted can be pointed out to be a more balanced choice that can be used when it is difficult to assess accuracy and time complexity requirements. Voronoi interpolation should generally not be used as its execution time gain does not justify big accuracy loss over IDW. Kriging should be used if the raw data has spatial correlations, and accuracy is important. However, no matter which algorithm is used, all of them take an array of spatially (or geographically for the realworld data) defined points and return a surface. This means that if there is a system capable of parallelization without algorithm modification, then they can be hotswapped.

2.3 Platforms and Frameworks

Implementation of a system capable of parallel data interpolation would be a simple task if it was used for smallscale research purposes. However, this is not the case here as this project is done in collaboration with the industry, where a more practical and maintainable approach is required. This section will go over useful software platforms and frameworks for this project.

(32)

2.3.1 MapReduce

MapReduce is a programming model for distributed computing that consists of two main procedures: map and reduce. The map function filters and sorts data, whereas the reduce function performs the addition operation. This model was initially presented by Google, and since has become the backbone of many distributed computing algorithms [41]. The biggest advantage of MapReduce is that the logic is executed on servers where the data already exists, instead of sending the data to where the application is, which results in saving computing resources [42].

Even though MapReduce is not directly used in this project, it is good to have an understanding of the key technology that has allowed other frameworks like Apache Hadoop or Apache Spark to emerge.

2.3.2 Apache Hadoop

Apache Hadoop is an opensource platform that can handle large batch data sets in

a distributed fashion. Hadoop uses Google’s MapReduce under the hood to split the data into blocks and distribute the chunks of data to nodes across clusters. Then, the MapReduce algorithm processes the data in parallel on each of the nodes [43].

Every node of the cluster both processes and stores the data. Hadoop stores its data on disks using HDFS – Hadoop Distributed File System. Its greatest advantage over other distributed file systems is that it is highly faulttolerant and has been designed to be deployed on lowcost hardware. At great scales, hardware failure is more of a norm rather than an exception, making Hadoop a solid choice for distributed computing [44].

The Apache Hadoop framework consists of four main components [45]:

• Apache Hadoop Common, also called Hadoop Core, provides a set of common utilities and libraries that other modules are dependent on.

• HDFS – Hadoop Distributed File System, which is capable of storing both structured and unstructured data.

• MapReduce – the main processing component of the system. Takes the data fragments from the HDFS and assigns them to mapping tasks in the cluster. It processes the data chunks to add the pieces together into the desired result in

(33)

parallel.

• YARN – Yet Another Resource Negotiator, which is responsible for managing computing resources and job scheduling.

Apache Hadoop does the processing by accessing data stored locally on HDFS. It might not be as fast as inmemory solutions, but it is favorable for largescale computing. However, this also means that Hadoop infrastructure is cheaper to run since large amounts of RAM are not required. The purpose of Hadoop is to store data on disks and only then analyze it in parallel batches, and it is well suited for linear data processing [46].

2.3.3 Apache Spark

Apache Spark is a unified opensource analytics framework for largescale data

processing, which provides highlevel APIs in multiple programming languages, such as Java, Scala, Python, or R. On top of this, it can be used interactively with the Scala, Python, or R shells. Apache Spark has great integration options, where it can run on Apache Hadoop, Apache Mesos, Kubernetes, cloud services, or even as a standalone system [47].

Spark has been specifically designed for fast performance and it uses RAM for data caching and processing. Compared to one of its distributed computing alternatives, Apache Hadoop, Spark yields 100 times better inmemory performance, and up to 10 times better performance when working with data from disk storage [46]. Spark has been created to improve the performance while keeping the benefits of MapReduce. It can be noted that since Spark depends on RAM computations, it is more expensive to run than Hadoop because the initial hardware investment is greater. But once this is resolved, Spark has superior performance because inmemory computations are much faster than using disks for reading/writing data.

Apache Spark is a versatile tool that consists of the following components [47]:

• Apache Spark Core – the heart of the project responsible for scheduling, parallel task dispatching, and I/O operations.

• Apache Spark Streaming – component responsible for handling streaming data.

(34)

• Apache Spark SQL – component for structured data processing.

• MLlib – machine learning library that helps with the scalability of machine learning algorithms on distributed machines.

• GraphX – a set of APIs for handling graph data.

To sum up, Apache Spark can be described as an opensource framework for distributed computing with fast inmemory performance suitable for iterative data analysis. It works well with Resilient Distributed Datasets (RDDs – Spark specific data structure) by providing a level of fault tolerance, but may be challenging to scale because of its reliability on inmemory computations.

2.3.4 Apache Ray

Apache Ray is the newest framework for distributed computing out of the three presented here. Apache Ray is also an opensource system, but it has been specifically designed for scaling Python applications from a single machine execution to large clusters. According to the official documentation, its design is architectured with machine learning and artificial intelligence use cases in mind, but it is also generic enough to be useful for other distributed problems [48].

The original MapReduce model is a solid choice for most big data workloads that excels at cleaning, preparing, and analyzing data, but some workloads require better communication between components and better support for distributed and mutable state. Apache Ray introduces a concept of actors that are able to provide state to stateless tasks, thus simplifying cluster management. This would also require fewer code modifications to transition an existing program/system to a distributed state, in stark contrast to Hadoop and Spark, where the whole execution revolves around the concepts of HDFS and RDDs. Additionally, when communicating between actors or tasks on the same machine (for example, running on parallel threads of a single processor), the application state is transparently managed via shared memory, without any copying between actors and tasks for performance optimization [49].

A more practical advantage of Apache Ray worth mentioning is its Pythonic approach, allowing for rapid development and focusing on solving the actual problem instead of writing large amounts of utility code to get a basic distributed application running.

(35)

2.3.5 Hadoop vs Spark vs Ray

Seemingly, Apache Hadoop, Apache Spark, and Apache Ray are all opensource, all made for big data processing, and all can reach similar results. However, there are some key differences that are summarized in Table 2.3.1.

Table 2.3.1: Hadoop vs Spark vs Ray overview

Category Hadoop Spark Ray

Performance Slower performance due to using storage

Fast performance due to using RAM for storing data

Cost

Opensource, less expensive to run due

to wide availability of cheap storage

Opensource, but more expensive to set up due to higher

RAM prices

Opensource, but more expensive to set up due to higher

RAM prices Data

processing

Suited for batch data processing

Suited for iterative and stream data

processing

Suited for batch data processing and ML applications Fault tolerance Highly faulttolerant, data is replicated across disks Faulttolerant, can rebuild data by tracking

RDD block creating process

Faulttolerant, reruns crashed tasks

automatically

Scalability

Easily scalable by adding new storage

disks and nodes, supports tens of thousands of nodes

A bit harder to scale due to relying on RAM for computing. Supports

thousands of nodes

More difficult to scale, can be used together with Hadoop or Spark

for better scalability Language

support Java, Python

Java, Scala, Python,

R, Spark SQL Python

Resource management

Uses YARN for scheduling and resource allocation

Has builtin tools for resource allocation/ scheduling/monitoring Builtin resource management for allocation/scaling/ monitoring

Because of these characteristics, Apache Hadoop applications are [50]: • Processing large data sets where data size exceeds memory size; • Building data analytics systems for limited budgets;

• Completing tasks where realtime performance is not a requirement; • Historical and achieve data analysis.

(36)

Apache Spark applications are [46]:

• Realtime systems, where time is of the essence with inmemory processing; • Machine learning applications;

• Graph processing data; • Stream data processing. Apache Ray applications are:

• Machine learning; • Artificial Intelligence; • Batch data processing;

• Reinforcement learning applications.

To sum up, both Spark and Ray have better performance than Hadoop, but it comes at a cost. Spark has been created as a successor to Hadoop, but, in general, Hadoop and Spark should be combined and used together for advanced systems. By combining the two, Hadoop can complement Spark’s lack of file system, and Spark can bring its improved raw performance to accommodate for realtime processing requirements. Ray can be implemented inside of a Spark cluster for better control of the application flow, as its API is one step below MapReduce on the system level [51].

2.4 Cloud Computing and Amazon Web Services

This section covers cloud computing – an alternative to managing big data frameworks yourself. Cloud computing can be defined as an ondemand service that provides computer system resources, including data storage and computing power without direct active user administration of hardware. Essentially, cloud providers rent out data center capacity to their clients. Clouds can either be private and limited to a single client (enterprise clouds, such as Amazon Web Services) or be shared among many users (public clouds, such as GitHub).

Cloud services heavily rely on economies of scale due to their nature of sharing resources. In the industry, cloud computing has won over the more traditional hosting methods due to its payment model. The “payasyougo” model charges users only

(37)

for the resources that have been consumed. A client may rent a set amount of virtual machines or API calls if the needed resources can be estimated in advance. However, the biggest advantage of the cloud payment model becomes apparent when the on demand service provision is used, where businesses would not need to micromanage the cloud resources because that has already been included in the price [52].

The possibility to always have available highspeed networks that are able to provide resources for any sort of computing capacity is extremely attractive to many businesses. CabinAir Sweden AB has entered the software market relatively recently and can be considered a prime example of such a business. Because of this, the company chose to use Amazon Web Services (AWS) as its cloud provider.

AWS has over 170 services available, but only few are interesting for the scope of this work. These services are described below.

AWS EC2

AWS EC2 is the oldest of the AWS services and one of the most used ones. It offers resizable virtual machines (VMs) in all performance ranges. The VM compute capacity ranges from very basic ARMbased singlecore VMs for executing simple single tasks all the way to 64 core x86 VMs that are able to do complicated processing. These VMs are available globally, and the main benefit of EC2 is being able to upgrade or downgrade (due to varying loads) VM configurations in a couple of clicks and with a minimal amount of server downtime [53]. Combined with previously mentioned Spark integration, AWS EC2 is a great tool for big data cloud computing, where a locally developed system can just be deployed on an EC2 cluster with minimal modification.

AWS Lambda

AWS Lambda is a serverless function compute service that allows running code without managing or provisioning any servers. Lambda runs program code only when it is called and its scaling is handled by AWS [54]. Because of this, Lambda is billed for the execution time, and it is usually more expensive than a virtual machine hosting service, such as AWS EC2. However, its biggest advantage is with its intended use case. It offers an efficient way to handle burst parallel processing without having to worry about costs associated with idling hardware.

(38)

AWS DynamoDB

AWS DynamoDB is a highperformance serverless NoSQL database that is a good choice for any keyvalue data at any scale. It was made with mobile, web, and IoT applications in mind, thus making it a good candidate for storing freshly collected sensor data [55]. DynamoDB is a serverless database. Since there is no need to actively monitor the load of the database, it has a smaller management overhead. This is possible due to its ondemand features, where the database can allocate more resources automatically if it detects that the load is too high.

2.5 Related Work

In the previous years, there has been put a lot of effort into improving data interpolation techniques by focusing on different approaches, mainly algorithm optimizations. Even though this thesis focuses on the parallelization aspect of data interpolation, not much relevant work has been done in this domain. However, it would still be insightful to understand what the general findings are in similar areas and what key aspects should be considered. Therefore, this section will summarize some of the scientific publications relevant to the scope of the thesis, even if they do not involve parallelization.

A study on the efficiency of kriging for realtime spatiotemporal interpolation was done by Srinivasan et al., where an attempt to reduce the kriging time complexity to O(N3₎ _{took place. The work has used atmospheric data, and the authors state}

that substantial performance increases were achieved, with some test execution times improving from 2 days to under 7 minutes [7]. The kriging algorithm has been changed by introducing iterative solvers like conjugate gradient, and then GPU acceleration was used for improving performance even further.

Another study performed by Li et al. has focused on improving the accuracy of IDW interpolation [56]. A new variant called Adjusted Inverse Distance Weighted algorithm (AIDW) was introduced. The algorithm’s advantage over standard IDW interpolation is to be able to take into account the influence of relative distance and positions of known points on the unknown point. This was done by adding a coefficient to the IDW formula, which is used to adjust the distance weight of sampled points. As a result, the generated interpolated surfaces would be more reasonable and resemble manual

(39)

professional interpolation by diminishing IDW interpolation errors for data sets with nonuniform distributions of sampled points.

Armstrong and Marciano have investigated parallel data interpolation on Multiple Instruction Multiple Data (MIMD) processors [57]. They have used the IDW interpolation algorithm on multiple processors, and only the execution time was measured. The study has shown that there was almost a linear decrease in execution time with a linear increase in the number of processors. However, the performance gains would gradually slow down with more added hardware. The study has also not addressed the accuracy of the computations and how the partitions would relate to each other.

Henneböhl et al. have compared CPU and GPU performance differences for the IDW algorithm [58]. The study has found that, for small known data to prediction location size ratios, GPUbased implementation outperforms CPUbased one by the factor of 2. However, with the data/location size ratios going up, GPUbased execution times start growing exponentially, falling behind CPU times by a lot, which has shown to not be affected by this metric. The study concludes by suggesting that the solution should be a hybrid CPU/GPU system that is able to combine the strengths of both of the approaches.

Summary

From the covered literature above, it can be highlighted that there have been multiple attempts to improve interpolation performance. Some focus on algorithm optimizations, whereas others resort to hardware solutions (GPU). But as Henneböhl et al. have mentioned, the answer lies somewhere in the middle where all of these techniques could be combined. This thesis will put focus on improving execution times by the means of parallelization with consideration of interpolated edges together with global accuracy.

(40)

Algorithms and Frameworks

This chapter covers the engineeringrelated contents of the thesis work before going into the actual implementation of the project. It will explain why certain approaches have been chosen over the other ones and will go into more detail about the concepts presented in Chapter 2.

Section 3.1 dives into the features of the data used in the thesis, as well as what is the best format for storing such data. Section 3.2 focuses on why kriging interpolation was chosen for this work, and which specific type of kriging is preferable. Section 3.3 explains why Apache Ray is the bid data framework of choice, and Section 3.4 summarizes the decisions before moving onto the implementation stage.

3.1 Data Structure Choice

Air quality data collected by CabinAir has a very specific format. This section will explain it in detail and cover why the raster data representation format was chosen for handling air quality data.

3.1.1 Air Quality Data Format

Air quality data available for this project was provided and is owned by CabinAir AB. The data has the following fields:

• Timestamp – when the data point was recorded in time, represented as a standard Unix time. Also known as Epoch time, Unix time is a time system

(41)

showing the number of seconds that have passed since the beginning of the Unix Epoch, which took place on 1 January 1970 [59].

• Longitude – geographic coordinate that specifies the eastwest position of a point on the surface of the Earth. Its values can range from 180° to 180°; 0° is defined by the prime meridian, which passes the Royal Observatory in Greenwich, England [60].

• Latitude – geographic coordinate that specifies the northsouth position of a point on the surface of the Earth. Its values can range from 90° to 90°;

• PM2.5 – count of partspermillion of PM2.5 particles. PM2.5 is also called fine

particulate matter. Gives a good estimate of how polluted the surrounding air is.

• PM10 – count of partspermillion of PM10 particles. Also called coarse

particulate matter. Serves a similar purpose as PM_2.5.

• CO – carbon monoxide. Its concentration is also measured in partspermillion. PM10and CO data have not been used for experiments since their interpolation process

is the exact step repetition of the PM2.5process. Table 3.1.1 below shows some sample

points.

Table 3.1.1: Example of collected air quality data

Timestamp (s) Latitude (°) Longitude (°) PM2.5(µg/m3)

1611302808 51.3706 0.0662 4.0625

1611302819 51.5453 0.1179 4.0000

1611302829 51.7712 0.4482 5.8750

1611302840 51.6852 0.0235 7.2500

3.1.2 Decimal Degree Precision

Five decimal places precision in geographical coordinates using decimal degree notation is roughly at a 1meter resolution. Each 0.00001 difference would be 1 meter in length. For example, Google Maps imagery uses 15 cm resolution, or roughly 6 decimals [61]. Six decimal places would give a precision of 10 cm, and seven decimal places 1 cm. Eight decimal places would provide millimeter precision, and since each extra decimal place will increase the amount of data stored by a factor of 10, using more

(42)

than five decimals for most use cases becomes suboptimal. The table 3.1.2 below shows the decimal degrees and objects suitable for use at those scales:

Table 3.1.2: Coordinate precision

Decimal places Decimal degrees Distance (m) Scale

1 0.100000 11,057.43 City 2 0.010000 1,105.74 District 3 0.001000 110.57 Street 4 0.000100 11.06 House 5 0.000010 1.11 Person 6 0.000001 0.11 Person

It should be noted that these decimal degree values reflect the distances near the equator. For example, near the north and south poles, the decimal degree delta in longitude represents a much smaller distance, where 4 decimal degrees become 20 cm. Nevertheless, this is not critical since those extreme regions will not produce any data.

3.1.3 Grid Representation

Section 2.1 has presented two distinct ways of storing data: vectors and rasters. The air quality data used for this project has a couple of features that are fitting for raster (grid) data representation. These features are:

• Coordinates – the data is represented by geographical coordinates. Longitude can be considered the Xaxis, and latitude – the Yaxis. This means that it can easily be mapped onto a 2D surface in Euclidean space, such as a simple grid. • CabinAir targets vehicles and vehicular travel as their main market. This means

that the grid resolution does not have to be extremely high. Therefore, 4 decimal places should be sufficient for retaining good data granularity without using unnecessary computing resources. Using more than 4 decimals would be impractical because car movement does not need the human tier of precision, whereas using less than 4 would result in aggregating too much information in the same area and loss of results.

• There is no direct relationship between any two collected data points. This means that there are no benefits to employing the vector approach since there is no obvious way of constructing a graph out of the initial data set.