Student Thesis

(1)

Student Thesis

Level: Master

The Impact of the COVID-19 Lockdown on the Urban Air Quality: A Machine Learning Approach.

Author: Srinivas Bobba Supervisor: Yves Rybarczyk Examiner: Moudud Alam

Subject/main field of study: Data Science Course code: MI4001

Credits: 30 ECTS

Date of examination: 2021-06-09

At Dalarna University it is possible to publish the student thesis in full text in DiVA. The publishing is open access, which means the work will be freely accessible to read and download on the internet. This will significantly increase the dissemination and visibility of the student thesis.

Open access is becoming the standard route for spreading scientific and academic information on the internet.

Dalarna University recommends that both researchers, as well as students, publish their work open access.

I give my/we give our consent for full-text publishing (freely accessible on the internet, open access):

Yes ☒ No ☐

Dalarna University – SE-791 88 Falun – Phone +4623-77 80 00

(2)

Page | 1

ABSTRACT

‘‘SARS-CoV-2’’ which is responsible for the current pandemic of COVID-19 disease was first reported from Wuhan, China, on 31 December 2019. Since then, to prevent its propagation around the world, a set of rapid and strict countermeasures have been taken. While most of the researchers around the world initiated their studies on the Covid-19 lockdown effect on air quality and concluded pollution reduction, the most reliable methods that can be used to find out the reduction of the pollutants in the air are still in debate. In this study, we performed an analysis on how Covid- 19 lockdown procedures impacted the air quality in selected cities i.e. New Delhi, Diepkloof, Wuhan, and London around the world. The results show that the air quality index (AQI) improved by 43% in New Delhi,18% in Wuhan,15% in Diepkloof, and 12% in London during the initial lockdown from the 19^{th of} March 2020 to 31^st May 2020 compared to that of four-year pre- lockdown. Furthermore, the concentrations of four main pollutants, i.e., NO2, CO, SO2, and PM2.5 were analyzed before and during the lockdown in India. The quantification of pollution drop is supported by statistical measurements like the AVOVA Test and the Permutation Test.

Overall, 58%, 61%,18% and 55% decrease is observed in NO2, CO,SO2, and PM2.5 concentrations, respectively. To check if the change in weather has played any role in pollution level reduction or not we analyzed how weather factors are correlated with pollutants using a correlation matrix. Finally, machine learning regression models are constructed to assess the lockdown impact on air quality in India by incorporating weather data. Gradient Boosting is performed well in the Prediction of drop-in PM2.5 concentration on individual cities in India. By comparing the feature importance ranking by regression models supported by correlation factors with PM2.5.This study concludes that COVID-19 lockdown has a significant effect on the natural environment and air quality improvement.

Keywords: Covid-19, air pollutants, Lockdown, AVOVA, Permutation Test, Machine Learning, feature importance.

(3)

Page | 2

ACKNOWLEDGEMENTS

I would like to express gratitude to my supervisor Professor Yves Rybarczyk for the continuous guidance, support, and timely advice through the learning process of this master thesis work. I would like to extend my appreciation to Professor Kenneth Carling for all his help and support throughout the Data Science program. I would like to take this opportunity to thank Moudud Alam

for teaching statistical learning which allowed me to learn statistical modeling. Also, I would thank Professor Hasan Fleyeh and Professor Arend Hintze for their thoughts on Machine Learning concepts, making me attain this knowledge and leading to the preparation of this thesis.

Furthermore, I would like to take this opportunity to thank all my teachers for making me gain knowledge in Microdata Analysis. I sincerely thank all the staff at Dalarna University for supporting me with the required tools to move ahead during this thesis. I would like to express my gratitude and appreciation to my family members and friends for extending their full support during the course and the thesis preparation.

(4)

Page | 3

LIST OF ABBRIVIATIONS

ANN Artificial Neural Networks

AQI Air Quality Index

ANOVA Analysis of variance

BTX Benzene, Toluene, and Xylene

COVID-19 Coronavirus disease of 2019.

CO Carbon monoxide

CSV Comma-separated value

CART Classification and Regression Tree

EDA Exploratory Data Analysis

EPA Environmental Protection Agency

GBRT Gradient Boosted Regression Tree

LASSO Least absolute shrinkage and selection operator

MSE Mean squared error

NO2 Nitrogen Dioxide

O3 Ozone

PM2.5 particulate matter

R2 R-squared

SO2 Sulfur dioxide

VOC Volatile organic compound

WHO World Health Organization

(7)

Page | 6

LIST OF FIGURES

Figure 1.2: Types of Air Pollutants Figure 2.1: Research design process

Figure 3.1.1: Drop-in pollution in Wuhan, China due to Covid-19 lockdown Figure 3.1.2: Drop-in pollution in Delhi, India due to Covid-19 lockdown

Figure 3.1.3: Drop-in pollution in London, the United Kingdom due to Covid-19 lockdown Figure 3.1.4: Geographic distribution of AQI change in selected cities around the World Figure 3.2.1.1: Visualizing pollutants(NO2, SO2) trend and seasonality maps

Figure 3.2.1.2: Visualizing pollutants(PM2.5, PM10) trend and seasonality maps Figure 3.2.2.1: Correlation between Weather factors and Pollutants

Figure 3.2.2.2: Weather Line Graph for pm2.5 Figure 3.2.2.3: Weather Line Graph for NO2 Figure 3.2.2.4: Weather Line Graph for CO

Figure 3.2.3.1: Quantification of pollution drop in major cities in India Figure 3.2.4.1: Permutation Test

Figure 3.2.5.1: Performance of the best model and feature importance ranking on all major cities Figure 3.2.5.2: Performance of the best model and feature importance ranking on the individual city Figure 3.2.6.1: Correlation between PM2.5 and weather factors

LIST OF TABLES

Table 2.3: Relation between AQI value and Air Quality remark Table 2.7: Tools used

Table 3.2.4.1: Results from ANOVA Test Table 3.2.4.2: ANOVA table for pm2.5

Table 3.2.5.1: Comparison of various regression models on major cities in India Table 3.2.5.2: Comparison of various regression models on the individual city(Delhi)

(8)

Page | 7 CHAPTER 1

INTRODUCTION

1.1. BACKGROUND

Early last year, after an outbreak in China, WHO identified this new type of coronavirus as SARS- Cov-2 belonging to one of the seven coronaviruses (WHO, 2020c). Covid-19 is the disease caused by SARS-Cov-2, which is a respiratory disease and can affect both the upper and lower respiratory tracts (Landrigan,2017). The sickness quickly spread around the world. Because of the disease's severity and, WHO declared it a pandemic on March 11, 2020 (WHO, 2020b). To fight the spread of this virus, almost every other major city in the world has been locked down and people are requested to maintain social distancing. As expected, the economy has crashed and people belonging to the lower-income band are affected the most. The government and other officials are trying hard to revive the situation, but experts predict it will remain quite a time before we get back to normal. Countless people have lost their lives and millions are infected. Everyday health officials are working extremely hard to save the ones infected. Although the lockdown has cost many people their jobs, it has exponentially reduced the pollution levels. Be it air, water, or soil, the cities under lockdown have seen a significant reduction in their pollution level and air quality improvement.

1.2. TYPES OF AIR POLLUTANTS

Air pollution is one of the main determinants of human health(WHO, 2020b). Dealing with air pollution is even more challenging for the world in current times. It has been increasing a lot after the industrial revolution, the rapid growth of vehicles, and improved lifestyle. It is one of the major concerns for developing countries like India which is considered as one of the most polluted countries in the world. Developing countries like India, use fossil fuels for domestic and industrial purposes, incomplete combustion of these fuels results in the emission of PM2.5, SO2, CO, NO2, etc. into the air and pollutes the air (Kandlikar & Ramachandran,2000) As a result of these air pollutants, there has been a huge increase in the level of global warming and irregular climate changes. To understand the various types of air pollutants in the datasets. On a broader level, these pollutants can be classified as shown in the below Figure 1.2:

(9)

Page | 8

Figure 1.2: Types of Air Pollutants

Particulate matter (PM2.5 and PM10) is a mix of solids and liquids, including carbon, complex organic chemicals, sulphates, nitrates, mineral dust, and water suspended in the air. Nitrogen Oxides (NO, NO2, NOx) are a group of seven gases and compounds composed of nitrogen and oxygen, sometimes collectively known as NOx gases. Sulphur Dioxide (SO2) is a colourless gas with a strong odour. Carbon Monoxide (CO) is a colourless, highly poisonous gas. Under pressure, it becomes a liquid. It is produced by burning gasoline, natural gas, charcoal, wood, and other fuels. Benzene, Toluene, and Xylene (BTX) are well-known indoor air pollutants, especially after house decoration. They are also common pollutants in the working places of the plastic industry, chemical industry, and leather industry. Ozone(O3) is a colourless and highly irritating gas that forms just above the earth's surface. It is called a "secondary" pollutant because it is produced when two primary pollutants nitrogen oxides (NOx) and volatile organic compounds (VOCs) react in sunlight and stagnant air (Yoo et al., 2015).

1.2.1. CAUSES FOR THESE POLLUTANTS

• Vehicles majorly emit Carbon Monoxide (CO) and Nitrogen Oxide (NO). It minorly emits Ozone(O3) and Particulate Matter (PM2.5 and PM10) (Pérez-Martínez et al., 2014).

• Industries majorly emit Sulphur Dioxide (SO2), Carbon Monoxide (CO), Ammonia (NH3) and Particulate Matter (PM2.5 and PM10), BTX (Benzene, toluene, xylene) (Arulprakasajothi et al., 2020).

(10)

Page | 9

1.3. RELATED RESEARCH

Before we went into our study, we have reviewed several research papers that are related to the current study. In the last few decades, many machine learning techniques have been proposed for analyzing and solving air pollution prediction and tried to predict the quality of air using classification techniques. Athanasiadis(2003) used an s-fuzzy lattice neurocomputing classifier to classify O3 concentrations: [low, mid, and high] using meteorological features and some pollutants. Kurt and Oktay( 2010) used a neural network model and predicted daily Air quality value by classifying it using concentration values of SO2, CO, and PM10. The drawback of converting regression problems into a classification is that it ignores the magnitude of numeric data and produces discretized output (class label) losing the resolution that can be achieved using numeric data. Dan Wei (2014) used Naive Bayes classification and support vector machine to predict air quality in Beijing city.

In another research, the authors (Carbajal-Hernández et al., 2012) used a fuzzy inference model to perform parameter classification using a reasoning process and integrating them into an air quality index. Shaban et al(2016) created a wireless system to monitor and predict air quality to identify the highly polluted area in each city using support vector machines, M5P model trees, and artificial neural networks (ANN). Jain et al (2018) in their work "Scalable measurement of air pollution using COTS IoT devices" created a system to predict air quality by considering traffic conditions and the available greenery, data is collected when users take a trip this approach ignores the contribution of pollutants that degrade air quality and just focuses on the greenery present in the surrounding. MdNazmulHoq (2019) in their work “Prediction of a possible asthma attack from air pollutants: towards a high-density air pollution map for smart cities to improve living” created a mobile application to predict asthma attack in highly populated cities.

Later, authors (Aditya C. R et al.,2018)used logistic regression is used to detect whether a data sample is either polluted or not polluted and used autoregression used to predict future values of PM2.5 based on the previous PM2.5 readings and thus they predict air quality.

Machine learning calculations and figurative power to predict the future, hence it has found their application almost everywhere. The Air Quality prediction problem is solved by many researchers using different algorithms. Some of the research works are closely related to our prime goal. For instance, the research work "Applying Machine learning Techniques in Air Quality Prediction"

(Kalapanida & Nikolaos, 1999), attempts to solve the same problem using Decision Tree and Naive Bayes Algorithm. The shortcoming of this methodology is Decision tree is not a good classifier for time series. They also tried to approach this problem with a limited amount of data set, which increases the chances of over-fitting.

(11)

Page | 10 In another research work "A Machine Learning Approach for Air Quality Prediction: Model Regularization and Optimization" (Zhu Dixian et al., 2018), authors used ozone, SO2, and PM 2.5 for AQI prediction. They also use regularization and optimization to predict future pollutant values. They have used linear regression to serve the purpose. The limitation of this research work is the lack of generality. As they have developed the model using data collected by only two data stations.

In the Research work "Comparative Analysis of Machine Learning Techniques for Predicting Air Quality in Smart Cities" (Ameer et al., 2020). the author has considered a dataset of four Chinese cities, namely ’Beijing’,’ Shanghai’,’ Shenyang City’,’ Guangzhou City’ and ’Chengdu City’. The authors have considered different regression techniques, compared them, and evaluated them based on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Regression techniques such as Decision Tree Regression, Random Forest Regression, Gradient boosting Regression, and Artificial Neural Networked Multilayer Perception Regression were used to serve this complex problem. Decision tree regression was used to develop a predictive model with the help of simple decision rules. The random forest ensures that every tree in the ensemble is generated from a sample with replacement (bootstrapping) from the training set. Furthermore, they have concluded that the random forest regression technique performs best among all the regression techniques taken into consideration.

In the recent publication " Assessing the COVID-19 Impact on Air Quality: A Machine Learning Approach" (Rybarczyk et al., 2020) machine learning models based on a Gradient Boosting algorithm are built to assess the outbreak impact on air quality in Quito, Ecuador. First, the precision of the prediction was evaluated by cross-validation on the four years pre lockdown, showing a high accuracy to estimate the real pollution levels.

1.4. RESEARCH PROBLEM

In summary, most of the researchers around the world initiatives their study on the COVID-19 lockdown effect on air quality and concluded pollution reduction, but the most reliable methods that are used to find out the reduction of the pollutants in the air still in debate.To tackle this problem, the concept of statistical methods and machine learning models are used are built to assess the COVID-19 impact on air quality around the world.

(12)

Page | 11

1.5. RESEARCH QUESTIONS

Based on the problem outlined earlier, the research questions this thesis addresses are thus formulated as:

• How did the weather condition and the pollutants affect the quality of urban air during COVID-19 lockdown?

• Can a predictive model be used for a reasonable prediction of the individual pollutants, affecting the quality of urban air, during COVID-19 lockdown?

• Was there a significant improvement in the air quality after the adoption of lockdown due to Covid-19?

1.6. AIM AND OBJECTIVES

This study aims to estimate the pollution drop in selected cities around the world. Finally, machine learning regression models are constructed to assess the lockdown impact on air quality in India by incorporating weather data. The following objectives have been set:

• Estimate the pollution drop in selected cities around the world.

• Estimate the correlation between the weather factors (temperature, pressure, humidity, dew point, wind speed, precipitation) and the concentration of individual pollutants.

• Infer on the variations of the air pollutant concentrations using ANOVA and Permutation Test.

• Predict the individual pollutant concentration drops, that affect air quality, using machine learning regression models.

(13)

Page | 12

^{CHAPTER 2}

METHODOLOGY

2.1. RESEARCH DESIGN

The research approach assumed in this study is predominantly quantitative with a hypothesis that often is developed based on existing theory and tested empirically or employing mathematical proof (Hanson et al., 2005). The research can moreover be characterized as deductive: a hypothesis is formulated and verified empirically and quantitatively, followed by hypothesis testing.

Conclusions are drawn via statistical inference in hypothesis testing. The null hypothesis is either rejected or accepted based on its level of significance. When comparing two models, statistical significance tests simply assess whether the reported difference in performance (according to some metric) is significant with a particular level of confidence(Alpaydin,2020).

Figure 2.1: Research design process

(14)

Page | 13 In the upper part of Figure 2.1, a literature review was conducted as the primary method to permit us to conclude, the relevant literature from conference proceedings, journals, thesis, reports, and other sources was searched, categorized, reviewed, analyzed, and summarized. During the exploratory process, the research gap was identified, and the research question(s) was formulated.

Research questions in machine learning usually addressed by deductive reasoning (Alpaydin,2020). Then the corresponding aim and objectives of this thesis were identified. A standard statistical method (ANOVA and Permutation Test) used for inferring causal effects are randomized controlled trials. The data has been split into two groups: treatment and control, administering treatment to one group and nothing to the other, and measuring the outcome of both groups. By assuming that the treatment and control groups are not dissimilar from each other and infer whether the treatment was effective based on the difference in outcome between the two groups.

In the lower part of Figure2.1, an experimental design was conducted to define independent variables and dependent variables. In this case, the independent variables are predicted variables and the dependent variable is not measurable; instead, it is given as an output from the model.

Missing values from the raw data are eliminated by using the scientific procedure. The modeling combines several techniques from different disciplines, including machine learning, statistical to develop an effective predictive model (Bhattacharya,2015). Analytical analysis was conducted to prove the theoretical merits of the proposed approach and numerical illustrations were applied to validate it (Montgomery,2017). Combines a final step, the work was written up in the form of documentation and present the thesis.

1.2. DATA COLLECTION

The data needed for this thesis was extracted from the World Air Quality Index project is a non- profit project started in 2007 with the mission of promoting air pollution awareness for citizens and provide unified and worldwide air quality information. The dataset consists of 435742 observations and 9 features of each major city around the world based on the average (median) of several stations from 2015 through 2020. The data set provides min, max, median, and standard deviation for each of the air pollutant species PM2.5, PM10, NO2, SO2, CO, and O3 as well as meteorological data (Wind Speed, Temperature, Pressure, Dew point, Humidity ). All air pollutant species are converted to the US EPA standard, dates are UTC(Coordinated Universal Time) based, and the count column is the number of samples used for calculating the median and standard deviation(AQICN; https://aqicn.org).

2.2.1. DATA WRANGLING

The raw dataset was converted into CSV files, imported, and concatenated as a single Pandas Data frame. The concatenated raw dataset is the combination of time-related records and various measurements taken at multiple stations in each city for a total of five cities in India from the year

(15)

Page | 14 2015 to 2020. The raw data frame is divided into two separate subsets for wrangling, EDA, and modeling, and three new columns are introduced to the appropriate subsets:

Two separate subsets:

pm_clean: this is the main dataset containing the measurement data for all five cities, here the average pm2.5 reading for each city is used instead of pm2.5 data from individual stations; this subset containing the station_average pm2.5 reading is used for building predictive models.

pm_stations: this is a supplementary dataset containing station-specific individual PM2.5 readings, serving the purpose of validating the measurement consistency among stations in each city, which is the underlying foundation for taking the average pm2.5 reading across multiple stations.

Three added columns:

'date_time': time-related information is recorded in separated columns as 'year', 'month', 'day', 'hour' and 'season' in the raw data. For EDA and modeling purposes, a DateTime formatted column is created by parsing the time-related columns and added to the 'pm_clean' and 'pm_sr' subsets.

'pm_average': Hourly PM2.5 readings from multiple stations are recorded for each city in the raw data. The pm2.5 readings are reasonably consistent among stations in the same city and there is no reason to choose the PM2.5 reading from one station over others, therefore it is most representative to use the average PM2.5 readings in EDA and Modeling. A 'pm_average' representing the average of pm2.5 readings from multiple stations of the same city is computed and used in the cleaned main dataset 'pm_clean' instead of the PM2.5 readings from the individual station.'ws': The 'iws' column records the cumulated wind speed over time, one of the meteorological weather parameters.

Missing data:

There are a total of 36% rows containing missing data, 93% of which are due to the missing pm2.5 values. Given that pm2.5 (air quality indicator) is the focus for the EDA and predictive modeling, records without valid pm2.5 values are of little use to this study. These missing data are dropped.

Outliers:

Unrealistic values (0.025%) are spotted in three columns ('dewp', 'humi' and 'ws'). A seasonal component from the STL(seasonal-trend decomposition procedure based on LOESS) fit is removed and a linear interpolation is performed to replace the outliers.

(16)

Page | 15

1.3. AIR QUALITY INDEX (AQI) CALCULATION

We calculate AQI and pollutant index of each of the pollutants by using the EPA method as described below (Ilvessalo, 1995):

Pollutants (Independent variables): NO2, SO2, CO , and PM2.5 Target(dependent variable) : AQI (Air quality index)

where, PObs = observed 24-hour average concentration in microgram/meter cube PMax = maximum concentration of AQI color category that contains PObs

PMin = minimum concentration of AQI color category that contains PObs AQIMax = maximum AQI value for color category that corresponds to PObs AQIMin = minimum AQI value for color category that corresponds to PObs

We calculate pollutant indexes (AQIP) of each of the pollutants namely - si, ni, ci and pi append it in the dataset.

Then, we find AQI by taking a maximum of all the pollutant indexes of the pollutants : AQI = max(AQINO2, AQISO2, AQICO, AQIPM2.5)

We have created one column ’AQI’ in our dataset, the prediction of pollution drop is done by applying a machine learning model on this dataset.

Further, for classification purposes, two new columns were introduced namely, AQI_label that divides the AQI into 5 major categories as shown in Table 2.3 below, and AQI_Binary_Range which divides the AQI into two broad categories namely - Good or Bad regarding the air quality.

Table 2.3: Relation between AQI value and Air Quality remark

Source: https://www.epa.gov

(17)

Page | 16

1.4. STATISTICAL METHODS

2.4.1. DATA TRANSFORMATION

Data transformation has performed using a power transform method. It removes a skew(or shift) from a data distribution to make the distribution more normal(Gaussian).In other words probability distribution of a variable more Gaussian and standardize the result, centering the values on the mean value of 0 and a standard deviation of 1.0. This can have the effect of removing a change in variance over time, especially in time series data. It has been done by using the Box-Cox transform (positive values) with the help of the Python boxcox() function from the SciPy library. The function invert_boxcox()takes a transformed value, and the lambda value returns the optimal value to the original values (or close to it). The optimal value for this hyperparameter is reused to transform new data in the future such as a test dataset or new data. The transformed training dataset can be used later in the ANOVA test , a Permutation test, and machine learning regression model (Piepho,2009).

2.4.2 ANOVA( ANALYSIS OF VARIANCE) TEST

We might use ANOVA as a statistical test when we want to test a particular hypothesis. It helps us to understand how different groups respond with the help of the null hypothesis (Cuevas et.al, 2004). If the result gets statistically significant, that means populations are unequal. In this study, we have visualized the distributions of the pollutants in each year and run an ANOVA test to find out if there is a significant difference in the means of the pollutant levels during the lockdown period in 2018, 2019, and 2020. During this process we check assumptions of ANOVA for pm2.5:

• Residuals are normally distributed.

• Homogeneity of variances.

• Observations are sampled independently from each other.

2.4.3 PERMUTATION TEST

In this Permutation Test, we are assessing whether the pollutant in question comes from two different distributions. We want to check if the weather being in lockdown has affected the level of pollutants recorded. Therefore, we will create an extra Boolean column in the transformed data telling whether Delhi County was in lockdown during that date. We are using March 19th as the date that lockdown started. We compared the data for when Delhi was in lockdown versus when Delhi was not. To do this, we took the average AQI level of each group (lockdown and during lockdown) and use the difference of their means as the test statistic. When doing this on the actual data, it will be called the observed test statistic. Next, we will simulate data under the null hypothesis(Good et.al, 2013) which is that being in lockdown does not affect the AQI levels of the pollutants. To do this, we will shuffle the Boolean lockdown column so that they label random data

(18)

Page | 17 points. We calculated the difference in means of the two groups once again. We repeated this process 1000 times to accumulate 1000 different simulated test statistics that were formed under the null hypothesis. Finally, we calculated a p-value by counting the number of times the simulated test statistic was greater than or equal to the observed test statistic (difference in means). We will use a significance level of 0.05 to determine whether the results are statistically significant. To help visualize this process, we plot a histogram of the distribution of the test statistic when simulated under the null hypothesis. The red dot on the histogram will represent where the observed test statistic lies. We will then use a significance level of 0.05 to determine whether there is a significant difference between the two groups. The only assumption for the permutation test is that the observations are sampled independently from each other, which we already know is true.

2.5. MACHINE LEARNING CONCEPTS USED

2.5.1. LINEAR REGRESSION

Linear regression is useful for finding the relationship between two continuous variables. One is a predictor or independent variable, and the other is a response or dependent variable. It looks for a statistical relationship but a not deterministic relationship. The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. In this section, we model a probabilistic approach and explicitly model noise using a likelihood estimation. Thereafter, we find the optimal parameters θ of the model using techniques like Maximum Likelihood Estimation and Maximum a Posteriori Estimation and analyze the ways to reduce the overfitting present in the model (Montgomery et.al, 2021).

2.5.2. LASSO REGRESSION

The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is a sort of shrinkage-based linear regression. Data values are shrunk towards a central point, such as the mean. Simple, sparse models are encouraged by the lasso approach (i.e., models with fewer parameters) and are well-suited to models with a high level of multicollinearity (Ranstam et.al, 2021).

2.5.3. DECISION TREE REGRESSION

The main objective of the Decision trees is to produce a predictive model for the values of the outcome variable using simple decision rules that have been derived from the features of the dataset. Binary trees are developed by classification and regression trees (CART) by considering the threshold and characteristics that produce information at each node of the tree (Loh, et.al, 2014).

(19)

Page | 18 2.5.4. RANDOM FOREST REGRESSION

The random forest ensures that every tree in the ensemble is generated from a sample with replacement from the training set. It is not a boosting technique but a bagging technique. It Primarily constructs a multitude of decision trees at the time of training and finally outputs the continuous prediction of individual trees in case of regression or mode of the classes in case of classification (Liaw, et.al, 2002).

2.5.5. GRADIENT BOOSTING REGRESSION

The generalization of boosting to a random differentiable loss function is called the GRBT. It produces a prediction model made in the form of an ensemble of weak prediction models, such as decision trees. It includes an effective solution that can be utilized for classification as well as for regression problems (Friedman, et.al, 2001).

2.5.6. MULTILAYER PERCEPTRON REGRESSION

MLP (Multilayer Perceptron) is a regressor who appears in the partial derivatives of the loss function concerning the model parameters are generated at each time step to update the parameters, so the regressor trains iteratively. It can also have a regularization term added to the loss function to prevent overfitting by shrinking model parameters (Murtagh, et.al, 1991).

2.5.7. IMPLEMENTATION

We incorporated the above-mentioned regression models for the analysis of pollution drop in PM2.5 concentration in different states of India mentioned in our dataset. The data is pre-processed as mentioned in the data acquisition step. Then, the regression models were applied after splitting the dataset into 80% training and 20% testing. The models were treated with inputs of various features from different states such as Delhi, Ahmadabad, Chennai, Mumbai, and Bengaluru.

Simulation parameters used as training set- 80% of the whole dataset and testing set- 20% of the total dataset.

2.6. PERFORMANCE METRICS

The effectiveness of the learned predictive models can be evaluated by various performance metrics(Alpaydin,2020). In an experimental setting one system is compared to another, which is sometimes seen as the baseline . Then the hypothesis framework is defined in the form of System X will perform better than System Y on given Task Z by applying performance metric M.The metrics that are used in this study are Mean Squared Error (MSE) and R-Squared.

Mean Squared Error (MSE): The average of the sum of squared differences between the actual value and the projected or estimated value is the mean squared error (MSE). Mean squared deviation is another name for it (MSD)(Das, et.al, 2004).

(20)

Page | 19 This is how it is mathematically represented:

Where n= number of data points = Observed values = Predicted values

R-Squared: The ratio of Sum of Squares Regression (SSR) to Sum of Squares Total (SST) is known as R-Squared. The amount of variance explained by the regression line is known as the Sum of Squares Regression. The goodness of fit is measured using the R-squared value and the higher the R-Squared score is best for the regression model (Miles, et.al, 2014).

2.7. TOOLS USED

Table 2.7: Tools used

2.8. ETHICAL ISSUES

The data collected for this thesis was extracted from the World Air Quality Index project with the mission of promoting air pollution awareness for citizens and provide unified air quality information. By registering with AQICN using the official Dalarna university e-mail, the combined air quality, and metrological data available for download. It is protected by law and available for any public research purpose. There was no conflict of interest related to the study.

(21)

RESULTS

In this chapter results from the analysis were presented in two parts:

• The drop of pollution estimated from monitoring stations in selected cities around the world.

• Case study approach to the drop of pollution estimated in India by considering various methods and influence factors.

3.1. DROP-IN POLLUTION IS ESTIMATED IN SELECTED CITIES AROUND THE WORLD.

This will give a holistic view of how the pollutant levels have been dropping in selected cities around the world during the COVID-19 Lockdown.

Figure 3.1.1: Drop-in pollution in Wuhan, China due to Covid-19 lockdown

(22)

Page | 21 From the above Figure 3.1.1, it has been observed from the line graph, bar charts, and stack area graphs, that pollution has been dropped during the lockdown in Wuhan, China.

Figure 3.1.2: Drop-in pollution in Delhi, India due to Covid-19 lockdown

From the above Figure 3.1.2, it has been observed from the line graph, bar charts, and stack area graphs, that pollution has been dropped during the lockdown in Delhi, India.

(23)

Page | 22

Figure 3.1.3: Drop-in pollution in London, the United Kingdom due to Covid-19 lockdown

From the above Figure 3.1.3, it has been observed from the line graph, bar charts, and stack area graphs, that pollution has been dropped during the lockdown in Diepkloof, South Africa.

Figure 3.1.4: Geographic distribution of AQI change in selected cities around the World during Covid-19 lockdown Period

From the above Figure 3.1.4, It has been observed from the Google map the AQI(Air Quality Index) changed during the lockdown period. That means we can conclude that the pollution drops in Wuhan 18 %, New Delhi 43%, London 12% and, Diepkloof 15%.

(24)

Page | 23

3.2. ESTIMATION OF POLLUTION DROP IN INDIA BY CONSIDERING VARIOUS INFLUENCE FACTORS.

From the previous analysis, we observed that the drop in pollution is high in India compared to other countries. Also, it is one of the top 10 polluted countries in the world. So that is the reason we selected India for our further analysis(Statista,2021).

3.2.1. ANALYSIS OF POLLUTANT LEVELS IN INDIA DURING 2015-2020

Figure 3.2.1.1: Visualizing pollutants(NO2, SO2) trend and seasonality maps

(25)

Page | 24

Figure 3.2.1.2: Visualizing pollutants(PM2.5, PM10) trend and seasonality maps

It has been observed from the above analysis Figure3.2.1.1 and Figure 3.2.1.2, the yearly and monthly plots, we can say the following thing:

• There is a clear trend that the pollution level in India falls in the month of July and August.

This might be major because monsoon season sets in during these months.

• The pollution level then starts rising and reaches the highest levels in the winter months. Again, it is during these months that a lot of crop residue burning takes place, especially in northern parts of India

• SO2 level has started increasing after 2017, although it had also seen a sudden rise in 2015.

• The median values of 2020 are generally less as compared to other years giving us a sense that there might be a pollution reduction lately.

3.2.2. CORRELATION BETWEEN WEATHER FACTORS AND POLLUTANTS

As we know from prior research that weather factors affect pollution levels, we would like to see which of those factors have significantly high correlations(Dastoorpoor, et.al, 2016). This way, we can check if, during the 2020 lockdown period, there was a significant change in those highly correlated weather factors. This is important because a change in weather may be the reason for a change in pollution levels, instead of the lockdown regulations themselves.

(26)

Page | 25

Figure 3.2.2.1: Correlation between Weather factors and Pollutants

It has been observed from Figure 3.2.2.1, The colouring of the heatmap displays high positive correlations in dark red, and high negative correlations in dark blue. Hence, the stronger the correlation, the darker the colour. From the correlation table and heatmap above, we can see that some pollutants have a notable correlation to certain weather factors. For example, the highest correlation we see is about 0.31 with temperature. O3, on the other hand, has a decently strong (positive) correlation to temperature, measuring to about 0.55. Finally, NO2 and CO have a notable (negative) correlation to wind speed, measuring to about -0.45.

Figure 3.2.2.2: Weather Line Graph for pm2.5

From the above graph(Figure 3.2.2.2), we observed that No notable Weather Fluctuations to Account for Drop-in pm2.5.Looking at the green line, we can see that pm2.5 had a significant drop when lockdown started (March 19th). We also can see that none of the three weather lines show any major change throughout all of 2020. Hence, we can make a reasonable assumption that weather did not play a role in the drop in pollution from pm2.5.

(27)

Page | 26

Figure 3.2.2.3: Weather Line Graph for NO2

We have observed from the above graph( Figure 3.2.2.3), No Significant Change in Wind to Account for Drop-in NO2. Also, we know from the correlation heatmap earlier, we learned that wind speed has a notable negative correlation with NO2. Hence, if wind speed had gone up during the lockdown period, then this could have been a reason why NO2 pollution levels had dropped.

However, from the graph, we see that the purple line (symbolizing wind speed) stays consistent throughout the entire 2020. Hence, we can make a reasonable assumption that weather did not play a role in the drop in pollution from NO2.

Figure 3.2.2.4: Weather Line Graph for CO

We have observed from the above graph(Figure 3.2.2.4), No Significant Change in Wind to Account for Drop-in CO.

(28)

Page | 27 3.2.3. QUANTIFICATION OF POLLUTION DROP IN MAJOR CITIES IN INDIA

Figure 3.2.3.1: Quantification of pollution drop in major cities in India

We have observed from the above graph( Figure 3.2.3.1), the major pollutants were decreasing during the lockdown. We clearly see the Air Quality Index(AQI) of major cities in India are improved due to lockdown as most of the industries are closed and usage of vehicles is very less.

3.2.4. INFERRING THE VARIATIONS OF THE AIR POLLUTANT CONCENTRATIONS USING ANOVA AND PERMUTATION TEST

To prove a change in pollutant levels during Covid-19, we used the ANOVA Test by checking Assumptions for pm2.5:

• Residuals are normally distributed

• Homogeneity of variances

• Observations are sampled independently from each other

(29)

Page | 28

Table 3.2.4.1: Results from ANOVA Test

We have observed from the above Table 3.2.4.1, Looks like the pollutants NO2, S02, O3, and CO did not pass the first assumption, so we are unable to perform the test. Overall, although the ANOVA test did not work for three of the pollutants, we were still able to come to an important conclusion for pm2.5. We observed that the second assumption has passed. Lastly, we know that the observations were sampled independently of each other and performed the test for pm2.5 The results from the ANOVA test for pm2.5 are presented as follows.

Table 3.2.4.2: ANOVA table for Pm2.5

sum_sq df F PR(>F)

C(year) 9738.447500 2.0 18.257658 8.277684e-08

Residual 39204.145833 147.0 NaN NaN

(30)

Page | 29 Where,

• Sum_sq: explained variance.

• df: degrees of freedom.

• mean_sq: To calculate the mean squares, one divides the sum of squares (SSM and SSR) by the degrees of freedom, respectively.

It has been observed from Table 3.2.4.2, The p-value we get from the ANOVA test is significant (P<.01) so we can conclude that there are significant differences among the years. To know the pairs of significantly different years, we will perform multiple pairwise comparison analyses using the Tukey HSD test.

Table 3.2.4.3: Multiple pairwise comparison (Tukey HSD)

**Multiple Comparison of Means - Tukey HSD, FWER(The familywise error rate )=0.05

From the above Table 3.2.4.3, the Tukey HSD test suggests that all pairwise comparisons for the years reject the null hypothesis and indicate statistically significant differences. We can conclude that the 2020 lockdown period has pm2.5 pollution levels that are significantly different from the past two years. Later, we conducted a Permutation Test as this test has fewer assumptions to pass.

Figure 3.2.4.1: Permutation Test

Group1 Group2 meandiff p-adj Lower Upper Reject

2018 2019 -10.7451 0.0032 -18.4028 -3.0874 True

2018 2020 -19.8002 0.001 -27.5766 -12.0239 True

2019 2020 -9.0551 0.0179 -16.8315 -1.2788 True

(31)

Page | 30

It has been observed from the above Figure 3.2.4.1,After running the Permutation Tests on each of the four pollutants, we concluded that pm2.5, NO2, and CO, have significantly different pollution levels during lockdown versus not during a lockdown. These results demonstrate that the lockdown regulations indeed had an impact on air pollution in Delhi. This could be explained by the many businesses and companies that were forced to shut down, as enforced by the lockdown regulations. Because of this shutdown as well as the discouragement of gatherings, fewer people were driving on the road. Therefore, the primary sources that cause pollution had suddenly declined and most likely contributed to the drastic change in pollution levels.

3.2.5. PREDICTION OF DROP-IN PM2.5 CONCENTRATION USING MACHINE LEARNING ALGORITHMS

Supervised machine learning regression models are constructed by incorporating available meteorological weather data (temperature, pressure, dew point, wind direction, wind speed, precipitation), Date Time information of the time series data ( represented as year, month, day, hour, and season) of major cities in India. Following this methodology, various regression models are constructed and fine-tuned using data of all five cities from the year 2015 to 2019 as the training set, then tested on the data of five cities in the year 2020(Lockdown period) as the holdout test set.

The results obtained from the grid search optimized regression models are summarized below:

Table 3.2.5.1: Comparison of various regression models on major cities in India

Model R2_train R2_test MSE_train MSE_test

Linear Regression 0.237894 0.205384 3802.51 2893.23

LASSO Regression 0.234133 0.209052 3821.28 2879.87

Ridge Regression 0.235065 0.213904 3816.63 2862.21

Random Forest Regressor 0.972667 0.417547 136.377 2120.73 Gradient Boosting Regressor 0.693764 0.443874 1527.96 2024.88

KNeighbors Regressor 1 0.346882 0 2378.03

MLP Regressor 0.523008 0.388481 2379.94 2226.57

(32)

Page | 31

Figure 3.2.5.1: Performance of the best model and feature importance ranking on major cities in India

We have been observed from Table 3.2.5.1 and Figure 3.2.5.1, Machine Learning models are constructed by incorporating available meteorological weather data (temperature, pressure, dew point, wind direction, wind speed, precipitation):

• Simpler linear models (Linear Regression, Lasso, and Ridge) only achieved R-squared =0.21 on the test set.

• Among more advanced models, KNeighbors Regressor achieved an R-squared of 0.34.

• Ensemble methods, like Random Forest Regressor and Gradient Boosting Regressor, achieved an R-squared of 0.42- 0.44. In addition, Neural-net-based MLPRegressor is also tried out with a reported R-squared of 0.39.

(33)

Page | 32

Table 3.2.5.2: Comparison of various regression models on the individual city(Delhi)

Model R2_train R2_test MSE_train MSE_test

Linear Regression 0.320638 0.273828 5470.79979 6169.28494 LASSO Regression 0.317638 0.265134 5494.9569 6243.14661 Ridge Regression 0.301852 0.251477 5622.075407 6359.16548 Random Forest Regressor 0.975413 0.489435 197.99461 4337.571074 Gradient Boosting Regressor 0.849636 0.504908 1210.858609 4206.114009 KNeighbors Regressor 1 0.323342 0 5748.634294 MLP Regressor 0.472161 0.388163 4250.601153 5197.938513

.

Figure 3.2.5.2: Performance of the best model and feature importance ranking for the individual city (Delhi)

It has been observed from Table 3.2.5.2 and Figure 3.2.5.2, The results from these machine learning models reveal that among all meteorological parameters, wind speed (ws), temperature (temp), humidity (humi), and dew point (dewp) are the top influencers; among all datetime-related parameters, the month is the most important factor, followed by an hour and day.

• In addition, separate machine learning models are built for individual cities. By separating out the 'city' feature into individual models, R-squared is improved from 0.4-0.45 to 0.45-0.5. But further improvements are hard without additional feature engineering.

(34)

Page | 33

• These suggest that although time-related information and weather conditions can only explain the variations in air quality (PM2.5 value) to a limited extent. Other underlying reasons are causing the PM2.5 trends and variations.

3.2.6. A CLOSER LOOK AT THE RELATIONSHIP BETWEEN PM2.5 AND WEATHER FACTORS

Figure 3.2.6.1: Correlation between PM2.5 and weather factors

Figure 3.2.6.2: Relationship between PM2.5 and weather factors using scatter plot

(35)

Page | 34 It has been clearly understanding and got evident from the above graphs compared to feature important ranking obtained by regression models (Figure 3.2.6.1 and Figure 3.2.6.2), we can conclude that PM2.5 has a strong correlation with pressure and humidity.

Temperature: there is a negative correlation between PM2.5 and temperature. Higher PM2.5 levels are mostly associated with lower temperatures below 10 Celsius (cold weather).

Dew point: Like temperature, there is a negative correlation between PM2.5 and dew point. Higher PM2.5 levels are mostly associated with lower dew points below 5 degrees Celsius. This is expected as temperature and dew point are strongly positively correlated.

Pressure: There is a positive correlation between PM2.5 and pressure. Higher PM2.5 levels are mostly associated with higher atmospheric pressures. This is also expected as pressure and temperature are negatively correlated.

Humidity: There is a weak positive correlation between PM2.5 readings and humidity, but it is still statistically significant. Higher PM2.5 levels are more likely to occur at higher humidity levels.

(36)

DISCUSSION

4.1. MAIN FINDINGS

This study presented and evaluated methods that influence the air quality during COVID-19 lockdown by using machine learning regression algorithms to build perfective models. The thesis focused on various technical challenges encountered while using air quality data publicly available from the air quality index project. Initially, we performed descriptive analysis on how Covid-19 lockdown procedures impacted the air quality in selected cities i.e. New Delhi, Diepkloof, Wuhan, and London around the world. The results show that the air quality index (AQI) improved by 43%

in New Delhi,18% in Wuhan,15% in Diepkloof, and 12% in London during the initial lockdown from the 19^{th of} March 2020 to 31^st May 2020 compared to that of the same period in previous years. Furthermore, we conducted a deeper analysis on the concentrations of four main pollutants, i.e., NO2, CO, SO2, and PM2.5 during the lockdown in India. The quantification of pollution drop is supported by statistical measurements like the AVOVA Test and the Permutation Test. Overall, 58%, 61%,18% and 55% decrease is observed in NO2, CO,SO2, and PM2.5 concentrations, respectively. To check if the change in weather has played any role in pollution level reduction or not we analyzed how weather factors are correlated with pollutants using a correlation matrix.

Finally, seven machine learning regression models are constructed to assess the lockdown impact on air quality in India by incorporating weather data. We evaluated the models by using performance metrics MSE(mean squared error) and R-squared error. Gradient Boosting is performed well in the Prediction of drop-in PM2.5 concentration on individual cities in India. By comparing the feature importance ranking by regression models supported by correlation factors with PM2.5.

4.2. ADDRESSING RESEARCH QUESTION(S)

Concerning the study's aim and objectives, this study could provide an answer to the questions in terms of developing descriptive, statical significant tests and predictive models. Additionally, it uses various supervised machine learning regression algorithms to build the model. Although the results were showing comparatively good in terms of the predictive strength of the models, certain limitations were behind these figures and will be highlighted in the limitations section.

(37)

Page | 36

4.3. DISCUSSION ON METHODS

In this study, an experimental approach was used to find out any impact of COVID-19 lockdown on urban air quality by using various machine learning algorithms and comparing prediction models by using statistical methods. However, this methodology was very successful for justifying the research questions and numerical illustrations were applied to validate the theoretical merits. In fact, it leads to fulfilling the final step, the written-up thesis.

4.4. DISCUSSION ON RESULTS

The seven machine learning regression models are constructed to assess the lockdown impact on air quality in India by incorporating weather data. We evaluated the models by using performance metrics MSE(mean squared error) and R-squared error. Gradient Boosting is performed well in the Prediction of drop-in PM2.5 concentration on individual cities in India.

In addition, separate machine learning regression models are built for individual cities. By separating out the 'city' feature into individual models, R-squared is improved from 0.4-0.45 to 0.45-0.5. But further improvements are hard without additional feature engineering. These suggest that although time-related information and weather conditions can only explain the variations in air quality (PM2.5 value) to a limited extent. Other underlying reasons are causing the PM2.5 trends and variations

A study was conducted by authors (Aditya C. R et al.,2018 )used logistic regression is used to detect whether a data sample is either polluted or not polluted and used autoregression used to predict future values of PM2.5 based on the previous PM2.5 readings and thus they predict air quality. In another research work "Comparative Analysis of Machine Learning Techniques for Predicting Air Quality in Smart Cities" (Saba Ameer, 2019), the author has considered a dataset of four Chinese cities, namely ’Beijing’,’ Shanghai’,’ Shenyang City’,’ Guangzhou City’ and

’Chengdu City’. The authors have considered different regression techniques, compared them, and evaluated them based on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).

Regression techniques such as Decision Tree Regression, Random Forest Regression, Gradient boosting Regression, and Artificial Neural Networked Multilayer Perception Regression were used to serve this complex problem. In the recent publication " Assessing the COVID-19 Impact on Air Quality: A Machine Learning Approach" (Yves & Rasa, 2020), machine learning models based on a Gradient Boosting Machine algorithm are built to assess the outbreak impact on air quality in Quito, Ecuador. When comparing the studies in the literature, our study has given positive results with a scope to improve accuracy by adding new features.

(38)

Page | 37

4.5. LIMITATIONS

While constructing separate machine learning regression models on individual cities in India, by separating the 'city' feature into individual models, R-squared is improved from 0.4-0.45 to 0.45- 0.5 only for few cities. The reason we found that model Performance mismatch happened due to the availability of data in Ahmedabad and Bengaluru. when we see overfitting of a model on a particular dataset, it is because the test harness is not as robust as it should be, not because of hill- climbing the test dataset. Also, the random initial weights in gradient boosting, the shuffling of data, and different sequences of random numbers are used. But further improvements are hard without additional feature engineering. These suggest that although time-related information and weather conditions can only explain the variations in air quality (PM2.5 value) to a limited extent.

Other underlying reasons are causing the PM2.5 trends and variations.

(39)

CONCLUSION

In summary, this study was approached to effectively and efficiently analyze the data and induce machine learning regression algorithms to construct prediction models. The results of the experiments have contributed towards investigating the four objectives of this study and answering the research questions of the thesis.

In short, this thesis provided efficient methods to find the impact of COVID-19 lockdown on urban air quality by applying different machine learning concepts to solve a major environmental problem. In this work, we analyzed the air quality dataset that consisted of different pollutant concentrations and other meteorological variables of different states and cities around the world taken at different time frames. The thesis also in part contributes towards an improved version of assessing the COVID-19 impact on air quality.

The first objective was to the estimation of pollution drop in selected cities around the world. This study contributed towards the first objective, we performed descriptive analysis on how Covid-19 lockdown procedures impacted the air quality in selected cities i.e. New Delhi, Diepkloof, Wuhan, and London around the world. The results show that the air quality index (AQI) improved by 43%

in New Delhi,18% in Wuhan,15% in Diepkloof, and 12% in London during the initial lockdown from the 19th of March 2020 to 31st May 2020 compared to that of the same period in previous years.

The second objective was to estimate the correlation between weather factors and pollutants. This study contributed towards the second objective, by producing the correlation matrix between pollutants and weather factors. we see O3 has a positive correlation with temperature, NO2 and CO have a notable (negative) correlation to wind speed. On other hand, PM2.5 has a positive correlation with pressure and humidity.

The third objective was to be inferring the variations of the air pollutant concentrations using ANOVA and the Permutation Test. This study contributed towards the third objective; After running the permutation tests on each of the four pollutants, we concluded that pm2.5, NO2, and CO, have significantly different pollution levels during lockdown versus not during a lockdown.

(40)

Page | 39 The fourth objective was to predict the individual pollutant concentrations using machine learning regression models. This study contributed towards the fourth objective; machine learning regression models are constructed to assess the lockdown impact on air quality in India by incorporating weather data. We evaluated the models by using performance metrics MSE(mean squared error) and R-squared error. Gradient Boosting is performed well in the Prediction of drop- in PM2.5 concentration on individual cities in India.

Specifically, long run the predictive model could implement in the Central Pollution Control Board system that can aid government authorities to pass on rules and regulations to Industries on the emission process of dangerous pollutants. Subsequently, improve the air quality and public health.

FUTURE DIRECTIONS

This study concludes that it may be possible to predict the concentration drop in individual pollutants by using Artificial Neural Networks(ANN). To forecast PM2.5 concentrations, extracted features are useful in building a deep ensemble network (EN) model that integrates the recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU) networks.

(41)

Page | 40

REFERENCES

Arulprakasajothi, M., Chandrasekhar, U., Yuvarajan, D., & Teja, M. B. (2020). An analysis of the implications of air pollutants in Chennai. International Journal of Ambient Energy, 41(2), 209- 213.

Athanasiadis, I. N., Kaburlasos, V. G., Mitkas, P. A., & Petridis, V. (2003). Applying Machine Learning Techniques on Air Quality Data for Real-Time Decision Support 1 Introduction 2 Decision support systems for assessing air quality in real—time 3 The σ—FLNMAP Classifier.

In First Int. Symp. Inf. Technol. Environ. Eng (pp. 2-7).

Aditya, C. R., Deshmukh, C. R., Nayana, D. K., & Vidyavastu, P. G. (2018). Detection and prediction of air pollution using machine learning models. International Journal of Engineering Trends and Technology (IJETT), 59(4).

Ameer, S., Shah, M. A., Khan, A., Song, H., Maple, C., Islam, S. U., & Asghar, M. N. (2019).

Comparative analysis of machine learning techniques for predicting air quality in smart cities. IEEE Access, 7, 128325-128338.

Alpaydin, E. (2020). Introduction to machine learning. MIT press.

Bhattacharya, M. (2015). Bioclimatic modelling: a machine learning perspective. Innovations and advances in computing, informatics, systems sciences, networking and engineering, 413-421.

Carbajal-Hernández, J. J., Sánchez-Fernández, L. P., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2012). Assessment and prediction of air quality using fuzzy logic and autoregressive models. Atmospheric Environment, 60, 37-50.

Cuevas, A., Febrero, M., & Fraiman, R. (2004). An anova test for functional data. Computational statistics & data analysis, 47(1), 111-122.

Dalipi, F., Yildirim Yayilgan, S., & Gebremedhin, A. (2016). Data-driven machine-learning model in district heating system for heat load prediction: A comparison study. Applied Computational Intelligence and Soft Computing, 2016.

(42)

Page | 41 Das, K., Jiang, J., & Rao, J. N. K. (2004). Mean squared error of empirical predictor. Annals of Statistics, 32(2), 818-840.

Dastoorpoor, M., Idani, E., Khanjani, N., Goudarzi, G., & Bahrampour, A. (2016). Relationship between air pollution, weather, traffic, and traffic-related mortality. Trauma monthly, 21(4).

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

Good, P. (2013). Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media.

Hoq, M. N., Alam, R., & Amin, A. (2019, February). Prediction of possible asthma attack from air pollutants: Towards a high density air pollution map for smart cities to improve living. In 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE) (pp.

1-5). IEEE.

Hanson, W. E., Creswell, J. W., Clark, V. L. P., Petska, K. S., & Creswell, J. D. (2005). Mixed methods research designs in counseling psychology. Journal of counseling psychology, 52(2), 224.

Ilvessalo, P. (1995). A new method for calculation of an air quality index.

Jain, V., Goel, M., Maity, M., Naik, V., & Ramjee, R. (2018, January). Scalable measurement of air pollution using COTS IoT devices. In 2018 10th international conference on communication systems & networks (COMSNETS) (pp. 553-556). IEEE.

Kandlikar, M., & Ramachandran, G. (2000). The causes and consequences of particulate air pollution in urban India: a synthesis of the science. Annual review of energy and the environment, 25(1), 629-684.

Kalapanidas, E., & Avouris, N. (1999, September). Applying machine learning techniques in air quality prediction. In Proc. ACAI (Vol. 99, pp. 58-64).

Kurt, A., & Oktay, A. B. (2010). Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks. Expert Systems with Applications, 37(12), 7986-7992.

(43)

Page | 42 Landrigan, P. J. (2017). Air pollution and health. The Lancet Public Health, 2(1), e4-e5.

Loh, W. Y. (2014). Classification and regression tree methods. Wiley StatsRef: Statistics Reference Online.

Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18- 22.

Montgomery, D. C. (2017). Design and analysis of experiments. John wiley & sons.

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to linear regression analysis. John Wiley & Sons.

Murtagh, F. (1991). Multilayer perceptrons for classification and regression. Neurocomputing, 2(5-6), 183-197.

Miles, J. (2014). R squared, adjusted R squared. Wiley StatsRef: Statistics Reference Online.

Pérez-Martínez, P. J., Miranda, R. M., Nogueira, T., Guardani, M. L., Fornaro, A., Ynoue, R., &

Andrade, M. F. (2014). Emission factors of air pollutants from vehicles measured inside road tunnels in São Paulo: case study comparison. International Journal of Environmental Science and Technology, 11(8), 2155-2168.

Piepho, H. P. (2009). Data transformation in statistical analysis of field trials with changing treatment variance. Agronomy Journal, 101(4), 865-869.

Rybarczyk, Y., & Zalakeviciute, R. (2021). Assessing the COVID‐19 impact on air quality: A machine learning approach. Geophysical Research Letters, 48(4), e2020GL091202

Ranstam, J., & Cook, J. A. (2018). LASSO regression. Journal of British Surgery, 105(10), 1348- 1348.

Shaban, K. B., Kadri, A., & Rezk, E. (2016). Urban air pollution monitoring system with forecasting models. IEEE Sensors Journal, 16(8), 2598-2606.

Statista. (2021, March 18). Average PM2.5 levels in most polluted countries worldwide 2019–

2020.https://www.statista.com/statistics/1135356/most-polluted-countries-in-the-

Student Thesis

Student Thesis

Level: Master

The Impact of the COVID-19 Lockdown on the Urban Air Quality: A Machine Learning Approach.

ABSTRACT

ACKNOWLEDGEMENTS

TABLE OF CONTENTS

LIST OF ABBRIVIATIONS

LIST OF FIGURES

LIST OF TABLES

INTRODUCTION

1.1. BACKGROUND

1.2. TYPES OF AIR POLLUTANTS

1.3. RELATED RESEARCH

1.4. RESEARCH PROBLEM

1.5. RESEARCH QUESTIONS

1.6. AIM AND OBJECTIVES

METHODOLOGY

2.1. RESEARCH DESIGN

1.2. DATA COLLECTION

1.3. AIR QUALITY INDEX (AQI) CALCULATION

1.4. STATISTICAL METHODS

2.5. MACHINE LEARNING CONCEPTS USED

2.6. PERFORMANCE METRICS

2.7. TOOLS USED

2.8. ETHICAL ISSUES

RESULTS

3.1. DROP-IN POLLUTION IS ESTIMATED IN SELECTED CITIES AROUND THE WORLD.

3.2. ESTIMATION OF POLLUTION DROP IN INDIA BY CONSIDERING VARIOUS INFLUENCE FACTORS.

DISCUSSION

4.1. MAIN FINDINGS

4.2. ADDRESSING RESEARCH QUESTION(S)

4.3. DISCUSSION ON METHODS

4.4. DISCUSSION ON RESULTS

4.5. LIMITATIONS

CONCLUSION

FUTURE DIRECTIONS

REFERENCES