Assessment Based on Indicators of the Sustainable Development Goals in Spain: A Data Science Approach

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

Assessment Based on Indicators of the Sustainable Development Goals in Spain

A Data Science Approach

CARLOS DE MIGUEL RAMOS

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Assessment Based on Indicators of the Sustainable Development Goals

in Spain

A Data Science Approach

Carlos de Miguel Ramos

Master of Science Thesis TRITA-ITM-EX 2020:167 KTH Industrial Engineering and Management

Machine Design

SE-100 44 STOCKHOLM

(4)

(5)

Examensarbete TRITA-ITM-EX 2020:167

Bedömning baserad på indikatorer för hållbar utveckling Mål i Spanien

En datavetenskaplig strategi

Carlos de Miguel Ramos

Godkänt

2020-06-12

Examinator

Sofia Ritzén

Handledare

Rafael Laurenti

Uppdragsgivare Kontaktperson

Master of Science Thesis TRITA-ITM-EX 2020:167

Assessment Based on Indicators of the Sustainable Development Goals in Spain

A Data Science Approach

Carlos de Miguel Ramos

Approved

2020-06-12

Examiner

Sofia Ritzén

Supervisor

Rafael Laurenti

Commissioner Contact person

Abstract

The global sustainable development has been marked by the United Nations plans for more than two decades. These plans have been adopted by most of the developed and developing countries to achieve the 2030 Agenda, currently formed by the 17 Sustainable Development Goals. Governments and policy-makers cannot make conscious decisions regarding sustainability progress without knowledge about how well the country is performing this path.

This study assessed the evolution of each SDG in Spain through their indicators and whether correlation and dependency between the stated targets exist. Goals 1, 2, 6, 8 and 11 were the less evolved, those which were undergoing a slower process or a negative evolution over the years. The correlation analysis delivered a quick guide of relationships amidst targets to help the appropriate ministries to make prompt decisions knowing which fields will be affected largely. Goal 3 (Good health and well-being) was strongly linked with indicators from Goal 4 (Quality education) and also Goal 6 (Clean water and sanitation). Furthermore, indicators from Goal 7 (Affordable and clean energy) shared a high correlation with the ones from Goal 12 (Responsible consumption and production) and Goal 15 (Life on land). All together obtained 60% share of positive interactions and almost 80% of significant interplays between the targets.

Correlation does not imply causality, so multiple linear regression analysis set true numerical relationships and revealed how to enhance certain targets by leveraging others.

Less developed indicator was taken as dependent variables and the final independent ones were defined using shrinkage methods. The procedure to reach these expressions could be used to establish the dependency between other relevant indicators and getting the assessment of the performance of this country afterwards.

Keywords: SDG, correlation, regression

(6)

(7)

Examensarbete TRITA-ITM-EX 2020:167

Bedömning baserad på indikatorer för de globala målen för hållbar utveckling i Spanien

Ett datavetenskapligt angreppsätt

Carlos de Miguel Ramos

Godkänt

2020-06-12

Examinator

Sofia Ritzén

Handledare

Rafael Laurenti

Uppdragsgivare Kontaktperson

Master of Science Thesis TRITA-ITM-EX 2020:167

Assessment Based on Indicators of the Sustainable Development Goals in Spain

A Data Science Approach

Carlos de Miguel Ramos

Approved

2020-06-12

Examiner

Sofia Ritzén

Supervisor

Rafael Laurenti

Commissioner Contact person

Sammanfattning

Den globala hållbara utvecklingen har präglats av FN:s planer i mer än två decennier.

Dessa planer har antagits av de flesta av de utvecklade länderna och utvecklingsländerna för att uppnå agenda 2030, som för närvarande bildas av de 17 globala målen för hållbar utveckling (SDG). Regeringar och beslutsfattare kan inte fatta medvetna beslut om hållbarhetsframsteg utan kunskap om hur väl landet presterar denna väg.

Denna studie undersökte utvecklingen av varje SDG i Spanien genom deras indikatorer och huruvida korrelation och beroende finns mellan de angivna målen. Mål 1, 2, 6, 8 och 11 var de mindre utvecklade. De genomgick en långsammare process eller hade negativ utveckling under åren. Korrelationsanalysen levererade en snabb guide över relationer förhållandet bland mål för att hjälpa de berörda ministerierna att fatta snabba beslut om att veta vilka områden som i hög grad kommer att påverkas. Mål 3 (God hälsa och välbefinnande) var starkt kopplat till indikatorer från mål 4 (Kvalitetsutbildning) och även mål 6 (Rent vatten och sanitet). Dessutom hade indikatorer från mål 7 (prisvärd och ren energi) en hög korrelation med indikatorer från mål 12 (Ansvarsfull konsumtion och produktion) och mål 15 (Liv på land). Tillsammans erhöll 60% positiva interaktioner och nästan 80% betydande samspel mellan målen.

Korrelation innebär inte orsakssamband, så flera linjära regressionsanalyser satte riktiga numeriska förhållanden och avslöjade hur man kan förbättra vissa mål genom att utnyttja andra. Mindre utvecklade indikatorer togs som beroende variabler och de slutliga oberoende variablerna definierades med krympningsmetoder. Tillvägagångssättet för att nå dessa uttryck kan användas för att fastställa beroendet mellan andra relevanta indikatorer och få en utvärdering av landets resultat.

Nyckelord: SDG, korrelation, regression

(8)

(9)

Acknowledgements

This study is presented as my final master’s degree project in Innovation Management and Product Development. First, I wish to show my gratitude to Rafael Laurenti, for supporting and guiding me in this unknown subject as my supervisor. Second, I would like to pay my special regards to my family, that allowed me to live abroad for this period and learn as much as I did. Last, I would like to thank Jenny Janhager for her gesture of helping me out with the Swedish translation; also to those students who gave me feedback and very helpful tips to keep on working on this project.

I would like to recognize the invaluable opportunity that my home university (Universi- dad Politécnica de Madrid, UPM) offered me, allowing me to study this academic year abroad. Especially at the Royal Institute of Technology of Stockholm, which gave me the chance to get involved in this study.

Carlos de Miguel Ramos

Stockholm, June 2020

(10)

(11)

List of Figures

1.1 Sustainable Development Goals logo. . . . 1

2.1 Bias-Variance Trade-off. . . . 8

3.1 Number of registered observations per each goal. . . . 15

3.2 Completed share per SDG and average line, 62.5%. . . . 16

3.3 Map of interactions presented by Nilsson et al., 2016. . . . 20

3.4 Representation of the growth each goal has had between 2000 and 2019, compared to their initial value. . . . . 22

4.1 Share of correlation ranges considering the Spearman’s correlation coeffi- cient. . . . 27

4.2 Correlation matrix using the Spearman’s correlation coefficient, signifi- cant values only. . . . 28

4.3 Observed synergies, trade-offs and non-classified interactions between targets at the goal level. Share in (%). . . . 29

4.4 Observed synergies and trade-offs globally for the entire dataset. Share in (%). . . . 30

4.5 Observed synergies and trade-offs between targets at the target level. Share in (%). . . . 31

4.6 The standardized lasso coefficients on the Target 8.6 as function of λ. . 32

4.7 Plots of the R ² values for training (blue) and testing (red) data as function of λ. The green line represents the average Mean Squared Error obtained by cross-validation. The black line is the optimal tuning parameter λ for the lasso regression. . . . . 33

4.8 The standardized ridge coefficients on the Target 8.6 as function of λ. . 35

4.9 Plots of the R ² values for training (blue) and testing (red) data as

function of λ. The green line represents the average Mean Squared Error

obtained by cross-validation. The black line is the tuning parameter λ

for the ridge regression. . . . . 36

(14)

4.10 The standardized coefficients for the Target 8.6 acting as a dependent variable, as function of α on the Elastic Network Regression. . . . 37 4.11 Plots of the R ² values for training (blue) and testing (red) data as

function of α. The green line represents the average Mean Squared Error obtained by cross-validation. The black line represents the optimal value for the tuning parameter λ for the Elastic Network regression. . . . 38 4.12 Actual values of the Target 8.6 extracted from the database compared to

the predicted figures obtained from each regression models. Proportion of youth not in education, employment or training (%), as function of

the year. Soft approximation to this proportion for 2020. . . . 39 A.1 Evolution of the selected targets from 2000 to 2019, used for the ex-

ploratory analysis I. . . . xiv A.2 Evolution of the selected targets from 2000 to 2019, used for the ex-

ploratory analysis II. . . . xv A.3 Evolution of the selected targets from 2000 to 2019, used for the ex-

ploratory analysis III. . . . xvi A.4 Evolution of the selected targets from 2000 to 2019, used for the ex-

ploratory analysis IV. . . xvii A.5 Evolution of the selected targets from 2000 to 2019, used for the ex-

ploratory analysis V. . . xviii A.6 Evolution of the selected targets from 2000 to 2019, used for the ex-

ploratory analysis VI. . . . xix

(15)

List of Tables

3.1 Targets selected for Spain. Short and official description I. . . . 18

3.2 Targets selected for Spain. Short and official description II. . . . 19

4.1 Most representative values after the correlation analysis. . . . 26

4.2 Value ranges of the correlation coefficient. . . . . 26

4.3 Real coefficient magnitudes for ridge regression, including the intercept. Their interpretation has already been explained in the previous section. 36 4.4 Minimum MSE values for each regression method: lasso, ridge and Elastic Network. . . . 39

A.1 Positive and negative signs assign to the SDG indicators I. . . . xx

A.2 Positive and negative signs assign to the SDG indicators II. . . . xxi

A.3 Number of interactions per goal classified as synergies, trade-offs and

non-classified. . . xxii

(16)

(17)

Glossary

AEMET Agencia Estatal de Meteorología (State Meteorology Agency) AFDB African Development Bank

CV Cross-Validation

EDA Exploratory Data Analysis GDP Gross Domestic Product

IAEG-SDGs Inter-Agency and Expert Group on Sustainable Development Goals Indicators

ICT Information and Communications Technology

INE Instituto Nacional de Estadística (National Statistics Institute) IUCN International Union for Conservation of Nature

KBAs Marine Key Biodiversity Areas MDG Millennium Development Goal

MSE Mean Squared Error

NPLs Non-performing Loans

ODA Official Development Assistance

OECD-DAC The Organisation for Economic Co-operation and Development’s, De- velopment Assistance Committee

RSS Residual Sum of Squares SDG Sustainable Development Goal UHC Universal Health Coverage

UN United Nations

(18)

(19)

Chapter 1 Introduction

In this chapter, the conceptual background, the objectives and purpose of the research are presented.

1.1 Conceptual framework

In 2015, all United Nations (UN) Member States pledged to fulfil 17 Sustainable Development Goals (SDGs), which set the 2030 agenda. These goals aim to tackle multiple challenges affecting worldwide such as ending poverty, eradicate hunger, ensure well-being, environmental protection and positive economic development, improve access to health and education, build strong institutions and partnerships, and more. Each one of these is divided in several targets which, likewise, are split in different indicators. So, the 193 Member States of the UN recognize that these objectives must be accomplished as a whole, and not just focusing on certain targets, which may cause not only accretive impacts but also harmful consequences to other fields. That is why the SDGs became a significant milestone on the path of sustainable development [1].

Figure 1.1: Sustainable Development Goals logo.

(20)

1.1. CONCEPTUAL FRAMEWORK

It is crucial to consider how the SDGs interact and impact each other by identifying positive correlations between indicator or synergies, and negative correlations among these or trade-offs. These synergies and trade-offs were globally ranked by Pradhan et al., 2017 [2] and the goals were said to need to act as “...a system of interacting cogwheels that together move the global system into the safe and just operating space...”.

Many researches were focused on analysing a specific SDG and exploring links with the rest; others addressed water, energy or health-related targets but only a few projects have analysed the interaction across goals.

Spain is one of the State Members which pledged to address the SDGs and is currently ranked by the Sustainable Development Report from 2019 [3] as the number 21 (of 162). This country was also one of those which stated the Millennium Development Goals (MDGs) in 2000: eight international development goals for the year 2015. In 2007, Spain made the largest contribution in the history of the United Nations to the MDGs fund. It had two main objectives: (i) to advance the progress to the SDGs and (ii) to do it in such a way that the UN system becomes stronger on the ground. From the success of this fund, in 2014 the SDGs fund was set with the same aim. Initially funded by Spain but opened to other states and the private sector (Jose Manuel García Margallo, 2015 ) [4].

Due to the author’s closeness to the country, this research was focussed on the SDGs progress from 2000 to 2019 in Spain. Synergies and trade-offs were analysed and assessed, the most influenced and influential targets were detected to foster the potential of certain indicators for the next years. Although examining these interactions between targets was useful, it was a qualitative method that stated the way they have evolved in comparison to others. This research provides a correlation analysis and a multiple linear regression study to quantify the real influence of each predictor (independent variables) on a desirable target (dependent variable). This allowed to unravel interactions among the SDGs that could be useful to implement new policies that seek to ensure effective and coherent decisions.

Based on the evaluation of how SDGs influence each other, this report went beyond the qualitative results published in similar studies [5]. It also offers a smooth approxi- mation to quantitative predictive models, which examined mathematical approaches and assessed the dependency between possible correlated predictors. The findings highlight how enhancing certain targets could lead into a roadmap to prioritize the right indicators and get the greatest progression in Spain for the 2030 agenda implementation.

This report utilized the United Nation Development Program database [6]. The observations were thoroughly compiled through the UN System in preparation for the Secretary-General’s annual report on the document: “Progress towards the Sustainable Development Goals”. Researchers can select the desired dataset from its interface and classify the observations in terms of geographical areas and years. This source was utilized in previous studies related to the SDGs progress and interactions, such as Kroll et al., 2019 [5], Pradhan et al., 2017 [2], and Weitz et al., 2018 [7]. Their results and

discussions concerning the reference were compelling and satisfactory.

(21)

1.2. PURPOSE AND OBJECTIVES

1.2 Purpose and objectives

This study aims to explore the interdependence that exists among the indicators of the SDGs for Spain. Understanding these interplays could help the population and policymakers to realize the real need to move towards the sustainable development.

Identifying which are the most influenced and influential targets, and which ones should be prioritized in Spain helps to allocate investments in certain sectors and to establish possible links between government ministries. To carry out these endeavours, this report postulates the following objectives:

• Objective 1 - Data quality. Ascertain the data quality available for the research. Assess the evolution of the SDGs over time and find out those having lamer performance.

• Objective 2 - Correlation analysis. Find the correlation coefficients between the indicators and analyse their significance. Determine significant synergies or trade-offs between representative indicators in pairs and classified them in terms of influence.

• Objective 3 - Multiple linear regression. Suggest which targets should be

enhanced in the upcoming times using multiple linear regression. To do so,

the dependent variable was an indicator with low or scarce evolution over the

years, independent variables were set using variable selection and model building

methods.

(22)

Chapter 2 Theoretical background

This chapter aims to clearly explain the frame of reference on which the forthcoming anal- ysis are based, which relies on correlation, variable and model selection, and regression analysis. More detailed information concerning some sections can be found in Appendix A.1. The details within the appendix provide more non-indispensable explanations to understand some calculations.

2.1 Correlation analysis

Correlation analysis is one of the most used techniques to investigate the relationship linking two variables [8]. This sort of studies aims to quantify the strength of the link joining two different features using the correlation coefficients. Generally, their values vary from -1 (corresponding to a perfect negative relationship) to 0 (no relationship), and finally to +1 (perfect positive relationship).

In statistics, there are two types of correlation: parametric and non-parametric.

Parametric correlation is often used when the user knows the distribution of the variables (linear, monotonic, coincidence...), so normally the first step is to graph the observations on a scatter diagram and summarize their main characteristics using exploratory data analysis (EDA).

2.1.1 Pearson’s correlation coefficient

The product-moment correlation coefficient (commonly known as the Pearson’s cor-

relation coefficient), is the most popular sort of parametric correlation analysis and

measures the strength of an association between two different variables. It is frequently

defined as the covariance of the two variables divided by the product of their standard

deviations. This coefficient is usually denoted by r [9].

(23)

2.1. CORRELATION ANALYSIS

Given a dataset containing n number of observations per each one of the p variables, the letter x is used to denote the first correlated variable, y does the same for the second. Pearson’s correlation coefficient is calculated as follows:

r =

P n

i=1 (x _i − ¯ x)(y _i − ¯ y)

q P n

i=1 (x _i − ¯ x) ² ^P ⁿ _i=1 (y _i − ¯ y) ²

(2.1)

Where the mean of the x and y values are denoted as ¯ x and ¯ y respectively. The value of r remains between -1 and +1. If it is close to +1, it indicates that a perfect positive linear joining exists, i.e. one variable increases in the same proportion as the other does. On the other hand, if the coefficient is close to -1, there is a total negative linear linking, i.e. one variable decreases and the other increases or vice versa. In case that r is near 0, there would be no linear relationship.

This coefficient is symmetric, applied to normally distributed variables with a linear relationship. The observations must be paired (no missing values) and the variables must have the same scatter (be homoscedastic ^I ). Outliers may cause misleading results due to Pearson’s coefficient is very sensitive to them, so it is assumed that no significant outliers exist.

Intending to keep a common method to evaluate all the relationships between targets, the study moved towards a non-parametric correlation method on account of some pairs of variables could present a linear relationship.

2.1.2 Spearman’s correlation coefficient

Spearman’s rank is a non-parametric correlation analysis (Spearman, 1904). Non- parametric correlation methods are commonly applied to those pairs of variables whose distribution is unknown a priori. In contrast to the Pearson’s, Spearman correlation does not assume normally distributed variables and does not require the variables to be measured on different scales, it can be used for variables measured at the ordinal level [10].

The rank correlation aims to use the ordinal association between the values to quan- tify the strength of the association, instead of the values. The Spearman’s correlation coefficient is normally referred to as rho (ρ) given by the following expression:

ρ = 1 − 6 ^P ⁿ _i=1 D

N (N ² − 1) (2.2)

I

Homoscedasticity means that the data values are scattered, or they have the same finite variable.

(24)

2.1. CORRELATION ANALYSIS

Where D is the difference between the assigned ranks of each component and variable, i.e. if an observation has got the following component: (2000, 15.5). The value x = 2000 is ranked as 6 ^th in the x values, and y = 15.5 is ranked as 2 ^nd in the y values, then D is equal to 4. n is the number of pairs of observations. All in all, this statistical method quantifies the degree to which testing variables are related by a monotonic function, i.e.

an invariable increasing or decreasing path. Besides, Spearman’s correlation coefficient is less sensitive than Pearson’s for the outliers [10].

The interpretation of this coefficient is like Pearson’s, it ranges between -1 and +1, indicating positive or negative associations. If the coefficient becomes 0, no correlation exists but it does not mean they are independent. To prove that the correlation is not due to chance, the null hypothesis H ₀ (the variables are uncorrelated) is assumed, the alternative form is H ₁ . A correlation coefficient becomes significant by failing to prove that the null hypothesis is true. The p-value is a measure of how likely it is that the observed correlation is due to chance: if it is close to zero and below the significance level (α = 0.05), it means that the likelihood of observing that the data samples are uncorrelated is very unlikely (e.g. 95% confidence) and that the null hypothesis H ₀ can be rejected and the alternative H ₁ must be accepted [11].

2.1.3 Kendall’s rank correlation

Another non-parametric correlation method is Kendall’s, which is used as an alternative to Pearson’s and Spearman’s correlation. The coefficient is often referred to as tau (τ ) and is commonly used when the sample size is small and has tied classifications (or ranks). Kendall’s correlation coefficient utilizes pairs of observations to establish the strength of the relationship based on concordance or discordance, e.g. using the example offered by Magiya et al., 2019 [12]: Ordered in the same way. A pair of observations is concordant if (x2 - x1) and (y2 - y1) have the same sign.

The τ value is usually smaller than ρ used in Spearman’s correlation and the p-

values are more accurate with smaller sample sizes. When calculating this correlation

coefficient, it is preferable that the data appear with a monotonic relationship and

the features must be measured on a continuous or ordinal scale. Ordinal refers to

non-numeric concepts like feeling, satisfaction, etc. Finally, the method makes the same

assumptions as Pearson’s does in terms of a statistical hypothesis test.

(25)

2.2. CROSS-VALIDATION

2.2 Cross-Validation

Cross-validation (CV) is a resampling method that allows to compare different statistical learning methods and get a sense of how well they fit the data. Cross-validation works with an initial set of observations which is used to do two things: estimate the parameters of each method using part of the data (training set), and evaluate how well the method represents the other part of the data (testing set). That is done because using all the available data to fit a model will leave no evidence to evaluate the behaviour of this model with an independent dataset, there would be no clue about its reliability.

Cross-validation keeps track of how well the method does with the test data. The resulting validation set error is typically assessed using Mean Squared Error (MSE).

The most common analysis in cross-validation are: Leave-One-Out CV and k-Fold CV.

Leave-One-Out Cross-Validation (LOOCV) starts by breaking a set of data points down into the training set containing all the observations but one, and the validation set carrying the n − 1 remaining observation. It calculates the test error as the average of all the n MSE’s and finally runs n processes leaving one data point at a time. LOOCV can be used in a wide range of predictive modelling but is a heavy time-consuming task because the process must fit the statistical learning method n times, with a different fold at the time.

When performing k-Fold Cross-Validation, it is usual to divide the data into k groups randomly whose size is approximately equal. The k usually takes the values 5 or 10 ^II . This process computes the M SE in the held-out fold and runs k times using different validation tests at the time. Finally, it computes the average value of all the M SE as:

CV _(k) = 1 k

k

X

i=1

M SE _i (2.3)

K-fold CV has an advantage comparing with the LOOCV, apart from the compu- tational issues, related to the Bias-Variance Trade-off explained at the beginning of Section 2.3. LOOCV yields approximately an unbiased estimates for the test error due to the training data have n − 1 observations, but it has higher variance than k-Fold CV since the outputs from LOOCV are highly correlated, they were trained with almost identical training sets.

K-folds CV is very useful when working with tuning parameters (values that are sort of guessed in a statistical learning method), since it can help to find the best value for that setting. Ridge and lasso regressions seek the point resulting in smallest MSE.

II

These values are the ones that empirically result in test error rates that are not dependent neither

from high bias nor from high variance [13].

(26)

2.3. REGRESSION ANALYSIS

2.3 Regression analysis

A regression analysis is a statistical method used to quantify the dependency between one or more different variables called predictors, covariates or independent variables, and the response or dependent variable [14]. At this point, it is crucial to explain certain concepts that lead most of the decisions throughout this study.

The Bias-Variance Trade-off is a concept that should be kept in mind all over the reading. Ideally, a desired regression model is expected to accurately fit the training data, i.e. low bias and performing well to an independent test set, which would make it have low variance [15]. These facts are not typically possible to achieve simultaneously.

By achieving a high-variance method, one may be able to represent its training set accurately but it could not fit the training data properly, which would cause over-fitting to noise (useless and pointless data). In contrast, models that count with high bias often conduct to easier expressions that do not over-fit but may under-fit the training data, missing to capture important regularities [13]. It is common to use the bias-variance decomposition to realize this matter.

E(y ₀ − ˆ y ₀ ) ² = V ar(y ₀ ) + [Bias(y ₀ )] ² + V ar() (2.4)

Figure 2.1: Bias-Variance Trade-off.

In Equation 2.4, the term E(y ₀ − ˆ y ₀ ) ² represents the expected test error MSE. To

reduce this error, the chosen method should have low bias and low variance at the

same time and their sum cannot remain under the irreducible error . In this context,

an algorithm releasing low bias reflects the training data well and becomes a flexible

model, causing high variance and over-fitting the model as observed in Figure 2.1 from

[16]. The aim is to find out the best (or one among the bests) possible model with the

smallest prediction error (MSE).

(27)

2.3. REGRESSION ANALYSIS

2.3.1 Linear regression

The most common regression form is linear regression, a parametric regression method

III whose result is the equation of a straight line (named as regression line if there is just one predictor) and the variance of y is assumed to be constant. They usually adopt the general form of Equation 2.5.

y = f (x) + (2.5)

Where y is the output variable, f (x) is an unknown function and is called irreducible error, which is independent of x and is not determined by regression analysis. The function f (x) depends on the number of predictors that take part in the regression, if the model counts with just one predictor, it is called simple linear regression. More information about the coefficient calculation is available in Appendix A.1.

In linear regression, the coefficient estimates are usually calculated using the least- squares method. This aims to minimise the squared difference between the real obser- vations and the estimated function, so-called Residual Sum of Squares (RSS). Other regression analyses include multiple linear, polynomial, and non-linear regression. This research performed multiple linear regression, where the number of predictors should be calculated to avoid useless information from noisy variables. Non-linear models were not considered because their complexity is proved not to overcome the results obtained by simpler linear models significantly [13].

2.3.2 Variable selection and model building.

In the regression setting, the standard linear model

y = β ₀ + β ₁ x ₁ + ... + β _p x _p + (2.6) is commonly used to describe the relationship between a response y and a set of variables x ₁ , x ₂ , ..., x _p . In this study, some approaches for extending the linear model framework were studied. Hereinafter, n refers to the number of observations and p indicates the number of variables. Besides, the hereunder description contains the basic knowledge to understand the methodology followed, it does not include neither detailed explanations nor every possible procedure.

There are some ways in which the simple linear model can be improved, by replacing plain least-squares fitting with some alternative setting procedures, three of these important methods are discussed. The reason why using other fitting pathways is that they can deliver better prediction accuracy and model interpretability.

III

Parametric regression methods are those where a fixed equation form determinates the relation

between the predictors and the response variable.

(28)

2.3. REGRESSION ANALYSIS

• Subset Selection. Also called wrapped methods, these approaches involve iden- tifying a subset of the p predictors that are supposed to influence the response.

They are set using least squares on a reduced set of variables. The problem of selecting the “best” model from among the 2 ^p combinations is not trivial, especially if p = 33, that makes a total of 8.589.934.592 of possible models! The “best”

model would be the one having the smallest RSS, or equivalently largest R ² .

• Shrinkage. These methods consist of fitting a model considering all p predictors.

However, the coefficients estimates are shrunken towards zero relative to the least-squares error. This shrinkage or regularization gets the effect of reducing variance from over-fitted models and balances the Bias-Variance Trade-off.

• Dimension Reduction. This approach involves projecting the p predictors into an M-dimensional subspace, where M < p. This results from computing M different linear combinations (or projections) of the variables. Then these M projections are used as predictors to fit a new linear regression model by least-squares.

2.3.3 Shrinkage methods. Regularization

The main goal of shrinkage methods is making predictions less sensitive to certain variables. These techniques choose a model containing all p predictors and constrain or shrink the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance by increasing the bias.

Two well-known techniques for shrinking the regression coefficients are ridge and the lasso regression, but there is an alternative that blends the benefits from both methods.

Ridge Regression

Also known as Tikhonov regularization, this method is helpful to mitigate the effect of multicollinearity in linear regression analysis. This problem usually occurs in statistical models counting with large numbers of predictors. To understand how this method acts, one should consider the least-squares fitting procedure: estimate β ₀ , β ₁ , ..., β _p by minimizing the Equation A.8. Ridge regression estimates slightly different coefficients by adding a shrinkage penalty to the loss function L that makes some estimates approach zero.

L(β _j , β ₀ ) =

n

X

i=1

(y _i − β ₀ −

p

X

j=1

β _j x _ij ) ² + λ

p

X

j=1

β _j ² = RSS + λ

p

X

j=1

β _j ² (2.7)

(29)

2.3. REGRESSION ANALYSIS

λ ≥ 0 is called the tuning parameter. When λ = 0, the ridge regression performs the least-squares estimates since the penalty term is null. When λ → ∞, the shrinkage penalty has a larger effect and the regression coefficients will be close to zero correspond- ing to the null model that contains no predictors but the intercept. The goal of this method is to select a good value of λ among a grid and computing the cross-validation error for each value of λ. The user can select the tuning parameter for which the cross-validation error is smallest. Further information about cross-validation is available in Section 2.2.

The shrinkage penalty is applied to β ₁ , ..., β _p but not to the intercept β ₀ . This is because the intercept value is just a measure of the mean value of the response when the observations are equal to 0 (assuming centred variables). So, the intercept value will be ˆ β ₀ = ¯ y = ^P ⁿ _i=1 y _i /n.

Lasso Regression

Lasso regression stands for the least absolute shrinkage and selection operator. One disadvantage of the ridge regression is that it does not perform variable selection, i.e.

none of the estimate coefficients is exactly zero (unless λ = ∞) and includes all the p predictors in the final model so interpreting the results becomes more difficult. Lasso regression overcomes this disadvantage using a ` ₁ penalty instead of a ` ₂ penalty ^IV .

L(β _j , β ₀ ) =

n

X

i=1

(y _i − β ₀ −

p

X

j=1

β _j x _ij ) ² + λ

p

X

j=1

| β _j |= RSS + λ

p

X

j=1

| β _j | (2.8)

Lasso regression can shrink the coefficient estimates to zero, excluding useless correlated variables from the equation and making the final model simpler and easier to interpret. As well as in ridge regression, choosing a good value of the tuning parameter λ is crucial, cross-validation is performed. The intercept value β ₀ will be equal to the one used in the ridge regression.

IV

The `

1

norm is given as || β ||

1

= P | β

j

| which forces some of the β coefficient estimates to be exactly zero, while an `

2

norm given as || β ||

2

= q

P β

²_j

will never set a value to zero

(30)

2.3. REGRESSION ANALYSIS

Elastic Network Regression

The elastic net regression adds the quadratic penalty ` 2 used in ridge regression to the

` ₁ penalty used in lasso regression. The coefficient estimates from the elastic network method are calculated by minimizing the following loss function.

L(β _j , β ₀ ) = RSS + λ ₁

p

X

j=1

| β _j | +λ ₂

p

X

j=1

β _j ² (2.9)

Where λ ₁ ≥ 0 represents the proportional lasso shrinkage and λ ₂ ≥ 0 does it for ridge regression. The hybrid elastic net regression is especially good at dealing with situations when a high correlation between predictors exists. Unlike lasso regression that tends to pick just one of the correlated predictors and eliminate the others, the elastic net regression is a blend of both that overcomes the weak points of each one.

The best values for λ ₁ and λ ₂ are calculated using cross-validation.

2.3.4 Dimension Reduction. Principal Component Analysis

Principal Component Analysis (PCA) is a popular approach for compressing a lot of

data into a low-dimensional set of variables that captures the essence of the original

large dataset. It is used as a dimension reduction technique for regression and the

new variables are called principal components. These are the directions of the original

variables that vary the most, they are usually remarked using a straight line. The

original data points are projected onto this line to get the amount of variation accounted

for each principal component. In conclusion, the principal component estimators reduce

the effects of multicollinearity by using a subset of the principal components in the

model [13].

(31)

2.4. CONSIDERATIONS IN HIGH DIMENSIONS

2.4 Considerations in high dimensions

Nowadays, it is common to find some fields where the number of features p is very large comparing to the number of observations n; medicine, marketing, finance and more. This turns out to have large feature sets with a limited number of observations (p > n); the sets whose number of features is larger than the number of observations are commonly known as high dimensional. Even some variables seem relevant, they could incur in increasing the variance outweighing the reduction of bias.

About the least-squared regression, performing such a study on a high dimensional dataset will bring coefficient estimates with a perfect fit to the training data, such that the residuals are zero. Although the training data can be fitted perfectly on this set, the resulting linear model will perform poorly on an independent test set, and thus provide a useless model. This situation often leads to flexible models that over-fit the data (Variance-Bias Trade-off). Generally, as the number of features increases, the test set error tends to increase as well.

This last affirmation have some clarifications: (i) adding features that are related to the response variable will improve the model and (ii) adding noise variables will deteriorate the model and increase the test error. However, when undertaking a high dimensional regression model, one cannot ensure that the added variables are either noise or useful, so the regularization plays a key role in this sort of problems.

Besides, high dimensions usually come together with the problem of multicollinearity

and that is due to the correlation that could exist within the variables. There is not one

unique best subset of variables that gives the more accurate regression, any variable can

be rewritten as a combination of the rest so each result obtained must be interpreted as

one of many possible models. Finally, based on the statements found in [13], “one should

never the use sum of squared errors, p-values, R ² statistics, or other traditional measures

of model fit on the training data as evidence of a good model fit in the high-dimensional

setting” (page 244). For this reason, cross-validation errors are used to report results

in the high dimensional regression analysis using independent test sets, instead of the

least-squares error.

(32)

Chapter 3 Data and methodology

The following chapter describes the research setting of how the study was conducted.

First, it explains how the data were treated and extracted from the sources. Finally, qualitative and quantitative methods are performed following the theoretical procedures.

3.1 Data

The data used for this study were taken through the Sustainable Development Goals Official Webpage and the United Nation Development Program, which allowed to get the observations of each target and indicator in a wide spectrum of geographical areas and years. To tackle the research, delimiting the amount of data was necessary as well as clarifying the expected results. The study was focussed on the evolution of the SDGs from the year 2000 to 2019 in Spain [1].

Intending to facilitate the data selection, the SDG Global Indicator Platform offers a classification made by the IAEG-SDGs (Inter-Agency and Expert Group on Sustainable Development Goals Indicators) split into three tiers based on their level of methodological development and the availability of data at the global level.

• Tier I: Indicator is conceptually clear, count with an internationally established methodology and available standards, and data are evenly produced by countries for at least 50 per cent of them and the population in every region where the indicator is relevant [17].

• Tier II: Indicator is conceptually clear, has an internationally established method- ology and available standards, but data are not regularly generated by countries.

• Tier III: No internationally established methodology or standards are yet available

for the indicator, but methodology/standards are being (or will be) developed or

tested.

(33)

3.1. DATA

1 2 3 4 5 6 7 8 9 1011121314151617 0

200 400 600 800 1000 1200 1400 1600 1800

Number of observations per Goal

Figure 3.1: Number of registered observations per each goal.

A crucial step when analysing datasets is to evaluate the quality of the observations, i.e. find out the number of N aN values, repeated data points or missing values. 6130 observations were taken from the source in the period and geographical region selected.

Each observation contained a lot of details: Goal, target, indicator, period, value, maximum and minimum registered values, series description, sex, region of study and more. The initial treatment aimed to keep as much information as possible but discarding those data points that did not add any information. To do so, an exploratory analysis of the data was necessary so that important characteristics could be identified.

Figure 3.1 displays the number of observations belonging to each goal: almost 1800 over 6130 points (29%) belong to the Goal 8 (Decent work and economic growth). In contrast, Goal 13 (Climate action) or Goal 14 (Life below water ) hardly reach 200 observations together. Even though Goal 8 is formed by 16 unique indicators (available in the Official list of SDG indicators by the United Nations) and Goal 13 and 14 gather 9 unique indicators together, the difference is still large. This fact reflected unbalanced measurements that complicated the research.

3.1.1 Major indicator and data gaps for the SDGs

Despite that these SDGs have been stated several years ago, there are still many gaps

in contemporary. That is why governments and the international community must

increase investments in SDGs data and monitoring systems to close these gaps as much

as possible [18]. This importance is referred to further on in the text.

(34)

3.2. METHODOLOGY

1 2 3 4 5 6 7 8 9 1011121314151617 0

20 40 60 80 100

Share (%)

Completed share per Goal

Figure 3.2: Completed share per SDG and average line, 62.5%.

As each goal had different number of indicators, it was essential to become aware of the level of completion of each one. The bars in Figure 3.2 display the completed share of every SDG accordingly to the number of indicators contained on each. The red line indicates the average of completed points considering the studied period (2000 - 2019), the data contained 62.5% of the observations for every indicator and every year. This graph must be compared to the one displayed in Figure 3.1, where the number of observations per SDG highlights Goal 8 to have almost 1800 registered values corresponding to 70% completed. On the other hand, Goal 13 hardly counts with 100 observations but its indicators cover all the studied period.

3.2 Methodology

In this section, specific methods and treatments are explained making use of Appendices

A.2 and A.3 for deeper explanations and details. The analysis has been done at the

level of targets, two targets per goal were selected and one indicator per target, due

to these gather more specific descriptions than evaluating at the level of goals. These

two indicators per goal intended to be significant and descriptive within each target

considering the Spanish context. The reasons to select just two among the others have

been: the computational cost (if all the targets and indicators were considered, the task

would become more time-consuming and would not be feasible for this project), and

also the available material of study (many missing points would lead into imputing

and adding inaccuracy on the dataset) so targets counting with Tier I indicators were

preferable. Finally, a total number of 34 targets were covered instead of 169.

(35)

3.2. METHODOLOGY

The selected targets are shown in both Table 3.1 and Table 3.2. A brief description of each one is placed next to the official one, given by the SDGs Website [1]. The selection methodology was based on their relevancy for Spain however, some issues have been discussed. Firstly, the selection has been made understanding the actual weak spots on the Spanish society: unemployment, job insecurity, social inequality, balancing the energy mix and issues disputed among the political groups. Missing data points or non-existent targets and years registered compelled the study to dismiss strong topics, which might have fostered deeper discussions and dissertations. Some examples are:

• Target 3.8 (Achieve universal health coverage, including financial risk protection, access to quality essential health-care services and access to safe, effective, quality and affordable essential medicines and vaccines for all). It affects the social well-being and the economic growth but there was no sufficient values of the Universal Health Coverage (UHC), so this target was discarded.

• Target 8.3 (Promote development-oriented policies that support productive ac- tivities, decent job creation, entrepreneurship, creativity and innovation, and encourage the formalization and growth of micro-, small- and medium-sized en- terprises, including through access to financial services). This target presented a significant lack of data points, its interactions with employment and economic related targets could have been remarkable, unfortunately, it was a noteworthy absence.

• Target 9.1 (Develop quality, reliable, sustainable and resilient infrastructure, including regional and trans-border infrastructure, to support economic development and human well-being, with a focus on affordable and equitable access for all).

Numerous people ran away from their countries at war and got the southern Spanish border. This target was expected to affect significantly to those from the Goal 11 (Sustainable cities and communities) and Target 10.7 (Facilitate orderly, safe, regular and responsible migration and mobility of people, including through the implementation of planned and well-managed migration policies), no

observations were registered.

• Target 16.5 (Substantially reduce corruption and bribery in all their forms). It was intriguing to analyse the impact that this target could have into the economic growth, but no data were recorded.

The capped targets intended to count with a homogeneous set of observations and

belonging to different goals to compromise the 17 SDGs somehow. The selection

reasoning and purposes of each target are carefully explained in Appendix A.2. Moreover,

changes within the targets’ structure were considered and the data were adapted to be

analysed practically afterwards, i.e. exploratory analysis of each target.

(36)

3.2. METHODOLOGY

Target Short description Official description

1.3 Social protection measures Implement nationally appropriate social protection systems and measures for all, including floors, and by 2030 achieve substantial coverage of the poor and the vulnerable

1.5 Social resilience By 2030, build the resilience of the poor and those in vulnerable situations and reduce their exposure and vulnerability to climate-related extreme events and other economic, social and environmental shocks and disasters

1.a Mobilization of resources Ensure significant mobilization of resources from a variety of sources, includ- ing through enhanced development cooperation, in order to provide adequate and predictable means for developing countries, in particular least developed countries, to implement programmes and policies to end poverty in all its dimensions

2.5 Maintain the genetic diversity By 2020, maintain the genetic diversity of seeds, cultivated plants and farmed and domesticated animals and their related wild species, including through soundly managed and diversified seed and plant banks at the national, re- gional and international levels, and promote access to and fair and equitable sharing of benefits arising from the utilization of genetic resources and asso- ciated traditional knowledge, as internationally agreed

2.a International cooperation Increase investment, including through enhanced international cooperation, in rural infrastructure, agricultural research and extension services, technol- ogy development and plant and livestock gene banks in order to enhance agricultural productive capacity in developing countries, in particular least developed countries

3.4 Premature mortality By 2030, reduce by one-third premature mortality from non-communicable diseases through prevention and treatment and promote mental health and well-being

3.9 Reduce pollution By 2030, substantially reduce the number of deaths and illnesses from haz- ardous chemicals and air, water and soil pollution and contamination 4.1 Primary education By 2030, ensure that all girls and boys complete free, equitable and quality

primary and secondary education leading to relevant and effective learning outcomes

4.4 Vocational education By 2030, substantially increase the number of youth and adults who have rel- evant skills, including technical and vocational skills, for employment, decent jobs and entrepreneurship

5.4 Domestic work Recognize and value unpaid care and domestic work through the provision of public services, infrastructure and social protection policies and the promo- tion of shared responsibility within the household and the family as nationally appropriate

5.5 Women opportunities Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life

6.1 Water services By 2030, achieve universal and equitable access to safe and affordable drinking water for all

6.2 Sanitation access By 2030, achieve access to adequate and equitable sanitation and hygiene for all and end open defecation, paying special attention to the needs of women and girls and those in vulnerable situations

6.6 Protect ecosystems By 2020, protect and restore water-related ecosystems, including mountains, forests, wetlands, rivers, aquifers and lakes

7.2 Renewable sources By 2030, increase substantially the share of renewable energy in the global energy mix

7.3 Energy efficiency By 2030, double the global rate of improvement in energy efficiency 8.4 Resource efficiency Improve progressively, through 2030, global resource efficiency in consump-

tion and production and endeavour to decouple economic growth from envi- ronmental degradation, in accordance with the 10-Year Framework of Pro- grammes on Sustainable Consumption and Production, with developed coun- tries taking the lead

Table 3.1: Targets selected for Spain. Short and official description I.

(37)

3.2. METHODOLOGY

Target Short description Official description

9.5 Scientific research Enhance scientific research, upgrade the technological capabilities of indus- trial sectors in all countries, in particular developing countries, including, by 2030, encouraging innovation and substantially increasing the number of re- search and development workers per 1 million people and public and private research and development spending

10.4 Social equality Adopt policies, especially fiscal, wage and social protection policies, and pro- gressively achieve greater equality

10.5 Financial regulation Improve the regulation and monitoring of global financial markets and insti- tutions and strengthen the implementation of such regulations

10.6 Voice for developing countries Ensure enhanced representation and voice for developing countries in decision- making in global international economic and financial institutions in order to deliver more effective, credible, accountable and legitimate institutions 11.5 Vulnerable housing By 2030, significantly reduce the number of deaths and the number of people

affected and substantially decrease the direct economic losses relative to global gross domestic product caused by disasters, including water-related disasters, with a focus on protecting the poor and people in vulnerable situations 12.2.1 Natural resources efficiency By 2030, achieve the sustainable management and efficient use of natural

resources

12.2.2 Natural resources efficiency By 2030, achieve the sustainable management and efficient use of natural resources

14.5 Marine conservation By 2020, conserve at least 10 per cent of coastal and marine areas, consistent with national and international law and based on the best available scientific information

14.a Marine knowledge Increase scientific knowledge, develop research capacity and transfer marine technology, taking into account the Intergovernmental Oceanographic Com- mission Criteria and Guidelines on the Transfer of Marine Technology, in order to improve ocean health and to enhance the contribution of marine biodiversity to the development of developing countries, in particular, small island developing States and least developed countries

15.2 Forests By 2020, promote the implementation of sustainable management of all types of forests, halt deforestation, restore degraded forests and substantially in- crease afforestation and reforestation globally

15.5 Biodiversity Take urgent and significant action to reduce the degradation of natural habi- tats, halt the loss of biodiversity and, by 2020, protect and prevent the ex- tinction of threatened species

16.1 Reduce violence Significantly reduce all forms of violence and related death rates everywhere 16.8 People participation Broaden and strengthen the participation of developing countries in the in-

stitutions of global governance

17.2 Assistance commitments Developed countries to implement fully their official development assistance commitments, including the commitment by many developed countries to achieve the target of 0.7 per cent of gross national income for official develop- ment assistance (ODA/GNI) to developing countries and 0.15 to 0.20 per cent of ODA/GNI to least developed countries; ODA providers are encouraged to consider setting a target to provide at least 0.20 per cent of ODA/GNI to least developed countries

17.3 Financial resources Mobilize additional financial resources for developing countries from multiple sources

Table 3.2: Targets selected for Spain. Short and official description II.

Many articles considered the interaction scores as the method to evaluate the synergies and trade-offs. The scoring technique was first exposed by Nilsson et al., 2016 [19], where a topology that follows seven points to rate these interactions was proposed, this classification is displayed in Figure 3.3, [19]. The authors distinguished three levels of positive interferences as well as three more ranges for the negative interactions.

Independence between targets was set as a “Consistent” interaction.

(38)

3.2. METHODOLOGY

GOALS SCORING

The in ence of one Sustainable Development Goal or target on another can be summarized with this simple scale.

Interaction Name Explanation Example

+3 Indivisible Inextricably linked to the

achievement of another goal. Ending all forms of discrimination against women and girls is indivisible from ensuring women’s full and e ective participation and equal opportunities for leadership.

+2 Reinforcing Aids the achievement of

another goal. Providing access to electricity reinforces water-pumping and irrigation systems. Strengthening the capacity to adapt to climate-related hazards reduces losses caused by disasters.

+1 Enabling Creates conditions that

further another goal. Providing electricity access in rural homes enables education, because it makes it possible to do homework at night with electric lighting.

0 Consistent No sign positive or

negative interactions. Ensuring education for all does not interact signi with infrastructure development or conservation of ocean ecosystems.

–1 Constraining Limits options on another goal. Improved water e ency can constrain agricultural irrigation.

Reducing climate change can constrain the options for energy access.

–2 Counteracting Clashes with another goal. Boosting consumption for growth can counteract waste reduction and climate mitigation.

–3 Cancelling Makes it impossible to reach

another goal. Fully ensuring public transparency and democratic accountability cannot be combined with national-security goals.

Full protection of natural reserves excludes public access for recreation.

Figure 3.3: Map of interactions presented by Nilsson et al., 2016.

Even though the objectives of these articles were similar to this project’s, these ranges are susceptible to be rendered subjectively causing the resulting cross-impact matrix to be inexact and different depending on who had established those ranges, i.e.

some authors may consider the interaction between a pair of targets as a particular

relation level while other may not. Qualitative assessment causes misinterpretations

or confusing evaluations, that is why quantitative methods were preferred to gauge

interactions. Apart from that, assessing each pair of targets individually would have

required individual analysis that was not practical in a data science project.

(39)

3.2. METHODOLOGY

3.2.1 Correlation analysis

Correlation analysis was the first tool used for assessing the relationship between targets that belong to similar of different SDGs. To undertake this task, a 34 × 34 symmetric correlation matrix was performed and whose values were calculated using a non-parametric correlation method. In this case, it was the Spearman’s correlation ratio since Kendall’s rank would be applied to non-numeric concepts, so it was not useful in this context. Thanks to this, it was clear to recognize which targets were positively, negatively, or non-related and then discuss about the synergies and trade-offs found.

Before tackling the correlation analysis, it was crucial to consider the conditions and requirements needed to perform the method. Spearman’s correlation coefficient required the ranked variables to be continuous and equals in size, i.e. all targets must have an observation for each of the testing years. As the resulting matrix from the target selection already had missing values (the matrix had 67% of numerical cells), the remaining 33% of the data were imputed or estimated.

Incomplete time series

Time series are often subject to have missing points owing to issues when recording or reading data from huge datasets and different sources. Most of the Machine Learning models and coefficients require no missing values. Hence, the cells with non-numerical points were either dropped or filled with appropriate values. This procedure was performed following these methods [20]:

• Imputing the time series using the mean/median values. Calculating the mean or the median of the non-missing values in a column and replacing the missing ones afterwards separately and independently from the others. This is simple and works well for small datasets but adds variance to the future model.

• Imputing using Most Frequent or Zero/Constant values. It replaces the missing points with the most frequent values within each column.

• Imputing the time-series using the rolling average.

• Imputation using k-NN (k-Nearest Neighbours). The desired point is assigned a value based on how closely it matches the points in the training set. This can be very useful in making predictions by finding the k’s closest neighbours to the observation with missing data and then imputing based on the non-missing values in the neighbourhood.

• Using a curve fitting tool to set the intermediate values. This is commonly

performed when the time series presents a defined shape, so one can estimate the

coefficients for a given function form and just guess those intermediate points.

(40)

3.2. METHODOLOGY

Appendix A.3 contains the results to build the matrix containing 34 targets and 20 years observed. Besides, all the imputed values are explained and contrasted with another software.

Once this matrix was set and checked, the software returned the Spearman’s correlation coefficients matrix beside the p-value of each coefficient. Finally, the correlation between indicators does not mean that causality exists, that is why a regression analysis was carried out. When facing such a study, the correlation between predictors usually causes multicollinearity so, to get a truthful model used for predicting future figures and examining dependency, useless variables must be removed.

3.2.2 Regression analysis

The first step to accomplish this analysis was to determine which target was going to be considered as a dependent variable, so the evolution over 20 years was assessed and is displayed in Figure 3.4. Five of the 17 SDGs were still in slow growth, considering the targets belonging to these particular SDGs, Target 8.6, (Proportion of youth not in education, employment or training (%)) has not suffered a negative tendency but is still worrying and seems to be poorly correlated (Figure 4.2). This conclusion matches with the Sustainable Development Report’s in 2019 [3], made for all countries. Particularly, the indicator representing this target was assessed with a negative ranking but a positive trend. There were many other indicators classified with a negative rating that could have been considered for this study too.

Figure 3.4: Representation of the growth each goal has had between 2000 and 2019,

compared to their initial value.

Assessment Based on Indicators of the Sustainable Development Goals in Spain: A Data Science Approach

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

Assessment Based on Indicators of the Sustainable Development Goals in Spain

A Data Science Approach

CARLOS DE MIGUEL RAMOS

KTH ROYAL INSTITUTE OF TECHNOLOGY

Assessment Based on Indicators of the Sustainable Development Goals

in Spain

A Data Science Approach

Carlos de Miguel Ramos

Master of Science Thesis TRITA-ITM-EX 2020:167 KTH Industrial Engineering and Management

Machine Design

SE-100 44 STOCKHOLM

Examensarbete TRITA-ITM-EX 2020:167

Bedömning baserad på indikatorer för hållbar utveckling Mål i Spanien

En datavetenskaplig strategi

Carlos de Miguel Ramos

Godkänt

2020-06-12

Examinator

Sofia Ritzén

Handledare

Rafael Laurenti

Uppdragsgivare Kontaktperson

Master of Science Thesis TRITA-ITM-EX 2020:167

Assessment Based on Indicators of the Sustainable Development Goals in Spain

A Data Science Approach

Carlos de Miguel Ramos

Approved

2020-06-12

Examiner

Sofia Ritzén

Supervisor

Rafael Laurenti

Commissioner Contact person

Abstract

Correlation does not imply causality, so multiple linear regression analysis set true numerical relationships and revealed how to enhance certain targets by leveraging others.

Keywords: SDG, correlation, regression

Examensarbete TRITA-ITM-EX 2020:167

Bedömning baserad på indikatorer för de globala målen för hållbar utveckling i Spanien

Ett datavetenskapligt angreppsätt

Carlos de Miguel Ramos

Godkänt

2020-06-12

Examinator

Sofia Ritzén

Handledare

Rafael Laurenti

Uppdragsgivare Kontaktperson

Master of Science Thesis TRITA-ITM-EX 2020:167

Assessment Based on Indicators of the Sustainable Development Goals in Spain

A Data Science Approach

Carlos de Miguel Ramos

Approved

2020-06-12

Examiner

Sofia Ritzén

Supervisor

Rafael Laurenti

Commissioner Contact person

Sammanfattning

Den globala hållbara utvecklingen har präglats av FN:s planer i mer än två decennier.

Nyckelord: SDG, korrelation, regression

Acknowledgements

I would like to recognize the invaluable opportunity that my home university (Universi- dad Politécnica de Madrid, UPM) offered me, allowing me to study this academic year abroad. Especially at the Royal Institute of Technology of Stockholm, which gave me the chance to get involved in this study.

Carlos de Miguel Ramos

Stockholm, June 2020

Contents

1 Introduction 1

1.1 Conceptual framework . . . . 1

1.2 Purpose and objectives . . . . 3

2 Theoretical background 4 2.1 Correlation analysis . . . . 4

2.1.1 Pearson’s correlation coefficient . . . . 4

2.1.2 Spearman’s correlation coefficient . . . . 5

2.1.3 Kendall’s rank correlation . . . . 6

2.2 Cross-Validation . . . . 7

2.3 Regression analysis . . . . 8

2.3.1 Linear regression . . . . 9

4.7 Plots of the R ² values for training (blue) and testing (red) data as function of λ. The green line represents the average Mean Squared Error obtained by cross-validation. The black line is the optimal tuning parameter λ for the lasso regression. . . . . 33

4.9 Plots of the R ² values for training (blue) and testing (red) data as

4.10 The standardized coefficients for the Target 8.6 acting as a dependent variable, as function of α on the Elastic Network Regression. . . . 37 4.11 Plots of the R ² values for training (blue) and testing (red) data as