Classiﬁcation of Corporate Social Performance

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

21 | LIU-IDA/STAT-A--21/027--SE

Classiﬁcation of Corporate Social

Performance

Klassiﬁcering av företagens sociala resultat

Erik Anders

Supervisor : Josef Wilzén Examiner : Jose M. Peña

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Over the past few years there has been an exponentially increasing attention in finance towards socially responsible investments which creates a need to determine whether a company is socially responsible or not. The ESG ratings often used to do this are based on Environmental, Social and Governance related data about the companies and have many flaws. This thesis proposes to instead model them by their controversies discussed in the media. It tries to answer the question if it is possible to predict future controversies of a company by its controversies and ESG indicators in the past and to isolate predictors which influence these. This has not been done before and offers a new way of rating companies without falling for the biases of conventional ESG ratings. The chosen method to approach this issue is the Zero Inflated Poisson Regression with Random Intercepts. A selection of variables was determined by Lasso and projection predictive variable selection. This method discovered new connections in the data between ESG indicators and the number of controversies but also made it apparent that it is difficult to make predictions for future years. Nether the less the coefficients of the selected indicators can give a valuable insight into the potential risk of an investment.

(4)

Acknowledgments

I would like to thank my family for always supporting me and making it possible for me to do the things I love.

I would like to thank my supervisor Josef Wilzen for the continued support and many insightful comments and suggestions.

Also, I would like to thank my external supervisor Tohid Ardeshiri from Högskolan i Gävle for his advice and defining this thesis problem.

Finally, I would like to thank Länsförsäkringsbolagens Forskningsfond for their financial support.

(5)

List of Figures

3.1 Distribution of All Controversy Counts . . . 7

3.2 Distribution of Non-Zero Controversy Counts . . . 8

5.1 Illustration of the different steps taken . . . 20

5.2 Distribution of Student t (3, -2.3, 2.5) . . . 21

5.3 Distribution of Logistic (0, 1) . . . 21

5.4 Distribution of Student t (3, 0, 2.5) . . . 22

6.1 Cross-validation curve of Binomial model with LASSO . . . 25

6.2 Cross-validation curve of Poisson model with LASSO . . . 25

6.3 Cross-validation curve for the projection predictive variable selection of Binomial model . . . 26

6.4 Cross-validation curve for the projection predictive variable selection of Poisson model . . . 27

6.5 Zoomed in predicted vs. observed plot for prediction of training data . . . 28

6.6 Predicted vs. observed plot for prediction of training data . . . 29

6.7 Density plots of intercepts and random effect standard deviation . . . 30

6.8 Density plots of coefficients modeling zero inflation probability . . . 30

6.9 Density plots of coefficients modeling zero inflation probability . . . 31

6.10 Density plots of coefficients modeling Poisson mean . . . 31

6.11 Density plots of coefficients modeling Poisson mean . . . 32

6.12 NUTS acceptance plot of ZIP . . . 33

6.13 Zoomed in predicted vs. observed plot for for 2020 prediction . . . 34

(8)

List of Tables

3.1 Count of controversies with each row representing a company with at least one

controversy in the entire dataset . . . 8

6.1 Confusion matrix of zero/non-zero values of the fitted model . . . 28

6.2 Confusion matrix of zero/non-zero values of prediction for 2020 . . . 33

(9)

1 Introduction

1.1 Motivation

Over the past few years there has been an exponentially increasing attention in finance to-wards socially responsible investments [15]. These investments can include many other in-vestment categories, with the only overlap being that they have a positive social impact. These investments presuppose the capacity of the investor to distinguish good from bad cor-porate social performance (CSP). To do so they have to especially look at the company under the three aspects environmental, social and corporate governance, as information relating to these aspects builds the foundation of CSP. Because of the increasing attention agencies specifically collect data that falls under these aspects. On top of that they offer a combined rating based on them. This is the standard method to assess CSP and is called ESG rat-ings (Environmental, Social and Governance). The ESG ratrat-ings have some weaknesses. It is shown in literature that ESG-type CSP ratings have no validity. These weaknesses are based on the fact, that most ESG metrics are very diverse in application and in terms of indicators measured, methodology used, and weights applied [13].

Despite these shortcomings the total assets in sustainable investment strategies by year-end 2019 have risen to $17 trillion USD in the United States, according to the US SIF 2020 Trends Report [67]. The amount is up 42% from 2018, with more double-digit growth ex-pected in the years ahead. As the empirical evidence of increased financial performance of well scoring ESG investments [21] grows, so will the demand for high quality and compara-ble ESG data [3] [36] [19].

As a solution to the current rating void, we investigate a new method to develop a CSP rating. We extract controversies from the GDELT Project database, which is a collection of articles from worldwide news outlets, that is naturally grouped in events and offers a number of metrics to analyze them. The controversies of a company are then being used as a measure of its CSP rating and we study the correlation between controversies and ESG indicators. This gives investors information on which companies to invest in and which features to look and to avoid in a company when making a financial decision.

Because controversies are rare by nature they can not be modelled with a standard count distribution like Poisson. Instead Zero Inflated models were used to account for the high number of zeros and accurately model the data.

(10)

1.2. Aim

1.2 Aim

The aim of this thesis is to extract company controversies from the GDELT Project online database. The extracted controversies are then counted on a yearly basis to describe the com-panies CSP for that year. On the base of these yearly counts and a dataset of yearly ESG indicators a statistical model is fit and then run to predict the controversy counts of the fol-lowing years as well as giving inference over which indicator contributed to the prediction in what way. The distribution of this model should accurately represent the yearly distribution of controversy counts. It should account for the longitudinal effect of the data and connect a company’s changes in the indicators over the years to controversies in later years.

1.3 Research questions

The research question this thesis is trying to answer is if it is possible to predict future con-troversies of a company by its concon-troversies and ESG indicators in the past and to isolate predictors which influence these? What is the appropriate regression model to use in this instance?

1.4 Delimitations

The data of controversy counts is somewhat noisy. Reasons for this is varying names with which companies are addressed by the media and therefore sometimes missed by the search algorithm used to filter the controversies out of the news data. On the same note, the way a event is counted has some weaknesses, in that its importance is factored by the number of articles written about the incident, which is strongly reliant on how big a company is. The volume of reports on a specific company is also dependent on its location as the database is worldwide but covers more English speaking sources. Because of this the controversy counts derived here are not the most accurate representation of the companies.

(11)

2 Related Work

In academic research, ESG and a companies social performance is mostly studied in regards to its financial returns, the creation of an ESG portfolio and the technologies used to do so. Other areas of research are the qualitative benefits of ESG related corporate efforts and the definition of ESG scores in itself. A more recent field of research on ESG scores is natural language processing (NLP).

Since it’s difficult to measure Corporate Social Performance this is often done with the help of ESG ratings. Because they are used in a similar context CSP and ESG are often used synonymously. For the sake of a differentiation between the social and the economical ap-proach to this topic the two groups are differentiated in this thesis.

A social introduction to the topic is given by [60] as it discusses the relationship between Corporate Social Responsibility (CSR) and corporate reputation from the viewpoint of value theory. To go one step further [70] tries to measure the business impact of CSR activities from a company perspective and similarly to [71] discusses whether a companies sustainability efforts get rewarded on the financial market.

To quantify the extent to which ESG-related conversations are carried out by companies [56] detects historical trends in ESG discussions by analyzing the transcripts of corporate earning calls.

As described in 1, there are many unregulated approaches to ESG Scoring, which make them unreliable and difficult to compare. These and other shortcomings together with the different scoring approaches get discussed in [18]. Because of this nature and the fact, that information on ESG Scoring is highly valuable and usually done by financial agencies instead of academics, not a lot of work is being published concerning this task. Never the less the few articles that are being published are predicting ESG ratings with the help of NLP algorithms. One exception is [23]which forecasts the ESG ratings of firms by using corporate financial performance variables.

As an example for the implementation of NLP, [61] converts unstructured text data into ESG scores. Most of the publications utilizing NLP are based on media reports. [29] proposes a model that can learn to predict firms that are more likely to be added to an investment exclusion list in the near future and therefore are not complying to ESG standards. [34] fo-cuses on the micro-foundations of news reports to elaborate how an atmosphere of negative news reports following an initial exposure of corporate pollution activity can help stop such activity through their impact on corporate managers.

(12)

As a subcategory of this approach, there are volumes of scientific papers on stock pre-diction based on news media. One of them is [14], which examines the predictive power of media-expressed tone in single-stock option markets and equity markets. [32] predicts the stock trend based on the sequence of recent related news. [24] discusses the response of stock volatility to unusual news. It finds that “unusual” news with negative sentiment predict an increase in stock market volatility and that "unusual" positive news forecasts lower volatility. These articles don’t focus on ESG as a concept but the underlying idea is the same. This becomes very apparent in [64], which suggests that linguistic media content captures other-wise hard-to-quantify aspects of firms’ fundamentals and that investors quickly incorporate these aspects into stock prices. [59] used financial news to predict stock price movements and surprisingly found an inverse relation between sentiment polarity and stock price move-ment. Instead of just looking at the news [35] also used social media data to do a comparative analysis of systems predicting the financial markets. It states that there is no well-rounded theoretical and technical framework for approaching the problem and that this is because of its interdisciplinary nature that involves at its core both behavioral-economic topics as well as artificial intelligence.

Only few of them consider the aspect of the ESG thematic. [26] does and specifically explores the predictive power of ESG related financial news on stock volatility.

This thesis uses the average tone value given by the GDELT Project, but the tone in news articles is a research field in itself. One interesting article on that is [41], which states that almost three-fourths of the words identified as negative by the widely used Harvard Dictio-nary are words typically not considered negative in financial contexts. As a solution it offers an alternative negative word list, along with five other word lists, that are supposed to better reflect tone in financial text.

The GDELT Project database offers valuable data for this thesis and was also utilized in other publications. [33] investigates whether media sentiment, retrieved from the GDELT Project, can be used as a predictor of stock price. [55] builds a Hidden Markov Model (HMM) based framework to predict indicators associated with country instability based on GDELT events dataset. Their framework utilizes the temporal burst patterns in GDELT event streams to uncover the underlying event development mechanics and formulates the social unrest event prediction as a sequence classification problem based on Bayes decision.

Count models, like the ones utilized in this thesis have many applications. An overview of count data models in econometrics, including hurdle and zero-inflated models, is provided in [9] and [10]. [58] considers the problem of modelling count data with excess zeros in more detail and reviews some possible models under the aspects of model fitting and inference. A publication closer to the statistical modelling done in this thesis is [45], which developed a multilevel approach for Zero Inflated Negative Binomial (ZINB) models. As they are based the same regression coefficients, the same formulation is also applicable to Zero Inflated Pois-son (ZIP) models. In a similar fashion [27] introduced random effects into the portion of the ZIP and Zero Inflated Binomial models that describes the dependence of the nonzero-state mean on the covariates. Another publication introducing a model for repeated measures of zero-inflated count data is [48]. Similar to this thesis they decided to go with a Bayesian ap-proach. [20] applied the multi-level ZINB to repeated measurements of two microbial organ-isms important in Oesophagitis and [16] of rash frequencies at intersections. Similar to this thesis [54] uses Stan to sample from a multi-level ZIP. A general framework for the modelling of count data exhibiting zero inflation (ZI) in Stan is given by [28].

Variable selection for Zero Inflated Model is a very specific topic and therefore not cov-ered to a big extend in literature. Because of its joint structure it is more challenging to fit than most other regression models as this means that there is a high number of possible models. Nonetheless [11] presents an approach based on a stochastic search through the space of all possible models without screening every single on of them. Because of this increased com-plexity the use of the Taylor approximation of the log-likelihood function can be numerically unstable because it requires to invert the Hessian matrix [7]. Therefore regularization

(13)

meth-ods have been proposed like the EM Lasso in [69]. However [43] suggests, that EM Lasso may not be fully efficient and the model selection may be inconsistent. [2] proposes Adaptive Lasso as the solution to these shortcomings for ZIP models.

In a Bayesian context [50] compares several widely used model selection methods. They state that from a predictive viewpoint, the best results are obtained by accounting for model uncertainty. They propose to do this by forming the full encompassing model, such as the Bayesian model averaging solution over the candidate models. For models that are too com-plex they suggest it to be simplified by the projection method, which projects the information of the full model onto the submodels. According to them this approach is less prone to over-fitting than the selection based on CV-Score.

(14)

3 Data

3.1 The GDELT Project

The response variable for the following models is the count of yearly controversies involving a certain company. This number will also be used as a covariate, giving each observation the controversy count of the previous year. The underlying data to calculate these controversy counts has been acquired through the GDELT Project database [65].

GDELT, short for Global Database of Events, Language, and Tone, is the largest, most comprehensive, and highest resolution open database of human society ever created. Its vast archives of more than a quarter billion georeferenced records covering the entire world over 30 years, coupled with massive networks that connect all of the people, organizations, locations, themes, and emotions underlying those events, offers unprecedented opportunities to understand and interact with our world in fundamentally new ways.

GDELT 2.0 is the newest version of the project and reaches back to January 2015. It dis-plays data in three tables, which get updated every 15 minutes and can be queried individu-ally and subset by date using Google BigQuery [25]. These tables are Events, Mentions and Global Knowledge Graph out of which Events and Mentions are being used to calculate the yearly controversies.

The Events table displays distinct events, selected by a NLP algorithm, when they first appear. It includes a number of measurements about the articles mentioning the event in the first 15 min time interval of its creation.

The Mentions table stores every mention of every event, one row per mention of an event. On top of that it includes a selection of measurements about each mention.

Out of the companies of interest the Event table included 238 which were evaluated from January 2015 to December 2020. The original list of companies given by the company super-visor was over 2500 entries long, but only the companies, that corresponded to at least one event in the GDELT Event table were included.

To calculate the significance of a certain event e the following formula was applied to give the event a relative score RSewith nebeing the number of articles mentioning event e.

RSe =log(ne)˚ 1 ne ne ÿ i=1 toneie

(15)

3.1. The GDELT Project

The variable toneiedescribes the tone measurement of article i corresponding to event e. This measurement is a numerical score given by the Mentions table that indicates the tone of an article by subtracting the percentage of all words in the article that were found to have a negative emotional connotation from the percentage of words with a positive emotional connotation.

To make sure that controversy counts don’t just scale linearly with the attention a com-pany gets in the media, but more focus on the extend to how negative an event is, the loga-rithm of the number of mentions was taken. This makes companies of different sizes more comparable and highlights truly negative reporting on an event.

The yearly controversies involving a specific company were then calculated by counting the number of events over a specific threshold in the timespan of a year. These yearly con-troversies were used as response variables in the models of this thesis. The threshold was chosen manually by looking into companies with known controversies and making sure that they get registered while still ensuring that not every small criticism of bigger companies gets counted as a controversy. After trying multiple thresholds it was set to 30. This number filtered out controversies but didn’t automatically include slightly negative events by bigger companies.

With this threshold the 89% of yearly controversy counts are still zero. This seems like a reasonable distribution as not many companies are expected to be involved in controversies on a yearly basis. The distribution of the controversy counts can be seen in 3.1. This distri-bution accounts for all the 5 years that were inspected. Because the overwhelming amount of zeros makes it difficult to see the distribution of nonzero counts they are again displayed in 3.2. The counts of all companies with at least one controversy over the span of the entire data are listed in 3.1 with each row representing a company. The first 4 years (2016-2019) were used as training data and the 5th year (2020) was used to test the model for future predictions.

(16)

3.1. The GDELT Project

Figure 3.2: Distribution of Non-Zero Controversy Counts

2015 2016 2017 2018 2019 0 6 0 0 0 17 2 1 2 2 0 0 0 2 0 20 14 6 30 7 72 26 26 9 43 7 0 0 2 0 0 0 2 6 2 0 0 2 0 1 1 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 1 11 0 4 8 0 1 9 6 0 6 5 0 0 1 0 2 0 1 0 0 0 0 0 0 3 0 6 1 0 14 10 2 10 0 0 0 0 9 15 5 14 1 0 0 4 0 0 7 1 11 0 0 0 0 2 5 0 1 0 0 0 0 3 18 12 7 4 8 0 2 15 0 3 1 1 0 1 57 26 26 29 22 0 2 0 0 0 2 3 0 0 0 0 1 0 0 0 19 4 0 1 0 3 0 0 1 4 0 0 11 0 0 0 1 0 0 0 2 0 7 4 1 2 1 4 2 5 39 15 19 0 0 41 4 40 13 5 0 0 0 1 0 2 2 7 3 1 3 0 1 0 0 0 1 0 0 1 18 2 0 0 0 0 0 0 2 2 0 1 0 0 0 39 9 5 6 5 0 0 0 9 0 0 0 0 5 0 1 0 2 0 0

Table 3.1: Count of controversies with each row representing a company with at least one controversy in the entire dataset

(17)

3.2. Refininiv ESG

3.2 Refininiv ESG

Almost all covariates of the models have been acquired through Refinitiv Eikon [57]. The only one not from this dataset is the controversy count of the previous year, which was added as a covariate for the model to benefit from its autoregressive nature.

Refinitiv offers one of the most comprehensive ESG databases in the industry, covering over 80% of the global market cap, across more than 450 different ESG metrics, with a history going back to 2002. Those metrics are given on a yearly basis and split in Environmental, Social and Governance and include Refinitivs own ESG metric. The data is based on publicly available sources such as company websites, annual reports, and corporate social responsi-bility reports or contributed by firms then audited and standardized.

Most of the covariates are binary, measuring for example, if in a certain year, a company is involved in a certain business or not. In addition to that it also includes categorical and numerical values. Because a big amount of the data is based on publicly available and vol-untary information given by the companies it contains a high number of missing values. The percentage of the missing values and the different datatypes in the data can be seen in 3.2.

% of missing values 39.26 % of binary columns 53.28 % of categorical columns 2.98

% of numerical columns 43.74

3.3 Data Preparation

As described in 3.2, the original data downloaded from Refinitiv Eikon was in a very raw for-mat, meaning it had a high number of missing values and some inconsistencies around the notation of binary values. The implementation used here does allow missing values impu-tation using mice [8] yet it was decided to replace the numerical predictors, which included missing values, with binary values based on whether the value was missing or not. For binary predictors missing values were set to zero and in addition to that a new predictor was cre-ated, stating whether the values were missing or not. This could be seen as transforming the binary variables into categorical variables with the labels "0", "1" and "NA" and then dummy coded to be used by the model. This way no information was lost as all three scenarios are represented by the two predictors.

This approach was chosen because in the columns with missing values only small num-bers of values were available and therefore not giving enough ground to make reasonable imputations. The fact that a company doesn’t publish a certain value can also be seen as a reason to belief that the value would put them in an unfavorable light. Because of this, for the model to know whether a value was published or not can add valuable information.

The inconsistencies in the binary values were solved by converting them all to the logical datatype of R. Some of the binary predictors only had one value and were therefore removed from the dataset.

Additional to the predictors given by Refinitiv Eikon, the dataset has been given the name of the company to which the observation belongs, the year the data corresponds to as well as the response variable of the observation belonging to the previous year.

3.4 Summary

The above described steps result in observations of 231 companies over the span of 5 years. The first 4 years have been used to train the model and the fifth to test its prediction capa-bility. Each observation has a total number of 331 covariates, consisting of the selected and processed covariates from Refinitiv Eikon, the extra binary covariates indicating if a value

(18)

3.4. Summary

is missing or not and the number of controversies in the year previous to the observation. Each observation as also been given a response variable that represents the number of con-troversies involving that company in that year. The percentage of the missing values and the different datatypes after the preprocessing can be seen in 3.4

% of missing values 0 % of binary columns 87.61 % of categorical columns 3.32

(19)

4 Theory

The controversy counts introduced in 3.1 can not be described by a traditional count distribu-tion. This chapter introduces Zero inflated Count Models, specifically Zero-Inflated Poisson, which is necessary to model this data appropriately. Because of the high correlation between the covariates a QR decomposition has been applied.

Because the data consists of yearly measurements and the same company appears mul-tiple times in the data, hierarchical modelling is introduced to account for the correlation of repeated measurements of the same company.

To do Bayesian inference Hamiltonian Monte Carlo and its variant No-U-Turn Sampling are introduces as the preferred choice of sampler. Utilizing them, efficient sampling can be achieved, which allows deeper insights into the inference of the model.

Based on the literature reviewed in 2 Lasso and projection predictive variable selection have been chosen as the methods to conduct variable selection.

4.1 Zero-inflated Count Models

Zero-inflated count models contain a mixture of a point mass at zero and an untruncated count distribution [48]. Because both parts of the mixture accommodate zeros, zeros get explicitly partitioned in zero inflated models. The partitioned types are structural zeros and chance zeros. The first occur, for example because a company is too small to ever have a significant controversy in the media. The latter occur by chance among bigger companies with sufficient media coverage due to non controversial behavior. [58] offers a comprehensive review of zero-inflated models.

Zero-Inflated Poisson Regression

The first model of this kind is the zero-inflated Poisson (ZIP) which got introduced by Lam-bert in [37].

As the Poisson model itself is inadequate in the case of excess of zeroes because of the violation of the equidispersion assumption, Lambert split the model in two processes. One process models the zero inflation, by including a proportion π of extra zero and a proportion 1 ´ π exp(´λi) of zeroes coming from the Poisson distribution and the second models the nonzero counts using the zero-truncated Poisson model. This can be written as

(20)

4.2. Hierarchical regression models

Pr(Yi=yi) = $ & %

πi+ (1 ´ πi)exp(´λi) if yi=0

(1 ´ πi)λyii exp (´λi)

yi! if yią0

with πi being the zero-inflation probability and λ the Poisson mean. This means that if the probability of zero inflation π is zero, the model reduces to the ordinary Poisson. Because of this characteristic it is later referred to as the Poisson part of the model.

The parameters λ= (λ1, ..., λn)and π= (π1, ..., πn)satisfy

log(λ) =Bβ and logit(π) =log π 1 ´ π =Gγ

for the covariate matrices B and G with the rows b1, ..., bnand g1, ..., gneach representing an observation and the regression coefficients β and γ. λ is defined by a log link function, as it would be the case in a regular Poisson regression. π is defined by a logit function and therefore modeled as a logistic regression. Because of this π follows a Binomial distribution and is later referred to as the Binomial part or the first part of the model.

Therefore the mean of the ZIP is(1 ´ π)λand its variance is λ(1 ´ π)(1+λπ).

Under the assumption that y1, ..., yn are independent and that π is not related to λ the likelihood function of yiis defined as:

L= ź y_i=0 [πi(gi) + (1 ´ πi(gi))exp(´λi)] ź y_i‰0 " (1 ´ πi(gi)) e´λi_λyi i yi! #

The log likelihood function is: LL= ÿ yr=0 logexp g1 iγ +exp ´ exp b1_iβ+ ÿ yi‰0

yib1β ´exp b1β´log(yi!)

´ n ÿ i=1

log 1+exp g1_iγ

Let Dibe an indicative function defined as follows: Di =

"

1 if yi =0 0 Otherwise The log likelihood function becomes:

LL= n ÿ i=1 Dilogexp g1 iγ +exp ´ exp bi1β + n ÿ i=1 (1 ´ Di)yib1β ´exp b1β´log(yi!) ´n log 1+exp g1γ

4.2 Hierarchical regression models

A traditional regression model is defined by

Yi=β0+β1Xi+ei,

where Yiis the response variable of subject i, β0is the intercept, β1the regression coefficients, Xi the covariates of i and ei its error. It assumes that the subjects are independent of each

(21)

4.2. Hierarchical regression models

other. Because of this assumption this model ignores the fact that subjects of a certain group may not be independent of each other. This may result in incorrect estimates for the standard errors of the coefficients. This also applies on the effect of a certain covariate on a specific group. Therefore, the traditional linear model is of limited utility and may give very decep-tive results when used in a hierarchical situation [1]. Traditional regression models also don’t allow for the incorporation of group level effects. This means, that they are using fixed pa-rameters, which are composed of a constant over all the groups. The alternative to using a fixed parameter is called a random parameter and differs in that it has a different value for each group.

In a Bayesian context fixed parameters get estimated independently (with independently specified priors) and therefore are all drawn from individual distributions while all levels of random parameters are modelled as being drawn from a common distribution, which is usually Normal.

The following models include increasingly more random parameters to display the dif-ferent modelling options.

One approach to address at least some of these issues is the random intercept model Yij=β0j+β1Xij+eij,

which starts to incorporate the hierarchical nature of the data. This is done by fitting a regression line to each group j, while being constrained to have the same slopes. A random intercept model would be fit if only subject level characteristics were of interest, and if one believed that the average response value varied across groups, whereas the effect of each subject level characteristic was the same for each subject of a group.

The next more complex model to factor in varying slopes between groups is the random coefficients model [40] [53]

Yij=β0j+β1jXij+eij.

In this model a separate regression model is fit to each subject, making it possible for the regression coefficients to vary across group specific models. As in the random intercept model, each model is shrunk to the average regression model, making it possible to include small groups. This model also makes it possible to examine the correlation between estimated regression coefficients. A random coefficients intercept model would be fit if only subject level characteristics were of interest and if one wanted to allow the effect of at least some of the characteristics to vary across groups.

The most complex model presented here is the full multilevel model Yij=β0j+β1jXij+eij

β0j =α00+α01wj+ω0j

β1j=α10+α11wj+ω1j,

which extends the random coefficients model by incorporating explanatory variables measured at different levels of the hierarchy. This makes it possible to use the group level variable to explain the variation in the group specific regression models. Multilevel models enable an examination of effects of characteristics measured at different levels of the hierar-chy, enabling the drawing of valid inferences from the data. Traditional regression models applied to data with hierarchical correlations will tend to produce misleading results [1].

Because the data presented in 3 is in a yearly format of several measurements of the same company it can be modeled as repeated measurements. When modeling random effects as repeated measurements, instead of interpreting i and j as subjects and groups they can be seen as observation and individual, giving it a two-level structure with measurements nested within individuals. This extension is very natural for this type of data as repeated measure-ments are inherently dependent responses.

(22)

4.3. Hamiltonian Monte Carlo

This mixed model with multiple levels allows yij to be the response variable for the jth individual subject with ith_{repeated measurement.}

Repeated measurements are almost always obtained for the assessment of change [39]. This often concerns a kind of growth or development. As this development happens over time the expected values of the observations can be modelled as a function of time.

4.3 Hamiltonian Monte Carlo

Bayesian inference is the process of producing statistical inference taking a Bayesian point of view. Bayesian inference of parameters assumes a model where data x is generated from a probability distribution depending on an unknown parameter θ and a prior knowledge about the parameter θ that can be expressed as a probability distribution p(θ). Observed data

x can be used to update the prior knowledge about this parameter using the Bayes theorem as follows.

p(θ|x) = p(x|θ)p(θ)

p(x)

The updated knowledge about the parameter p(θ|x)is called the posterior and the

proba-bility distribution of the observed data given a parameter value p(x|θ)is called the likelihood. The probability distribution of the observed data, independent of any parameter value p(x), requires the computation of the integral

p(x) =

ż θ

p(x|θ)P(θ)dθ.

This computation can be done for lower dimensions but becomes intractable for data with higher dimensions like the data presented in 3.2. One way to overcome this issue is to use Markov chain Monte Carlo (MCMC) sampling. Instead of trying to deal with intractable computations involving the posterior, MCMC can get samples from this distribution (using only the not normalised part definition) and use these samples to compute various punc-tual statistics such as mean and variance or even to approximate the distribution by Kernel Density Estimation.

Hamiltonian Monte Carlo also known as HMC is a Markov chain Monte Carlo (MCMC) algorithm, which is able to explore the target distribution much more efficiently, resulting in faster convergence than many other MCMC algorithms. This is because it gets rid of their random-walk behavior by adopting physical system dynamics rather than a probability dis-tribution to propose future states in the Markov chain.

Hamiltonian Monte Carlo was initially presented in [17] as Hybrid Monte Carlo to com-pute calculations in Lattice Quantum Chromodynamics. It was later introduced in the context of statistics in [47] and got the name Hamiltonian Monte Carlo in [42].

HMC takes a series of steps informed by first-order gradient information. By doing this it avoids the random walk behavior and sensitivity towards correlated parameters, that many MCMC algorithms struggle with. It also lets HMC converge quicker to high-dimensional target distributions.

(23)

4.3. Hamiltonian Monte Carlo

The Hamiltonian Monte Carlo algorithm is dependent on the three parameters: discretiza-tion time e, metric M and the number of steps taken L. It can be defined as follows, with the auxiliary momentum variable rdfor each model variable θd

Algorithm 1:Hamiltonian Monte Carlo Input: θ0, e, L,L, M for m=1 to M do Sample r0N (0, I) Set θmÐ θm´1, ˜θ Ð θm´1, ˜r Ð r0 for i Ð1 to L do Set ˜θ, ˜r Ð Leapfrog( ˜θ, ˜r, e) end

With probability α=min " 1, exp L( ˜θ)´12˜r¨˜r exp L(θm´1_)´1 2r0¨r0 * , set θmÐ ˜θ, rmÐ ´˜r end functionLeapfrog( ˜θ, ˜r, e) Set ˜r Ð r+ (e/2)∆θL(θ) Set ˜θ Ð θ+e ˜r Set ˜r Ð ˜r+ (e/2)∆_θL(θ˜) return ˜θ, ˜r

As these momentum variables are drawn independently from the standard normal distri-bution they yield the unnormalized joint density of

p(θ, r)9expL(θ)´1 2r ¨ r,

withLbeing the logarithm of the joint density variables of interest θ and x ¨ y being the scalar product of the x and y vectors.

The algorithm first resamples the momentum variables of each sample m, which can be interpreted as a Gibbs sampling update from a standard multivariate normal. It then applies L leapfrog updates to the position and momentum variables θ and r, returning a proposal-momentum pair ˜θ, ˜r. It proposes to set θm Ð ˜θand rm Ð ´˜r and decides to accept or reject

the proposal based on the Metropolis algorithm [44]. In practice the negation of ˜r can be ignored if only sampling from p(θ)is of interest but it is necessary in theory to obtain

time-reversability. The acceptance probability depends on the term logp( ˜θ,˜r)_p(θ,r), which is the negative change in energy of the simulated Hamiltonian system up to this point in time i. In principle the error e can grow without bound as a function of L. But this typically doesn’t happen because of the symplecticness of the leapfrog discretization [30]. As a result HMC is able to run with many leapfrog steps, generating proposals for θ that have high probability of acceptance despite being distant from the previous sample.

Choosing the correct values for e and L has a strong impact on the performance of HMC.

ehas to be large enough, not to waste time taking small steps but also not to be too large

so that the simulation would be inaccurate and yield low acceptance rates. L has to be large enough so that successive samples are not too close to each other and don’t create a random walk behavior with slow mixing. If L is too large however HMC will create trajectories that loop back and retrace their steps. In the worst case L is chosen so that it jumps from one side of the space to the other with each iteration. In this case the Markov chain may not even be ergodic [46].

No-U-Turn Sampler

The original No-U-Turn Sampler also called NUTS, as introduced in [30] is a dynamic imple-mentation of Hamiltonian Monte Carlo. The NUTS differs however not just by employing

(24)

4.4. QR Decomposition

the No-U-Turn termination criterion but also by using a multiplicative expansion of each tra-jectory and a slice sampler to sample states from them. It utilizes adaptive integration times, where the number of leapfrog steps can vary. Because of this L is being controlled automat-ically, which is crucial for the reasons mentioned corresponding to HMC. A more detailed explanation can be found in [30] and [4].

One of the popular implementations of the No-U-Turn Sampler is Stan [62]. Stan was first developed around the original No-U-Turn Sampler [12] but has since changed to using multinomial sampling from each trajectory instead of slice sampling.

4.4 QR Decomposition

QR decomposition is often used to solve linear least squares problems and helps models to perform better, often both in terms of wall time and in terms of effective sample size. In this case it was used to increase numerical stability as a way for the NUTS to be able to compute the posterior. Because of its benefits the Stan User Guide [63] generally recommends its ap-plication for linear and generalized linear models in Stan whenever the number of covariates K ą 1 and an informative prior on the location of the regression coefficients is unavailable.

The QR decomposition of a matrix A is a product A=QR of an orthogonal matrix Q and an upper triangular matrix R. It is often used to solve the linear least squares problem and is the basis for a particular eigenvalue algorithm, the QR algorithm.

If A P Rmˆnhas linearly independent columns it can be factored as A=QR, with Q being m ˆ n with orthonormal columns (QTQ=I). This means that if A is square with m=n, then Q is orthogonal (QTQ = QQT = I). R is a n ˆ n, upper triangular with nonzero diagonal elements.

The matrices for Q and R can be computed in different ways, with Gram-Schmidt being the most popular one. The Gram-Schmidt procedure is not recommended in practice as it can lead to cancellation that causes inaccuracy of the computation of qj, which may result in a non-orthogonal Q matrix.

Instead it is more common to use Householder Reflections [31] in practice, which is an-other method of orthogonal transformation that transforms a vector x into a unit vector y parallel with x. It works by finding appropriate H matrices and multiplying them from the left by the original matrix A to construct the upper triangular matrix R.

Unlike the Gram-Schmidt procedure, the Householder reflection approach does not ex-plicitly form the Q matrix. However, the Q matrix can be found by taking the dot product of each successively formed Householder matrix.

Q= H1H2. . . Hm´2Hm´1

These Householder reflection matrices with normal vector v take the form H= I ´ 2vvT.

Because of this H needs to fulfill Hx=αe1for some constant α and e1= [1 0 0]T.

Since H is orthogonal, ||Hx||=||x|| and ||αe1||=|α|||e1||=|α|. Therefore, α=˘||x||. The sign is selected so it has the opposite sign of x1(for the remaining definitions + will be used). The vector u is thus defined as

u=        x1+sign(x1)||x|| x2 .. . xn        .

(25)

4.5. Variable Selection

With the unit vector v defined as u= v

||v||. The corresponding Householder reflection can then be written as

H(x) =I ´ 2vvT= I ´ 2uu T uT_u From this the Householder algorithm can be formed. Algorithm 2:Householder Tridiagonalization

Input: θ0, e, L,L, M for k=1 to n do x Ð Ak:m,k Compute (m-k+1)-vector uk u=x+sign(x1)||x||e1, vk = _||w||1 w

Multiply Ak:m,k:nwith reflector I ´ 2vkvkTAk:m,k:n ÐAk:m,k:n´2vk(vkTAk:m,k:n end

R Ðś10

k=1tireturn Rmatrix and v1, ..., vnwith vkof length m ´ k+1

This algorithm was introduced in [5] and returns the matrix R, which is sufficient for most calculations. If there is a need to compute Q it can be done with the vectors v1, ..., vn, that define

[Q] =H1H2. . . Hn as they are an economical representation of Q.

4.5 Variable Selection

Variable selection is the process of finding the "best" subset of predictors in a model. It can be used to explain the data in the simplest way, get rid of unnecessary predictors that are adding noise to the estimation, lower collinearity and to save on computational cost.

Lasso

Lasso stands for Least absolute shrinkage and selection operator and is a regression analysis method that was first introduced by [66]. It performs both variable selection and regulariza-tion with the goal of enhancing the predicregulariza-tion accuracy and interpretability of the resulting statistical model. Lasso was chosen to conduct a preselection for the predictors because it is well-suited for models showing high levels of multicollinearity. This is the case for the ESG indicators as often predictors are identical with each other or at least highly correlated. Lasso assures that for these predictors only one gets selected.

When a given response variable does not follow a normal distribution it is not appropriate to apply a linear model. Instead a Generalized Linear Model is applied, so that the probability distribution can be from the exponential family.

The first part of a ZIP model follows a binomial distribution and can therefore be modelled as

logPr(G=1 | X=x)

Pr(G=0 | X=x) =β0+β

T_x

with the possible values of the response variable G=0, 1 when using the log-odds transfor-mation.

The objective function for the penalized logistic regression uses the negative binomial log-likelihood, and is

(26)

4.5. Variable Selection min (β0,β)PRp+1 ´ " 1 N N ÿ i=1 yi¨ β0+xTi β ´log1+e(β0+xTiβ) # +λ h (1 ´ α)}β}2₂/2+α}β}1 i .

The second part of ZIP model follows a poisson distribution and its mean can be modelled on the log scale log µ(x) =β0+β1x.

The log-likelihood for observations xi, yiN1 is given by

l(β | X, Y) = N ÿ i=1 yi β0+β1xi ´eβ0+βTxi

Respectively the penalized log-likelihood is optimized by

min β0,β ´1 Nl(β | X, Y) +λ (1 ´ α) N ÿ i=1 β2_i/2 ! +α N ÿ i=1 |βi| ! .

The optimization of the penalized log-likelihoods of the binomial and the poisson part has a lasso penalization when α=1.

Projection Predictive Variable Selection

Most common Bayesian approaches to variable selection attempt to solve the two prob-lems—prediction and feature selection—simultaneously. [49] argues that in many situations it can be beneficial to solve these problems in two stages. Step one is to construct the best possible prediction model, which potentially includes a lot of features. In this context this model is called reference model.

Step two is to find a simpler model (with acceptable complexity) that gives similar pre-dictions as the reference model. Given a complexity (number of features), the model with the smallest predictive discrepancy to the reference model is selected.

Posterior projection replaces the posterior distribution p(θ˚|D), with D as training data, of the reference model with a simpler distribution qK(θ)that is restricted in some way. The

domain of the projected parameters θ PΘ typically differs from the domain of the reference model parameters θ˚ P Θ˚. In a feature selection context for GLMs, this would mean to constrain some regression coefficients to be exactly zero. The projection is defined by the discrepancy between the induced predictive distributions.

KL(p(˜y|D)||q(˜y)) = E˜y(log p(˜y|D)´log q(˜y))

=´E˜y(log q(˜y)) +const.

=´E˜y(log Eθ(p(˜y|θ))) +const.

=´E_θ_˚(E_˜y|θ_˚(log E_θ(p(˜y|θ)))) +const.

Here Eθ˚(¨), E˜y|θ˚(¨), Eθ(¨)denote expectations over p(θ˚|D, p(˜y|θ˚)and qK(θ), respec-tively.

Optimal projection of posterior p(θ˚|D)from parameter spaceΘ˚toΘ in terms of mini-mal predictive loss would then be the distribution qK(θ)that minimizes the functional equa-tion above [49]. This equaequa-tion here is difficult to solve even for simple models and therefore only serves as the ideal when reformulating the projection in a more tractable way.

In practice different projection techniques like the Draw-by-draw, Single point and Clus-tered projection are being used. To go further into these techniques extends beyond the scope of this thesis.

(27)

5 Method

The practical part of this thesis was to fit a zero inflated regression model with the objective to predict controversies and perform inference on predictors. The multilevel ZIP model shown in 5.1 was fitted using the NUTS through Stan. In order to achieve explainability and decrease complexity two steps of variable selection have been applied beforehand. Because the ZIP model is very complex, instead of using the entire model for the variable selection it was split up in a Binomial part and a Poisson part. The variable selection was then conducted for both parts individually. These steps are illustrated in 5.1 and explained in more detail in 5.2.

(28)

5.1. Model

Figure 5.1: Illustration of the different steps taken

5.1 Model

As introduced in 4.2 there are many ways to model a regression problem. The Hierarchical regression model chosen in this thesis was formulated in the following way. It is a multilevel model with fixed coefficients and random intercepts and was used to both model the π and λ introduced in 4.1. Level 1 of the hierarchical model resembles the components of a traditional fixed effects model, while level 2 specifies the components they are calculated from. In this case it is only the intercept that has a second level which consists of the population intercept and the group intercept, which is specific to every company. This form of the random inter-cept was chosen because it accounts for the fact that there is correlation inside the repeated measurements of a company.

For the Binomial part this results in π being modeled as Level 1:

πij =γ0j+γ1i(Gi) +eij Level 2:

(29)

5.1. Model

and for the Poisson part λ being modeled as Level 1:

λij=β0j+β1i(Bi) +eij Level 2:

β0j =ζ00+ω0j

With i being the observation and j the group the observation belongs to. In this case the observation is a certain controversy count and the group is the company that controversy count belongs to. On the second level of the parts δ00 and ζ00 are the population intercept while δ0j and ζ0j are the group intercepts, that determine the variation of each company’s intercept from the population intercept.

The prior distributions used in the fitting of the model are mostly uninformative. The prior distribution of the intercept of the Poisson part was defined as student t(3, -2.3, 2.5) and the intercept of the binomial part as logistic(0,1), which are displayed in 5.2 and 5.3. The other coefficients of both parts were given flat priors. The standard derivations of the random effects of γ1iand β1iwere given the prior distribution student t(3, 0, 2.5) which is displayed in 5.4.

Figure 5.2: Distribution of Student t (3, -2.3, 2.5)

(30)

Figure 5.4: Distribution of Student t (3, 0, 2.5)

5.2 Variable Selection

When the goal is to make the most accurate prediction possible all predictors should be in-cluded, as this maximizes the amount of information utilized for the prediction. [50] and [51] recommend to include all available prior information and to integrate over all uncertainties if prediction accuracy is the sole goal of the model.

The prediction itself doesn’t benefit from variable selection but it can be used to reduce measurement or computation cost in the future and to improve explainability. This was needed to find the most influential predictors and their impact on the response variable. On top of that a model with such a high number of predictors quickly runs into numerically stability issues, which were avoided by downsizing the model.

To decrease complexity of the variable selection and to be able to draw from a bigger va-riety of methods it was decided to conduct it for the Binomial and Poisson part individually instead of the full ZIP model. For the Binomial part a Binomial regression was fitted to pre-dict whether a controversy count is zero or nonzero as this simulated the calculation of π in the ZIP model. For the Poisson part a Poisson regression was fitted to predict the value of the nonzero counts. This step is only identical with the ZIP model if its value π of the observa-tions would be zero. This is usually not true and the Poisson regression is therefore only an approximation of this step.

The preferred method for the variable selection was the projection predictive variable selection but it requires a fitted Stan model. Because the model was too big to be fitted a preselection of predictors had to be done. For this Lasso was chosen. All steps of the variable selection were performed on the training data, the data of the years 2016-2019.

Lasso

To narrow down the selection of variables a Lasso generalized linear model was implemented for both the Binomial and the Poisson part individually. Both parts of the model were run on all 331 predictors. The lasso penalty was applied to discard of correlated predictors. It was implemented with the R package glmnet [22] utilizing k-fold cross validation to find the optimal regularisation factor lambda. For this the packages default settings were used. Then the most regularized model, such that error is within one standard error of the minimum was selected for the Binomial part as this guarantees a model without too high of an error that has a small enough selection of variables so it can be sampled with Stan. For the Poisson part the model that was minimizing Lambda was selected. This was possible because it included fewer covariates than for the Binomial part.

(31)

5.3. Zero Inflated Poisson

Projection predictive variable selection

The projection predictive variable selection for generalized linear models was implemented with the R package projpred [52]. The implemented method is described in detail in [49]. The package among others is compatible with brms models of the type Gaussian, Binomial and Poisson. The brms package provides an interface to fit Bayesian generalized (non-)linear multivariate multilevel models using Stan [62].

Stan is a comprehensive software ecosystem aimed at facilitating the application of Bayesian inference. It features an expressive probabilistic programming language for spec-ifying sophisticated Bayesian models backed by extensive math and algorithm libraries to support automated computation. To deliver this functionality Stan parses its code to a high performing C++ program which implements the No U-Turn Sampler introduced in 4.3.

The reason brms [6] was chosen as a means of implementation is because of its wide variety of features, needed for this problem.

To get the most influential predictors, a binomial model was fitted in brms with a multi-level model, giving individual intercepts to the companies. To increase sampling efficiency and numerical stability the QR decomposition of Stan was used, which utilizes the House-holder algorithm. It was fitted on whether a company has zero or above zero controversies to mimic the first step of the ZIP. This was done including the selection of 25 predictors given by Lasso over 6000 iterations with non-informative priors. Then the cross-validation method of projpred was applied, which computes an LOO-CV estimate of the predictive performance for the best models with a certain number of variables. The same was done for the Poisson model with the nonzero counts of controversies and the 30 predictors given by Lasso.

The cross-validation curve given by projpred was then evaluated and the number of the most relevant predictors chosen accordingly. This selection of variables for the Binomial and Poisson part have then been given to the ZIP model.

5.3 Zero Inflated Poisson

The model introduced in 5.1 was implemented in R with the brms [6] package. It was fitted with the variables selected by the projection predictive variable selection. The variables se-lected by the Binomial part were used to model the zero inflation probability π and the the variables selected by the Poisson part were used to model the Poisson mean λ. The model was run for 6000 iterations on the data of the year 2016-2019 and then used to predict the controversies of the year 2020.

It was implemented using the QR decomposition via the Householder algorithm to model both the Poisson mean and the zero inflation probability of the observations.

(32)

6 Results

This chapter presents the results of the variable selection and the results gained by running the ZIP model on the selected variables.

6.1 Variable Selection

Lasso

This section visualized the results of Lasso regression for the Binomial and the Poisson part of the model. The images include the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the λ sequence (error bars). The two λs indicated by the vertical dotted lines represent different values for λ. The one to the left is the value of

λ that gives minimum mean cross-validated error. The one to the right is the value for λ

which gives the most regularized model such that the error is within one standard error of the minimum.

(33)

Figure 6.1: Cross-validation curve of Binomial model with LASSO

For the Binomial model, the model with a minimum for lambda included 63 predictors and the most regularized model within one standard error included 25 predictors. Because 63 predictors are too many to fit a model in Stan with on this data, the 25 predictors of the most regularized model within one standard error were chosen instead and used for the Binomial part in 5.2.

Figure 6.2: Cross-validation curve of Poisson model with LASSO

For the Poisson model, the model with a minimum for lambda included 30 predictors and the most regularized model within one standard error only included 1 predictor. Because 30 is still a manageable number for Stan with this data this selection was chosen and used for the Poisson part in 5.2.

(34)

Projection predictive variable selection

The images show the estimated predictive performance of smaller models compared to the full model. The statistics sum of log predictive densities (ELPD) and root mean squared error (RMSE) are computed on the training data as a function of number of variables added. The variables added to the submodel are ordered by their relevance to the predictive performance. The two measurements are given for each submodel individually and in relation to the full model. The red dotted lines at the top and bottom represent the full model performance and the points show how much the performance of the submodel differs from it. The black bars going through each point represent its uncertainty as they indicate how high the standard error of the submodels is for this particular measurement.

Figure 6.3: Cross-validation curve for the projection predictive variable selection of Binomial model

(35)

Figure 6.4: Cross-validation curve for the projection predictive variable selection of Poisson model

After analyzing the cross-validation curves the following predictors were selected to be included in the ZIP model.

Set of predictors selected by projpred for the Binomial part of the final model 1. y_prev 2. Consumer.Complaints.Controversies 3. Product.Quality.Controversies 4. Wages.Working.Condition.Controversies.Count_missing 5. Strikes 6. Bribery..Corruption.and.Fraud.Controversies 7. Business.Ethics.Controversies_missing 8. Carbon.Offsets.Credits_missing 9. Number.of.Analysts

Set of predictors selected by projpred for the Poisson part of the final model 1. Policy.Emissions 2. Elimination.of.Cumulative.Voting.Rights_missing 3. Consumer.Complaints.Controversies 4. Fundamental.Human.Rights.ILO.UN 5. Board.Meeting.Attendance.Average_missing 6. Toxic.Chemicals.Reduction 7. Renewable.Energy.Produced_missing 8. Number.of.Analysts

(36)

6.2 Zero Inflated Poisson

This section displays the results of the fitted ZIP model of the years 2016-2019 and the results of the prediction of the year 2020.

Fitted Model

Actual y>0 y=0 Fitted y>=0.5 92 136 y<0.5 15 681

Table 6.1: Confusion matrix of zero/non-zero values of the fitted model

6.1 shows that the ZIP model was able to predict most of the yearly controversy counts that were zero correctly but more than half of the ones predicted as non-zero counts were actually zero. It did however cover 92 out of the 107 non-zero counts. In this confusion matrix for a prediction to count as zero the it has to be below 0.5, as the count gets rounded to the closest whole number. As this is a very strict cut, 6.5 visualizes the fitted and actual values slightly outside these boundaries. It can be seen that the vast majority of the observations that are actually zero got predicted to be between 0 and 1.7. This is also the case for observations with actual values of 1 or 2.

Figure 6.5: Zoomed in predicted vs. observed plot for prediction of training data

6.6 shows that the yearly controversy counts that were predicted to be over zero are in close reach of their actual values.

(37)

Figure 6.6: Predicted vs. observed plot for prediction of training data

The density plots are computed from posterior draws with all chains merged, with uncer-tainty intervals shown as shaded areas under the curves. The plots show the density of the posterior parameters. The light blue area in the plots marks the 50% intervals and the dark blue lines the posterior medians. The variables beginning with "b_" are the regression coeffi-cients. The variables starting with "sd_" display the standard deviation of the random effects. If one of these are followed by "zi_" they are part of the Binomial part of the ZIP model, if not they belong to the Poisson part.

Even though the implementation utilized the QR decomposition to fit the model, the co-efficients were transformed back at the end of each sampling iteration to their original form.

(38)

Figure 6.7: Density plots of intercepts and random effect standard deviation

(39)

Figure 6.9: Density plots of coefficients modeling zero inflation probability

(40)

Figure 6.11: Density plots of coefficients modeling Poisson mean

The plot in the top left corner of 6.12 is a histogram of the average acceptance probabilities of all possible samples in the proposed tree. The solid line indicates the mean and the dashed line the median of the acceptance.

The plot to its right displays the histogram of the draws from the log-posterior. The solid line indicates the mean and the dashed line the median of the log-posterior.

The x-axis named accept_stat__ refers to the average acceptance probabilities in the pro-posed tree. The y-axis named lp__ refers to the draws of the log-posterior. As a result the plot shows the average acceptance probabilities for different log-posteriors.

(41)

Figure 6.12: NUTS acceptance plot of ZIP

Prediction

The visualization of the prediction on the test data is supposed to show whether the model is good at forecasting.

Actual y>0 y=0 Predicted y>=0.5 17 29

y<0.5 4 181

Table 6.2: Confusion matrix of zero/non-zero values of prediction for 2020

6.2 was able to predict most of the companies with zero controversies in the following year correctly. The prediction of non-zero controversies is unreliable. It did manage to predict 17 out of 21 correctly as non-zero but predicted 29 more as non-zero which were not.

(42)

Figure 6.13: Zoomed in predicted vs. observed plot for for 2020 prediction

6.13 shows that the vast majority of the observations that are actually zero got predicted to be between 0 and 1. Looking at 6.14 however it is visible, that there are several extreme outliers of values being actually zero or very low with their predictions being up to 41. It also shows one outlier that got predicted to be around 4 when its actual value is over 40.

(43)

7 Discussion

7.1 Results

The results of the model that was trained on the final section of variables for the Binomial and Poisson part are given in 6.2. The confusion matrix of the fitted model on the training data is given in 6.1 and makes it obvious that this is not a good fit. The model fails to correctly classify most of the non-zero counts. Looking at 6.6 it can be seen that the model also struggles to correctly predict the number of yearly controversies. Because these predictions were done on the training data itself, a better fit would have been expected.

As the model fit in regards to the training data was not of the highest quality it is not surprising that it has issues predicting the controversies of a year it hasn’t been trained on. 6.2 shows that the differentiation of the model between zero and nonzero is less accurate than on the training data but this is to be expected and the decline in accuracy is still reasonable. The performance of the Poisson part can be seen in 6.14 and suffers from a number of outliers, that are very far off the actual results.

The model appears to be underfitted, which typically indicates that there is not enough data or that non-linear data is tried to be modeled with a linear model. Both of these pos-sibilities could be true in this case. The amount of data might be too low for the model to achieve a better fit. There also might be some non-linear features in this data which can not be captured by the ZIP model used. To utilize possible non-linear features a random forest regression has been fitted to the training data. Opposite to the ZIP regression the random forest regression overfitted on the training data, achieving 100% accuracy on differentiating between zero and nonzero counts and a far better fit overall. In the prediction of the test data however the performance of the random forest was similar to the ZIP, despite utilizing all 331 predictors. A problem many machine learning approaches face with data like this is that they often assume a continuous target value. The loss function of these models does not account for zero inflated data and therefore struggles to produce good results. A potential solution for this problem, that could also be applied in this case to improve the result of the random forest regression is introduced in [38]. It proposes a decision tree for zero-inflated count data, using a maximum of zero-inflated Poisson likelihood as the split criterion.

It has to be acknowledged that this is not a typical split of training and test data, which samples observation from the entire population. Here the observations were chosen by year

Classiﬁcation of Corporate Social Performance

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

21 | LIU-IDA/STAT-A--21/027--SE

Classiﬁcation of Corporate Social

Performance

Klassiﬁcering av företagens sociala resultat

Erik Anders

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Related Work

3

Data

3.1

The GDELT Project

3.2

Refininiv ESG

3.3

Data Preparation

3.4

Summary

4

Theory

4.1

Zero-inflated Count Models

Zero-Inflated Poisson Regression

4.2

Hierarchical regression models

4.3

Hamiltonian Monte Carlo

No-U-Turn Sampler

4.4

QR Decomposition

4.5

Variable Selection

Lasso

Projection Predictive Variable Selection

5

Method

5.1

Model

5.2

Variable Selection

Lasso

Projection predictive variable selection

5.3

Zero Inflated Poisson

6

Results

6.1

Variable Selection

Lasso

Projection predictive variable selection

6.2

Zero Inflated Poisson

Fitted Model

Prediction

7

Discussion

7.1

Results