Subgroup identification in classification scenario with multiple treatments

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/STAT-A--20/026--SE

Subgroup identiﬁcation in

clas-siﬁcation scenario with multiple

treatments

–

Héctor Andrés Plata Santos

Supervisor : Oleg Sysoev Examiner : Fredrik Lindsten

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The subgroup identification field which sometimes is called personalized medicine, tries to group individuals such that the effects of a treatment are the most beneficial for them. One of the methods developed for this purpose is called PSICA. Currently this method works in a setting of multiple treatments and real valued response variables. In this thesis, this methodology is extended to the degree that it can also handle ordinal re-sponse variables that can take a finite number of values. It is also compared to a competitor method which results in similar performance but with the added value of a probabilistic output and a model that is interpretable and ready for policy making. This is achieved at the expense of a higher execution time. Finally, this extension is applied to a longitudinal study done in Nicaragua in the los Cuatro Santos population in which some interventions were applied in order to reduce poverty. The results showed which were the most benefi-cial treatments for different population subgroups.

(4)

Acknowledgments

I’d like to give a special thanks to my supervisor Oleg Sysoev for giving me the opportunity of doing this thesis with him. Also for the continue support during the development of it and for his overall work done for the Masters program.

I also want to express my gratitude towards my opponent Maria Treesa Sebastian and my examiner Fredrik Lindsten for their valuable feedback and insights which led this work to become a better version of itself.

Lastly, I would like to thank my family, friends and my partner Carol, which supported me throughout these two years and without whom this work wouldn’t have been possible.

(5)

List of Figures

2.1 Histogram of the unsatisfied basic needs in the los Cuatro Santos population . . . 6

2.2 Histogram of the encoded treatments in the los Cuatro Santos population . . . 6

3.1 A PSICA tree forming subgroups by dividing the covariate space with the prob-abilities of a treatment being the best on each node and the labels containing the most likely optimal treatments . . . 10

5.1 Personalized training time by model and number of observations with standard error bars. . . 20

5.2 Classification PSICA tree without bootstrapping (without UBN of 4) . . . 23

5.3 Classification PSICA tree with bootstrapping (without UBN of 4) . . . 24

5.4 Classification PSICA tree without bootstrapping (with UBN of 4) . . . 26

(7)

List of Tables

2.1 Encoding of the treatments . . . 7 2.2 New encoding of the treatments . . . 7 3.1 Confusion matrix for a binary classification problem . . . 16 5.1 Accuracy and execution time (seconds) results. The main number is the mean of

the simulations and in parenthesis the standard error of the mean. . . 21 5.2 Sensitivity results by class. The main number is the mean of the simulations and

in parenthesis the standard error of the mean. Variables without entries mean that there were no true positive nor false negative cases. . . 22 5.3 Specificity simulation results. The main number is the mean of the simulations

and in parenthesis the standard error of the mean. . . 25 A.1 Variables on the los Cuatros Santos dataset. . . 34

(8)

1 Introduction

1.1 Motivation

A randomized clinical trial is an experimental setting in which a group of interest is given different interventions or treatments in which the assignment of those treatment does not de-pend on the characteristics of the group of interest. This is done in order to measure how a variable of interest, usually called potential outcome, respond to each one of those inter-ventions. The purpose of giving these treatments independently of the characteristics of the group is to minimize any kind of bias introduced by the treatment assignment to each indi-vidual.

When studying how certain treatments affect a population, the classical measure to es-tablish whether a treatment had an impact over a variable of interest is called the average treatment effect. Unfortunately, this measure ignores that populations sampled in these stud-ies are often not homogeneous [13] and present different responses for the same treatment.

The need to account for heterogeneous treatment effects has given rise to the field of sub-group identification where the goal is to stratify the population into sub-groups such that their treatment effects deviate from the average [12, 1]. The methods developed under this um-brella are important because it enables to identify which part of the population benefits from certain treatments and it maximizes the positive effects that the treatments might provide to the individuals. In contexts like clinical trials where new drugs are being tested or in policy making where some treatments are applied into a population in order to incentivize certain behaviours, it’s clear that the correct identification of the heterogeneous treatment effects is imperative since the decisions made based on these results are going to have a sizable impact on the population and in most cases those effects might not be reversible.

These methods are primarily used in clinical trials for precision medicine as exploratory or confirmatory tools [13, 8, 2, 18, 12]. There are two main groups where these methods belong. The first one falls into the guidance-driven where the subgroups need to be pre-specified. While the second category is conformed by data-driven approaches where the discovery of the subgroups strategy needs to be pre-specified [2].

There are four categories under the data-driven strategies specified by Lipkovich et al. [10]. The first one is global outcome modeling, in this setting the methods model directly the potential outcome given a treatment and the characteristics of the individual. One example of this approach is the method Virtual Twins [6] in which random forest are used to model

(9)

1.2. Aim

the the potential outcome function. Once the estimation of the potential outcome is done, classification and regression trees are used to estimate a treatment effect, which is a measure of how beneficial a treatment is to an individual compared to a control. The second category is called global treatment effect modeling, the difference of this approach is that it directly estimates the treatment effect. One model under this category is called Model-Based Recursive Partitioning for Stratified and Personalised Medicine [16]. This method uses ensembles of trees that can detect similarities between patients in terms of treatment effects and use a similarity measure to estimate personalised treatment effects. The third category of methods models individual treatment regimes. This means that they model what is called optimal treatment regimes which tries to assign the best treatment for a given patient. An example of a method in this category is the method Outcome Weighted Learning (OWL) [19] in which a weighted classifier is used to estimate the optimal treatment assignment. The final group falls under the category of local treatment effect modeling. These methods identify subgroups which has an enhanced treatment effect. This means that their treatment effect is above some defined threshold. An example that falls under this category is SIDES [11] which uses a partitioning based method to identify subjects with enhanced treatment effects.

In this thesis, we are going to focus on the data driven methods. More specifically, the work done is going to be built upon a method called PSICA [14]. PSICA stands for Prob-abilistic Subgroup Identification with Categorical treatments. This method is the first one under the data-driven approaches that is able to handle categorical treatments in a setting where the potential outcome is a continuous variable. Its purpose is to create subgroups and suggest which treatments are more beneficial to each subgroup with the respective probabil-ity of each treatment being the best. Under the above categorization, PSICA places itself as a mixture of global outcome modeling and local treatment effect modeling, since it uses ran-dom forests to estimate the potential outcome and then looks for subgroups with enhanced treatment effects making use of the potential outcome estimator. Unfortunately as mentioned above, this method only works for real valued potential outcomes. Therefore, this thesis is going to extend this methodology in such a way that it can handle categorical variables. In this new setting, it is assumed that the number of categories is limited (2 to 5 categories) and that these categories are comparable meaning that one category might be preferred than another one but there is no actual value behind the categories.

1.2 Aim

The main goals of this thesis are the following:

• Generalize PSICA method to a classification setting and modify the current implemen-tation appropriately.

• Investigate the efficiency of the classification PSICA and its performance in terms of execution time. Also compare it to an alternative method.

• Apply the classification PSICA to the Nicaragua data and analyse which interventions are most efficient in reducing poverty for different population groups.

1.3 Ethical considerations

The Nicaragua dataset used in this thesis was provided by Carina Källestål from Uppsala University. It comes from a longitudinal study done in the area of Los Cuatro Santos in Nicaragua. The data was derived from project records and a Health and Demographic Surveillance System that was initiated in 2004 [3].

The data doesn’t contain any personal information that can lead to the identification of particular individuals or households.

(10)

1.3. Ethical considerations

The results presented in the Section 6.2 shouldn’t be used as a basis for policymaking but rather as exploratory results that should be investigated further.

(11)

2 Data

The dataset is a longitudinal study that took place in los Cuatro Santos in Nicaragua from 1990 until 2014 [3]. During this period of time, a group of interventions were applied to households in this population with the goal of reducing poverty. There were a total of 10 interventions that took place during this period in which 4 of them were at the household level and not the population level.

In addition to this, several variables related to each household were recorded. Some of them are related to the house itself, like the wall type (if it’s made of ceramic brick, adobe/wattle wall, or other), whether the house has water available, if it has a toilet, the type of the floor (ceramic brick, brick of mud, cement, tiling, mud floor) or whether they have a stove or not. Other variables are related to food insecurities, like for example: whether there is anxiety in the household for lack of food, few kinds of food consumed due to lack of it, reduction of portion sizes of meals, no food to eat due to lack of resources, etc. There are also variables related to tools that a house might have like a car, bicycle, horses or a com-puter. The last category of variables contained in this dataset are about the people in the household. These variable reflect socio-economic status like highest education in a house-hold, immigration, emigration, illiterate people in the household and working status. A full list of the variables in the dataset can be seen in Table A.1.

Furthermore, a variable called unsatisfied basic needs (UBN) was recorded. This variable is used to define the poverty level of a household. If a household had zero or one UBNs it’s considered non-poor, if it had two or three UBNs it is considered poor and if it had four UBNs the household is considered extremely poor.

2.1 Interventions

As mentioned above the data collected contains 4 interventions at the household levels that are going to be used as treatments during this thesis.

MicrocreditsMicrocredits were given to households to finance agricultural and small indus-trial activities.

Safe water Potable water was given to some households via piped water systems. The households who got clean water also got a water meter in order to start the appropriate

(12)

2.2. Unsatisfied basic needs

billing. Thus, this treatment is measured by whether the household has or not a water meter.

Technical trainingTechnical training was given to young men and women in different areas like carpentry, house construction, handicraft, welding, electricity, sewing, computer science, baking, cooking, agriculture, animal health and solar energy techniques.

Home gardeningA program of farm planning was put in place in los Cuatro Santos were households were invited to participate in home gardening activities for self consumption or for profit.

2.2 Unsatisfied basic needs

The UBN is a proxy variable for poverty. In this thesis, this variable is going to be the variable of interest or potential outcome. This is an ordinal variable that has 5 levels and it is composed of 4 items. The first component relates to the house and adds to the UBN when the wall is made of wood, cartons, plastic and has mud floor. The second component adds up to the UBN when the household has access to water from rivers, wells, or bought in barrels and has no latrine. The third component adds to the UBN when there are children aged between 7 and 14 years old and are not attending school. The final component adds to the UBN when the head of the family is illiterate or didn’t complete primary school and there is a dependency ratio greater than 2. The dependency ratio is calculated as the number of people outside the working age (from 0 to 14 years and from 65 years and over) divided by the the number of people in the working age (between 15 and 64 years).

2.3 Pre-processing

The first step on the pre-processing pipeline was to drop observations that had missing values on any of its features. This reduced the total number of observations (households) from 6803 to 5253. The second step was to remove from the dataset observations with a UBN of 4 since there were only 7 observations for this level. However, another dataset in which these 7 observations were included is also used as input for the PSICA method. This assumes that there is some sort of dependency between the different levels of UBN. For example, the lower the highest level of education level is on a household, the higher the UBN is going to be. The next step was to reverse the order of the UBN since PSICA works with the assumption that a higher potential outcome is better. The distribution of the UBN can be seen in Figure 2.1. The third step was to encode the treatments. This is done because each of the treatments described in the data section 2.1 can be applied to the same household at the same time and this doesn’t fit into the PSICA framework so, each unique combination of treatments is treated as its own treatment. The encoding of the treatments can be seen in the Table 2.1. The final step on the pre-processing pipeline was to remove all the encoded treatments that had less than 100 observations, this leaves a total of 7 unique treatments. The distribution of the encoded treatments can be seen in the Table 2.2. One thing to point out is that once these treatments are removed a new encoding was set in place, which can be seen on Table 2.2.

(13)

2.3. Pre-processing

Figure 2.1: Histogram of the unsatisfied basic needs in the los Cuatro Santos population

(14)

2.3. Pre-processing

Home gardening Microcredit Safe water Technical training Encoding

0 0 0 0 0 0 0 1 0 1 0 1 1 0 2 1 0 1 0 3 0 0 1 1 4 0 0 0 1 5 0 1 0 0 6 1 1 1 1 7 0 1 1 1 8 0 1 0 1 9 1 0 0 0 10 1 0 0 1 11 1 1 0 0 12 1 1 1 0 13 1 1 0 1 14 1 0 1 1 15

Table 2.1: Encoding of the treatments

Old encoding New encoding

0 0 1 1 2 2 4 3 5 4 6 5 10 6

(15)

3 Method

This chapter aims first to describe how decision trees work. Second, describe the original PSICA method. Third, which modifications are possible in order to make it work in a classi-fication setting. Fourth, describe the personalized framework and finally describe the evalu-ation procedure used in the thesis.

3.1 Decision trees

A decision tree is a method that creates binary partitions of a dataset in order to group obser-vations given a certain criteria [4]. Having a starting dataset∆= t(Xi), i P 1, ..., nu a binary splitting rule Rj = (xj, tm)which contains a covariate xjwhere j is the index of the covariate and a threshold tmwill partition the dataset∆ into ∆1and∆2the following way

∆1(Rj) =tx : xjďtmu ∆2(Rj) =∆z∆1(Rj)

This partition can be evaluated in such a way that it minimizes some measure, sometimes called impurity, or it can also maximize information gain. In the case of the later, the infor-mation gain is represented as

g(∆, Rj) =L(∆)´(L(∆1) +L(∆2))

where L(¨) is an arbitrary function that represents a loss in a given dataset. Given this measure, a tree will partition the dataset using the optimal splitting rule such that it maxi-mizes information gain.

R˚

j =argmax R_j

g(∆, Rj)

This partition of the dataset is applied recursively until a pre-defined stopping criteria is met, like for example, a minimum number of observations on a split. The subsets generated after the stopping criteria took place are commonly called leafs.

(16)

3.2. PSICA tree

3.2 PSICA tree

The following notations are used in the PSICA methodology. There is a data set D =

t(Xi, Yi, ti), i P 1, ..., nu where Xi = (Xi1, ..., Xip)is a set of characteristics corresponding to each individual i. Yiis the outcome of a given treatment ti. Finally, the treatment tibelongs to a setT =tτ1, ..., τmu. This dataset is collected using randomized clinical trials, this means that the random allocation of treatments are not dependent on the set of characteristics of the individuals. The potential outcome Y(x, τ)is the outcome of using the treatment τ on an individual with characteristics (covariates) x. It is assumed that each treatment can be ap-plied to each individual. However, in practice usually only the response to one treatment is observed. The relation between the observed outcome Yi and the potential outcome Y(x, τ) is given by Yi = m ÿ j=1 Y(Xi, τj)¨I(τj =ti)

where I(z)is an indicator function which is equal to one when z is true and zero otherwise. Another assumption in this setting is that the treatment status of an individual doesn’t affect the potential outcome of others and that there are no hidden versions of the treatments [14]. The functional form of the potential outcome is defined as follows

Y(Xi, τj) = f(Xi, τj) +e

where f(Xi, τj)is the expected observed outcome per treatment, ie, E(Yi|x =Xi, τ =ti) = E(Y(Xi, ti))and e are the error terms which are assumed to be independent between patients and treatments of the same patient. Another assumption is that higher potential outcomes are assumed to be better without loss of generality.

Given this setting, the main goal of the PSICA tree is to find subgroups by dividing the covariate space X into non-overlapping regions and give on each region the probability of each treatment τkbeing the best. This problem of identifying subgroups is defined as follows: identify groups G and subsets of treatments T ĂT such that

p Y x, τ1

ąY x, τ2

|x P G, τ1 PT, τ2PTzT

ą1 ´ α

Here α is a risk level. This ensures that subgroups are found such that a treatment is expected to be better than the others. Also, T would be the set of treatments that are useful for this subgroup while TzT are the useless treatments which should not be used on the individuals belonging to the subgroup G. The PSICA tree computation is divided in two steps. The first one consists on calculating the distribution of a treatment being the best for a patient and the second step summarizes this distribution on a tree, the resulting tree can be observed in Figure 3.1. These two steps can be described as follows.

First step

As mentioned above the first step calculates πk(x), k = 1, ..., m which is the probability of a treatment τk being the best for a given patient x. To estimate this probability, samples from the the joint distribution p(Y(x, τ1), . . . , Y(x, τm)) needs to be taken. Unfortunately, in practice as mentioned above, only one potential outcome Y(x, τi)is observed and none of the counterfactuals are. Thus, this probability is estimated by simulation. Assuming that B samples from this distribution can be obtained Yb ₌ _Yb₍_{x, τ}

1), . . . , Yb(x, τm)

, b = 1, ..., B, then πk(x)can be estimated as

π_k(x) = 1 B B ÿ b=1 I Yb(x, τk)ą max j=1,...,m,j‰kY b₍_{x, τj}₎ _(3.1)

(17)

3.2. PSICA tree

Figure 3.1: A PSICA tree forming subgroups by dividing the covariate space with the prob-abilities of a treatment being the best on each node and the labels containing the most likely optimal treatments

The generation of samples from p(Y(x, τ1), . . . , Y(x, τm)) is done the following way. First, the dataset D is divided into subsets Dk = t(Xi, Yi, tiu : ti = τkusuch that each sub-set only contains observations with a given treatment k. This datasub-set is then used to generate samples Yb(x, τk)using one of two methods.

Method 1 constructs B datasets D_kb, b = 1, .., B by bootstrapping over the the dataset Dk, which implies Db_k „Bootstrap(Dk). Afterwards, a machine learning model Mb_k(x)is trained on each dataset Db

kby using Y as the response and X as the covariates. The current implemen-tation of the algorithm uses conditional inference random forest [7] as its machine learning model. These models then are used to generate samples Yb(x, τk) = Mbk(x), k =1, ..., m, b = 1, ..B. The samples are generated by using the covariates X from the original dataset D as inputs to M_kb(x)and the out of bag predictions are obtained for Yb(x, τk). Finally πk(x)is calculated using the generated samples and the formula 3.1.

Method 2 samples from the following distribution Yb₍_X

i, τk)„N (Mk(Xi), σk2(Xi))where Mk(Xi)is a machine learning model trained on Dkand σk2(Xi)is the variance of the prediction by using the bias-corrected infinitesimal jackknife [17]. Given the samples obtained above,

π_k(x)is calculated using formula 3.1.

Second step

Once the probabilities of each treatment being the best are calculated, a PSICA tree is trained on them in order to find interesting subgroups. For this end, a dataset∆0 = t(Xi, Pi), Pi =

(π1(Xi), ..., πm(Xi))uis constructed in which Xi are going to be the covariates for the tree and Pi is going to be the response vector. Two methods for growing the tree are proposed.

(18)

3.3. Modifications to PSICA

Method A involves growing a large tree on this new dataset and let the user prune it until its interpretable for policy making. This method has the downfall that spurious subgroups could be discovered. Thus, method B is proposed as a solution which involves pre-pruning of the tree.

In both methods, the standard decision tree growing principles [4] are used. This means that a splitting rule that maximizes information gain is applied. In the case of method A, a dataset∆ (before splitting) and the datasets ∆1and∆2(after the split) are considered. With these datasets the following loss functions are computed L1 = L(∆), L2 = L(∆1)and L3 =

L(∆2). Here L(∆)is the deviance or cross-entropy of the dataset∆ L(∆) =´ 1 |∆| ÿ xP∆ m ÿ k=1 πk(x)log πk(x)

where |∆| is the cardinality of ∆. The information gain is computed as follows

g(∆, Rj) =L1´(L2+L3) (3.2)

where Rj is a binary splitting rule. In the case of method B, the information gain g is modified as g1(∆, ∆1,∆2) = g(∆, ∆1,∆2)¨G(∆1,∆2), where G(∆1,∆2)is equal to one if the distributionsΠ1 = tπk(∆1), k=1, . . . , mu andΠ2 = tπk(∆2), k=1, . . . , mu differ signifi-cantly and zero otherwise.

The computation of G is as follows: A chi-square test is done to compareΠ1andΠ2. For eachΠj, the following counts are computed

! nkj= P πk ∆j ¨ˇˇ∆j ˇ ˇ¨ ω_jT , k=1, . . . , m )

where ωj is an inflation factor defined as the standard deviation of the uniform distribution U[0, 1] (which is equal to 1/?12) divided by the standard deviation of

πk(Xi):(Xi, Pi)P∆j(. The inflation factor plays the role of giving a higher weight to the dis-tribution πk(Xi)with lower variance. After the counts are computed for both distributions, a two-way table is computed and the standard chi-square test is performed. If its p-value is lower than a risk level α, G is set to 1 and otherwise 0.

3.3 Modifications to PSICA

As stated in the introduction, the goal is to be able to extend the PSICA tree such that it can handle categorical variables. The first difference compared to the original method is that the potential outcome Y(Xi, τj)is not going to be real valued but is going to be represented by categories. These categories are going to be encoded into numbers such that a higher number represents a better outcome without loss of generality. This means that Y(Xi, τj) P t1, ..., Cu where C is the total number of categories.

The first problem that appears, is that method 1 and method 2 as well as πk(x)are built and defined assuming real valued potential outcomes. In this case, using a regression model Mb_k(Xi)(for method 1) to model the potential outcome is not reasonable since this could po-tentially yield more categories than the ones existing and it won’t be able to represent accord-ingly the distribution of the data. As for method 2, using a discretized normal distribution is not reasonable since the bias-corrected infinitesimal jackknife variance σ_k2(x)used in this method was developed for when the data or in this case the potential outcome, follows a normal distribution [17] and not a discretized version of it. Given this, the samples Yb can’t be calculated as well as πk.

In order to solve this problem, a modification to the method 1 is done. This modification changes the regression model Mb_k(Xi)to an ordinal classification model. This will correctly

(19)

3.3. Modifications to PSICA

represent the distribution of the potential outcomes. In the same case as in the original im-plementation, conditional inference random forest [7] are proposed but in an ordinal clas-sification setting. Given this, the model Mb_k(Xi) will have a probabilistic output which is represented as a vector of probabilities of each potential outcome given a treatment k

~

Pb

k = [P(Yb(Xi, τk) =1), ..., P(Yb(Xi, τk) =C)] = Mbk(Xi) (3.3) Given this probabilistic output, the definition of πkmust be changed in order to account for this. The definition of this quantity is the probability that the treatment k is the best for a given observation, meaning that treatment k offers a higher or equal potential outcome than all the other treatments. Thus, this probability can be represented as follows

πk(xi) =P(Y(xi, τk)ěY(xi, τ1), ..., Y(xi, τk)ěY(xi, τk´1), Y(xi, τk)ěY(xi, τk+1), ..., Y(xi, τk)ěY(xi, τm)) For simplicity, the above definition is going to be rewritten as:

πk(xi) =P(Y(xi, τk)ěY(xi, τ´k))

here τ´kin Y(xi, τ´k)represents all the other potential outcomes j such that j ‰ k, j PT. This expression can be developed even further

πk(xi) = C ÿ c=1

P(Y(xi, τk)ěY(xi, τ´k)|Y(xi, τk) =c)P(Y(xi, τk) =c)

π_k(xi) =

C ÿ c=1

P(c ě Y(xi, τ´k)|Y(xi, τk) =c)P(Y(xi, τk) =c)

Since Y(xi, τk) is independent of Y(xi, τl) for all k ‰ l, because the potential outcome observed from the treatment tk wont affect the potential outcome of the same individual when a treatment τlis applied to itself. The above formula can be rewritten as:

π_k(xi) = C ÿ c=1 m ź l=1,l‰k P(c ě Y(xi, τl))P(Y(xi, τk) =c) When bootstrapping is added to the procedure the above equation becomes

πk(xi) = 1 B B ÿ b=1 C ÿ c=1 m ź l=1,l‰k Pc ě Yb(xi, τl) PYb(xi, τk) =c (3.4)

The equation above can be calculated using the classification models Mb_k(Xi). For ex-ample, the probability P(Yb₍_x

i, τk) = c)can be obtained from the vector of probabilitiesP~kb defined in 3.3. As for the probability P(c ě Yb(xi, τl)|Yb(xi, τk) = c) it can be calculated for the corresponding vector of probabilitiesP~b

l by summing all the probabilities such that Yb(Xi, τl)ďc P(c ě Yb(xi, τl)) = c ÿ j=1 _~ Pb k j c ÿ j=1 _~ P_kb j=P(Y b₍_{Xi, τ} l) =1) +...+P(Yb(Xi, τl) =c)

The second step of PSICA doesn’t need any modifications since the dataset ∆0 =

(20)

3.4. General Statistical Framework for Subgroup Identification: personalized package

new modification of method 1 is going to be compared with the same strategy but without bootstraping, since it is interesting to see whether bootstraping is actually needed or not in this setting.

3.4 General Statistical Framework for Subgroup Identification:

personalized package

A general statistical framework for subgroup identification and comparative treatment scor-ing was proposed by Chen et al. [5] and later implemented in an R package called personal-ized [9]. This framework aims to collect the various methods in the subgroup identification literature and put them under the same umbrella. The following notation is used in this framework. A dataset tYi, Ti, Xi), i = 1, ..., nu is observed. This dataset comes from either a randomized clinical trial or an observational study (studies where the assignment of the treatments is out of the control of the researcher). Where Y is the response variable, X is a set of covariates and T PT =t´1, 1u is a binary treatment. In this framework, E(Y|T, X)can be decomposed as

E(Y|T, X) =m(X) +T∆(X)/2

where m(X) = 0.5(E(Y|T = 1, X) +E(Y|T = ´1, X))reflects the main effect of X into Y and∆(X) = (E(Y|T = 1, X)´E(Y|T = ´1, X)) is a contrast function, that expresses the treatment effects given X. Not all subgroup identification methods target∆(X)directly, but usually target some transformation of it. This transformation f(X)is called benefit score. It posses two properties: the first one is that is monotone in the treatment effect∆(X)and the second one is that it has a meaningful, known cutpoint value x such that if f(x)ąc implies the treatment is more effective than the control, then f(x)ďc implies that the control is more effective than the treatment. Given this definition,∆(x)is itself a benefit score. The main objective of the framework is estimating the contrast function∆(X)since it’s the one dictating the changes on the response variable given the treatments and covariates, and can be used to determine an individualized treatment rule (ITR). An ITR, is a function d(X) :X Ñ T that maps individuals into a treatment. In order for this function to be meaningful, it must assign treatments that benefit the most an individual. Thus, and optimal ITR is a function d˚₍_X₎ that maximizes the value function

V(d) =Ed(Y(d))

V(d˚)ěV(d), @d PD

whereDis a class of ITR and Ed_{is the expected value given the distribution of the data}

(X, T, Y)under the ITR d. An example of an optimal ITR is d˚ ₌_sign₍_∆₍_X₎₎_{. The} maximiza-tion of the value funcmaximiza-tion can be seen as the minimizamaximiza-tion of a weighted miss-classificamaximiza-tion rate [19]. In order to show this, some notation must be introduced first. The distribution of the data under a randomized clinical trial (X, T, Y) is defined as P. The likelihood of

(X, T, Y) under P is then f0(X)P(T|X)f1(Y|X, T), where f0 is the unknown density of X, f1 is the unknown density of Y conditioned on(X, T) and P(T|X) is the propensity score function, which is the probability of assigning treatment T to the individual x. This func-tion is assumed to be known for randomized clinical trials (RCT). For example, if in a RCT the treatment and control are assigned evenly at random, then P(T|X) = 1/2. For an ITR d denote the distribution of (X, T, Y) as Pd. Then the likelihood of (X, T, Y) under Pd is f0(X)I(T=d(X))f1(Y|X, T). Assuming that any of the treatments can be assigned to x, then Pdis absolutely continuous with respect to P and a version of the Radon-Nikodym derivative is dPd/dP= I(T=d(X))/P(T|X)[15]. With this at hand, the value function can be rewritten as:

(21)

3.4. General Statistical Framework for Subgroup Identification: personalized package V(d) =Ed[Y] = ż YdPd= ż YdP d dP dP =E I(T=d(X)) P(T|X) Y

and the maximization of V(d)is equivalent to the minimization of E

Y

P(T|X)I(T ‰ d(X))

which is a miss-classification error weighted by Y/P(T|X). Note that maximizing V(d)

or minimizing the error is not feasible due to the discrete nature of d(x). Thus, the weighting method [19, 5] introduces a surrogate loss function in order to estimate the benefit score f(X)

and consequently∆(X). For this end, a convex loss function M(y, v)is proposed. This loss function must have the following two properties: i) Mv(y, v) =δM(y, v)/δv is increasing in

v for any given y. ii) U(y) = Mv(y, 0)is monotone in y. The first condition allows to "order" the expected utility under the comparative treatments to form an ITR. The second condition is simply to make the transformed quantity, that is, U(y), an interpretable endpoint. One example of a loss function that satisfies these two conditions is M(y, v) = (y ´ v)2, where, Mv(y, v) =2v ´ 2y and U(y) =´2y. Given this transformation of Y, the value function can be rewritten as [5]:

VU(d) =´E([U(Y(T=1))´U(Y(T=´1))]d(X))

also, the miss-classification error or the weighted loss function that we want to minimize is defined as follows

LW(f) =E M(Y, T f(X)) P(T|X) |X=x

The sample estimator of this loss function is defined as:

LW(f) = 1 n n ÿ i=1 M(Yi, Tif(xi)) P(T|X)

The weighting estimator of the benefit score function is ˆf_W =argmin

f

LW(f)

For this estimator, Chen et al. [5] showed that d=sign(ˆf_W)maximizes the value function VU(d)making d˚ = sign(ˆfW) an optimal ITR. This estimator is used to assign treatments to patients and it can also be used to identify subgroups by splitting the covariate space X given the treatment recommendation under the personalized package [9]. It is worth noting that the above optimization problem is complex since the minimization is done over a space of functions. Because of this, the space of functions is reduced in the sense that a modeling choice must be made. For example, f(X)can be a linear model f(X) = XTβ, a regression

tree, smoothing splines, or other non-parametric models. With this in mind, in practice the minimization of the loss function LW(f)can be handed over to the model selected for f(X)

given the reduced space. An important result of this procedure is that even if the space of functions is reduced, the ITR obtained by it d˚ _{is "Fisher-consistent" [5]. Furthermore, for} given forms of M(y, v)a relation between fW(X)and∆(X)can be obtained [9]. For example, for M(y, v) = (y ´ v)2the relation of the benefit score estimator and the contrast function is 2 fW(X) =∆(X).

(22)

3.5. Evaluation

In the case of multiple treatments, the contrast function becomes ∆kl(X) = E(Y|T =

k, X = x)´E(Y|T = l, X = x). If∆kl ą 0, then treatment k is preferable to treatment l for a patient x. Then, the optimization problem becomes:

min LW(f1, . . . , fK) = 1 n n ÿ i=1 MYi, řK k=1I(Ti =k)fk(xi) Pr(T=Ti|X=xi) s.t. K ÿ k=1 fk(xi) =0

The above restriction is put in place for identifiability. It’s worth noting that under the multiple treatment scenario, the framework only supports contrast functions that has a linear form, i.e.∆kl=XTβk.

3.5 Evaluation

In order to investigate the efficiency of the classification PSICA as mentioned in the aims section 1.2, the evaluation procedure for the corresponding simulations 4 will be defined here.

Data splits

Every dataset is going to be divided into a training set and test set. The training set will be conformed of 80% of the observations while the test set will be conformed of the other 20%. All the models are going to be trained using the training set and the evaluation metrics are going to be calculated using the test split. As for the performance in terms of execution time, it will be measured as the seconds it took the model to be trained.

Evaluation metrics

To evaluate the efficiency of the classification PSICA method, a set of simulations are going to be done. The basic setting is that to a population X some treatments T are going to be applied which can potentially interact with the covariates and modify the potential outcome Y. The idea is to measure the efficiency of PSICA and the personalized packaged described above by their capabilities of assigning the treatment that maximizes the potential outcome Y. So this is considered as a classification problem where each algorithm should match an observation xi with the treatment tithat maximizes the potential outcome yi. For this purpose, three metrics are going to be used: accuracy, sensitivity and specificity. In the binary case, there are two treatments τ P t0, 1u, 0 symbolizes a negative outcome and 1 a positive. In this setting a True Positive outcome is when the best treatment τ˚

i = 1 for the individual i is equal to 1 and the method in question predicts the same treatment ˆti =1. A False negative occurs when the best treatment is τ˚

i = 1 but the method wrongly predicts it as ˆti = 0. A False positive happens when the best treatment is τ˚

i =0 but the method predicts it as ˆti =1. Finally a True negative happens when the best treatment is τ˚

i =0 and the method predicts it as ˆti=0. Given these definitions, a confusion metric can be constructed as shown in Table 3.1 from which the three metrics aforementioned can be calculated.

For a binary classification problem as depicted in the table 3.1, the metrics are defined as: Accuracy= True positive + True negative

(23)

3.5. Evaluation

Predicted

Yes No

Actual

Yes True Positive False negative No False positive True negative

Table 3.1: Confusion matrix for a binary classification problem

Sensitivity= True positive

True positive + False negative Specificity= True negative

True negative + False positive

The accuracy is the percentage of correctly classified observations, the sensitivity is the percentage of correctly classified positive cases and the specificity is the percentage of cor-rectly classified negative cases. For the multinomial case the accuracy is going to be the sum of correctly classified observations over the total number of observations. As for the sensitivity and specificity metrics, each one will yield a value per class k.

Sensitivity_k= True positivek

True positive_k+False negative_k Specificity_k = True negativek

True negative_k+False positive_k

where True positive_k is the number of observations correctly classified as class k, False negative_k is the number of observations from class k being miss-classified, True negative_k is the number of observations that aren’t of class k which were classified as not belonging to the class k and False positive_k is the number of observations that aren’t of class k classified as belonging to the class k.

(24)

4 Simulations

To evaluate the efficiency of the classification PSICA method in both versions (bootstrapping and non bootstrapping), an alternative method must be selected as a baseline. For this thesis, the general framework depicted in section 3.4 is used. More specifically, the personalized pack-age implementation is used [9]. There are several reasons for this choice. The first one, is that this framework has a package readily available to use for subgroup identification. The sec-ond one is that the implementation of this framework as well as PSICA is able to recommend treatments to new observations rather than just splitting the covariate space or calculating the average treatment effect per subgroup. Furthermore, both methods are able to handle categorical treatments. Finally, as the name implies, the general framework for subgroup identification is a flexible framework that covers many previous methods under one single umbrella [5, 9]. Thus, making it a reasonable choice as a benchmark.

For the purpose of evaluating the performance, four models are proposed. The first three belong to the binary classification setting since the general framework only supports this set-ting for classification. The last model is an ordinal classification one, where only the PSICA al-gorithm is used. To create the binary classification setting for the potential outcome a logit(p)

function is used, where p=p(Y =1|X)is the probability that Y belongs to class 1 given the covariates X. The logit function is defined as follows

logit(p) = p

1 ´ p and the models described are

M1 : logit(p) =´2I(τ=τ1) + 1 3I(τ=τ2) +2x1I(τ=τ3) +e M2 : logit(p) = 1 2I(x1ě0, x2ě0, τ=τ1) + 1 2I(x1ă0, x2ă0, τ=τ2) +e M3 : logit(p) = 40 ÿ i=1 xi+1 2x1I(x1ą0, τ=τ1) + 1 2x2I(x2ą0, τ=τ2) + 1 2x3I(x3ą0, τ=τ3) +e

(25)

M4 : Y= $ ’ & ’ % 0 Y˚_{ď ´}1 3 1 ´1₃ ăY˚_ď1 3 2 1₃ ăY˚ Y˚= 40 ÿ i=1 xi´2x1I(τ=τ1) + 1 2I(x1ě0, x2ě0, τ=τ2) + (x1+x2)I(τ=τ3) +e All of the experiments consists of three treatments. The first model (M1) is a simple one where the first treatment is never desired, the second treatment has a fixed effect and the third one is the preferred one when x1is greater than 1/6. The second model (M2) contains interac-tions between a treatment and two variables at the same time and the third treatment has no impact. The third model (M3) involves many variables in creating the main effect and only a few interact with the treatments in a positive way. The final model (M4) involves multiple categories of Y. For all the experiments e is an error term that follows a normal distribution

N (0, 0.22). All the variables xiare sampled from a uniform distribution U(´1, 1). A total of 50 variables were used as covariates for each simulation tx1, x2, ...., x50u. Each of these experi-ments are performed 200 times and will be done for n observations, where n P 300, 900, 1800. In the case of the PSICA tree with bootstrapping, it will only be trained on n = 300 due to computational resources. For the PSICA trees the parameters Boot=100, mtry=7 were used, both of them are set due to computational constraints. The first parameter indicates the num-ber of bootstrap samples done in the first step and the second one is the numnum-ber of variables to be selected at each split, the number 7 is selected since is the recommended value for a large number of variables [14](?50 « 7). As for the personalized package the default parameters were used with a logistic_loss_lasso loss.

(26)

5 Results

In this section the results for the simulations are going to be introduced and then the results of applying the classification PSICA method to the los Cuatro Santos dataset are going to be presented.

5.1 Simulations

The results of the simulations are presented below. Psica boot refers to the PSICA method with bootstrapping and Psica no boot to the method without bootstrapping. The accuracy and execution time, metrics are presented in Table 5.1. On the overall accuracy, the personalized package performs better than both PSICA models on the data model M1, however it under-performs for both data models M2 and M3. As for execution time, there is no clear relation of the training time for the personalized package given the number of observations when compared to itself since it takes more time to train for a lower number of observations than for a higher number, this can be seen in Figure 5.1. However, most of the times it is faster than both of the PSICA trees. Psica boot always outperforms Psica no boot in every data model except on the data model M4 and for n=300 Psica boot always takes more time than Psica no boot. Another observation that was made is that the more complex the model, the wider the gap in execution time between PSICA boot and PSICA no boot. This can be seen as the data models grow in complexity from M1 to M4. It is more noticeable in M4 where the execution time of PSICA boot is more than 25 times the one from PSICA no boot.

The results for the sensitivity metrics are presented by class in Table 5.2. All the subgroup identification methods for the data model M1 doesn’t have true positives nor false negatives for the first treatment τ1, that’s why the entries are blank. The same happens for the third treatment τ3in the model M2. For the data model M1, all methods for all number of obser-vations have a high sensitivity for the second treatment. However for the third treatment all have a sensitivity lower than 50%. In the data model M2 all of the models increase their sensitivity when they have more observations. For the data model M3 which has more noise and is more complex than the other two, increasing the number of observations doesn’t seem to affect the sensitivity of any treatment of the predictions. One interesting fact is that PSICA holds the same sensitivity for all the treatment in the data model M3 which is expected since all of the treatments have the same probability of being the best for each observation. Mean-while, the personalized have a higher sensitivity for the second and third treatment and a

(27)

5.2. Los Cuatro Santos: Classification PSICA results

Figure 5.1: Personalized training time by model and number of observations with standard error bars.

lower one for the first one for all values of n. As for the data model M4 it seems that the higher the number of observations, the better PSICA can classify positive cases.

Finally, the results for the specificity are presented in Table 5.3. There is no clear evidence on whether the PSICA method with bootstrapping outperforms the one without bootstrap-ping for any of the data models in terms of specificity. The personalized method for the data model M1 seems to have better performance over the PSICA method since it has a higher per-centage of correctly classified negative cases for the treatment number 2. In the data model M3 both PSICA boot and PSICA no boot presents expected specificity for all the treatments since all of them have the same probability of being the best. However, the personalized method has a higher specificity for the treatment one for n = 300 and n = 900 but it di-minishes to 0.3929 from 0.8257 for n = 1800. Given these results, it seems that PSICA boot offers slightly better results than PSICA no boot and similar variance on it’s predictions. As for the personalized package it appears to have a similar behaviour sometimes, however its results seems to be unstable and change depending on the number of observations in terms of execution time.

5.2 Los Cuatro Santos: Classification PSICA results

The classification PSICA method with bootstrapping and without bootstrapping was trained using the los Cuatro Santos dataset for both datasets (one without the 7 observations that had a UBN of 4 and another one containing those). For the PSICA trees the parameters Boot=100, mtry=7 were used in this case for both datasets. The results for the PSICA tree without bootstrapping can be seen in Figure 5.2 without the observations with a UBN of 4 and in Figure 5.4 for the dataset with all the UBN levels. As for the results with bootstrapping without a UBN of 4 can be seen in Figure 5.3 and for the dataset with all the UBN levels the results can be seen in 5.5.

(28)

Data Model No. Observations Model Name Accuracy Execution time (seconds)

M1 300 Psica boot 0.5907 (0.0604) 26.5420 (0.0960) Psica no boot 0.5835 (0.0661) 16.3071 (0.1701) Personalized 0.6379 (0.1020) 0.2772 (0.0051) 900 Psica no boot 0.5839 (0.0354) 50.5889 (0.2255) Personalized 0.7212 (0.0792) 4.3291 (0.1424) 1800 Psica no boot 0.5797 (0.0259) 113.6384 (0.9266) Personalized 0.7792 (0.0401) 1.8034 (0.1125) M2 300 Psica boot 0.5154 (0.2566) 18.4544 (0.1141) Psica no boot 0.4962 (0.2690) 7.6937 (0.1191) Personalized 0.4705 (0.1837) 7.2676 (0.5158) 900 Psica no boot 0.6270 (0.2097) 46.9646 (0.7122) Personalized 0.5973 (0.0888) 0.6310 (0.0062) 1800 Psica no boot 0.7168 (0.1265) 151.9421 (0.9966) Personalized 0.6134 (0.0573) 0.9124 (0.0093) M3 300 Psica boot 0.3417 (0.0880) 19.6710 (0.0661) Psica no boot 0.3315 (0.0828) 7.8464 (0.0841) Personalized 0.3170 (0.0656) 8.3134 (0.6666) 900 Psica no boot 0.3380 (0.0712) 32.7897 (0.7272) Personalized 0.3239 (0.0581) 1.2984 (0.0563) 1800 Psica no boot 0.3290 (0.0661) 85.6175 (1.7410) Personalized 0.3249 (0.0528) 1.8962 (0.0835) M4 300 Psica boot 0.3224 (0.1462) 256.9788 (1.6497) Psica no boot 0.3263 (0.1542) 10.0903 (0.1092) 900 Psica no boot 0.2965 (0.1507) 49.2378 (0.8565) 1800 Psica no boot 0.3439 (0.1920) 132.0090 (1.4503) Table 5.1: Accuracy and execution time (seconds) results. The main number is the mean of the simulations and in parenthesis the standard error of the mean.

Without UBN of 4

For the non-bootstraping version of PSICA the common theme is that every subgroup has on its recommended treatments the first, second and third one. The first treatment contains only the safe water intervention. The second one contains the safe water and microcredit inter-ventions. The third treatment contains the safe water and the technical training intervention. Another interesting find is that the control treatment 0 belongs to the recommended groups when the household has a floor that isn’t made out of mud and the highest education in the household is secondary school or university level. One final interesting observation is that the probability of a treatment being the best is concentrated on the treatments 1, 2, 3 and that from the leftmost subgroup to the rightmost subgroup there seems to be a shift in the alloca-tion of the probabilities from 3 being the best treatment (left most subgroup) to 1 being the best one (rightmost subgroup). Following this pattern across the tree suggests that the better the conditions of a household are (leftmost subgroup) access to water and technical training take the most important roles but if the conditions of the household are the worst (rightmost subgroup), access to safe water has the highest probability of being the best treatment.

The results of the classification PSICA tree with bootstrapping presents similar character-istics as the one without bootstrapping. Both methods recommends the treatments 1, 2 and 3 for every subgroup. In addition, the probability shift mentioned above can also be seen here from the left most subgroup to the right most one. One of the differences between the two is that the later one doesn’t use the highest education level in the household nor the limited variation of food in the household due to the lack of it. It also incorporates the house wall type into its nodes to splits. One interesting observation that might seem counter-intuitive is that on the rightmost subgroup the probability of the treatment number 1 is higher than the

(29)

Data Model No. Observations Model Name Sensitivity₁ Sensitivity₂ Sensitivity₃

M1 300 Psica boot - 1.0000 (0.0000) 0.0000 (0.0000) Psica no boot - 1.0000 (0.0000) 0.0000 (0.0000) Personalized - 0.9994 (0.0002) 0.1486 (0.0137) 900 Psica no boot - 1.0000 (0.0000) 0.0000 (0.0000) Personalized - 1.0000 (0.0000) 0.3291 (0.0128) 1800 Psica no boot - 1.0000 (0.0000) 0.0000 (0.0000) Personalized - 1.0000 (0.0000) 0.4705 (0.0064) M2 300 Psica boot 0.5136 (0.0348) 0.4980 (0.0353) -Psica no boot 0.4784 (0.0348) 0.5223 (0.0352) -Personalized 0.3370 (0.0191) 0.8894 (0.0166) -900 Psica no boot 0.5814 (0.0243) 0.7749 (0.0277) -Personalized 0.4644 (0.0081) 0.9954 (0.0016) -1800 Psica no boot 0.6454 (0.0243) 0.9297 (0.0153) -Personalized 0.4835 (0.0052) 0.9992 (0.0002) -M3 300 Psica boot 0.3800 (0.0344) 0.2850 (0.0320) 0.3350 (0.0334) Psica no boot 0.3438 (0.0334) 0.3503 (0.0336) 0.3113 (0.0323) Personalized 0.1431 (0.0188) 0.4649 (0.0304) 0.3970 (0.0299) 900 Psica no boot 0.3530 (0.0336) 0.3362 (0.0332) 0.3090 (0.0323) Personalized 0.1862 (0.0212) 0.4286 (0.0305) 0.4098 (0.0307) 1800 Psica no boot 0.3150 (0.0329) 0.3750 (0.0343) 0.3100 (0.0327) Personalized 0.2310 (0.0236) 0.3830 (0.0293) 0.4010 (0.0291) M4 300 Psica boot 0.2800 (0.0318) 0.3588 (0.0339) 0.3632 (0.0339) Psica no boot 0.3508 (0.0336) 0.3936 (0.0344) 0.2642 (0.0312) 900 Psica no boot 0.2539 (0.0298) 0.4984 (0.0347) 0.2688 (0.0302) 1800 Psica no boot 0.3220 (0.0320) 0.4789 (0.0349) 0.3036 (0.0312) Table 5.2: Sensitivity results by class. The main number is the mean of the simulations and in

parenthesis the standard error of the mean. Variables without entries mean that there were no true positive nor false negative cases.

probability of the treatment 2 being the best. It seems counter intuitive since the treatment 1 contains only the safe water treatment while the treatment 2 contains the safe water and the microcredit treatment. This might indicate that for this particular subgroup microcredits has a negative effect on the UBN.

With UBN of 4

For the non-bootstrapping version of PSICA with the added 7 observations a similar be-haviour regarding the subgroups and the allocation of probability of the treatments being the best is seen. The worse the conditions of a household are, the more the probabilities of a treatment being the best shifts from the treatment 3 to the first treatment. One important thing to notice is that the structure and selection of the subgroups change since the variables that conforms the split are different in Figure 5.2 and Figure 5.4.

The results for the classification PSICA tree with bootstrapping with the added 7 obser-vations seen in Figure 5.5 are almost the same as in the results of the model without a UBN of 4 seen in Figure 5.3. The same variables are selected at the same levels for the splits of the dataset and the same value for the splits is used on each node. The only difference is that the probabilities in some cases differ between each other by 0.01˘. This suggests that using bootstrapping in the first step of the method might give some robustness to the results given changes in the dataset.

(30)

(31)

(32)

Data Model No. Observations Model Name Specificity₁ Specificity₂ Specificity₃

M1 300 Psica boot 1.0000 (0.0000) 0.0000 (0.0000) 1.0000 (0.0000) Psica no boot 1.0000 (0.0000) 0.0000 (0.0000) 1.0000 (0.0000) Personalized 1.0000 (0.0000) 0.1486 (0.0137) 0.9994 (0.0002) 900 Psica no boot 1.0000 (0.0000) 0.0000 (0.0000) 1.0000 (0.0000) Personalized 1.0000 (0.0000) 0.3291 (0.0128) 1.0000 (0.0000) 1800 Psica no boot 1.0000 (0.0000) 0.0000 (0.0000) 1.0000 (0.0000) Personalized 1.0000 (0.0000) 0.4705 (0.0064) 1.0000 (0.0000) M2 300 Psica boot 0.5080 (0.0353) 0.5236 (0.0348) 0.9900 (0.0070) Psica no boot 0.5423 (0.0351) 0.4984 (0.0349) 0.9800 (0.0099) Personalized 0.9228 (0.0143) 0.4012 (0.0214) 0.9439 (0.0089) 900 Psica no boot 0.7749 (0.0277) 0.5814 (0.0243) 1.0000 (0.0000) Personalized 0.9970 (0.0008) 0.4763 (0.0085) 0.9907 (0.0020) 1800 Psica no boot 0.9297 (0.0153) 0.6454 (0.0134) 1.0000 (0.0000) Personalized 0.9993 (0.0002) 0.4864 (0.0053) 0.9978 (0.0005) M3 300 Psica boot 0.6200 (0.0344) 0.7150 (0.0320) 0.6650 (0.0334) Psica no boot 0.6587 (0.0333) 0.6513 (0.0336) 0.6926 (0.0322) Personalized 0.8621 (0.0189) 0.5325 (0.0303) 0.6079 (0.0294) 900 Psica no boot 0.6465 (0.0336) 0.6631 (0.0332) 0.6895 (0.0324) Personalized 0.8257 (0.0207) 0.5813 (0.0301) 0.6050 (0.0303) 1800 Psica no boot 0.6850 (0.0329) 0.6250 (0.0343) 0.6900 (0.0327) Personalized 0.3929 (0.0236) 0.6246 (0.0292) 0.6078 (0.0290) M4 300 Psica boot 0.7200 (0.0318) 0.6405 (0.0338) 0.6410 (0.0338) Psica no boot 0.6574 (0.0333) 0.6045 (0.0344) 0.7454 (0.0306) 900 Psica no boot 0.7634 (0.0288) 0.4901 (0.0343) 0.7668 (0.0283) 1800 Psica no boot 0.7690 (0.0287) 0.5312 (0.0341) 0.7729 (0.0268) Table 5.3: Specificity simulation results. The main number is the mean of the simulations and

(33)

(34)

(35)

6 Discussion

This chapter discusses the simulation results, the results for the PSICA method applied to the los Cuatro Santos data set and future research.

6.1 Simulations

One of the results that stands out was the execution time for the personalized method. In particular for the data models M2 and M3, the execution time of this method is higher for n = 300 than for n = 900, 1800. This seems counter intuitive. However, it is due to the optimization procedure that personalized uses since it warns during those runs that the con-verging criteria couldn’t be met. Thus, for a low number of observations the optimization procedure had to run more iterations in order to try to converge. Another important fact to highlight is that the personalized package is faster than both PSICA methods, however, it is worth noting that the personalized package for multiple treatments fits a linear model, while PSICA has to fit at a minimum k random forests, where k is the number of total treatments and one additional decision tree.

The results of the simulations are a good baseline to frame the PSICA method among the subgroup identification methods. However the results must be interpreted with care. Both PSICA and personalized can do treatment recommendation for a given patient. Un-fortunately these metrics ignore the fact that PSICA is a probabilistic model and thus, the results only incorporates the prediction with the highest probability while personalized is a deterministic one since it compares the values of the benefit scores function between treat-ments. Thus, these metrics completely ignores the probabilistic dimension of PSICA which can come into play when there is uncertainty about which is the best treatment. For example, when two treatments might give the same potential outcome but one of them has a lower cost or is preferable than the other one for some reason. Another thing to consider is that PSICA, unlike personalized and other subgroup identification methods, can handle ordinal variables in the classification setting. In this case the simulations were done using a binary potential outcome (except for model M4).

(36)

6.2. Los Cuatro Santos results

6.2 Los Cuatro Santos results

Without UBN of 4

The results from PSICA with bootstrapping and PSICA without bootstrapping seems similar and also plausible in both. Both share the top two variables being floor type and fridge, and they also share similar top treatments (1, 2 ,3) for all the subgroups. Unfortunately these results are obtained by breaking two assumptions which the PSICA method counts on. The first one assumes that the observations came from a randomized clinical trial, however, the data in this case comes from an observational study, meaning that the assignment of the treatments are out of the control of the researcher. The other assumption that is broken is the one where the treatment of one individual shouldn’t affect the potential outcome of another individual. In this case, the data is at a household level and collected over a community. This assumptions is not met since it can’t be ignored the social dynamics that a community might have and the feedback loops that applying a treatment over a household could have over another one. For example, if a household receives technical training, then other households might have access to this new specialized workforce. This could lead to better goods and services in the community and give a generalized decline in poverty. These two assumptions might lead to biased results.

One limitation of applying the PSICA method to this observational study is that the pos-sible treatments are bound to the ones applied onto the population. So, for example, in this particular case it is not possible to determine whether only giving access to safe water is a good enough treatment for some subgroup since in the data there are no instances where only the safe water treatment was applied to a household. There is also an added bias that some of the treatment were cutoff from the final model due to the low amount of households that had those treatments. Thus, in both resulting trees where the treatment 3 seems to be one of the top ones in every subgroup, it can’t be determined whether it is because of the combination of access to water and technical training or if it was only due to the technical training. This must be taken into account if policies are to be created or explored given these results.

With UBN of 4

As in the case where the dataset without a UBN of 4 is used, both generated trees present a similar structure and share all but one variable (Refrigerator in HH). In addition to this, these results are also subject to some bias introduced by breaking some modeling assumptions. As mentioned in the results section 5.2 from comparing these two datasets results, it seems that the model with bootstraping provides more reliable results than the one without. So extra care should be taken when using the results of the non-bootstraping version of the classification PSICA.

6.3 Future research

One limitation of the work done and that is left out as future research is that the implementa-tion of the method 2 for generating samples of Y to calculate the probabilities of a treatment being the best πk(x)is not implemented. This is because a solution to obtain samples given a classification tree wasn’t found.

As for the results of the PSICA method applied to the Nicaragua Dataset. Further research can be made regarding the possible negative effect that microcredits seems to have in a certain subgroup. In addition to this, further research should be done regarding the instability of the results of the classification PSICA given changes in the data used as input.

(37)

7 Conclusion

The main goals for this these were the following:

1. Generalize PSICA method to a classification setting and modify the current implemen-tation appropriately.

2. Investigate the efficiency of the classification PSICA and its performance in terms of execution time. Also compare it to an alternative method.

3. Apply the classification PSICA to the Nicaragua data and analyse which interventions are most efficient in reducing poverty for different population groups.

The work done in this thesis defined a natural extension to the PSICA method into a clas-sification setting were an ordinal potential outcome is of interest, the current implementation was also modified to support this new feature. In terms of efficiency we conclude that the classification PSICA method presents a similar performance to competing methods with the added value that it can handle ordinal potential outcomes, has a probabilistic output and a re-sulting tree that is interpretable and usable for policymaking. However these benefits comes at the expense of a higher execution time. Finally, from the results of applying the PSICA tree to the los Cuatro Santos data set, we conclude that there are around 5 to 6 subgroups in which they benefit the most of the microcredits, safe water access and technical training or combinations of those treatments up to different degrees. Given this, we see that the home gardening program was not successful in reducing poverty in the area. An important remark of these results is that when the conditions of a household are on the lowest end (for example, mud floor and no education) the treatment that benefits the most is safe access to water.

(38)

Bibliography

[1] Demissie Alemayehu, Yang Chen, and Marianthi Markatou. “A comparative study of subgroup identification methods for differential treatment effect: Performance metrics and recommendations”. In: Statistical Methods in Medical Research 27.12 (2018). PMID: 28629264, pp. 3658–3678. DOI: 10 . 1177 / 0962280217710570. eprint: https : / / doi . org / 10 . 1177 / 0962280217710570.URL: https://doi.org/10.1177/ 0962280217710570.

[2] Demissie Alemayehu, Yang Chen, and Marianthi Markatou. “A comparative study of subgroup identification methods for differential treatment effect: Performance metrics and recommendations”. In: Statistical Methods in Medical Research 27.12 (2018). PMID: 28629264, pp. 3658–3678. DOI: 10 . 1177 / 0962280217710570. eprint: https : / / doi . org / 10 . 1177 / 0962280217710570.URL: https://doi.org/10.1177/ 0962280217710570.

[3] Elmer Zelaya Blandón, Carina Källestål, Rodolfo Peña, Wilton Perez, Staffan Berglund, Mariela Contreras, and Lars-Åke Persson. “Breaking the cycles of poverty: Strate-gies, achievements, and lessons learned in Los Cuatro Santos, Nicaragua, 1990–2014”. In: Global Health Action 10.1 (2017). PMID: 28136698, p. 1272884. DOI: 10 . 1080 / 16549716 . 2017 . 1272884. eprint: https : / / doi . org / 10 . 1080 / 16549716 . 2017.1272884.URL: https://doi.org/10.1080/16549716.2017.1272884. [4] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression

Trees. Monterey, CA: Wadsworth and Brooks, 1984.

[5] Shuai Chen, Lu Tian, Tianxi Cai, and Menggang Yu. “A general statistical frame-work for subgroup identification and comparative treatment scoring”. In: Biomet-rics 73.4 (2017), pp. 1199–1209. DOI: 10 . 1111 / biom . 12676. eprint: https : / / onlinelibrary . wiley . com / doi / pdf / 10 . 1111 / biom . 12676.URL: https: //onlinelibrary.wiley.com/doi/abs/10.1111/biom.12676.

[6] Jared Foster, Jeremy Taylor, and Stephen Ruberg. “Subgroup identification from ran-domized clinical trial data”. In: Statistics in medicine 30 (Oct. 2011), pp. 2867–80.DOI: 10.1002/sim.4322.

[7] Torsten Hothorn, Kurt Hornik, and Achim Zeileis. “Unbiased Recursive Partitioning: A Conditional Inference Framework”. In: Journal of Computational and Graphical Statistics 15.3 (2006), pp. 651–674. DOI: 10 . 1198 / 106186006X133933. eprint: https : / /

Subgroup identification in classification scenario with multiple treatments

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/STAT-A--20/026--SE

Subgroup identiﬁcation in

clas-siﬁcation scenario with multiple

treatments

Héctor Andrés Plata Santos

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Ethical considerations

2

Data

2.1

Interventions

2.2

Unsatisfied basic needs

2.3

Pre-processing

3

Method

3.1

Decision trees

3.2

PSICA tree

First step

Second step

3.3

Modifications to PSICA

3.4

General Statistical Framework for Subgroup Identification:

personalized package

3.5

Evaluation

Data splits

Evaluation metrics

4

Simulations

5

Results

5.1

Simulations

5.2

Los Cuatro Santos: Classification PSICA results

Without UBN of 4

With UBN of 4

6

Discussion

6.1

Simulations

6.2

Los Cuatro Santos results

Without UBN of 4

With UBN of 4

6.3

Future research

7

Conclusion

Bibliography